Converter
← Back to Converter

May 12, 2026

How PDF to Markdown Conversion Works

Ever wondered what happens behind the scenes when you drop a PDF into our converter and get clean Markdown out? The entire process runs 100% in your browser — no server uploads, no API calls, no privacy concerns. Here's a deep dive into the pipeline.

The Core Challenge

PDF is a visual format. It tells a renderer where to place characters on a page — but carries zero semantic information. There are no "headings," no "paragraphs," no "lists" in a PDF. Just positioned glyphs with font metadata.

Converting PDF to Markdown means reconstructing structure from visual cues: font sizes, weights, positions, and spacing. Our heuristic engine infers what a human would read as a heading, a list, or a table — entirely from spatial and typographic relationships.

The Conversion Pipeline

The conversion happens in a series of well-defined stages:

PDF UploadValidationText ExtractionHeader/Footer RemovalColumn DetectionHeading ClassificationTable DetectionList DetectionParagraph MergingInline FormattingMarkdown Output

Step 1: Text Extraction with PDF.js

We use PDF.js (the same library that powers Firefox's PDF viewer) to extract every text element from your document. PDF.js runs its heavy parsing inside a Web Worker, keeping the main browser thread responsive while processing large files.

For each text fragment, we extract:

Step 2: Header & Footer Removal

Repeated headers, footers, and page numbers clutter the output. Our stripper algorithm:

  1. Scans text in the top 10% and bottom 10% of each page
  2. Compares content across all pages (with a 3pt position tolerance)
  3. If identical text appears at the same position on ≥80% of pages, it's classified as header/footer and removed
  4. Isolated numbers in those zones are treated as page numbers

Step 3: Column Detection

Many PDFs use multi-column layouts. Without column detection, a two-column document would interleave content from both columns line by line — producing unreadable output.

Our detector builds an x-position histogram of all text items, identifies gaps ≥20pt as column separators, and reorders content so each column reads top-to-bottom before moving to the next. We support up to 3 columns per page.

Step 4: Heading Classification

Since PDFs don't mark headings explicitly, we infer them from font size relative to body text:

  1. Detect the "body size" — the most frequently used font size in the document
  2. Collect all font sizes larger than body size + 1pt
  3. Sort them descending and map: largest → H1, second → H2, third → H3, fourth → H4

This means if your PDF uses 12pt for body text, 24pt for titles, and 18pt for sections — those map cleanly to # H1 and ## H2.

Step 5: Table Detection

Tables are the hardest element to extract. We detect them by finding grid-aligned text regions:

  1. Cluster by x-position (5pt tolerance) to identify columns
  2. Cluster by y-position (2pt tolerance) to identify rows
  3. If ≥2 columns AND ≥2 rows are detected, it's a table
  4. First row becomes the header, with a --- separator below

The output is a standard Markdown pipe table — compatible with GitHub, Obsidian, and all major renderers. If merged cells are detected, we fall back to plain text to avoid mangled output.

Step 6: List Detection

We identify lists by looking for bullet characters (, -, , ) or ordered patterns (1., a)) at the start of lines. Key rules:

Step 7: Paragraph Merging

Raw PDF text comes as scattered fragments. We reassemble them into coherent paragraphs using vertical gap analysis:

This ensures natural paragraph breaks without losing text or creating excessive whitespace.

Step 8: Inline Formatting

Finally, we apply Markdown formatting based on font metadata:

Link annotations from the PDF are also matched to their corresponding text and wrapped in [text](url) syntax.

Why Client-Side Only?

Running everything in your browser provides key advantages:

Error Handling

Not every PDF converts perfectly. Our system handles edge cases gracefully:

See it in action — convert a PDF now

Try the Converter Free →