How PDF to Markdown Conversion Works

Ever wondered what happens behind the scenes when you drop a PDF into our converter and get clean Markdown out? The entire process runs 100% in your browser — no server uploads, no API calls, no privacy concerns. Here's a deep dive into the pipeline.

The Core Challenge

PDF is a visual format. It tells a renderer where to place characters on a page — but carries zero semantic information. There are no "headings," no "paragraphs," no "lists" in a PDF. Just positioned glyphs with font metadata.

Converting PDF to Markdown means reconstructing structure from visual cues: font sizes, weights, positions, and spacing. Our heuristic engine infers what a human would read as a heading, a list, or a table — entirely from spatial and typographic relationships.

The Conversion Pipeline

The conversion happens in a series of well-defined stages:

PDF Upload→Validation→Text Extraction→Header/Footer Removal→Column Detection→Heading Classification→Table Detection→List Detection→Paragraph Merging→Inline Formatting→Markdown Output

Step 1: Text Extraction with PDF.js

We use PDF.js (the same library that powers Firefox's PDF viewer) to extract every text element from your document. PDF.js runs its heavy parsing inside a Web Worker, keeping the main browser thread responsive while processing large files.

For each text fragment, we extract:

Content — the actual text string
Position — exact x/y coordinates on the page
Font size — derived from the PDF's transformation matrix
Font weight — inferred from the font name (e.g., "Helvetica-Bold")
Style — italic/oblique detection from font descriptors

Step 2: Header & Footer Removal

Repeated headers, footers, and page numbers clutter the output. Our stripper algorithm:

Scans text in the top 10% and bottom 10% of each page
Compares content across all pages (with a 3pt position tolerance)
If identical text appears at the same position on ≥80% of pages, it's classified as header/footer and removed
Isolated numbers in those zones are treated as page numbers

Step 3: Column Detection

Many PDFs use multi-column layouts. Without column detection, a two-column document would interleave content from both columns line by line — producing unreadable output.

Our detector builds an x-position histogram of all text items, identifies gaps ≥20pt as column separators, and reorders content so each column reads top-to-bottom before moving to the next. We support up to 3 columns per page.

Step 4: Heading Classification

Since PDFs don't mark headings explicitly, we infer them from font size relative to body text:

Detect the "body size" — the most frequently used font size in the document
Collect all font sizes larger than body size + 1pt
Sort them descending and map: largest → H1, second → H2, third → H3, fourth → H4

This means if your PDF uses 12pt for body text, 24pt for titles, and 18pt for sections — those map cleanly to # H1 and ## H2.

Step 5: Table Detection

Tables are the hardest element to extract. We detect them by finding grid-aligned text regions:

Cluster by x-position (5pt tolerance) to identify columns
Cluster by y-position (2pt tolerance) to identify rows
If ≥2 columns AND ≥2 rows are detected, it's a table
First row becomes the header, with a --- separator below

The output is a standard Markdown pipe table — compatible with GitHub, Obsidian, and all major renderers. If merged cells are detected, we fall back to plain text to avoid mangled output.

Step 6: List Detection

We identify lists by looking for bullet characters (•, -, ▪, ○) or ordered patterns (1., a)) at the start of lines. Key rules:

A bullet that isn't adjacent to other list items is treated as a regular paragraph (no false positives)
Nesting is detected from x-position indentation — up to 3 levels deep
Each nesting level adds 2 spaces of indentation in the output

Step 7: Paragraph Merging

Raw PDF text comes as scattered fragments. We reassemble them into coherent paragraphs using vertical gap analysis:

Gap ≤ 1.5× line height + same font size → same paragraph (joined with spaces)
Gap > 1.5× line height → new paragraph (blank line inserted)

This ensures natural paragraph breaks without losing text or creating excessive whitespace.

Step 8: Inline Formatting

Finally, we apply Markdown formatting based on font metadata:

Font name contains "Bold" or weight ≥ 700 → **bold**
Font name contains "Italic" or "Oblique" → *italic*
Both conditions → ***bold italic***

Link annotations from the PDF are also matched to their corresponding text and wrapped in [text](url) syntax.

Why Client-Side Only?

Running everything in your browser provides key advantages:

Privacy — Your documents never leave your device
Speed — No upload/download latency; processing starts instantly
Offline capable — Works without an internet connection after initial load
No limits — No server quotas or rate limiting

Error Handling

Not every PDF converts perfectly. Our system handles edge cases gracefully:

Image-only PDFs — Detected when pages return zero text items; user is advised to run OCR first
Password-protected files — Caught immediately with a clear error message
Memory pressure — Large files are subject to a 50 MB limit and a 10-second timeout to prevent browser crashes

See it in action — convert a PDF now

Try the Converter Free →