How PDF to Markdown Conversion Works
Ever wondered what happens behind the scenes when you drop a PDF into our converter and get clean Markdown out? The entire process runs 100% in your browser — no server uploads, no API calls, no privacy concerns. Here's a deep dive into the pipeline.
The Core Challenge
PDF is a visual format. It tells a renderer where to place characters on a page — but carries zero semantic information. There are no "headings," no "paragraphs," no "lists" in a PDF. Just positioned glyphs with font metadata.
Converting PDF to Markdown means reconstructing structure from visual cues: font sizes, weights, positions, and spacing. Our heuristic engine infers what a human would read as a heading, a list, or a table — entirely from spatial and typographic relationships.
The Conversion Pipeline
The conversion happens in a series of well-defined stages:
Step 1: Text Extraction with PDF.js
We use PDF.js (the same library that powers Firefox's PDF viewer) to extract every text element from your document. PDF.js runs its heavy parsing inside a Web Worker, keeping the main browser thread responsive while processing large files.
For each text fragment, we extract:
- Content — the actual text string
- Position — exact x/y coordinates on the page
- Font size — derived from the PDF's transformation matrix
- Font weight — inferred from the font name (e.g., "Helvetica-Bold")
- Style — italic/oblique detection from font descriptors
Step 2: Header & Footer Removal
Repeated headers, footers, and page numbers clutter the output. Our stripper algorithm:
- Scans text in the top 10% and bottom 10% of each page
- Compares content across all pages (with a 3pt position tolerance)
- If identical text appears at the same position on ≥80% of pages, it's classified as header/footer and removed
- Isolated numbers in those zones are treated as page numbers
Step 3: Column Detection
Many PDFs use multi-column layouts. Without column detection, a two-column document would interleave content from both columns line by line — producing unreadable output.
Our detector builds an x-position histogram of all text items, identifies gaps ≥20pt as column separators, and reorders content so each column reads top-to-bottom before moving to the next. We support up to 3 columns per page.
Step 4: Heading Classification
Since PDFs don't mark headings explicitly, we infer them from font size relative to body text:
- Detect the "body size" — the most frequently used font size in the document
- Collect all font sizes larger than body size + 1pt
- Sort them descending and map: largest → H1, second → H2, third → H3, fourth → H4
This means if your PDF uses 12pt for body text, 24pt for titles, and 18pt for sections — those map cleanly to # H1 and ## H2.
Step 5: Table Detection
Tables are the hardest element to extract. We detect them by finding grid-aligned text regions:
- Cluster by x-position (5pt tolerance) to identify columns
- Cluster by y-position (2pt tolerance) to identify rows
- If ≥2 columns AND ≥2 rows are detected, it's a table
- First row becomes the header, with a
---separator below
The output is a standard Markdown pipe table — compatible with GitHub, Obsidian, and all major renderers. If merged cells are detected, we fall back to plain text to avoid mangled output.
Step 6: List Detection
We identify lists by looking for bullet characters (•, -, ▪, ○) or ordered patterns (1., a)) at the start of lines. Key rules:
- A bullet that isn't adjacent to other list items is treated as a regular paragraph (no false positives)
- Nesting is detected from x-position indentation — up to 3 levels deep
- Each nesting level adds 2 spaces of indentation in the output
Step 7: Paragraph Merging
Raw PDF text comes as scattered fragments. We reassemble them into coherent paragraphs using vertical gap analysis:
- Gap ≤ 1.5× line height + same font size → same paragraph (joined with spaces)
- Gap > 1.5× line height → new paragraph (blank line inserted)
This ensures natural paragraph breaks without losing text or creating excessive whitespace.
Step 8: Inline Formatting
Finally, we apply Markdown formatting based on font metadata:
- Font name contains "Bold" or weight ≥ 700 →
**bold** - Font name contains "Italic" or "Oblique" →
*italic* - Both conditions →
***bold italic***
Link annotations from the PDF are also matched to their corresponding text and wrapped in [text](url) syntax.
Why Client-Side Only?
Running everything in your browser provides key advantages:
- Privacy — Your documents never leave your device
- Speed — No upload/download latency; processing starts instantly
- Offline capable — Works without an internet connection after initial load
- No limits — No server quotas or rate limiting
Error Handling
Not every PDF converts perfectly. Our system handles edge cases gracefully:
- Image-only PDFs — Detected when pages return zero text items; user is advised to run OCR first
- Password-protected files — Caught immediately with a clear error message
- Memory pressure — Large files are subject to a 50 MB limit and a 10-second timeout to prevent browser crashes
See it in action — convert a PDF now
Try the Converter Free →