PDF to Markdown API — Build Document Pipelines Programmatically
You've got a document pipeline. PDFs are coming in — research papers, contracts, reports — and you need to process them downstream: feed them into an LLM, index them in a vector store, or store them as structured content. You don't want to manually convert each file. You want a PDF to Markdown API.
This post covers your real options: browser-native JavaScript libraries, Node.js tools, Python packages, and cloud APIs — with actual code you can use today.
Option 1: PDF.js in the Browser (No API Required)
If your use case is client-side — a web app where users upload PDFs — you don't need an API at all. PDF.js runs entirely in the browser and gives you raw text extraction with positional metadata. You can layer heuristics on top to produce structured Markdown.
This is exactly how our converter works — the full conversion pipeline is documented here. The short version: PDF.js extracts text items with x/y positions and font metadata; you then classify headings by font size, detect tables by grid alignment, and assemble paragraphs from vertical gaps.
// Browser — minimal PDF.js text extraction
import * as pdfjsLib from 'pdfjs-dist';
async function extractText(arrayBuffer: ArrayBuffer): Promise<string> {
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
let fullText = '';
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const pageText = content.items
.map((item: any) => item.str)
.join(' ');
fullText += pageText + '\n\n';
}
return fullText;
}For production use, run the heavy parsing inside a Web Worker to keep the main thread responsive. PDF.js ships with a worker file for exactly this purpose.
Option 2: Node.js with pdf-parse or pdf2md
For server-side Node.js pipelines, two packages cover most needs:
pdf-parse (text only, simple)
import pdfParse from 'pdf-parse';
import fs from 'fs';
const buffer = fs.readFileSync('document.pdf');
const data = await pdfParse(buffer);
// data.text contains the extracted plain text
console.log(data.text);pdf-parse is a thin wrapper around PDF.js for Node.js. It gives you plain text — no structure, no headings, no tables. Good enough for feeding into an LLM where the model handles structure inference. Not suitable if you need clean Markdown with headers and tables preserved.
For structured Markdown output
If you need Markdown with structure preserved (headings, bold, tables, lists), you need either a more sophisticated open-source tool or a commercial API. The open-source landscape here is sparse — most tools either produce flat text or require significant post-processing. For high-fidelity conversion, look at the commercial options below.
Option 3: Python — pdfminer, pypdf, or marker
Python has a richer ecosystem for PDF extraction.
pypdf (simple text extraction)
from pypdf import PdfReader
reader = PdfReader("document.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n\n"
print(text)marker (structured Markdown, open-source)
marker is a Python library from Datalab that produces clean Markdown from PDFs. It uses a local ML model to understand document structure — handling multi-column layouts, mathematical equations, and tables far better than heuristic-only approaches. The tradeoff: it requires a GPU for reasonable speed on large files.
from marker.convert import convert_single_pdf
from marker.models import load_all_models
models = load_all_models()
full_text, images, metadata = convert_single_pdf("document.pdf", models)
# full_text is clean Markdown
print(full_text)For RAG pipelines where document quality matters — research papers, technical manuals — marker produces noticeably better output than heuristic tools. See our post on using converted Markdown with LangChain and LlamaIndex for how to integrate this into an LLM pipeline.
Option 4: Commercial APIs
If you need reliable, high-volume conversion without managing your own infrastructure, several commercial APIs offer PDF-to-Markdown or PDF-to-structured-text endpoints:
- Anthropic Claude API — Claude's vision capability can read PDF files directly and extract structured content. Pass the PDF as a base64-encoded document in the message. Best for complex documents where ML-level understanding is needed.
- Google Document AI — Enterprise document processing API. Produces structured output with headings, tables, and form fields identified. Strong on scanned/OCR documents.
- AWS Textract — Similar to Document AI; excellent table and form extraction. Returns structured JSON that you can convert to Markdown programmatically.
- Mathpix — Specialized in scientific documents with math. Best for LaTeX-heavy papers.
- LlamaParse — A newer option from LlamaIndex optimized for RAG use cases. Returns chunked Markdown designed for embedding and retrieval.
Example: PDF to Markdown via Claude API
import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs';
const client = new Anthropic();
const pdfBuffer = fs.readFileSync('document.pdf');
const base64Pdf = pdfBuffer.toString('base64');
const response = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 4096,
messages: [
{
role: 'user',
content: [
{
type: 'document',
source: {
type: 'base64',
media_type: 'application/pdf',
data: base64Pdf,
},
},
{
type: 'text',
text: 'Convert this PDF to clean Markdown. Preserve headings, lists, and tables. Output only the Markdown, no commentary.',
},
],
},
],
});
const markdown = response.content[0].text;
console.log(markdown);Option 5: Build Your Own Serverless API
If you want a private, self-hosted endpoint, you can wrap pdf-parse or marker in a serverless function. Here's a minimal Cloudflare Worker (or any Node.js-based serverless environment) pattern:
// Cloudflare Worker example
export default {
async fetch(request: Request): Promise<Response> {
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
const formData = await request.formData();
const file = formData.get('file') as File;
if (!file || file.type !== 'application/pdf') {
return new Response('Invalid file', { status: 400 });
}
const buffer = await file.arrayBuffer();
// Pass to your conversion logic
const markdown = await convertPdfToMarkdown(buffer);
return new Response(JSON.stringify({ markdown }), {
headers: { 'Content-Type': 'application/json' },
});
},
};The key consideration: PDF parsing is CPU-intensive. A 100-page PDF can take several seconds to process. Use a queue-based architecture (upload → job ID → poll for result) rather than a synchronous request-response pattern for anything beyond small files.
Choosing the Right Approach
The right tool depends on your context:
- Web app, client-side → PDF.js in a Web Worker. Zero server costs, instant results, full privacy.
- Simple text extraction, Node.js → pdf-parse. Minimal setup, good enough for LLM input.
- High-fidelity Markdown, Python pipeline → marker. Best open-source output quality for complex documents.
- Scanned/image PDFs → Google Document AI or AWS Textract. OCR is built-in.
- Scientific papers with math → Mathpix API.
- RAG pipeline, managed service → LlamaParse. Optimized chunking and structured output.
- General purpose, flexible output → Claude or GPT-4 with vision. Understands context, handles edge cases, but higher cost per page.
For understanding what happens during conversion at the technical level, see how PDF to Markdown conversion works. For a comparison of the major tools in this space, see our tools comparison guide.
Need to convert PDFs in the browser right now?
Try the Free Converter →