Converter
← Back to Blog

July 4, 2026

How to Use PDF to Markdown with LangChain and LlamaIndex

Most RAG tutorials show you how to load a PDF directly with PyPDFLoader or SimpleDirectoryReader. That works — but it often produces poor retrieval results on complex documents. The chunks are noisy, headings and table structure are lost, and the LLM has to work harder to extract meaning from unstructured text.

A better pattern: convert PDFs to Markdown first, then load the Markdown. Markdown preserves semantic structure — headings become chunk boundaries, tables stay intact, lists stay grouped. The result is higher retrieval precision and more accurate LLM answers.

This guide shows you how to do it end-to-end with both LangChain and LlamaIndex.

Why Markdown Beats Raw PDF Text for RAG

Consider a typical PDF conversion scenario: a technical specification document with sections, subsections, tables, and bullet points. When loaded directly via PyPDFLoader:

When you convert to Markdown first:

The key insight: Markdown headers are the ideal chunk boundary signal. Splitting on #, ##, ### gives you semantically coherent chunks that align with how the document is actually organized.

Approach 1: LangChain with marker + MarkdownTextSplitter

Step 1: Convert PDF to Markdown with marker

Install marker and convert your PDFs to a folder of Markdown files:

pip install marker-pdf
from marker.convert import convert_single_pdf
from marker.models import load_all_models
import os

def pdf_to_markdown(pdf_path: str, output_dir: str) -> str:
    """Convert a PDF to Markdown and save to output_dir."""
    models = load_all_models()
    full_text, images, metadata = convert_single_pdf(pdf_path, models)

    # Save markdown file alongside the PDF
    base_name = os.path.splitext(os.path.basename(pdf_path))[0]
    output_path = os.path.join(output_dir, f"{base_name}.md")

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(full_text)

    return output_path


# Convert all PDFs in a directory
pdf_dir = "./documents"
md_dir = "./documents_md"
os.makedirs(md_dir, exist_ok=True)

for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        pdf_path = os.path.join(pdf_dir, filename)
        md_path = pdf_to_markdown(pdf_path, md_dir)
        print(f"Converted: {md_path}")

Step 2: Load and Split with LangChain

LangChain's MarkdownHeaderTextSplitter splits on heading hierarchy — this is the key advantage over character-based splitting.

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Load all Markdown files
loader = DirectoryLoader(
    "./documents_md",
    glob="**/*.md",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)
documents = loader.load()

# Split on Markdown headers — creates semantically coherent chunks
headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False  # Keep headers in chunk content for context
)

all_chunks = []
for doc in documents:
    chunks = md_splitter.split_text(doc.page_content)
    # Preserve source metadata
    for chunk in chunks:
        chunk.metadata["source"] = doc.metadata.get("source", "unknown")
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")

# Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print("Vector store created.")

Step 3: Query with a RAG chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximal marginal relevance for diverse results
    search_kwargs={"k": 6, "fetch_k": 20}
)

prompt_template = """Use the following context from the documentation to answer the question.
If the context includes a table, use it directly in your answer when relevant.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What are the system requirements?"})
print(result["result"])

# Show source sections used
for doc in result["source_documents"]:
    print(f"\nSource: {doc.metadata.get('source')} — {doc.metadata.get('h1', '')} > {doc.metadata.get('h2', '')}")

Approach 2: LlamaIndex with LlamaParse

LlamaIndex has native integration with LlamaParse — a document parser from the same team, optimized for RAG. LlamaParse handles the PDF-to-Markdown conversion in the cloud, then LlamaIndex handles indexing and retrieval.

pip install llama-index llama-parse
import os
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# LlamaParse converts PDF to structured Markdown
parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"],
    result_type="markdown",       # Get Markdown output
    num_workers=4,                # Parallel processing
    verbose=True,
    language="en"
)

# Parse PDFs — returns LlamaIndex Document objects with Markdown content
documents = parser.load_data("./documents/report.pdf")

# Use MarkdownNodeParser for structure-aware chunking
node_parser = MarkdownNodeParser()

# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Build index with Markdown-aware node parsing
index = VectorStoreIndex.from_documents(
    documents,
    transformations=[node_parser]
)

# Query
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

response = query_engine.query(
    "Summarize the key findings from section 3"
)
print(response)

Approach 3: Using Pre-converted Markdown Files (Any Source)

If you've already converted PDFs to Markdown files (using any tool — our browser converter, marker, or another solution), loading them in LlamaIndex is straightforward:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser

# Load pre-converted Markdown files
reader = SimpleDirectoryReader(
    input_dir="./documents_md",
    required_exts=[".md"],
    recursive=True
)
documents = reader.load_data()

# Parse with Markdown-aware splitter
node_parser = MarkdownNodeParser(include_metadata=True)
nodes = node_parser.get_nodes_from_documents(documents)

print(f"Loaded {len(documents)} documents → {len(nodes)} nodes")

# Build index
index = VectorStoreIndex(nodes)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What installation steps are required?")
print(response)

Key Tips for Better Retrieval Quality

1. Use header-based chunking, not character-based

Character-based splitters (RecursiveCharacterTextSplitter) are a fallback for unstructured text. With Markdown, use MarkdownHeaderTextSplitter (LangChain) or MarkdownNodeParser (LlamaIndex). Chunks will align with actual document sections.

2. Preserve heading metadata

Both LangChain and LlamaIndex can attach heading-path metadata to each chunk (e.g., h1: "Chapter 3", h2: "Installation"). This metadata helps during retrieval — you can filter chunks by section, and the source citation in the response is more meaningful.

3. Don't over-chunk

A common mistake is setting chunk size too small. Markdown-based chunking naturally produces larger, coherent chunks. Trust it. Chunks that split in the middle of a table or list are worse than larger chunks that keep content together.

4. Handle tables specially

Tables extracted as Markdown pipe format are readable to LLMs. Don't split a table across chunks. Both parsers above attempt to keep tables intact — verify this for your document type. For documents where table accuracy is critical, see our guide on accurate table extraction.

5. Quality of source Markdown matters

Garbage in, garbage out. The quality of your PDF conversion directly affects retrieval performance. For complex documents (multi-column, dense tables, mixed layouts), invest in a higher-quality converter. Our tools comparison covers the options and their tradeoffs.

The Conversion Step is an Investment

It might feel like extra work to add a PDF-to-Markdown conversion step before ingesting into your pipeline. But it pays off: better chunks, more accurate retrieval, cleaner LLM context. The documents that matter most in your knowledge base — technical manuals, product specs, research papers — tend to be exactly the complex, structured PDFs that benefit most from this approach.

To understand what the conversion step is actually doing under the hood, see how PDF to Markdown conversion works. For choosing the right conversion tool for your pipeline, see our tools comparison.

Convert PDFs to Markdown for your pipeline — free, in the browser

Try the Converter →