Talk2PDF Agent
In this tutorial, we'll build a custom AI agent that can parse PDF documents and make them consumable by LLMs. We'll walk through the process step by step, explaining what we're doing and why.
What We're Building
We want to create an agent that can:
- Parse PDF files and extract text content
- Search for specific terms within documents
- Provide intelligent analysis and summaries
- Handle errors gracefully
The key insight is that we'll separate concerns: a custom MCP server handles the low-level PDF parsing, while the agent provides intelligent interaction.
Step 1: Understanding the Architecture
Our agent will have two main components:
- MCP Server: Handles PDF parsing operations (tools)
- Agent: Provides intelligent analysis and user interaction
This separation allows the agent to focus on understanding and analysis while the MCP server handles the technical PDF processing.
Step 2: Quick Setup
The talk2pdf agent uses a published MCP server that's automatically installed:
# From the saiki project root
saiki --agent agents/talk2pdf-agent/talk2pdf-agent.yml
That's it! The MCP server (@truffle-ai/talk2pdf-mcp
) will be automatically downloaded and installed via npx
on first run.
Step 3: Understanding the MCP Server
The core functionality is provided by the Talk2PDF MCP Server, a TypeScript-based server built with the Model Context Protocol SDK. To understand the complete MCP server implementation, refer to the mcp-servers repository:
- Talk2PDF Server: src/talk2pdf - PDF parsing, text extraction, and content analysis tools
What's Happening Behind the Scenes
The published MCP server includes these key dependencies:
- @modelcontextprotocol/sdk: The MCP framework for server communication
- pdf-parse-debugging-disabled: PDF parsing without debug console output
- zod: Runtime type validation for tool parameters
- TypeScript: Compiled to JavaScript for reliable execution
Step 4: Available Tools
The talk2pdf MCP server provides two main tools:
parse_pdf
: Extract all text and metadata from a PDFextract_section
: Search for specific content within a PDF
Here's how the MCP server is structured (you can view the full implementation at https://github.com/truffle-ai/mcp-servers/tree/main/src/talk2pdf):
// Core server structure
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { z } from 'zod';
import { readFileSync, existsSync } from 'fs';
import { extname } from 'path';
import pdf from 'pdf-parse-debugging-disabled';
class Talk2PDFMCPServer {
private server: McpServer;
constructor() {
this.server = new McpServer(
{ name: 'talk2pdf', version: '1.0.0' },
{ capabilities: { tools: {}, resources: {} } }
);
this.registerTools();
}
private registerTools(): void {
// Tools are registered here
}
async start(): Promise<void> {
const transport = new StdioServerTransport();
await this.server.connect(transport);
console.log('Talk2PDF MCP Server started');
}
}
What's happening here?
- We create an MCP server with the name 'talk2pdf'
- The server registers tools that the agent can call
- The server communicates via stdio (standard input/output)
Step 5: Adding the Parse Tool
The parse_pdf
tool is our main workhorse. It needs to:
- Validate the file exists and is a PDF
- Extract text content
- Extract metadata (page count, title, author, etc.)
- Return structured data
this.server.tool(
'parse_pdf',
'Parse a PDF file and extract its text content and metadata',
{
filePath: z.string().describe('Path to the PDF file'),
includeMetadata: z.boolean().optional().default(true),
maxPages: z.number().optional(),
},
async ({ filePath, includeMetadata = true, maxPages }) => {
// Validate file exists and is PDF
if (!existsSync(filePath)) {
throw new Error(`File not found: ${filePath}`);
}
// Parse PDF and extract content
const dataBuffer = readFileSync(filePath);
const pdfData = await pdf(dataBuffer, { max: maxPages });
// Extract metadata
const metadata = {
pageCount: pdfData.numpages,
title: pdfData.info?.Title,
author: pdfData.info?.Author,
fileSize: dataBuffer.length,
fileName: filePath.split('/').pop() || filePath,
};
// Return structured result
const result = {
content: pdfData.text,
metadata: includeMetadata ? metadata : {
pageCount: metadata.pageCount,
fileSize: metadata.fileSize,
fileName: metadata.fileName,
},
};
return {
content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
};
}
);
Key points:
- We validate the file first (safety)
- We use
pdf-parse-debugging-disabled
to avoid common issues - We return JSON so the agent can easily parse the result
Step 6: Adding the Search Tool
The extract_section
tool allows searching within PDFs:
this.server.tool(
'extract_section',
'Extract specific content from a PDF document',
{
filePath: z.string().describe('Path to the PDF file'),
searchTerm: z.string().optional().describe('Search term to find'),
},
async ({ filePath, searchTerm }) => {
// Validate file exists and is PDF
if (!existsSync(filePath)) {
throw new Error(`File not found: ${filePath}`);
}
const fileExtension = extname(filePath).toLowerCase();
if (fileExtension !== '.pdf') {
throw new Error(`File is not a PDF: ${filePath}`);
}
// Parse the PDF
const pdfData = await pdf(readFileSync(filePath));
let extractedContent = pdfData.text;
// If search term provided, filter content
if (searchTerm) {
const lines = extractedContent.split('\n');
const matchingLines = lines.filter(line =>
line.toLowerCase().includes(searchTerm.toLowerCase())
);
extractedContent = matchingLines.join('\n');
}
const result = {
fileName: filePath.split('/').pop() || filePath,
totalPages: pdfData.numpages,
extractedContent,
searchTerm: searchTerm || null,
contentLength: extractedContent.length,
};
return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] };
}
);
What this does:
- Parses the PDF to get all text
- If a search term is provided, filters lines containing that term
- Returns the filtered content
Step 7: Understanding the Agent Configuration
The agent configuration connects to the published MCP server:
# agents/talk2pdf-agent/talk2pdf-agent.yml
mcpServers:
talk2pdf:
type: stdio
command: npx
args:
- "@truffle-ai/talk2pdf-mcp"
timeout: 30000
connectionMode: strict
systemPrompt:
contributors:
- id: primary
type: static
priority: 0
content: |
You are a Talk2PDF Agent. You can parse PDF files, extract their text, metadata, and provide summaries or extract specific sections for LLM consumption.
## Your Capabilities
- Parse PDF files and extract all text content and metadata
- Extract specific sections or search for terms within a PDF
- Provide intelligent analysis, summarization, and insights based on the extracted content
- Handle errors gracefully and provide clear feedback
Always ask for the file path if not provided. If a file is not a PDF or does not exist, inform the user.
llm:
provider: openai
model: gpt-4o-mini
apiKey: $OPENAI_API_KEY
Key configuration points:
- The agent connects to the published MCP server via stdio
- We use
npx @truffle-ai/talk2pdf-mcp
to run the compiled server - The system prompt tells the agent what it can do and how to behave
Step 8: Testing the Agent
Now we can test our agent:
# From the project root
saiki --agent ./agents/talk2pdf-agent/talk2pdf-agent.yml
Once started, try these interactions:
Parse and summarize:
Parse the PDF at /path/to/document.pdf and summarize the key points
Search for content:
Find all mentions of "budget" in the financial report at /path/to/report.pdf
How It All Works Together
- User asks a question about a PDF
- Agent understands what the user wants
- Agent calls the appropriate tool (
parse_pdf
orextract_section
) - MCP server processes the PDF and returns structured data
- Agent analyzes the returned data and provides intelligent response
- User gets a helpful, contextual answer
What We've Accomplished
We've created a complete PDF parsing agent that demonstrates:
- Separation of concerns: Tools handle technical operations, agent handles intelligence
- Error handling: Proper validation and graceful error messages
- Flexible architecture: Easy to extend with more tools
- Distributed architecture: Published MCP server for easy deployment and updates
The agent can now parse PDFs, extract content, search for terms, and provide intelligent analysis - all through natural language interaction, using a published MCP server that's automatically installed.
Next Steps
This pattern can be extended to:
- Add more document formats (DOCX, TXT)
- Implement document comparison
- Add OCR for scanned PDFs
- Create document classification
The key insight is that by separating the technical operations (published MCP server) from the intelligence (agent), we create a flexible, maintainable system that's easy to extend and debug. The published server approach means updates and improvements can be distributed automatically without requiring changes to the agent configuration.