Transform PDF tables into clean JSON data structures ready for your database, API, or analytics pipeline.
Stop copying PDF tables cell by cell into your database. Whether you're processing invoices, financial reports, or research data, manually transferring table data wastes hours and introduces errors. You need that quarterly sales report in your analytics pipeline, those invoice line items in your accounting system, and those research results in your data warehouse, all in clean, structured JSON.
Understanding PDF Table Structure
PDF tables aren't stored as tables. They're collections of positioned text elements that happen to look like tables when rendered. Each cell is just text at specific x,y coordinates. The visual alignment creates the illusion of structure, but there's no underlying table object to query.
Common table formats challenge extraction tools differently. Simple tables with clear borders and consistent spacing extract cleanly. Tables with merged cells require intelligent parsing to understand which cells span multiple columns or rows. Nested tables, where cells contain sub-tables, demand recursive extraction strategies.
JSON excels as the output format because it naturally represents hierarchical data, supports various data types, integrates with every programming language, and validates against schemas. Your downstream systems already speak JSON, making integration seamless.
Method 1: Using Tabula for Simple Tables
Tabula specializes in extracting tables from PDFs with remarkable accuracy for well-structured documents.
Basic Table to JSON Conversion
import { exec } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
const execAsync = promisify(exec);
async function extractWithTabula(pdfPath: string): Promise<any[]> {
// Extract tables to JSON using Tabula CLI
const outputPath = 'tables.json';
await execAsync(
`java -jar tabula.jar --format JSON --pages all "${pdfPath}" > "${outputPath}"`
);
const jsonContent = await fs.readFile(outputPath, 'utf-8');
const rawTables = JSON.parse(jsonContent);
// Transform Tabula output to structured JSON
return rawTables.map((table: any[]) => {
if (table.length === 0) return null;
// First row as headers
const headers = table[0].map((cell: any) => cell.text || '');
// Convert remaining rows to objects
const rows = table.slice(1).map((row: any[]) => {
const obj: Record<string, string> = {};
row.forEach((cell: any, index: number) => {
const header = headers[index] || `column_${index}`;
obj[header] = cell.text || '';
});
return obj;
});
return {
headers,
rows,
rowCount: rows.length,
columnCount: headers.length
};
}).filter(table => table !== null);
}
// Convert specific table types
async function extractInvoiceTable(pdfPath: string) {
const tables = await extractWithTabula(pdfPath);
// Find the line items table (usually the largest)
const lineItemsTable = tables.reduce((largest, current) =>
current.rowCount > largest.rowCount ? current : largest
);
// Transform to invoice structure
const invoice = {
lineItems: lineItemsTable.rows.map(row => ({
description: row['Description'] || row['Item'],
quantity: parseFloat(row['Qty'] || row['Quantity'] || '0'),
unitPrice: parseFloat(row['Unit Price'] || row['Price'] || '0'),
total: parseFloat(row['Total'] || row['Amount'] || '0')
})),
subtotal: 0,
tax: 0,
total: 0
};
// Calculate totals
invoice.subtotal = invoice.lineItems.reduce((sum, item) => sum + item.total, 0);
return invoice;
}
// Usage
const tables = await extractWithTabula('report.pdf');
console.log(JSON.stringify(tables, null, 2));Pros:
- Excellent accuracy for well-formatted tables
- Free and open-source
- Supports multiple output formats
- Works well with bordered tables
Cons:
- Requires Java runtime
- Struggles with borderless tables
- No built-in schema validation
- Limited customization options
Method 2: Using PDFPlumber with Custom Parsing
PDFPlumber provides fine-grained control over table extraction with Python, but we can achieve similar results in TypeScript using pdf-parse.
Advanced Table Detection
import PDFParser from 'pdf-parse';
import fs from 'fs/promises';
interface TableCell {
text: string;
x: number;
y: number;
width: number;
height: number;
}
async function extractWithCustomParsing(pdfPath: string) {
const dataBuffer = await fs.readFile(pdfPath);
const data = await PDFParser(dataBuffer);
// Extract positioned text elements
const textElements = extractTextElements(data);
// Detect table regions
const tables = detectTables(textElements);
// Convert to structured JSON
return tables.map(table => convertTableToJSON(table));
}
function extractTextElements(pdfData: any): TableCell[] {
const elements: TableCell[] = [];
// Parse PDF content (implementation depends on PDF structure)
// This is simplified - real implementation needs proper PDF parsing
const lines = pdfData.text.split('\n');
let y = 0;
lines.forEach(line => {
if (line.trim()) {
// Simple column detection based on multiple spaces
const cells = line.split(/\s{2,}/);
let x = 0;
cells.forEach(cell => {
elements.push({
text: cell.trim(),
x,
y,
width: cell.length * 8, // Approximate
height: 12
});
x += cell.length * 8 + 20;
});
}
y += 15;
});
return elements;
}
function detectTables(elements: TableCell[]): TableCell[][] {
// Group elements by proximity
const tables: TableCell[][] = [];
const used = new Set<number>();
elements.forEach((element, index) => {
if (used.has(index)) return;
const table: TableCell[] = [element];
used.add(index);
// Find nearby elements
elements.forEach((other, otherIndex) => {
if (used.has(otherIndex)) return;
// Check if elements are aligned
const yDistance = Math.abs(element.y - other.y);
const xDistance = Math.abs(element.x - other.x);
if (yDistance < 20 || xDistance < 10) {
table.push(other);
used.add(otherIndex);
}
});
if (table.length > 3) { // Minimum cells for a table
tables.push(table);
}
});
return tables;
}
function convertTableToJSON(tableCells: TableCell[]): any {
// Sort cells by position
tableCells.sort((a, b) => {
if (Math.abs(a.y - b.y) < 5) {
return a.x - b.x;
}
return a.y - b.y;
});
// Group into rows
const rows: TableCell[][] = [];
let currentRow: TableCell[] = [];
let currentY = tableCells[0]?.y || 0;
tableCells.forEach(cell => {
if (Math.abs(cell.y - currentY) > 10) {
if (currentRow.length > 0) {
rows.push(currentRow);
}
currentRow = [cell];
currentY = cell.y;
} else {
currentRow.push(cell);
}
});
if (currentRow.length > 0) {
rows.push(currentRow);
}
// Convert to JSON structure
if (rows.length === 0) return null;
const headers = rows[0].map(cell => cell.text);
const data = rows.slice(1).map(row => {
const obj: Record<string, string> = {};
row.forEach((cell, index) => {
const header = headers[index] || `col${index}`;
obj[header] = cell.text;
});
return obj;
});
return {
headers,
data,
metadata: {
rowCount: data.length,
columnCount: headers.length,
position: {
x: Math.min(...tableCells.map(c => c.x)),
y: Math.min(...tableCells.map(c => c.y))
}
}
};
}
// Usage with schema mapping
async function extractFinancialData(pdfPath: string) {
const tables = await extractWithCustomParsing(pdfPath);
// Map to financial schema
const financialData = {
quarters: [] as any[],
totals: {} as any
};
tables.forEach(table => {
if (table.headers.includes('Q1') || table.headers.includes('Quarter')) {
// Quarterly data table
financialData.quarters = table.data.map(row => ({
metric: row['Metric'] || row['Description'],
q1: parseFloat(row['Q1'] || '0'),
q2: parseFloat(row['Q2'] || '0'),
q3: parseFloat(row['Q3'] || '0'),
q4: parseFloat(row['Q4'] || '0'),
total: parseFloat(row['Total'] || '0')
}));
}
});
return financialData;
}Pros:
- Full control over extraction logic
- Can handle complex layouts
- Custom schema mapping
- No external dependencies
Cons:
- Requires significant development effort
- Complex coordinate calculations
- May miss edge cases
- Harder to maintain
Method 3: PDF Vector's Ask API
PDF Vector's Ask API uses AI to understand table context and extract data directly into your defined JSON schema.
Define Your JSON Schema
import { PDFVector } from 'pdfvector';
const client = new PDFVector({
apiKey: 'pdfvector_your_api_key'
});
// Extract specific business data
async function extractSalesReport(pdfPath: string) {
const pdfBuffer = await fs.readFile(pdfPath);
const result = await client.ask({
data: pdfBuffer,
contentType: 'application/pdf',
prompt: 'Extract sales data including products, quantities, revenues, and regional breakdowns',
mode: 'json',
schema: {
type: 'object',
properties: {
reportPeriod: { type: 'string' },
currency: { type: 'string' },
productSales: {
type: 'array',
items: {
type: 'object',
properties: {
productName: { type: 'string' },
sku: { type: 'string' },
category: { type: 'string' },
unitsSold: { type: 'number' },
revenue: { type: 'number' },
averagePrice: { type: 'number' }
},
required: ['productName', 'unitsSold', 'revenue']
}
},
regionalBreakdown: {
type: 'array',
items: {
type: 'object',
properties: {
region: { type: 'string' },
revenue: { type: 'number' },
percentageOfTotal: { type: 'number' }
}
}
},
totals: {
type: 'object',
properties: {
totalRevenue: { type: 'number' },
totalUnits: { type: 'number' },
averageOrderValue: { type: 'number' }
}
}
}
}
});
return result.json;
}
// Usage
const salesData = await extractSalesReport('Q4-sales-report.pdf');
console.log(`Total Revenue: ${salesData.totals.totalRevenue}`);
console.log(`Top Product: ${salesData.productSales[0].productName}`);Pros:
- AI understands context and relationships
- Works with any table format
- Direct schema-based extraction
- Handles complex and nested tables
- No preprocessing required
Cons:
- Requires API key
- Costs 3 credits per page
- Internet connection required
- Less control over extraction logic
Method 4: Using Apache Tika
Apache Tika provides a robust content extraction framework with table support.
Server Setup and Table Extraction
async function extractWithTika(pdfPath: string) {
const pdfBuffer = await fs.readFile(pdfPath);
// Call Tika server
const response = await fetch('http://localhost:9998/tika', {
method: 'PUT',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/pdf'
},
body: pdfBuffer
});
const tikaOutput = await response.json();
// Parse Tika's structured content
return parseTableFromTikaOutput(tikaOutput);
}
function parseTableFromTikaOutput(tikaData: any): any[] {
const tables: any[] = [];
// Tika returns structured content with table markers
if (tikaData['table:table']) {
const rawTables = Array.isArray(tikaData['table:table'])
? tikaData['table:table']
: [tikaData['table:table']];
rawTables.forEach(table => {
const rows = table['table:row'] || [];
const structuredTable = {
headers: [] as string[],
data: [] as any[]
};
rows.forEach((row: any, index: number) => {
const cells = row['table:cell'] || [];
if (index === 0) {
// First row as headers
structuredTable.headers = cells.map((cell: any) =>
cell['table:content'] || ''
);
} else {
// Data rows
const rowData: Record<string, any> = {};
cells.forEach((cell: any, cellIndex: number) => {
const header = structuredTable.headers[cellIndex] || `col${cellIndex}`;
rowData[header] = cell['table:content'] || '';
});
structuredTable.data.push(rowData);
}
});
tables.push(structuredTable);
});
}
return tables;
}
// Transform to specific formats
async function extractAndTransform(pdfPath: string) {
const tables = await extractWithTika(pdfPath);
// Transform for data analysis
const analysisReady = tables.map(table => ({
dimensions: {
rows: table.data.length,
columns: table.headers.length
},
headers: table.headers,
data: table.data,
statistics: calculateTableStats(table.data)
}));
return analysisReady;
}
function calculateTableStats(data: any[]): any {
const stats: any = {
numericColumns: {},
textColumns: {}
};
if (data.length === 0) return stats;
// Analyze each column
Object.keys(data[0]).forEach(column => {
const values = data.map(row => row[column]);
const numericValues = values
.map(v => parseFloat(v))
.filter(v => !isNaN(v));
if (numericValues.length > values.length * 0.5) {
// Mostly numeric column
stats.numericColumns[column] = {
min: Math.min(...numericValues),
max: Math.max(...numericValues),
avg: numericValues.reduce((a, b) => a + b, 0) / numericValues.length,
sum: numericValues.reduce((a, b) => a + b, 0)
};
} else {
// Text column
stats.textColumns[column] = {
uniqueValues: new Set(values).size,
mostCommon: findMostCommon(values)
};
}
});
return stats;
}
function findMostCommon(arr: string[]): string {
const counts = arr.reduce((acc, val) => {
acc[val] = (acc[val] || 0) + 1;
return acc;
}, {} as Record<string, number>);
return Object.entries(counts)
.sort(([,a], [,b]) => b - a)[0]?.[0] || '';
}Pros:
- Handles many file formats
- Good multilingual support
- Metadata extraction included
- Can run as a service
Cons:
- Requires server setup
- Java dependency
- Generic extraction not table-specific
- May need post-processing
Making the Right Choice
Use Tabula when:
- Your PDFs have well-formatted, bordered tables
- You need a simple, reliable solution
- You're comfortable with Java dependencies
- You can preprocess PDFs to improve quality
- Free and open-source is a requirement
Use PDFPlumber when:
- You need complete control over extraction
- Your tables have unique formatting
- You're building a specialized solution
- You have development resources
- Performance is critical
Use PDF Vector when:
- You need to handle any table format
- Accuracy is more important than cost
- You want structured JSON output immediately
- You're processing diverse document types
- Development speed matters
Use Apache Tika when:
- You're already using Tika for other extraction
- You need to process multiple file formats
- Metadata extraction is also required
- You can run a Tika server
- You need multilingual support



