Convert PDF Tables to JSON

Transform PDF tables into clean JSON data structures ready for your database, API, or analytics pipeline.

Stop copying PDF tables cell by cell into your database. Whether you're processing invoices, financial reports, or research data, manually transferring table data wastes hours and introduces errors. You need that quarterly sales report in your analytics pipeline, those invoice line items in your accounting system, and those research results in your data warehouse, all in clean, structured JSON.

Understanding PDF Table Structure

PDF tables aren't stored as tables. They're collections of positioned text elements that happen to look like tables when rendered. Each cell is just text at specific x,y coordinates. The visual alignment creates the illusion of structure, but there's no underlying table object to query.

Common table formats challenge extraction tools differently. Simple tables with clear borders and consistent spacing extract cleanly. Tables with merged cells require intelligent parsing to understand which cells span multiple columns or rows. Nested tables, where cells contain sub-tables, demand recursive extraction strategies.

JSON excels as the output format because it naturally represents hierarchical data, supports various data types, integrates with every programming language, and validates against schemas. Your downstream systems already speak JSON, making integration seamless.

Method 1: Using Tabula for Simple Tables

Tabula specializes in extracting tables from PDFs with remarkable accuracy for well-structured documents.

Basic Table to JSON Conversion

import { exec } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';

const execAsync = promisify(exec);

async function extractWithTabula(pdfPath: string): Promise<any[]> {
  // Extract tables to JSON using Tabula CLI
  const outputPath = 'tables.json';
  
  await execAsync(
    `java -jar tabula.jar --format JSON --pages all "${pdfPath}" > "${outputPath}"`
  );
  
  const jsonContent = await fs.readFile(outputPath, 'utf-8');
  const rawTables = JSON.parse(jsonContent);
  
  // Transform Tabula output to structured JSON
  return rawTables.map((table: any[]) => {
    if (table.length === 0) return null;
    
    // First row as headers
    const headers = table[0].map((cell: any) => cell.text || '');
    
    // Convert remaining rows to objects
    const rows = table.slice(1).map((row: any[]) => {
      const obj: Record<string, string> = {};
      
      row.forEach((cell: any, index: number) => {
        const header = headers[index] || `column_${index}`;
        obj[header] = cell.text || '';
      });
      
      return obj;
    });
    
    return {
      headers,
      rows,
      rowCount: rows.length,
      columnCount: headers.length
    };
  }).filter(table => table !== null);
}

// Convert specific table types
async function extractInvoiceTable(pdfPath: string) {
  const tables = await extractWithTabula(pdfPath);
  
  // Find the line items table (usually the largest)
  const lineItemsTable = tables.reduce((largest, current) => 
    current.rowCount > largest.rowCount ? current : largest
  );
  
  // Transform to invoice structure
  const invoice = {
    lineItems: lineItemsTable.rows.map(row => ({
      description: row['Description'] || row['Item'],
      quantity: parseFloat(row['Qty'] || row['Quantity'] || '0'),
      unitPrice: parseFloat(row['Unit Price'] || row['Price'] || '0'),
      total: parseFloat(row['Total'] || row['Amount'] || '0')
    })),
    subtotal: 0,
    tax: 0,
    total: 0
  };
  
  // Calculate totals
  invoice.subtotal = invoice.lineItems.reduce((sum, item) => sum + item.total, 0);
  
  return invoice;
}

// Usage
const tables = await extractWithTabula('report.pdf');
console.log(JSON.stringify(tables, null, 2));

Pros:

Excellent accuracy for well-formatted tables
Free and open-source
Supports multiple output formats
Works well with bordered tables

Cons:

Requires Java runtime
Struggles with borderless tables
No built-in schema validation
Limited customization options

Method 2: Using PDFPlumber with Custom Parsing

PDFPlumber provides fine-grained control over table extraction with Python, but we can achieve similar results in TypeScript using pdf-parse.

Advanced Table Detection

import PDFParser from 'pdf-parse';
import fs from 'fs/promises';

interface TableCell {
  text: string;
  x: number;
  y: number;
  width: number;
  height: number;
}

async function extractWithCustomParsing(pdfPath: string) {
  const dataBuffer = await fs.readFile(pdfPath);
  const data = await PDFParser(dataBuffer);
  
  // Extract positioned text elements
  const textElements = extractTextElements(data);
  
  // Detect table regions
  const tables = detectTables(textElements);
  
  // Convert to structured JSON
  return tables.map(table => convertTableToJSON(table));
}

function extractTextElements(pdfData: any): TableCell[] {
  const elements: TableCell[] = [];
  
  // Parse PDF content (implementation depends on PDF structure)
  // This is simplified - real implementation needs proper PDF parsing
  const lines = pdfData.text.split('\n');
  let y = 0;
  
  lines.forEach(line => {
    if (line.trim()) {
      // Simple column detection based on multiple spaces
      const cells = line.split(/\s{2,}/);
      let x = 0;
      
      cells.forEach(cell => {
        elements.push({
          text: cell.trim(),
          x,
          y,
          width: cell.length * 8, // Approximate
          height: 12
        });
        x += cell.length * 8 + 20;
      });
    }
    y += 15;
  });
  
  return elements;
}

function detectTables(elements: TableCell[]): TableCell[][] {
  // Group elements by proximity
  const tables: TableCell[][] = [];
  const used = new Set<number>();
  
  elements.forEach((element, index) => {
    if (used.has(index)) return;
    
    const table: TableCell[] = [element];
    used.add(index);
    
    // Find nearby elements
    elements.forEach((other, otherIndex) => {
      if (used.has(otherIndex)) return;
      
      // Check if elements are aligned
      const yDistance = Math.abs(element.y - other.y);
      const xDistance = Math.abs(element.x - other.x);
      
      if (yDistance < 20 || xDistance < 10) {
        table.push(other);
        used.add(otherIndex);
      }
    });
    
    if (table.length > 3) { // Minimum cells for a table
      tables.push(table);
    }
  });
  
  return tables;
}

function convertTableToJSON(tableCells: TableCell[]): any {
  // Sort cells by position
  tableCells.sort((a, b) => {
    if (Math.abs(a.y - b.y) < 5) {
      return a.x - b.x;
    }
    return a.y - b.y;
  });
  
  // Group into rows
  const rows: TableCell[][] = [];
  let currentRow: TableCell[] = [];
  let currentY = tableCells[0]?.y || 0;
  
  tableCells.forEach(cell => {
    if (Math.abs(cell.y - currentY) > 10) {
      if (currentRow.length > 0) {
        rows.push(currentRow);
      }
      currentRow = [cell];
      currentY = cell.y;
    } else {
      currentRow.push(cell);
    }
  });
  
  if (currentRow.length > 0) {
    rows.push(currentRow);
  }
  
  // Convert to JSON structure
  if (rows.length === 0) return null;
  
  const headers = rows[0].map(cell => cell.text);
  const data = rows.slice(1).map(row => {
    const obj: Record<string, string> = {};
    
    row.forEach((cell, index) => {
      const header = headers[index] || `col${index}`;
      obj[header] = cell.text;
    });
    
    return obj;
  });
  
  return {
    headers,
    data,
    metadata: {
      rowCount: data.length,
      columnCount: headers.length,
      position: {
        x: Math.min(...tableCells.map(c => c.x)),
        y: Math.min(...tableCells.map(c => c.y))
      }
    }
  };
}

// Usage with schema mapping
async function extractFinancialData(pdfPath: string) {
  const tables = await extractWithCustomParsing(pdfPath);
  
  // Map to financial schema
  const financialData = {
    quarters: [] as any[],
    totals: {} as any
  };
  
  tables.forEach(table => {
    if (table.headers.includes('Q1') || table.headers.includes('Quarter')) {
      // Quarterly data table
      financialData.quarters = table.data.map(row => ({
        metric: row['Metric'] || row['Description'],
        q1: parseFloat(row['Q1'] || '0'),
        q2: parseFloat(row['Q2'] || '0'),
        q3: parseFloat(row['Q3'] || '0'),
        q4: parseFloat(row['Q4'] || '0'),
        total: parseFloat(row['Total'] || '0')
      }));
    }
  });
  
  return financialData;
}

Pros:

Full control over extraction logic
Can handle complex layouts
Custom schema mapping
No external dependencies

Cons:

Requires significant development effort
Complex coordinate calculations
May miss edge cases
Harder to maintain

Method 3: PDF Vector's Ask API

PDF Vector's Ask API uses AI to understand table context and extract data directly into your defined JSON schema.

Define Your JSON Schema

import { PDFVector } from 'pdfvector';

const client = new PDFVector({
  apiKey: 'pdfvector_your_api_key'
});

// Extract specific business data
async function extractSalesReport(pdfPath: string) {
  const pdfBuffer = await fs.readFile(pdfPath);
  
  const result = await client.ask({
    data: pdfBuffer,
    contentType: 'application/pdf',
    prompt: 'Extract sales data including products, quantities, revenues, and regional breakdowns',
    mode: 'json',
    schema: {
      type: 'object',
      properties: {
        reportPeriod: { type: 'string' },
        currency: { type: 'string' },
        productSales: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              productName: { type: 'string' },
              sku: { type: 'string' },
              category: { type: 'string' },
              unitsSold: { type: 'number' },
              revenue: { type: 'number' },
              averagePrice: { type: 'number' }
            },
            required: ['productName', 'unitsSold', 'revenue']
          }
        },
        regionalBreakdown: {
          type: 'array',
          items: {
            type: 'object',
            properties: {
              region: { type: 'string' },
              revenue: { type: 'number' },
              percentageOfTotal: { type: 'number' }
            }
          }
        },
        totals: {
          type: 'object',
          properties: {
            totalRevenue: { type: 'number' },
            totalUnits: { type: 'number' },
            averageOrderValue: { type: 'number' }
          }
        }
      }
    }
  });
  
  return result.json;
}

// Usage
const salesData = await extractSalesReport('Q4-sales-report.pdf');
console.log(`Total Revenue: ${salesData.totals.totalRevenue}`);
console.log(`Top Product: ${salesData.productSales[0].productName}`);

Pros:

AI understands context and relationships
Works with any table format
Direct schema-based extraction
Handles complex and nested tables
No preprocessing required

Cons:

Requires API key
Costs 3 credits per page
Internet connection required
Less control over extraction logic

Method 4: Using Apache Tika

Apache Tika provides a robust content extraction framework with table support.

Server Setup and Table Extraction

async function extractWithTika(pdfPath: string) {
  const pdfBuffer = await fs.readFile(pdfPath);
  
  // Call Tika server
  const response = await fetch('http://localhost:9998/tika', {
    method: 'PUT',
    headers: {
      'Accept': 'application/json',
      'Content-Type': 'application/pdf'
    },
    body: pdfBuffer
  });
  
  const tikaOutput = await response.json();
  
  // Parse Tika's structured content
  return parseTableFromTikaOutput(tikaOutput);
}

function parseTableFromTikaOutput(tikaData: any): any[] {
  const tables: any[] = [];
  
  // Tika returns structured content with table markers
  if (tikaData['table:table']) {
    const rawTables = Array.isArray(tikaData['table:table']) 
      ? tikaData['table:table'] 
      : [tikaData['table:table']];
    
    rawTables.forEach(table => {
      const rows = table['table:row'] || [];
      const structuredTable = {
        headers: [] as string[],
        data: [] as any[]
      };
      
      rows.forEach((row: any, index: number) => {
        const cells = row['table:cell'] || [];
        
        if (index === 0) {
          // First row as headers
          structuredTable.headers = cells.map((cell: any) => 
            cell['table:content'] || ''
          );
        } else {
          // Data rows
          const rowData: Record<string, any> = {};
          
          cells.forEach((cell: any, cellIndex: number) => {
            const header = structuredTable.headers[cellIndex] || `col${cellIndex}`;
            rowData[header] = cell['table:content'] || '';
          });
          
          structuredTable.data.push(rowData);
        }
      });
      
      tables.push(structuredTable);
    });
  }
  
  return tables;
}

// Transform to specific formats
async function extractAndTransform(pdfPath: string) {
  const tables = await extractWithTika(pdfPath);
  
  // Transform for data analysis
  const analysisReady = tables.map(table => ({
    dimensions: {
      rows: table.data.length,
      columns: table.headers.length
    },
    headers: table.headers,
    data: table.data,
    statistics: calculateTableStats(table.data)
  }));
  
  return analysisReady;
}

function calculateTableStats(data: any[]): any {
  const stats: any = {
    numericColumns: {},
    textColumns: {}
  };
  
  if (data.length === 0) return stats;
  
  // Analyze each column
  Object.keys(data[0]).forEach(column => {
    const values = data.map(row => row[column]);
    const numericValues = values
      .map(v => parseFloat(v))
      .filter(v => !isNaN(v));
    
    if (numericValues.length > values.length * 0.5) {
      // Mostly numeric column
      stats.numericColumns[column] = {
        min: Math.min(...numericValues),
        max: Math.max(...numericValues),
        avg: numericValues.reduce((a, b) => a + b, 0) / numericValues.length,
        sum: numericValues.reduce((a, b) => a + b, 0)
      };
    } else {
      // Text column
      stats.textColumns[column] = {
        uniqueValues: new Set(values).size,
        mostCommon: findMostCommon(values)
      };
    }
  });
  
  return stats;
}

function findMostCommon(arr: string[]): string {
  const counts = arr.reduce((acc, val) => {
    acc[val] = (acc[val] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);
  
  return Object.entries(counts)
    .sort(([,a], [,b]) => b - a)[0]?.[0] || '';
}

Pros:

Handles many file formats
Good multilingual support
Metadata extraction included
Can run as a service

Cons:

Requires server setup
Java dependency
Generic extraction not table-specific
May need post-processing

Making the Right Choice

Use Tabula when:

Your PDFs have well-formatted, bordered tables
You need a simple, reliable solution
You're comfortable with Java dependencies
You can preprocess PDFs to improve quality
Free and open-source is a requirement

Use PDFPlumber when:

You need complete control over extraction
Your tables have unique formatting
You're building a specialized solution
You have development resources
Performance is critical

Use PDF Vector when:

You need to handle any table format
Accuracy is more important than cost
You want structured JSON output immediately
You're processing diverse document types
Development speed matters

Use Apache Tika when:

You're already using Tika for other extraction
You need to process multiple file formats
Metadata extraction is also required
You can run a Tika server
You need multilingual support

Understanding PDF Table Structure

Method 1: Using Tabula for Simple Tables

Basic Table to JSON Conversion

Method 2: Using PDFPlumber with Custom Parsing

Advanced Table Detection

Method 3: PDF Vector's Ask API

Define Your JSON Schema

Method 4: Using Apache Tika

Server Setup and Table Extraction

Making the Right Choice

Essential Resources

Related Articles

PDF Vector vs Nanonets: OCR & AI Docs Compared

AlfredAPI vs Eden AI: Which Unified AI API Wins?

Retrieval Pipelines for Long PDFs that Actually Scale