Extract PDF Data to JSON Format

Learn how to convert PDF documents into structured JSON data using four different methods, from open-source libraries to API services.

You've got 50 invoices to process, and manually copying data is not an option. We've all been there, staring at a pile of PDFs that need to become structured data for your database, CRM, or analytics tool. The good news? You can automate this entire process and get clean JSON output in minutes, not hours.

Understanding PDF Data Extraction

PDFs were designed for consistent visual presentation, not data extraction. Unlike HTML or XML, PDFs don't have a logical structure that makes extracting data straightforward. Text might be stored as individual characters, tables could be just positioned text blocks, and don't even get me started on scanned documents.

That's where JSON comes in. As the universal data exchange format, JSON lets you transform unstructured PDF content into something your applications can actually use. Whether you're building an invoice processing system, extracting research data, or parsing forms, converting to JSON opens up endless possibilities.

Method 1: Using Python with pdfplumber

pdfplumber is a Python library that excels at extracting text and tables from PDFs. It's particularly good with tabular data, making it a solid choice for invoices and reports.

Installation

pip install pdfplumber

Implementation

import pdfplumber
import json

def extract_invoice_data(pdf_path):
    invoice_data = {
        "invoice_number": ",
        "date": ",
        "total": 0,
        "line_items": []
    }
    
    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        text = first_page.extract_text()
        
        # Extract invoice number (simple pattern matching)
        if "Invoice #" in text:
            invoice_data["invoice_number"] = text.split("Invoice #")[1].split("\n")[0].strip()
        
        # Extract tables for line items
        tables = first_page.extract_tables()
        if tables:
            # Assume first table contains line items
            for row in tables[0][1:]:  # Skip header row
                if len(row) >= 3:
                    invoice_data["line_items"].append({
                        "description": row[0],
                        "quantity": row[1],
                        "price": row[2]
                    })
    
    return json.dumps(invoice_data, indent=2)

# Usage
result = extract_invoice_data("invoice.pdf")
print(result)
# Output: {"invoice_number": "INV-2024-001", "date": ", "total": 0, "line_items": [...]}

Pros:

Free and open-source
Excellent table extraction capabilities
Works well with standard PDF layouts
Good documentation and community support

Cons:

Struggles with complex layouts or rotated text
Limited OCR support for scanned PDFs
Requires custom logic for each document type
No built-in AI understanding of content

Method 2: Using Node.js with pdf-parse

pdf-parse is a lightweight Node.js library for basic PDF text extraction. While it doesn't have advanced features, it's perfect for simple extraction tasks.

Installation

npm install pdf-parse

Implementation

const fs = require('fs');
const pdf = require('pdf-parse');

async function extractPDFData(pdfPath) {
    const dataBuffer = fs.readFileSync(pdfPath);
    
    try {
        const data = await pdf(dataBuffer);
        
        // Simple extraction requires parsing the text
        const lines = data.text.split('\n');
        const jsonData = {
            totalPages: data.numpages,
            extractedText: lines,
            metadata: data.info
        };
        
        // Custom parsing logic based on your PDF structure
        const invoiceData = {
            invoice_number: ",
            items: []
        };
        
        lines.forEach(line => {
            if (line.includes('Invoice #')) {
                invoiceData.invoice_number = line.split('#')[1]?.trim();
            }
            // Add more parsing logic as needed
        });
        
        return JSON.stringify(invoiceData, null, 2);
    } catch (error) {
        console.error('Error:', error);
        return null;
    }
}

// Usage
extractPDFData('./invoice.pdf').then(result => {
    console.log(result);
});

Pros:

Very lightweight (minimal dependencies)
Fast processing for simple PDFs
Easy to integrate into Node.js applications
Good for basic text extraction

Cons:

No table extraction capabilities
Limited formatting preservation
Requires extensive custom parsing logic
Not suitable for complex documents

Method 3: Using PDF Vector's Ask API

PDF Vector's Ask API provides an AI-powered Ask API that can extract structured data directly into custom JSON schemas. This eliminates the need for complex parsing logic.

Installation

npm install pdfvector

Implementation

import { PDFVector } from 'pdfvector';

const client = new PDFVector({ 
    apiKey: 'pdfvector_your_api_key' 
});

async function extractInvoiceToJSON(pdfUrl: string) {
    const result = await client.ask({
        url: pdfUrl,
        prompt: "Extract invoice information including all line items",
        mode: "json",
        schema: {
            type: "object",
            properties: {
                invoiceNumber: { type: "string" },
                issueDate: { type: "string" },
                dueDate: { type: "string" },
                vendorInfo: {
                    type: "object",
                    properties: {
                        name: { type: "string" },
                        address: { type: "string" },
                        taxId: { type: "string" }
                    }
                },
                customerInfo: {
                    type: "object",
                    properties: {
                        name: { type: "string" },
                        address: { type: "string" }
                    }
                },
                lineItems: {
                    type: "array",
                    items: {
                        type: "object",
                        properties: {
                            description: { type: "string" },
                            quantity: { type: "number" },
                            unitPrice: { type: "number" },
                            total: { type: "number" }
                        }
                    }
                },
                subtotal: { type: "number" },
                tax: { type: "number" },
                total: { type: "number" }
            },
            required: ["invoiceNumber", "total", "lineItems"]
        }
    });
    
    return result.json;
}

// Usage with URL
const invoiceData = await extractInvoiceToJSON('https://example.com/invoice.pdf');
console.log(JSON.stringify(invoiceData, null, 2));

// Usage with local file
const fs = require('fs');
const fileBuffer = fs.readFileSync('./invoice.pdf');
const localResult = await client.ask({
    data: fileBuffer,
    contentType: 'application/pdf',
    prompt: "Extract invoice information",
    mode: "json",
    schema: { /* same schema */ }
});

Pros:

AI understands context and document structure
No parsing logic needed since you just define your schema
Handles complex layouts, tables, and multi-page documents
Works with scanned PDFs (OCR built-in)

Cons:

Requires API key (not self-hosted)
Costs 2 credits per page (pricing details)
Internet connection required
Processing time depends on document size

Method 4: Using Adobe PDF Services API

Adobe PDF Services API offers enterprise-grade PDF processing, including data extraction capabilities.

Implementation

const PDFServicesSdk = require('@adobe/pdfservices-node-sdk');

const credentials = PDFServicesSdk.Credentials
    .serviceAccountCredentialsBuilder()
    .withClientId("YOUR_CLIENT_ID")
    .withClientSecret("YOUR_CLIENT_SECRET")
    .build();

const executionContext = PDFServicesSdk.ExecutionContext.create(credentials);
const extractPDFOperation = PDFServicesSdk.ExtractPDF.Operation.createNew();

const input = PDFServicesSdk.FileRef.createFromLocalFile('invoice.pdf');
extractPDFOperation.setInput(input);

const options = new PDFServicesSdk.ExtractPDF.options.ExtractPdfOptions.Builder()
    .addElementsToExtract(
        PDFServicesSdk.ExtractPDF.options.ExtractElementType.TEXT,
        PDFServicesSdk.ExtractPDF.options.ExtractElementType.TABLES
    )
    .build();

extractPDFOperation.setOptions(options);

extractPDFOperation.execute(executionContext)
    .then(result => result.saveAsFile('output.json'))
    .then(() => {
        // Read and process the JSON file
        const extractedData = require('./output.json');
        console.log(JSON.stringify(extractedData, null, 2));
    })
    .catch(err => console.error('Error:', err));

Pros:

Enterprise-grade reliability and support
Excellent for high-volume processing
Comprehensive extraction options
Strong security and compliance features

Cons:

Complex setup and authentication
Higher cost for small projects
Requires Adobe account and credentials
More suited for enterprise applications

Comparing the Methods

Feature	pdfplumber	pdf-parse	PDF Vector	Adobe PDF Services
Free to Use	Yes	Yes	No	No
Easy Setup	Yes	Yes	Yes	No
AI-Powered	No	No	Yes	No
Extracts Tables	Yes	No	Yes	Yes
Handles Scanned PDFs	No	No	Yes	Yes
Custom JSON Schemas	No	No	Yes	No
Self-Hosted	Yes	Yes	No	No
Enterprise Support	No	No	No	Yes

Making the Right Choice

Use pdfplumber when:

You're working with simple PDFs that have clear table structures
You need a free, open-source solution
You're comfortable writing custom parsing logic
Your documents follow consistent layouts

Use pdf-parse when:

You only need basic text extraction
You're already in a Node.js environment
File size and performance are critical
You don't need table or formatting preservation

Use PDF Vector when:

You need structured JSON output with custom schemas
You're dealing with complex or variable document layouts
You want AI-powered understanding of content
You need to handle scanned PDFs with OCR
Development speed is more important than infrastructure control

Use Adobe PDF Services when:

You're building enterprise applications
You need guaranteed uptime and support
You have high-volume processing requirements
You're already invested in the Adobe ecosystem

The key is to match your tool to your specific needs. Start small, test with your actual PDFs, and scale up as needed. You now have everything you need to turn those PDFs into useful JSON data.

Extract PDF Data to JSON Format

Understanding PDF Data Extraction

Method 1: Using Python with pdfplumber

Installation

Implementation

Method 2: Using Node.js with pdf-parse

Installation

Implementation

Method 3: Using PDF Vector's Ask API

Installation

Implementation

Method 4: Using Adobe PDF Services API

Implementation

Comparing the Methods

Making the Right Choice

Related Articles

PDF Vector vs Nanonets: OCR & AI Docs Compared

AlfredAPI vs Eden AI: Which Unified AI API Wins?

Retrieval Pipelines for Long PDFs that Actually Scale