Learn how to convert PDF documents into structured JSON data using four different methods, from open-source libraries to API services.
You've got 50 invoices to process, and manually copying data is not an option. We've all been there, staring at a pile of PDFs that need to become structured data for your database, CRM, or analytics tool. The good news? You can automate this entire process and get clean JSON output in minutes, not hours.
Understanding PDF Data Extraction
PDFs were designed for consistent visual presentation, not data extraction. Unlike HTML or XML, PDFs don't have a logical structure that makes extracting data straightforward. Text might be stored as individual characters, tables could be just positioned text blocks, and don't even get me started on scanned documents.
That's where JSON comes in. As the universal data exchange format, JSON lets you transform unstructured PDF content into something your applications can actually use. Whether you're building an invoice processing system, extracting research data, or parsing forms, converting to JSON opens up endless possibilities.
Method 1: Using Python with pdfplumber
pdfplumber is a Python library that excels at extracting text and tables from PDFs. It's particularly good with tabular data, making it a solid choice for invoices and reports.
Installation
pip install pdfplumberImplementation
import pdfplumber
import json
def extract_invoice_data(pdf_path):
invoice_data = {
"invoice_number": ",
"date": ",
"total": 0,
"line_items": []
}
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
# Extract invoice number (simple pattern matching)
if "Invoice #" in text:
invoice_data["invoice_number"] = text.split("Invoice #")[1].split("\n")[0].strip()
# Extract tables for line items
tables = first_page.extract_tables()
if tables:
# Assume first table contains line items
for row in tables[0][1:]: # Skip header row
if len(row) >= 3:
invoice_data["line_items"].append({
"description": row[0],
"quantity": row[1],
"price": row[2]
})
return json.dumps(invoice_data, indent=2)
# Usage
result = extract_invoice_data("invoice.pdf")
print(result)
# Output: {"invoice_number": "INV-2024-001", "date": ", "total": 0, "line_items": [...]}Pros:
- Free and open-source
- Excellent table extraction capabilities
- Works well with standard PDF layouts
- Good documentation and community support
Cons:
- Struggles with complex layouts or rotated text
- Limited OCR support for scanned PDFs
- Requires custom logic for each document type
- No built-in AI understanding of content
Method 2: Using Node.js with pdf-parse
pdf-parse is a lightweight Node.js library for basic PDF text extraction. While it doesn't have advanced features, it's perfect for simple extraction tasks.
Installation
npm install pdf-parseImplementation
const fs = require('fs');
const pdf = require('pdf-parse');
async function extractPDFData(pdfPath) {
const dataBuffer = fs.readFileSync(pdfPath);
try {
const data = await pdf(dataBuffer);
// Simple extraction requires parsing the text
const lines = data.text.split('\n');
const jsonData = {
totalPages: data.numpages,
extractedText: lines,
metadata: data.info
};
// Custom parsing logic based on your PDF structure
const invoiceData = {
invoice_number: ",
items: []
};
lines.forEach(line => {
if (line.includes('Invoice #')) {
invoiceData.invoice_number = line.split('#')[1]?.trim();
}
// Add more parsing logic as needed
});
return JSON.stringify(invoiceData, null, 2);
} catch (error) {
console.error('Error:', error);
return null;
}
}
// Usage
extractPDFData('./invoice.pdf').then(result => {
console.log(result);
});Pros:
- Very lightweight (minimal dependencies)
- Fast processing for simple PDFs
- Easy to integrate into Node.js applications
- Good for basic text extraction
Cons:
- No table extraction capabilities
- Limited formatting preservation
- Requires extensive custom parsing logic
- Not suitable for complex documents
Method 3: Using PDF Vector's Ask API
PDF Vector's Ask API provides an AI-powered Ask API that can extract structured data directly into custom JSON schemas. This eliminates the need for complex parsing logic.
Installation
npm install pdfvectorImplementation
import { PDFVector } from 'pdfvector';
const client = new PDFVector({
apiKey: 'pdfvector_your_api_key'
});
async function extractInvoiceToJSON(pdfUrl: string) {
const result = await client.ask({
url: pdfUrl,
prompt: "Extract invoice information including all line items",
mode: "json",
schema: {
type: "object",
properties: {
invoiceNumber: { type: "string" },
issueDate: { type: "string" },
dueDate: { type: "string" },
vendorInfo: {
type: "object",
properties: {
name: { type: "string" },
address: { type: "string" },
taxId: { type: "string" }
}
},
customerInfo: {
type: "object",
properties: {
name: { type: "string" },
address: { type: "string" }
}
},
lineItems: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
unitPrice: { type: "number" },
total: { type: "number" }
}
}
},
subtotal: { type: "number" },
tax: { type: "number" },
total: { type: "number" }
},
required: ["invoiceNumber", "total", "lineItems"]
}
});
return result.json;
}
// Usage with URL
const invoiceData = await extractInvoiceToJSON('https://example.com/invoice.pdf');
console.log(JSON.stringify(invoiceData, null, 2));
// Usage with local file
const fs = require('fs');
const fileBuffer = fs.readFileSync('./invoice.pdf');
const localResult = await client.ask({
data: fileBuffer,
contentType: 'application/pdf',
prompt: "Extract invoice information",
mode: "json",
schema: { /* same schema */ }
});Pros:
- AI understands context and document structure
- No parsing logic needed since you just define your schema
- Handles complex layouts, tables, and multi-page documents
- Works with scanned PDFs (OCR built-in)
Cons:
- Requires API key (not self-hosted)
- Costs 2 credits per page (pricing details)
- Internet connection required
- Processing time depends on document size
Method 4: Using Adobe PDF Services API
Adobe PDF Services API offers enterprise-grade PDF processing, including data extraction capabilities.
Implementation
const PDFServicesSdk = require('@adobe/pdfservices-node-sdk');
const credentials = PDFServicesSdk.Credentials
.serviceAccountCredentialsBuilder()
.withClientId("YOUR_CLIENT_ID")
.withClientSecret("YOUR_CLIENT_SECRET")
.build();
const executionContext = PDFServicesSdk.ExecutionContext.create(credentials);
const extractPDFOperation = PDFServicesSdk.ExtractPDF.Operation.createNew();
const input = PDFServicesSdk.FileRef.createFromLocalFile('invoice.pdf');
extractPDFOperation.setInput(input);
const options = new PDFServicesSdk.ExtractPDF.options.ExtractPdfOptions.Builder()
.addElementsToExtract(
PDFServicesSdk.ExtractPDF.options.ExtractElementType.TEXT,
PDFServicesSdk.ExtractPDF.options.ExtractElementType.TABLES
)
.build();
extractPDFOperation.setOptions(options);
extractPDFOperation.execute(executionContext)
.then(result => result.saveAsFile('output.json'))
.then(() => {
// Read and process the JSON file
const extractedData = require('./output.json');
console.log(JSON.stringify(extractedData, null, 2));
})
.catch(err => console.error('Error:', err));Pros:
- Enterprise-grade reliability and support
- Excellent for high-volume processing
- Comprehensive extraction options
- Strong security and compliance features
Cons:
- Complex setup and authentication
- Higher cost for small projects
- Requires Adobe account and credentials
- More suited for enterprise applications
Comparing the Methods
| Feature | pdfplumber | pdf-parse | PDF Vector | Adobe PDF Services |
|---|---|---|---|---|
| Free to Use | Yes | Yes | No | No |
| Easy Setup | Yes | Yes | Yes | No |
| AI-Powered | No | No | Yes | No |
| Extracts Tables | Yes | No | Yes | Yes |
| Handles Scanned PDFs | No | No | Yes | Yes |
| Custom JSON Schemas | No | No | Yes | No |
| Self-Hosted | Yes | Yes | No | No |
| Enterprise Support | No | No | No | Yes |
Making the Right Choice
Use pdfplumber when:
- You're working with simple PDFs that have clear table structures
- You need a free, open-source solution
- You're comfortable writing custom parsing logic
- Your documents follow consistent layouts
Use pdf-parse when:
- You only need basic text extraction
- You're already in a Node.js environment
- File size and performance are critical
- You don't need table or formatting preservation
Use PDF Vector when:
- You need structured JSON output with custom schemas
- You're dealing with complex or variable document layouts
- You want AI-powered understanding of content
- You need to handle scanned PDFs with OCR
- Development speed is more important than infrastructure control
Use Adobe PDF Services when:
- You're building enterprise applications
- You need guaranteed uptime and support
- You have high-volume processing requirements
- You're already invested in the Adobe ecosystem
The key is to match your tool to your specific needs. Start small, test with your actual PDFs, and scale up as needed. You now have everything you need to turn those PDFs into useful JSON data.



