Transform ArXiv's complex XML responses into clean, structured data you can actually use in your TypeScript applications.
If you’ve tried to get paper data from ArXiv’s API, you’ve probably hit the same wall we all do. Instead of nice JSON, you get complex XML with multiple namespaces that breaks standard parsing. Let’s fix that. We’ll explore three ways to extract titles, authors, and abstracts from ArXiv API responses using TypeScript.
Understanding ArXiv API XML Structure
The ArXiv API returns data in Atom 1.0 format, which uses XML namespaces extensively. Here’s what a typical response looks like:
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Quantum Computing Fundamentals</title>
<author>
<name>John Doe</name>
</author>
<summary>This paper explores...</summary>
</entry>
</feed>
The challenge? Standard XML parsing fails because of the default namespace http://www.w3.org/2005/Atom. Without handling this namespace correctly, you’ll get empty results even when the data is right there.
Method 1: Using xml2js Library
Implementation Guide
Install the xml2js library and its types
Configure the parser to handle namespaces
Parse the XML and extract the data
Code Example
import { parseString } from 'xml2js';
import { promisify } from 'util';
const parseXML = promisify(parseString);
interface ArxivPaper {
title: string;
authors: string[];
abstract: string;
id: string;
}
async function fetchArxivPapers(query: string, maxResults: number = 10): Promise<ArxivPaper[]> {
try {
// Build URL with query parameters
const params = new URLSearchParams({
search_query: query,
max_results: maxResults.toString()
});
// Make request to ArXiv API
const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
const xmlData = await response.text();
// Parse XML with namespace handling
const result = await parseXML(xmlData, {
explicitArray: false,
ignoreAttrs: true
});
// Extract papers from the feed
const entries = Array.isArray(result.feed.entry)
? result.feed.entry
: [result.feed.entry];
return entries.map((entry: any) => ({
title: entry.title.replace(/\s+/g, ' ').trim(),
authors: Array.isArray(entry.author)
? entry.author.map((a: any) => a.name)
: [entry.author.name],
abstract: entry.summary.replace(/\s+/g, ' ').trim(),
id: entry.id.split('/').pop()
}));
} catch (error) {
console.error('Failed to fetch ArXiv papers:', error);
return [];
}
}
// Usage
const papers = await fetchArxivPapers('quantum computing', 5);
console.log(papers);
// Output: [{ title: "...", authors: ["..."], abstract: "...", id: "..." }, ...]
Advantages and Limitations
Pros:
✅ Extensive configuration options for complex XML structures
✅ Mature namespace support with granular control
✅ Large community with extensive Stack Overflow coverage
Cons:
❌ 30-45x slower than fast-xml-parser on large files
❌ No releases since 2023 (v0.6.2) - appears unmaintained
Common Issues:
Namespace pollution: Default settings include namespace prefixes in keys, cluttering the output
Memory exhaustion: 80-90MB files can take 45+ seconds and spike RAM usage
Method 2: Using fast-xml-parser
Different Implementation
fast-xml-parser offers better performance and a more modern API. It handles namespaces automatically and provides TypeScript support out of the box.
Code Example
import { XMLParser } from 'fast-xml-parser';
interface ArxivEntry {
title: string;
author: { name: string } | { name: string }[];
summary: string;
id: string;
published: string;
}
async function fetchArxivWithFastParser(query: string, maxResults: number = 10) {
try {
// Build URL with query parameters
const params = new URLSearchParams({
search_query: query,
max_results: maxResults.toString()
});
const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
const xmlData = await response.text();
// Configure parser
const parser = new XMLParser({
ignoreAttributes: true,
removeNSPrefix: true, // This handles namespaces for us
parseTagValue: false
});
const result = parser.parse(xmlData);
// Handle single vs multiple entries
const entries: ArxivEntry[] = result.feed.entry
? (Array.isArray(result.feed.entry) ? result.feed.entry : [result.feed.entry])
: [];
return entries.map(entry => ({
title: entry.title.trim(),
authors: Array.isArray(entry.author)
? entry.author.map(a => a.name)
: [entry.author.name],
abstract: entry.summary.trim(),
id: entry.id.split('/').pop(),
published: entry.published
}));
} catch (error) {
console.error('Failed to parse ArXiv response:', error);
return [];
}
}
// Usage with async/await
const papers = await fetchArxivWithFastParser('machine learning', 10);
papers.forEach(paper => {
console.log(`Title: ${paper.title}`);
console.log(`Authors: ${paper.authors.join(', ')}`);
console.log('---');
});
Advantages and Limitations
Pros:
✅ 10-15x faster than xml2js (parses 80MB in ~1.3 seconds)
✅ Built-in TypeScript definitions
✅ Built-in XML validation
Cons:
❌ Self-closing tags handled differently than empty tags
❌ Bundle size 4x larger due to HTML entity decoder
Common Issues:
Missing attributes: Default config ignores attributes, so always set
ignoreAttributes: falsefor ArXivBoolean attribute parsing: Can fail in self-closing tags like
<entry published="true"/>
Method 3: Using PDF Vector’s Academic Search API
PDF Vector’s Academic Search API provides a different approach by offering a unified API for multiple academic databases including ArXiv.
Code Example
import { PDFVector } from 'pdfvector';
const pdfvector = new PDFVector({
apiKey: 'pdfvector_xxx' // Get from dashboard
});
// Search for papers by query
async function searchArxivViaPDFVector(query: string) {
try {
const response = await pdfvector.academicSearch({
query: query,
providers: ['arxiv'], // Can add more: ['pubmed', 'semantic-scholar']
limit: 20,
yearFrom: 2020, // Built-in date filtering
yearTo: 2024,
fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL']
});
return response.results.map(paper => ({
title: paper.title,
authors: paper.authors.map(a => a.name),
abstract: paper.abstract,
arxivId: paper.providerData?.arxivId,
pdfUrl: paper.pdfURL
}));
} catch (error) {
console.error('PDF Vector search failed:', error);
return [];
}
}
// Fetch specific papers by ArXiv ID
async function fetchArxivPaperByID(arxivIds: string[]) {
try {
const response = await pdfvector.academicFetch({
ids: arxivIds,
fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL', 'date']
});
return response.results.map(paper => ({
id: paper.id,
title: paper.title,
authors: paper.authors.map(a => a.name),
abstract: paper.abstract,
publishedDate: paper.date,
pdfUrl: paper.pdfURL
}));
} catch (error) {
console.error('PDF Vector fetch failed:', error);
return [];
}
}
const papers = await searchArxivViaPDFVector('quantum computing');
console.log(`Found ${papers.length} papers`);
const specificPapers = await fetchArxivPaperByID(['2301.00001', '2103.14030']);
console.log(specificPapers);
// Output: [{ id: '2301.00001', title: '...', authors: [...], ... }]
Advantages and Limitations
Pros:
✅ Returns clean JSON, no XML parsing needed
✅ Built-in date filtering (yearFrom/yearTo)
✅ Search multiple databases in one call
✅ Enrich each paper with additional data
Cons:
❌ Requires API key and registration
❌ Costs 2 credits per search or fetch
Common Issues:
- Rate limiting: Free tier limited to 100 credits/month.
Making the Right Decision
Time Investment Reality
Consider the full lifecycle of your integration:
Initial Setup Time:
How long will namespace handling and XML parsing take to implement?
What’s the learning curve for your team?
How quickly do you need to ship?
Ongoing Maintenance Burden:
Who handles edge cases and format changes?
What happens when ArXiv updates their API?
Will future developers understand the XML parsing logic?
Key Considerations
Technical factors to evaluate:
Single source (ArXiv only) versus multi-database needs
Monthly query volume and rate limit constraints
Project type (prototype versus production application)
Team’s XML parsing expertise and maintenance capacity
Budget constraints versus development time costs
API service benefits to consider:
Multiple database access through one interface
Consistent JSON responses across all providers
Time saved on parsing and error handling
Built-in metadata enrichment (citations, references)
Someone else maintains the integration
The best choice depends on your specific context, timeline, and resources.



