Extract ArXiv Paper Metadata from XML Responses

Transform ArXiv's complex XML responses into clean, structured data you can actually use in your TypeScript applications.

If you’ve tried to get paper data from ArXiv’s API, you’ve probably hit the same wall we all do. Instead of nice JSON, you get complex XML with multiple namespaces that breaks standard parsing. Let’s fix that. We’ll explore three ways to extract titles, authors, and abstracts from ArXiv API responses using TypeScript.

Understanding ArXiv API XML Structure

The ArXiv API returns data in Atom 1.0 format, which uses XML namespaces extensively. Here’s what a typical response looks like:

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Quantum Computing Fundamentals</title>
    <author>
      <name>John Doe</name>
    </author>
    <summary>This paper explores...</summary>
  </entry>
</feed>

The challenge? Standard XML parsing fails because of the default namespace http://www.w3.org/2005/Atom. Without handling this namespace correctly, you’ll get empty results even when the data is right there.

Method 1: Using xml2js Library

Implementation Guide

Install the xml2js library and its types
Configure the parser to handle namespaces
Parse the XML and extract the data

Code Example

import { parseString } from 'xml2js';
import { promisify } from 'util';

const parseXML = promisify(parseString);

interface ArxivPaper {
  title: string;
  authors: string[];
  abstract: string;
  id: string;
}

async function fetchArxivPapers(query: string, maxResults: number = 10): Promise<ArxivPaper[]> {
  try {
    // Build URL with query parameters
    const params = new URLSearchParams({
      search_query: query,
      max_results: maxResults.toString()
    });
    
    // Make request to ArXiv API
    const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
    const xmlData = await response.text();

    // Parse XML with namespace handling
    const result = await parseXML(xmlData, {
      explicitArray: false,
      ignoreAttrs: true
    });

    // Extract papers from the feed
    const entries = Array.isArray(result.feed.entry) 
      ? result.feed.entry 
      : [result.feed.entry];

    return entries.map((entry: any) => ({
      title: entry.title.replace(/\s+/g, ' ').trim(),
      authors: Array.isArray(entry.author) 
        ? entry.author.map((a: any) => a.name)
        : [entry.author.name],
      abstract: entry.summary.replace(/\s+/g, ' ').trim(),
      id: entry.id.split('/').pop()
    }));
  } catch (error) {
    console.error('Failed to fetch ArXiv papers:', error);
    return [];
  }
}

// Usage
const papers = await fetchArxivPapers('quantum computing', 5);
console.log(papers);
// Output: [{ title: "...", authors: ["..."], abstract: "...", id: "..." }, ...]

Advantages and Limitations

Pros:

✅ Extensive configuration options for complex XML structures
✅ Mature namespace support with granular control
✅ Large community with extensive Stack Overflow coverage

Cons:

❌ 30-45x slower than fast-xml-parser on large files
❌ No releases since 2023 (v0.6.2) - appears unmaintained

Common Issues:

Namespace pollution: Default settings include namespace prefixes in keys, cluttering the output
Memory exhaustion: 80-90MB files can take 45+ seconds and spike RAM usage

Method 2: Using fast-xml-parser

Different Implementation

fast-xml-parser offers better performance and a more modern API. It handles namespaces automatically and provides TypeScript support out of the box.

Code Example

import { XMLParser } from 'fast-xml-parser';

interface ArxivEntry {
  title: string;
  author: { name: string } | { name: string }[];
  summary: string;
  id: string;
  published: string;
}

async function fetchArxivWithFastParser(query: string, maxResults: number = 10) {
  try {
    // Build URL with query parameters
    const params = new URLSearchParams({
      search_query: query,
      max_results: maxResults.toString()
    });
    
    const response = await fetch(`http://export.arxiv.org/api/query?${params}`);
    const xmlData = await response.text();

    // Configure parser
    const parser = new XMLParser({
      ignoreAttributes: true,
      removeNSPrefix: true, // This handles namespaces for us
      parseTagValue: false
    });

    const result = parser.parse(xmlData);
    
    // Handle single vs multiple entries
    const entries: ArxivEntry[] = result.feed.entry 
      ? (Array.isArray(result.feed.entry) ? result.feed.entry : [result.feed.entry])
      : [];

    return entries.map(entry => ({
      title: entry.title.trim(),
      authors: Array.isArray(entry.author) 
        ? entry.author.map(a => a.name)
        : [entry.author.name],
      abstract: entry.summary.trim(),
      id: entry.id.split('/').pop(),
      published: entry.published
    }));
  } catch (error) {
    console.error('Failed to parse ArXiv response:', error);
    return [];
  }
}

// Usage with async/await
const papers = await fetchArxivWithFastParser('machine learning', 10);
papers.forEach(paper => {
  console.log(`Title: ${paper.title}`);
  console.log(`Authors: ${paper.authors.join(', ')}`);
  console.log('---');
});

Advantages and Limitations

Pros:

✅ 10-15x faster than xml2js (parses 80MB in ~1.3 seconds)
✅ Built-in TypeScript definitions
✅ Built-in XML validation

Cons:

❌ Self-closing tags handled differently than empty tags
❌ Bundle size 4x larger due to HTML entity decoder

Common Issues:

Missing attributes: Default config ignores attributes, so always set ignoreAttributes: false for ArXiv
Boolean attribute parsing: Can fail in self-closing tags like <entry published="true"/>

Method 3: Using PDF Vector’s Academic Search API

PDF Vector’s Academic Search API provides a different approach by offering a unified API for multiple academic databases including ArXiv.

Code Example

import { PDFVector } from 'pdfvector';

const pdfvector = new PDFVector({
  apiKey: 'pdfvector_xxx' // Get from dashboard
});

// Search for papers by query
async function searchArxivViaPDFVector(query: string) {
  try {
    const response = await pdfvector.academicSearch({
      query: query,
      providers: ['arxiv'], // Can add more: ['pubmed', 'semantic-scholar']
      limit: 20,
      yearFrom: 2020,  // Built-in date filtering
      yearTo: 2024,
      fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL']
    });

    return response.results.map(paper => ({
      title: paper.title,
      authors: paper.authors.map(a => a.name),
      abstract: paper.abstract,
      arxivId: paper.providerData?.arxivId,
      pdfUrl: paper.pdfURL
    }));
  } catch (error) {
    console.error('PDF Vector search failed:', error);
    return [];
  }
}

// Fetch specific papers by ArXiv ID
async function fetchArxivPaperByID(arxivIds: string[]) {
  try {
    const response = await pdfvector.academicFetch({
      ids: arxivIds,
      fields: ['title', 'authors', 'abstract', 'arxivId', 'pdfURL', 'date']
    });

    return response.results.map(paper => ({
      id: paper.id,
      title: paper.title,
      authors: paper.authors.map(a => a.name),
      abstract: paper.abstract,
      publishedDate: paper.date,
      pdfUrl: paper.pdfURL
    }));
  } catch (error) {
    console.error('PDF Vector fetch failed:', error);
    return [];
  }
}

const papers = await searchArxivViaPDFVector('quantum computing');
console.log(`Found ${papers.length} papers`);

const specificPapers = await fetchArxivPaperByID(['2301.00001', '2103.14030']);
console.log(specificPapers);
// Output: [{ id: '2301.00001', title: '...', authors: [...], ... }]

Advantages and Limitations

Pros:

✅ Returns clean JSON, no XML parsing needed
✅ Built-in date filtering (yearFrom/yearTo)
✅ Search multiple databases in one call
✅ Enrich each paper with additional data

Cons:

❌ Requires API key and registration
❌ Costs 2 credits per search or fetch

Common Issues:

Rate limiting: Free tier limited to 100 credits/month.

Making the Right Decision

Time Investment Reality

Consider the full lifecycle of your integration:

Initial Setup Time:

How long will namespace handling and XML parsing take to implement?
What’s the learning curve for your team?
How quickly do you need to ship?

Ongoing Maintenance Burden:

Who handles edge cases and format changes?
What happens when ArXiv updates their API?
Will future developers understand the XML parsing logic?

Key Considerations

Technical factors to evaluate:

Single source (ArXiv only) versus multi-database needs
Monthly query volume and rate limit constraints
Project type (prototype versus production application)
Team’s XML parsing expertise and maintenance capacity
Budget constraints versus development time costs

API service benefits to consider:

Multiple database access through one interface
Consistent JSON responses across all providers
Time saved on parsing and error handling
Built-in metadata enrichment (citations, references)
Someone else maintains the integration

The best choice depends on your specific context, timeline, and resources.

Extract ArXiv Paper Metadata from XML Responses

Understanding ArXiv API XML Structure

Method 1: Using xml2js Library

Implementation Guide

Code Example

Advantages and Limitations

Method 2: Using fast-xml-parser

Different Implementation

Code Example

Advantages and Limitations

Method 3: Using PDF Vector’s Academic Search API

Code Example

Advantages and Limitations

Making the Right Decision

Time Investment Reality

Key Considerations

Related Articles

PDF Vector vs Nanonets: OCR & AI Docs Compared

AlfredAPI vs Eden AI: Which Unified AI API Wins?

Retrieval Pipelines for Long PDFs that Actually Scale