Skip to content

undmsDocument Text & Metadata Extraction

High-performance document processing with built-in similarity comparison

undms logo
📄

Multi-Format Support

Extract text from PDF, DOCX, XLSX, images, and plain text files with a unified API

🔍

Similarity Comparison

Compare documents against reference texts using Jaccard, N-gram, Levenshtein, or hybrid algorithms

📊

Rich Metadata

Extract format-specific metadata including EXIF data, PDF properties, DOCX statistics, and more

🖼️

OCR Support

Extract text from images using Tesseract OCR with automatic language detection

Parallel Processing

Documents are processed concurrently using Rayon for maximum performance

💎

TypeScript Support

Full type definitions included with intelligent autocomplete and type safety

Quick Example

Extract text and metadata from documents with a simple function call:

ts
import { extract, computeDocumentSimilarity } from 'undms';

const documents = [
  {
    name: 'report.pdf',
    size: 1024,
    type: 'application/pdf',
    lastModified: Date.now(),
    webkitRelativePath: '',
    buffer: Buffer.from(pdfData),
  },
];

const result = extract(documents);
console.log(result[0].documents[0].content);
console.log(result[0].documents[0].metadata);

Performance

Built with Rust using napi-rs for native Node.js performance:

OperationTime
Extract 10 PDFs~50ms
Extract 10 DOCX~30ms
Extract 10 Images~120ms
Similarity Check~5ms

Supported Platforms

  • Node.js 12.22+ (except 13.x)
  • Node.js 14.17+, 15.12+, 16+
  • Bun
  • Web browsers (via browser.js)

License

MIT License - View on GitHub

Released under the MIT License.