undmsDocument Text & Metadata Extraction
High-performance document processing with built-in similarity comparison
High-performance document processing with built-in similarity comparison

Extract text from PDF, DOCX, XLSX, images, and plain text files with a unified API
Compare documents against reference texts using Jaccard, N-gram, Levenshtein, or hybrid algorithms
Extract format-specific metadata including EXIF data, PDF properties, DOCX statistics, and more
Extract text from images using Tesseract OCR with automatic language detection
Documents are processed concurrently using Rayon for maximum performance
Full type definitions included with intelligent autocomplete and type safety
Extract text and metadata from documents with a simple function call:
import { extract, computeDocumentSimilarity } from 'undms';
const documents = [
{
name: 'report.pdf',
size: 1024,
type: 'application/pdf',
lastModified: Date.now(),
webkitRelativePath: '',
buffer: Buffer.from(pdfData),
},
];
const result = extract(documents);
console.log(result[0].documents[0].content);
console.log(result[0].documents[0].metadata);Built with Rust using napi-rs for native Node.js performance:
| Operation | Time |
|---|---|
| Extract 10 PDFs | ~50ms |
| Extract 10 DOCX | ~30ms |
| Extract 10 Images | ~120ms |
| Similarity Check | ~5ms |
MIT License - View on GitHub