Chandra: Layout-aware, multilingual OCR for structured document extraction

Chandra is an OCR and structured-extraction platform for complex documents, emphasizing multilingual support, layout preservation, and strong table/math recognition—suited for organizations needing high-quality document digitization and structured outputs.

GitHub datalab-to/chandra Updated 2026-03-27 Branch main Stars 7.6K Forks 762

OCR Document Intelligence Multilingual Layout-aware Tables/Math Recognition vLLM/HuggingFace CLI/Streamlit PDF Processing Apache-2.0 (code) OpenRAIL-M (model)

💡 Deep Analysis

How should a pipeline be designed for large-scale document processing to avoid OOM, truncation and performance degradation?

Core Analysis ¶

Project Positioning: Chandra exposes batch and pagination controls, but large-scale stability requires careful pipeline design and resource governance.

Technical Strategies ¶

Sharding & pagination: Use --page-range and region-level splits for very large/high-res images to avoid single-input OOMs.
Batch & concurrency control: Tune --batch-size and --max-workers per GPU memory (vLLM supports larger batches); set --max-output-tokens to cap runaway generation.
Pre/post-processing: Add denoising, deskew, and resolution adjustments; keep cropped images for human review.

Practical Pipeline Template ¶

Preprocessing: Auto-denoise, deskew, resize, and page cropping.
Sharding & scheduling: Submit small files at file level; shard large docs by page and submit batches to vLLM with controlled batch-size.
Monitoring & adaptation: Use _metadata.json to log tokens/latency/errors and adapt batch/concurrency dynamically.
Retry strategy: For timeouts/OOMs, retry with smaller shards/batches and retain original crops for inspection.

Important Notice: Don’t increase batch-size blindly—benchmark on target hardware and use metadata metrics for dynamic tuning.

Summary: Preprocess + shard + tuned batching/concurrency + metadata-driven monitoring/retry are essential to avoid OOMs and retain stable throughput.

89.0%

How does Chandra's technical architecture support high-fidelity layout awareness and table/formula reconstruction?

Core Analysis ¶

Project Positioning: Chandra treats pages as layout-aware inputs and trains models to directly emit structured markup, making it more robust than traditional multi-step OCR for tables/formulas/complex layouts.

Technical Features ¶

Layout-aware end-to-end generation: The model considers semantic and layout boundaries when producing Markdown/HTML/JSON, reducing reliance on handcrafted assembly rules.
Dual inference paths: vLLM (Dockerized, production-optimized) for high throughput and GPU acceleration; HuggingFace local for experimentation and fine-tuning.
Engineering metadata: _metadata.json includes page info, token counts, enabling sharding, monitoring, and retry strategies.

Practical Recommendations ¶

Validate table/formula outputs post-hoc: Use rule-based checks or secondary models to verify cell boundaries and math expression integrity.
Use metadata for sharding: Split very long pages or high-res images into chunks to avoid OOMs and token truncation.

Important Notice: End-to-end improves reconstruction for complex layouts but extreme damage or bleed-through still requires human review or additional processing.

Summary: Architecturally, Chandra embeds layout in the generation step and combines flexible backends with metadata plumbing — this is its core strength for reconstructing complex documents.

88.0%

For production deployment, how should one choose and optimize between vLLM (Docker) and HuggingFace (local)?

Core Analysis ¶

Project Positioning: vLLM (Docker) is aimed at production-scale processing; HuggingFace (local) is for development, fine-tuning, or low-throughput offline work.

Technical Trade-offs ¶

vLLM (production-first): Unified image, GPU orchestration, horizontal scaling; larger default batch-size (README example: 28), fits batch processing and low-latency services.
HuggingFace (dev-first): Flexible for fine-tuning and local experimentation but constrained by GPU memory and local deps (torch, flash attention), with smaller default batch-size.

Practical Recommendations ¶

Production: Use the chandra_vllm Docker container, configure VLLM_MODEL_NAME, set appropriate --max-workers and --batch-size, and rely on _metadata.json for monitoring and cost allocation.
Dev/validation: Use --method hf for quick quality checks on small samples before migrating to vLLM.
Performance tuning checklist: Tune batch-size, max-output-tokens, and worker concurrency; shard very large documents.

Important Notice: HF local mode can OOM or be extremely slow—benchmark on target GPUs and install acceleration libs (e.g. flash attention).

Summary: Default to vLLM for production, HF for development; stabilize performance via batching, concurrency, token limits and metadata-driven monitoring.

87.0%

How does Chandra perform on tables, math formulas, handwriting and forms? What are common failure modes?

Core Analysis ¶

Project Positioning: Chandra provides specialized improvements for tables, math formulas, handwriting and forms, producing consumable structured outputs—but it is not infallible.

Technical Behavior and Failure Modes ¶

Tables: End-to-end generation reconstructs cell and row/column relations for most financial/statistical tables; failures include wrong merges/splits and nested table misalignment.
Math formulas: Better structural preservation than classic OCR, but small symbols or handwritten math can suffer character loss or syntax misplacement.
Handwriting: Good for common handwriting and educational notes; accuracy drops for idiosyncratic cursive or rare styles.
Forms/checkboxes: Can reconstruct fields and detect checkboxes, but tilted, occluded or partially missing checkboxes can be misread.

Practical Recommendations ¶

Post-hoc validation: Use rules or secondary models (regex, syntax parsers) to validate table cells and math expressions.
Keep cropped images: Save cell/field crops for human verification and correction.
Stratified sampling: Perform stratified sampling in production to detect edge case failures across document types.

Important Notice: For extremely poor scans, bleed-through, or missing regions, automatic reconstruction may be unacceptable—use denoising/deskew or human intervention.

Summary: Chandra significantly improves complex-structure extraction but should be paired with validation and human-in-the-loop for edge cases.

86.0%

How should one validate and calibrate the model on target languages or specialty handwriting to ensure usability?

Core Analysis ¶

Project Positioning: Chandra claims 90+ language support and handwriting optimizations, but users must validate and calibrate for minority languages or unusual handwriting styles.

Validation & Calibration Workflow ¶

Sampling: Collect 10–100 representative documents covering paper quality, fonts/handwriting styles, and common variations.
End-to-end evaluation: Run chandra and measure recall/precision on key fields (form fields, table columns, formulas) and categorize errors.
Error analysis: Identify frequent failure modes (character substitution, row/column misalignment, cursive misreads) and plan targeted fixes.

Possible Remediation Steps ¶

Post-processing rules: Use regex, domain dictionaries or lexicons to fix frequent mistakes.
Fine-tuning / few-shot: If licensing permits, fine-tune on a small labeled set to improve target-language/handwriting performance.
Secondary models: Use lightweight specialized models for hard fields (handwritten words or math symbols).

Important Notice: Verify model licensing for large-scale commercial use (OpenRAIL-M style limits) and validate thoroughly on the target language.

Summary: A loop of small-sample benchmarking → error analysis → targeted rules/fine-tuning/secondary models reliably adapts Chandra to specific languages and handwriting styles.

86.0%

✨ Highlights

Structured outputs (HTML/MD/JSON) that preserve layout
Supports 90+ languages with strong handwriting recognition
Offers both local (vLLM/HuggingFace) and remote inference modes
Model weights use OpenRAIL-M; commercial use is restricted

🔧 Engineering

Uses vLLM/HuggingFace backends to produce layout-preserving HTML/Markdown/JSON, extracting images and metadata
Specifically optimized and benchmarked for tables, math, forms, and complex multi-column layouts

⚠️ Risks

OpenRAIL-M license for model weights imposes clear commercial limits; verify compliance before deployment
Repo shows minimal community contribution (no releases/commit stats reported); assess long-term maintenance risk
Local backends demand significant GPU/memory resources for inference
Benchmarks are largely in-house; limited public third-party validation

👥 For who?

Geared to enterprises and research teams needing high-fidelity document digitization—invoices, textbooks, forms, and handwritten notes
Also suitable for engineers building document pipelines and data/annotation teams for batch processing or API integration