Chandra: Layout-aware, multilingual OCR for structured document extraction
Chandra is an OCR and structured-extraction platform for complex documents, emphasizing multilingual support, layout preservation, and strong table/math recognition—suited for organizations needing high-quality document digitization and structured outputs.
GitHub datalab-to/chandra Updated 2026-03-27 Branch main Stars 7.6K Forks 762
OCR Document Intelligence Multilingual Layout-aware Tables/Math Recognition vLLM/HuggingFace CLI/Streamlit PDF Processing Apache-2.0 (code) OpenRAIL-M (model)

💡 Deep Analysis

5
How should a pipeline be designed for large-scale document processing to avoid OOM, truncation and performance degradation?

Core Analysis

Project Positioning: Chandra exposes batch and pagination controls, but large-scale stability requires careful pipeline design and resource governance.

Technical Strategies

  • Sharding & pagination: Use --page-range and region-level splits for very large/high-res images to avoid single-input OOMs.
  • Batch & concurrency control: Tune --batch-size and --max-workers per GPU memory (vLLM supports larger batches); set --max-output-tokens to cap runaway generation.
  • Pre/post-processing: Add denoising, deskew, and resolution adjustments; keep cropped images for human review.

Practical Pipeline Template

  1. Preprocessing: Auto-denoise, deskew, resize, and page cropping.
  2. Sharding & scheduling: Submit small files at file level; shard large docs by page and submit batches to vLLM with controlled batch-size.
  3. Monitoring & adaptation: Use _metadata.json to log tokens/latency/errors and adapt batch/concurrency dynamically.
  4. Retry strategy: For timeouts/OOMs, retry with smaller shards/batches and retain original crops for inspection.

Important Notice: Don’t increase batch-size blindly—benchmark on target hardware and use metadata metrics for dynamic tuning.

Summary: Preprocess + shard + tuned batching/concurrency + metadata-driven monitoring/retry are essential to avoid OOMs and retain stable throughput.

89.0%
How does Chandra's technical architecture support high-fidelity layout awareness and table/formula reconstruction?

Core Analysis

Project Positioning: Chandra treats pages as layout-aware inputs and trains models to directly emit structured markup, making it more robust than traditional multi-step OCR for tables/formulas/complex layouts.

Technical Features

  • Layout-aware end-to-end generation: The model considers semantic and layout boundaries when producing Markdown/HTML/JSON, reducing reliance on handcrafted assembly rules.
  • Dual inference paths: vLLM (Dockerized, production-optimized) for high throughput and GPU acceleration; HuggingFace local for experimentation and fine-tuning.
  • Engineering metadata: _metadata.json includes page info, token counts, enabling sharding, monitoring, and retry strategies.

Practical Recommendations

  1. Validate table/formula outputs post-hoc: Use rule-based checks or secondary models to verify cell boundaries and math expression integrity.
  2. Use metadata for sharding: Split very long pages or high-res images into chunks to avoid OOMs and token truncation.

Important Notice: End-to-end improves reconstruction for complex layouts but extreme damage or bleed-through still requires human review or additional processing.

Summary: Architecturally, Chandra embeds layout in the generation step and combines flexible backends with metadata plumbing — this is its core strength for reconstructing complex documents.

88.0%
For production deployment, how should one choose and optimize between vLLM (Docker) and HuggingFace (local)?

Core Analysis

Project Positioning: vLLM (Docker) is aimed at production-scale processing; HuggingFace (local) is for development, fine-tuning, or low-throughput offline work.

Technical Trade-offs

  • vLLM (production-first): Unified image, GPU orchestration, horizontal scaling; larger default batch-size (README example: 28), fits batch processing and low-latency services.
  • HuggingFace (dev-first): Flexible for fine-tuning and local experimentation but constrained by GPU memory and local deps (torch, flash attention), with smaller default batch-size.

Practical Recommendations

  1. Production: Use the chandra_vllm Docker container, configure VLLM_MODEL_NAME, set appropriate --max-workers and --batch-size, and rely on _metadata.json for monitoring and cost allocation.
  2. Dev/validation: Use --method hf for quick quality checks on small samples before migrating to vLLM.
  3. Performance tuning checklist: Tune batch-size, max-output-tokens, and worker concurrency; shard very large documents.

Important Notice: HF local mode can OOM or be extremely slow—benchmark on target GPUs and install acceleration libs (e.g. flash attention).

Summary: Default to vLLM for production, HF for development; stabilize performance via batching, concurrency, token limits and metadata-driven monitoring.

87.0%
How does Chandra perform on tables, math formulas, handwriting and forms? What are common failure modes?

Core Analysis

Project Positioning: Chandra provides specialized improvements for tables, math formulas, handwriting and forms, producing consumable structured outputs—but it is not infallible.

Technical Behavior and Failure Modes

  • Tables: End-to-end generation reconstructs cell and row/column relations for most financial/statistical tables; failures include wrong merges/splits and nested table misalignment.
  • Math formulas: Better structural preservation than classic OCR, but small symbols or handwritten math can suffer character loss or syntax misplacement.
  • Handwriting: Good for common handwriting and educational notes; accuracy drops for idiosyncratic cursive or rare styles.
  • Forms/checkboxes: Can reconstruct fields and detect checkboxes, but tilted, occluded or partially missing checkboxes can be misread.

Practical Recommendations

  1. Post-hoc validation: Use rules or secondary models (regex, syntax parsers) to validate table cells and math expressions.
  2. Keep cropped images: Save cell/field crops for human verification and correction.
  3. Stratified sampling: Perform stratified sampling in production to detect edge case failures across document types.

Important Notice: For extremely poor scans, bleed-through, or missing regions, automatic reconstruction may be unacceptable—use denoising/deskew or human intervention.

Summary: Chandra significantly improves complex-structure extraction but should be paired with validation and human-in-the-loop for edge cases.

86.0%
How should one validate and calibrate the model on target languages or specialty handwriting to ensure usability?

Core Analysis

Project Positioning: Chandra claims 90+ language support and handwriting optimizations, but users must validate and calibrate for minority languages or unusual handwriting styles.

Validation & Calibration Workflow

  • Sampling: Collect 10–100 representative documents covering paper quality, fonts/handwriting styles, and common variations.
  • End-to-end evaluation: Run chandra and measure recall/precision on key fields (form fields, table columns, formulas) and categorize errors.
  • Error analysis: Identify frequent failure modes (character substitution, row/column misalignment, cursive misreads) and plan targeted fixes.

Possible Remediation Steps

  1. Post-processing rules: Use regex, domain dictionaries or lexicons to fix frequent mistakes.
  2. Fine-tuning / few-shot: If licensing permits, fine-tune on a small labeled set to improve target-language/handwriting performance.
  3. Secondary models: Use lightweight specialized models for hard fields (handwritten words or math symbols).

Important Notice: Verify model licensing for large-scale commercial use (OpenRAIL-M style limits) and validate thoroughly on the target language.

Summary: A loop of small-sample benchmarking → error analysis → targeted rules/fine-tuning/secondary models reliably adapts Chandra to specific languages and handwriting styles.

86.0%

✨ Highlights

  • Structured outputs (HTML/MD/JSON) that preserve layout
  • Supports 90+ languages with strong handwriting recognition
  • Offers both local (vLLM/HuggingFace) and remote inference modes
  • Model weights use OpenRAIL-M; commercial use is restricted

🔧 Engineering

  • Uses vLLM/HuggingFace backends to produce layout-preserving HTML/Markdown/JSON, extracting images and metadata
  • Specifically optimized and benchmarked for tables, math, forms, and complex multi-column layouts

⚠️ Risks

  • OpenRAIL-M license for model weights imposes clear commercial limits; verify compliance before deployment
  • Repo shows minimal community contribution (no releases/commit stats reported); assess long-term maintenance risk
  • Local backends demand significant GPU/memory resources for inference
  • Benchmarks are largely in-house; limited public third-party validation

👥 For who?

  • Geared to enterprises and research teams needing high-fidelity document digitization—invoices, textbooks, forms, and handwritten notes
  • Also suitable for engineers building document pipelines and data/annotation teams for batch processing or API integration