Dolphin: Document-image parsing via heterogeneous anchor prompting

Dolphin introduces a two-stage VLM-based analyze-then-parse approach: it analyzes page structure in reading order and uses heterogeneous-anchor prompting for parallel element parsing—suitable for research and GPU-backed production deployments.

GitHub bytedance/Dolphin Updated 2025-09-25 Branch main Stars 7.2K Forks 580

Document understanding Vision-language models (VLM) Page- and element-level parsing Efficient parallel inference Hugging Face integration TensorRT / vLLM acceleration

💡 Deep Analysis

What are common failure sources when using Dolphin, and how to locate and troubleshoot them?

Core Analysis ¶

Core Concern: Common failures when running Dolphin stem from environment configuration, input quality, model prompt/parameter mismatches, and post-processing. A structured step-by-step troubleshooting flow helps locate issues quickly.

Technical Analysis (Common Failure Sources)¶

Environment & dependency errors: Incorrect CUDA, GPU driver, TensorRT-LLM, or vLLM installation/version mismatch can cause inference failures or poor performance.
Input quality issues: Low resolution, skew, or noisy scans break element detection and OCR.
Page-sequence or prompt mismatch: If page analysis produces a wrong element sequence, downstream element parsing will fail or return invalid structures.
Post-processing/mapping errors: JSON/Markdown output schema mismatches or coordinate system errors lead to downstream failures.

Troubleshooting Steps ¶

Environment sanity check: Verify GPU, CUDA, drivers, TensorRT/vLLM match README requirements and run a simple inference test.
Run demos: Use demo_page.py and demo_element.py to reproduce official examples and confirm generation of page-level JSON and element-level outputs.
Inspect intermediate outputs: Check the generated page element sequence to see whether the failure occurs in page analysis or element parsing stage.
Input inspection: Compare original images and visualized intermediate outputs on representative samples; apply denoising/dewarping if needed.
Tune params & prompts: Adjust --max_batch_size, prompt templates or anchor settings to see improvements.
Verify post-processing: Ensure output schema and coordinate systems (pixels vs normalized) match downstream expectations.

Important Notice: If environment or accelerator issues arise, prioritize stabilizing basic demos in the target environment before complex tuning.

Summary: Follow the “environment → demo → intermediate outputs → input → parameters → post-processing” path to efficiently locate most issues during Dolphin productionization.

88.0%

When deploying Dolphin in production, how should one balance performance (latency/throughput) and resource requirements?

Core Analysis ¶

Project Positioning: Dolphin supports parallel element decoding and multiple inference accelerators (TensorRT-LLM, vLLM), indicating that its design allows trade-offs between latency and throughput via engineering choices.

Technical Analysis ¶

Parallel batch (--max_batch_size): Increasing batch size boosts throughput but consumes more GPU memory and can increase per-document tail latency; decreasing batch size reduces latency at the cost of GPU utilization.
Inference accelerators (vLLM/TensorRT-LLM): Can significantly reduce latency and increase throughput but require environment setup and compatibility tuning.
Input complexity matters: Multi-page, table-dense, or high-resolution inputs increase memory and processing time and may require sharding or preprocessing.

Practical Recommendations ¶

Set strategy by scenario:
- Real-time/interactive: Use smaller max_batch_size (e.g., 1–4) and a high-performance inference stack (TensorRT-LLM) to prioritize latency.
- Batch/offline: Increase max_batch_size (e.g., 8–16 or more depending on memory) to maximize throughput and cost-efficiency.
Tune iteratively: Run benchmarks on representative hardware/documents and record memory, latency, throughput, and parsing quality then adjust settings.
Engineering safeguards: Implement async queues, sharding, and memory monitoring to prevent a single oversized document from degrading service.

Important Notice: Misconfigured accelerators or insufficient memory will yield much worse performance than expected. End-to-end load testing prior to production is essential.

Summary: Tune max_batch_size, enable appropriate accelerators, and use preprocessing/sharding strategies to achieve an acceptable latency-throughput trade-off for your deployment scenario.

87.0%

As a developer, what are the key considerations when integrating Dolphin into an existing IDP/RPA pipeline?

Core Analysis ¶

Core Concern: When embedding Dolphin into an IDP/RPA pipeline, the key considerations are interface compatibility, input preprocessing, inference environment configuration, and license/compliance—these determine parsing reliability and engineering cost.

Technical Analysis ¶

Output format & interface: Dolphin outputs JSON/Markdown; verify the output schema (coordinate system, reading order fields, element type tags) match upstream/downstream systems to avoid conversion overhead.
Input quality control: Low resolution, skew, or noisy scans degrade parsing accuracy; add denoising, cropping, and dewarping before inference.
Inference environment & dependencies: Production typically needs GPU and accelerators (vLLM/TensorRT-LLM); test compatibility and prepare fallbacks for environment mismatches.
Compliance & license: README lacks a license; legal review is required before commercial use.

Practical Recommendations ¶

Adapter layer: Build a converter mapping Dolphin JSON to your existing data model (fields, coordinate transforms, order checks).
Preprocessing module: Integrate image enhancement/dewarping and quality checks; route low-quality docs to a fallback path.
Performance & fallback: Use small batches under resource constraints to control latency and implement CPU or rule-based fallbacks.
Legal review: Confirm licensing before integration; contact the authors or choose an alternative with clear licensing if needed.

Important Notice: Using the project commercially without license confirmation poses legal risk; extensive testing on representative docs is mandatory before production.

Summary: Focus on data mapping, preprocessing, inference environment, and compliance. Designing adapters and fallbacks upfront reduces deployment risk.

86.0%

How to evaluate and improve Dolphin's parsing accuracy on domain-specific documents (e.g., legal/research)?

Core Analysis ¶

Core Concern: Improving Dolphin’s accuracy for vertical domains like legal or research requires a representative evaluation framework and failure-driven improvement strategies (fine-tuning, prompt engineering, post-processing).

Technical Analysis (Evaluation & Improvement Steps)¶

Build labeled test set: Collect representative multi-page samples and annotate page-level element sequences and element-level semantics (table structure, formula LaTeX/semantics, paragraph boundaries).
Run baseline evaluation: Use the official demo/pretrained model to compute metrics (element detection accuracy, table-structure F1, reading-order error rate, OCR CER/WER).
Analyze error patterns: Identify frequent failure modes (merged/shifted table cells, row/column inversion, missing formulas, OCR omissions) to prioritize fixes.

Improvement Paths ¶

Fine-tuning (preferred): If model weights and training interfaces are accessible, perform few-shot fine-tuning on a Hugging Face-format model to cover domain-specific layouts/terminology.
Prompt engineering & anchor tuning: Create domain-specific prompts/anchors for frequent error types and iteratively refine them.
Post-processing & rule-based fixes: Implement geometric and syntactic validation for tables, formulas, and references to correct model outputs.
Input enhancement: Improve image resolution, denoising, and dewarping to boost OCR and downstream structure parsing.

Important Notice: README lacks clear training pipeline and license information. For large-scale fine-tuning or commercial deployment, first verify licensing and availability of weights/training interfaces.

Summary: Use a representative evaluation set and failure-driven approach (fine-tuning where possible, otherwise prompt and post-processing) to improve domain-specific parsing. If training is not available, prioritize prompt engineering and post-processing and consider legal review.

86.0%

For documents with complex tables and nested structures, what are Dolphin's table-parsing limitations and how can they be mitigated?

Core Analysis ¶

Core Concern: Complex nested tables (multi-level headers, merged cells, hand-drawn borders, noisy scans) remain challenging for VLM-based parallel prompting approaches and may cause cell misalignment, inverted structure, or loss of hierarchy.

Technical Analysis ¶

Generalization depends on training coverage: Heterogeneous prompts work well on typical tables, but unseen complex layouts not present in training data increase parsing errors.
Input quality greatly affects outcomes: Low resolution or noise undermines OCR and geometric reasoning used for table reconstruction.
Lack of a dedicated post-processing pipeline: README lacks detailed table post-processing steps; relying solely on model output is risky.

Practical Recommendations (Mitigations)¶

Implement post-processing rules: Use geometric relations (row/column projection, connected components, merged cell detection) to validate and correct model outputs.
Hybrid approach: For complex tables, combine classic table detection + OCR + structure reconstruction with Dolphin outputs in an ensemble to improve robustness.
Sample-driven fine-tuning: Collect failure cases and perform targeted fine-tuning or prompt template augmentation to adapt the model to specific complex layouts.
Input enhancement: Increase resolution, denoise, and dewarp inputs to improve OCR baseline quality.

Important Notice: For tables that are critical data sources (financials, contracts), do not rely solely on unvalidated model outputs—introduce validation and human review.

Summary: Dolphin handles common tables well but for extreme nested/noisy tables, combine post-processing, hybrid parsing, and targeted fine-tuning to achieve reliable results.

85.0%

✨ Highlights

Introduces a two-stage analyze-then-parse paradigm
Heterogeneous-anchor prompting enables parallel element parsing and speedup
Provides both page-level and element-level inference interfaces (with examples)
Repository license and contributor details are unclear; verify compliance before use
Model and acceleration deps require large-model inference resources; deployment cost can be high

🔧 Engineering

Two-stage flow: generate page element sequence in reading order, then parse elements in parallel
Heterogeneous anchors and task-specific prompts tailor parsing for tables, formulas and paragraphs
Compatible with Hugging Face model format and provides TensorRT / vLLM acceleration options
Supports multi-page PDF, demo scripts and pretrained model download flows for reproducibility

⚠️ Risks

Repository license is unspecified, which may affect commercial use or redistribution
README and metadata inconsistencies (e.g., contributors and commits listed as 0) may complicate maintenance assessment
Depends on large models and external downloads (Baidu/Drive); availability and compliance should be evaluated
No formal releases; perform full validation and regression testing before production deployment

👥 For who?

Researchers and engineers in document understanding and information extraction
ML/data engineering teams integrating document parsing into production pipelines
Users requiring performant, low-latency parallel parsing of multiple element types (tables/formulas/paragraphs)