arXiv Paper Curator: Learner-focused, production-grade RAG hands-on course

Hands-on course repo for engineers and researchers offering end-to-end templates and weekly lessons to build production-grade RAG systems from ingestion to hybrid retrieval and deployment.

GitHub jamwithai/arxiv-paper-curator Updated 2025-11-09 Branch main Stars 1.3K Forks 424

RAG (Retrieval-Augmented Generation) Academic paper ingestion Production infrastructure Educational engineering template

💡 Deep Analysis

What role do intelligent chunking and hybrid retrieval play in improving QA quality? What are the trade-offs and configuration considerations?

Core Analysis ¶

Core Issue: For long papers, extracting the right context for the LLM is crucial. Intelligent chunking and hybrid retrieval address this by producing context units that are both relevant and model-size compatible.

Technical Analysis ¶

Value of chunking: Splitting a paper into semantically coherent, appropriately sized chunks ensures candidate contexts contain enough information without exceeding the LLM’s context window, reducing noise.
Common chunking strategies: Section/heading-based, sentence sliding windows (with overlap), or semantic paragraph boundaries—each trades off coherence versus chunk size.
Hybrid retrieval patterns:
BM25 recall -> vector rerank: Fast BM25 filters large collections; vector reranking adds semantic relevance.
Parallel recall + fusion: Keywords and vectors recall candidates in parallel and merge by weighted scoring.

Config considerations & trade-offs ¶

Chunk size & overlap: Set chunk size by token/character limits and use modest overlap to maintain continuity. Too-large chunks exceed model windows; too-small lose context.
Candidate set size: BM25 initial recall size affects cost/quality—50–200 candidates before vector rerank is a practical compromise.
Fusion weight tuning: Tune BM25 vs. vector weights on a validation query set (e.g., grid search).
Performance cost: Vector storage and similarity search add latency/cost—use ANN libraries to reduce query time.

Important Notice: For academic papers, leveraging structural cues (sections, abstracts, captions) for chunking significantly improves retrieval and generation quality.

Summary: Intelligent chunking secures high-quality contexts, and hybrid retrieval balances precision and recall. Both require dataset-aware tuning to maximize QA performance.

89.0%

Why adopt a 'BM25-first, then vector/hybrid retrieval' architecture? What are the technical advantages of this choice?

Core Analysis ¶

Core Issue: Relying solely on vector search can cause inexplicable false positives, higher resource use, and harder debugging. The project advocates establishing a keyword search foundation (BM25) first, then combining vector search for a hybrid approach.

Technical Analysis ¶

Better explainability: BM25 scoring (TF-IDF/term frequency) and filter conditions are easier to diagnose and tune, making it clear why a document was retrieved.
Performance and cost: Keyword search typically has lower latency and lighter index requirements than vector search, making it suitable for quickly filtering large corpora.
Robustness and fallback: BM25 serves as a stable fallback when vector indices are unavailable or of poor quality.
Hybrid retrieval benefits: After BM25 filtering, vector search can be used to semantically rerank candidates or increase recall, balancing precision and recall.

Practical Recommendations ¶

Tune BM25 first: Adjust tokenizers, stopwords, field weights, and BM25 params (k1, b) and run A/B tests on a query set.
Design hybrid strategy: Common approaches are “BM25 recall -> vector rerank” or parallel recall with score fusion; set fusion weights based on query samples.
Monitor and fallback: Use Langfuse or logging to capture candidate sets and scores; fallback to pure BM25 if the vector layer misbehaves.

Important Notice: Hybrid retrieval demands extra fusion logic and hyperparameter tuning. In resource-constrained environments, optimizing BM25 yields the best ROI.

Summary: BM25-first gives explainability, lower cost, and robustness—an engineering-safe starting point—after which vectors can augment semantic recall.

88.0%

What use cases is this project suitable for? In which situations is it not recommended to directly migrate this architecture to production?

Core Analysis ¶

Core Issue: Whether the project fits a production scenario depends on traffic, data scale, availability, and compliance needs.

Suitable Use Cases ¶

R&D prototypes & teaching: Excellent for learning the end-to-end RAG pipeline and building internal research assistant prototypes.
Small-to-medium academic search: Good for research groups or individual researchers needing arXiv retrieval and QA.
Experimentation of hybrid retrieval: Useful for validating chunking, hybrid retrieval and reranking strategies.

Situations NOT recommended for direct migration ¶

High concurrency / massive scale: The default docker compose single-node OpenSearch/DB cannot meet throughput, scaling, or reliability needs.
Strict SLAs or multi-tenant enterprise use: Lacks enterprise-grade security, audit, backup, and failure isolation.
Need for highly available distributed LLM serving: Local Ollama or single-instance models cannot handle high throughput or low-latency SLAs.

Required changes to move to production ¶

Cluster OpenSearch with shard/replica strategy
Use managed/distributed Postgres with backups and failover
Separate model serving into scalable inference layer (K8s, inference platform, or cloud APIs)
Add authentication, auditing, quotas, CI/CD and robust monitoring/alerts

Important Notice: The repo is a production-oriented learning template, but default deployment targets experiments/prototypes. Do capacity testing and harden the architecture before enterprise use.

Summary: Ideal for learning, prototyping, and small-to-medium academic uses; for enterprise-scale production, plan comprehensive scaling and hardening.

88.0%

What common issues occur during data ingestion and PDF parsing in practice? How does this project mitigate them?

Core Analysis ¶

Core Issue: The main challenges when converting arXiv PDFs to searchable content are network/API rate limits, PDF structural variability causing parser failures, and accumulated errors during bulk ingestion.

Technical Analysis ¶

Network & rate limits: Bulk fetching can trigger API throttling or transient network failures leading to missed downloads.
PDF complexity: Papers contain formulas, tables, images, and irregular layouts that parsers (e.g., Docling) may mis-handle, causing dropped or mis-segmented content.
Bulk ingestion stability: Without checkpointing and failure logs, reruns are costly and hard to debug.

Mitigations implemented or recommended by the project:

Rate limiting & retry: The fetcher uses backoff and retry logic to avoid overwhelming APIs.
Airflow orchestration: DAGs schedule segmented ingestion with task retries and alerts, aiding visual debugging.
Failure sample logging: Parsing failures record metadata and original PDFs for offline analysis.
Small-sample validation: Validate parser settings on small batches before scaling up.

Practical Recommendations ¶

Enable detailed logging & monitoring: Track parse failure counts and retry attempts via Langfuse or custom alerts.
Persist original PDFs: Store originals in Postgres or object storage to enable re-parsing.
Create a failure remediation workflow: Periodically inspect failed samples and classify failure modes (e.g., table reconstruction issues) to improve parsing or add targeted parsers.

Important Notice: Automated parsing cannot cover all edge cases. Combine automation with manual QA for high-value ingestion.

Summary: The project includes robust engineering controls—rate limiting, DAG-based ingestion, and failure logging—that significantly reduce ingestion and parsing risks, but operational monitoring and human intervention remain necessary for outliers.

87.0%

How to upgrade this project from a teaching/single-node prototype to an enterprise-grade RAG service? What are the key steps and priority modifications?

Core Analysis ¶

Core Issue: To move from a teaching prototype to an enterprise-grade service, you must address scalability, availability, security, and operational automation.

Key Migration Priorities (ranked)¶

Cluster the retrieval layer: Upgrade single-node OpenSearch to a cluster with proper shards/replicas, ILM, and hot/cold tiers to support throughput and fault tolerance.
Database HA: Move Postgres to managed HA (or set up primary/replica), use partitioning and archival for large metadata volumes.
Decouple & scale model serving: Replace local LLM with a scalable inference layer (K8s with HPA or cloud inference), support GPU pools and autoscaling.
Auth & security: Implement an API gateway, auth (OAuth/MTLS), secret management, and audit logging for compliance.
CI/CD & IaC: Use Terraform/Helm and GitOps (ArgoCD/GitHub Actions) for repeatable deployments and rollbacks.
Monitoring & SLA alerts: Expand Langfuse tracing, add Prometheus/Grafana metrics, log aggregation, and alerting rules.
Capacity testing & fallback: Run stress tests and define degradation strategies (e.g., BM25-only) to maintain core availability.

Practical Migration Steps ¶

Incrementally enable OpenSearch clustering and managed Postgres in a staging environment.
Move model inference to an independent scalable service and validate performance.
Add authentication/audit and perform security reviews.
Automate deployments and run capacity/failure injection tests.
Gradually shift traffic with a rollback plan.

Important Notice: Don’t switch everything at once. Use incremental migration and canary releases with rollback paths.

Summary: Enterprise readiness requires systematic architectural hardening—prioritize retrieval and inference scalability, then strengthen security, monitoring, and CI/CD. Incremental validation and rollback capability are essential.

86.0%

✨ Highlights

Learner-focused, production-grade RAG course covering end-to-end engineering
Provides a full infra stack: Docker, FastAPI, Postgres, OpenSearch, Airflow, etc.
Week-by-week learning path covering ingestion, BM25/hybrid retrieval and monitoring
README is detailed but repository shows low visible code activity and no contributors; verify repo status
Missing license and release records; poses legal and redistribution risks for adopters

🔧 Engineering

End-to-end RAG teaching repo: from arXiv ingestion to production-grade retrieval and RAG pipelines
Practical focus: BM25 keyword search, intelligent chunking, hybrid vector retrieval, and local LLM integration
Example services: FastAPI API, Gradio chat UI, Langfuse tracing and Redis caching examples

⚠️ Risks

Docs and demos are rich but repository shows zero contributors/commits; may contain only teaching assets or require syncing branches
No license, no releases or versioning—hampers direct enterprise deployment and safe reuse
Depends on third-party/commercial services (e.g., JINA API, Ollama, Langfuse), potentially adding cost and integration overhead

👥 For who?

AI engineers and researchers who need to build or understand production-grade RAG pipelines
Learners and course participants aiming to master ingestion, retrieval, and RAG deployment via hands-on work
Prototype teams wanting to quickly build research assistants or domain-specific retrieval systems