QMD: Lightweight local hybrid search for documents with LLM re-ranking

QMD is a privacy-oriented on-device document retrieval tool that combines BM25 full-text, vector semantic search and LLM reranking — suited for individuals or teams needing offline, controllable search with structured outputs for agent workflows.

GitHub tobi/qmd Updated 2026-02-09 Branch main Stars 19.5K Forks 1.2K

Node.js / Bun On-device semantic search BM25 + vector + reranking Privacy / personal KB

💡 Deep Analysis

What concrete retrieval problems does this project solve, and how does it achieve high-quality, explainable search results locally?

Core Analysis ¶

Project Positioning: qmd targets users who need high-quality search over Markdown, meeting transcripts, and documents in a local/offline environment. It addresses the core need to balance speed, semantic recall, and explainable match evidence, while producing structured outputs for agents.

Technical Features ¶

Hybrid retrieval pipeline: SQLite FTS5 BM25 for fast and explainable keyword matching; local GGUF embeddings for semantic search; a lightweight local LLM for candidate reranking (as noted in README).
Query expansion and fusion strategy: A fine-tuned query expansion module creates alternative queries to boost recall, then RRF + position-aware blending with a top-rank bonus preserves highly relevant exact matches.
Agent-oriented outputs: CLI --json, --files and MCP endpoints (e.g., qmd_query, qmd_get) supply structured results for downstream agents.

Practical Recommendations ¶

Split important documents into collections (notes, meetings, docs) and add collection contexts (qmd context add) to improve reranker signals.
Validate with BM25 first (qmd search) before running hybrid rerank (qmd query) to compare and tune behavior.
Prefer local GGUF + node-llama-cpp for sensitive data to avoid cloud exposure.

Note: Query expansion increases recall but can introduce noise; control output with fusion parameters and min-score thresholds.

Summary: By combining BM25 + vector search + LLM rerank locally, qmd offers a practical, privacy-preserving retrieval stack that is well suited for embedding into agent workflows without sending data to the cloud.

88.0%

What are the main performance and resource considerations when deploying qmd locally, and how to optimize initial indexing and embedding?

Core Analysis ¶

Core Issue: Local deployment bottlenecks are embedding generation and LLM reranking inference costs, plus disk IO and indexing overhead when handling large numbers of small files.

Performance and Resource Considerations ¶

Embedding (qmd embed): Time depends on GGUF model size, GPU availability, batch size and concurrency. CPU-only inference can be slow.
Indexing (FTS5): SQLite indexing of many small files generates IO; tuning transactions and batch commits helps.
Reranking: Run the LLM reranker only on top-K candidates to avoid expensive inference per query.

Optimization Recommendations (Practical)¶

Batch embeddings: Process documents in chunks (e.g., 500–2000 docs per batch) with checkpointing to resume.
Incremental updates: Re-embed only new/changed files using timestamps or hashes.
Limit reranker scope: Apply reranker to top-N (N=20–50) candidates from BM25/vector search.
Model & concurrency choices: Use smaller GGUF models on constrained machines and tune node-llama-cpp thread settings.
Monitoring & thresholds: Use --min-score and output limits (--all/--files) to control result volume and downstream work.

Note: Avoid one-shot embedding of very large corpora; estimate storage and CPU requirements first.

Summary: With batch/incremental embedding, restricted reranking, and appropriate model selection and concurrency, qmd can be run efficiently on many local setups while keeping resource consumption manageable.

87.0%

Why choose the stack SQLite FTS5 + local GGUF embeddings + LLM reranker? What are the advantages and trade-offs of this architecture?

Core Analysis ¶

Core Question: The stack SQLite FTS5 + local GGUF embeddings + LLM reranker is chosen to overcome the limits of single retrieval methods and achieve a balance of “fast & explainable” and “high semantic recall.”

Technical Analysis ¶

BM25 (FTS5): Provides low latency, disk-indexed retrieval, and match location information for explainability.
Local GGUF vectors: Capture synonymy and fuzzy semantics to boost recall; running models locally protects data privacy.
LLM reranker: Reranks the candidate set from BM25 and vector search to improve final ordering, especially on complex contextual queries.
Fusion strategy: RRF + position-aware blending with a top-rank bonus helps raise recall without discarding high-quality exact matches, mitigating query expansion noise.

Advantages and Trade-offs ¶

Advantages: Balances speed, recall, and interpretability; fully local; modular for debugging and integration.
Trade-offs: Requires storage and compute for models and embeddings, platform dependencies (node-llama-cpp, FTS5), and parameter tuning to balance precision/recall.

Practical Recommendations ¶

On low-power devices use smaller GGUF models and only rerank top-K candidates.
Tune RRF and min-score incrementally while monitoring noise from query expansion.

Note: Ensure your system SQLite supports FTS5 (macOS may require Homebrew sqlite).

Summary: The architecture is a pragmatic engineering compromise that yields privacy-preserving, high-quality retrieval locally, at the expense of deployment complexity and tuning effort.

86.0%

How to integrate qmd seamlessly into agent workflows? What are practical tips and common pitfalls with the MCP interface?

Core Analysis ¶

Core Issue: qmd exposes an MCP interface for agents, but seamless integration requires engineering controls around query frequency, returned size, and post-processing.

Technical Analysis ¶

Available endpoints: qmd_search (BM25 fast), qmd_vsearch (semantic), qmd_query (hybrid+rerank), qmd_get (by path/ID).
Output options: --json and --files enable agent parsing; --min-score and -n limit result volume.

Practical Tips ¶

Two-stage pattern: Agent performs a fast pre-filter with qmd_search, then calls qmd_query for high-quality context when needed.
Limit return size: Set -n and --min-score in MCP calls and only pass essential fields (docid, snippet, score) to the main LLM to save context.
Caching & throttling: Cache high-frequency queries and reuse results to avoid costly reranker calls.
Structured output: Use --json to include metadata, paragraph boundaries, and match snippets so downstream agents can reference precisely.

Common Pitfalls & Mitigations ¶

Pitfall: Always calling reranker causes high latency.
Mitigation: Rerank only top-K or return layered responses (fast hits + refined hits).
Pitfall: Returning full documents wastes context tokens.
Mitigation: Use qmd_get --full only when necessary.

Note: Ensure agent MCP config points to qmd mcp and handles error codes/timeouts.

Summary: Two-stage retrieval, output constraints, caching, and structured JSON are key to reliably embedding qmd into agent workflows.

86.0%

In day-to-day use, how should you design collections, tune query expansion and fusion parameters to balance precision and recall? What are best practices?

Core Analysis ¶

Core Issue: Improving recall without losing critical exact matches hinges on collection design, controlling query expansion strength, and tuning fusion (RRF + position-aware) parameters.

Technical Analysis ¶

Collection design: Partition by topic/use-case (e.g., notes, meetings, docs) and add collection context (qmd context add) so expansions and reranker signals operate within a consistent corpus, reducing noise.
Query expansion control: Limit the number and strength of alternative queries; keep the original query highly weighted (the pipeline’s ×2 original query weighting is sensible).
Fusion tuning: Use RRF with position-aware blending and a top-rank bonus to prevent high-ranked BM25 hits from being replaced by low-quality expanded matches.

Best Practices (stepwise)¶

Baseline then compare: Start with qmd search to establish BM25 baseline, then compare with qmd vsearch and qmd query.
Small representative test set: Run A/B tests on a set of representative queries and monitor precision@k and recall while tuning expansion counts and RRF weights.
Keep original-query weight: Maintain the original query’s higher weight instead of relying entirely on expansions.
Filter with min-score: Use a conservative threshold to drop clearly irrelevant results.

Note: If expansions consistently introduce noise, reduce expansion count or weight rather than disabling expansion entirely—expansions remain valuable for complex queries.

Summary: Proper collection partitioning + controlled query expansion + careful RRF/position-aware tuning is the effective path to increase recall while preserving key exact matches.

86.0%

What are qmd's suitable use cases and limitations? When should you choose alternatives (like cloud search services or pure vector retrieval)?

Core Analysis ¶

Core Issue: Identify qmd’s best-fit use cases and its boundaries so you can choose the right tool for your architecture needs.

Suitable Scenarios ¶

Privacy-sensitive individuals/small teams: Need to keep notes, meeting transcripts, and docs on-device.
Agent integration: Embedding retrieval as a tool in desktop agents or automation workflows via MCP.
Medium-size text corpora: Collections dominated by Markdown/plain text where local control is valued over low-latency at scale.

Limitations & Non-ideal Scenarios ¶

Resource-constrained or extremely large corpora: Embedding and reranking costs can become prohibitive for very large datasets.
High concurrency / enterprise needs: Lacks built-in multi-tenant, sync, audit, and access-control features for enterprise governance.
Complex binary/multimodal documents: README focuses on Markdown/text; no built-in OCR/complex table/image parsing.

When to choose alternatives ¶

Need horizontal scaling & low-latency high concurrency: Use a cloud-hosted search or distributed retrieval service (cloud vector DB or ElasticSearch with vector support).
Need advanced document parsing / multimodal: Use pipelines/services with OCR and table/image extraction.
Prefer lower ops burden: Cloud services reduce model and infra management and provide SLAs.

Note: qmd’s strength is local hybrid retrieval and composability—evaluate based on data scale, privacy needs, and ops capability.

Summary: Choose qmd if local privacy, agent integration, and text-first high-quality retrieval are priorities. For very large scale, enterprise governance, or complex multimodal needs, prefer cloud or specialized retrieval systems.

85.0%

✨ Highlights

On-device hybrid retrieval: BM25, vector search and LLM reranking combined
Agent-friendly outputs with structured JSON and file lists
License unknown — distribution and commercial constraints unclear
Low maintenance activity: zero contributors, no releases, no recent commits

🔧 Engineering

Local hybrid retrieval pipeline supporting query expansion and RRF fusion
Combines FTS5(BM25), vector search and qwen3 reranker to balance speed and quality
Provides CLI and MCP server for integration with agents and tools like Claude

⚠️ Risks

Repository shows zero contributors — long-term maintenance and security fixes at risk
High dependence on local models and hardware — users must manage GGUF models and resources
Absent clear license — legal and compliance review required before organizational deployment

👥 For who?

Privacy-conscious individuals/teams: engineers and PMs needing local KB search
Agent developers and advanced use cases: agent workflows needing structured outputs and MCP
Developers experienced with local deployments: users familiar with Node/Bun and running local LLMs