turbovec: High-compression local vector index based on TurboQuant

turbovec provides a TurboQuant-based trainless vector index that delivers high compression and low-latency local search, tailored for RAG and embedding retrieval use cases sensitive to privacy, memory, or latency.

GitHub RyanCodrai/turbovec Updated 2026-06-08 Branch main Stars 13.3K Forks 1.2K

Rust Python Vector Search Quantization SIMD Optimized Local Deployment RAG

💡 Deep Analysis

In practice, how to use kernel‑level allowlist/slot bitmask for efficient filtering? What are the caveats?

Core Analysis ¶

Key Question: How to practically leverage kernel‑level allowlist/slot bitmask to efficiently filter at search time and reduce compute?

Technical Analysis ¶

Mechanics: Filtering is performed by short‑circuiting 32‑vector SIMD blocks that contain no allowed slots; such blocks are skipped before LUT lookups or scoring.
Best Case: Extremely sparse candidate sets produced by upstream filters (SQL/BM25/ACL/time ranges) — block short‑circuiting avoids most SIMD cost and prevents over‑fetch.
When It Helps Less: If the allowlist fraction is large or allowed ids are uniformly spread across blocks, short‑circuit hits are rare and benefits diminish.

Practical Recommendations ¶

Do candidate narrowing first: Use cheap upstream filters to obtain a sparse set of candidate ids, then call search(allowlist=...).
Consider block granularity: Short‑circuiting is most effective when allowed ids cluster into blocks; if not, batch reranks or candidate grouping help.
Check output semantics: search returns min(k, len(allowed)) results—no fallback to disallowed items.

Note: Kernel‑level filtering is a performance optimization that depends on workload sparsity; design your pipeline accordingly.

Summary: For hybrid retrieval, combine upstream selective candidate generation with turbovec’s kernel‑level filtering to maximize performance and maintain recall.

86.0%

How to choose `bit_width` (2‑bit vs 4‑bit) in production? What factors affect recall and latency?

Core Analysis ¶

Key Question: How to choose between 2‑bit and 4‑bit in production to meet recall, memory, and latency goals?

Technical Analysis ¶

Bits vs resolution: bit_width controls discretization granularity—lower bits yield higher compression but greater information loss.
Key factors:
Recall requirements: Lower bit widths risk losing accuracy—avoid if near‑perfect recall is required.
Vector dimensionality: High dimensions (e.g., 1536) tolerate low‑bit quantization better than low dimensions (e.g., 200).
Hardware: Presence of AVX‑512/NEON affects throughput/latency of low‑bit kernels.
Retrieval pipeline: If you have an upstream candidate generator or strong reranker, you can accept coarser initial quantization.

Practical Recommendations ¶

Benchmark: Measure recall@k, query latency, and memory on your target hardware and representative data.
Tiered approach: Use 2‑bit for coarse filtering in very large indexes, and 4‑bit for default production reranking.
Monitor & rollback: Track recall/user metrics post‑deployment and switch bit widths or rebuild if quality drops.

Note: 2‑bit is not universally suitable—avoid for low‑dim or precision‑critical tasks.

Summary: Decide bit_width based on recall sensitivity, memory constraints, vector dimensionality, and hardware; validate with representative benchmarks and consider hybrid strategies.

86.0%

How do IdMapIndex's O(1) deletes and external uint64 id support benefit engineering? What implementation and maintenance caveats exist?

Core Analysis ¶

Key Question: What practical benefits do IdMapIndex’s stable external uint64 ids and O(1) deletes bring, and what are the caveats?

Technical Analysis ¶

Engineering benefits:
Stable id mapping: maps business ids directly into the vector index, easing integration with DBs/metadata.
O(1) deletes: avoids full rebuilds, enabling frequent deletes/updates (tenant cleanup, time windows).
Persistence: .tvim format for local save/load.
Maintenance caveats:
Fragmentation/holes: deletes create empty slots reducing block short‑circuit efficiency and density—periodic compaction is required.
Consistency & crash recovery: verify atomicity of writes and file durability to avoid id/index mismatches.
Licensing/metadata: README lacks license details—confirm before enterprise use.

Practical Recommendations ¶

Implement periodic compaction: trigger rebuilds when deletion ratio exceeds a threshold to restore density.
Sync with business DB: keep turbovec ids aligned with primary data sources and leverage DB candidate sets for allowlists.
Test persistence semantics: simulate crashes during write/load to ensure consistency.

Note: O(1) deletes ease operational burden but do not eliminate the need for periodic maintenance and license verification.

Summary: IdMapIndex is valuable for CRUD‑heavy retrieval systems; pair it with fragmentation management and persistence checks for robust production use.

86.0%

Which scenarios are unsuitable for turbovec? How to handle data drift or low‑dimensional vectors?

Core Analysis ¶

Key Question: Which scenarios are unsuitable for turbovec, and how to handle data drift or low‑dimensional vectors?

Technical Analysis ¶

Unsuitable scenarios:
Low‑dimensional vectors (e.g., d ≈ 100–300): TurboQuant’s high‑dimensional assumptions break down and 2/4‑bit quantization yields larger errors.
Significant long‑term distribution drift: TQ+ performs one‑time calibration on first writes; substantial later drift requires index rebuilds.
Need for cross‑node horizontal scaling or strong HA: turbovec is single‑node/single‑process and lacks built‑in sharding/replication.
Alternatives & mitigations:
For low‑dim or accuracy‑critical tasks, consider trained PQ/OPQ or FAISS with offline builds.
For drift, schedule periodic rebuilds or recalibration (export a representative sample and rebuild indices).
For scalability/HA, use distributed vector DBs (Milvus, Weaviate) and consider turbovec as a single‑node reranker.

Note: Verify licensing before enterprise deployment—README lacks explicit license details.

Summary: turbovec is well suited for high‑dim, private, single‑node deployments. For low‑dim data, significant drift, or distributed requirements, choose training‑based quantizers, periodic rebuilds, or distributed DBs accordingly.

86.0%

Why choose TurboQuant (random rotation + scalar quantization) instead of common PQ/OPQ? What are its architectural advantages and limitations?

Core Analysis ¶

Key Question: Why use random orthogonal rotation + scalar quantization (TurboQuant) instead of trained PQ/OPQ? The tradeoff centers on training & reconstruction accuracy versus real‑time writes & deployment complexity.

Technical Analysis ¶

Advantages:
Training‑free, low‑latency writes: supports online incremental add without codebook training or index rebuilds.
Data‑oblivious: suitable for privacy / air‑gapped deployments—no data export for training.
Simple, efficient implementation: rotation yields predictable per‑coordinate distributions; scalar quantization + bit‑packing maps well onto SIMD kernels.
Limitations:
Sensitive for low‑dim or non‑high‑dim assumptions: performance at 2/4‑bit may lag PQ/OPQ in low dimensions or skewed distributions.
Single calibration freeze: TQ+ performs one‑time shift/scale calibration on first writes; significant later drift requires explicit rebuild.

Practical Recommendations ¶

Favor TurboQuant for continuous ingest and private deployments; if offline training and maximum accuracy are acceptable, benchmark against PQ/OPQ.
Run representative benchmarks pre‑deployment to compare recall at equal bit rates.

Note: TurboQuant is not a universal replacement for trained quantizers—choose based on the precision vs operational/privacy tradeoff.

Summary: TurboQuant delivers engineering benefits for online, privacy‑sensitive use cases; for the absolute best recall when training is acceptable, PQ/OPQ may still be preferable.

84.0%

How does turbovec's performance vary across hardware? How to validate performance on target platforms?

Core Analysis ¶

Key Question: turbovec’s throughput and latency depend heavily on hardware (AVX‑512/NEON, memory bandwidth, caches). How should you validate and tune for your target platform?

Technical Analysis ¶

SIMD is critical: hand‑written AVX‑512BW and NEON kernels yield the best performance on CPUs supporting those instruction sets.
Sources of variation:
Instruction width (AVX‑512 > AVX2 > SSE) dictates parallelism.
Memory bandwidth/cache behavior affects bit‑packed LUT accesses.
Threading & NUMA influence latency/throughput in multi‑socket servers.

Validation & Tuning Steps ¶

Baseline benchmarks: measure single‑query latency (p50/p95/p99) and throughput on the target machine; record SIMD support.
Scenario tests: run full index search, allowlist (sparse/dense), and concurrent queries to evaluate short‑circuiting and heap costs.
Resource profiling: inspect CPU utilization, cache misses, and memory bandwidth to identify bottlenecks.
Fallback plan: if AVX‑512 or modern NEON is absent, consider higher bit widths, lower concurrency, or alternative libraries (FAISS) to meet SLAs.

Note: The README’s 12–20% gains are hardware/configuration specific—don’t assume they transfer to your platform.

Summary: Systematic end‑to‑end benchmarking and resource profiling on your target hardware is mandatory; adjust bit widths, concurrency, or choose alternate implementations based on the results.

84.0%

✨ Highlights

Very high compression: 10M documents (1536-d) fit in 4 GB
Online indexing: no training or rebuilds required; supports incremental adds
Provides Rust and Python bindings and integrates with popular retriever frameworks
License is unspecified; perform legal/compliance review before adoption
Repository metadata incomplete (no releases / unclear contributors); maintenance risk should be verified

🔧 Engineering

TurboQuant-based trainless quantized index achieving distortion near the Shannon lower bound
Handwritten SIMD kernels (NEON and AVX‑512BW) deliver competitive search performance on ARM/x86
Supports online ingest, filtered search (allowlist/bitmask), and stable external ids via IdMapIndex
Offers Python/Rust APIs and drop-in integrations for LangChain, LlamaIndex, Haystack, etc.

⚠️ Risks

License unknown; may restrict commercial use or code integration strategies
Repository shows incomplete contributor and release metadata; long-term maintenance and community support are unclear
Benchmarks target specific hardware and datasets; validate performance when porting to other platforms
ARM/x86 SIMD optimizations may cause compatibility issues or degraded performance on other architectures

👥 For who?

Engineering teams building local RAG stacks or operating under strict privacy/VPC constraints
Retrieval systems sensitive to memory footprint and latency, deployed in resource-constrained environments
Production systems requiring stable external ids, deletions, and incremental index updates