whichllm: Hardware-aware, benchmark-driven local LLM selection tool

whichllm auto-detects hardware and consolidates multi-source benchmarks to recommend locally runnable LLMs ranked by VRAM fit, speed and evaluated quality—useful for engineers and buyers who must run models efficiently on their devices; however, verify license and long-term maintenance feasibility.

GitHub Andyyyy64/whichllm Updated 2026-06-09 Branch main Stars 3.5K Forks 203

local LLMs hardware detection model selection CLI tool HuggingFace GGUF resource planning

💡 Deep Analysis

What core problem does whichllm solve? How does it outperform a simple "model-size/VRAM-only" selection strategy in real-world model selection?

Core Analysis ¶

Project Positioning: whichllm’s primary value is expanding the simple “fits into VRAM” check into a multi-dimensional selection problem: it considers weights, KV cache, activation peaks, quantization efficiency, MoE active parameters, and backend differences rather than just model size.

Technical Features ¶

Evidence merging: Aggregates scores from LiveBench, Aider, Open LLM Leaderboard and discounts by direct/variant/base/self-reported types to reduce misleading influence.
Architecture-aware estimation: Models VRAM usage as weights + KV cache + activations + overhead; speed is modeled as bandwidth-bound with per-quant efficiency, MoE active-vs-total splits, and unified-memory vs PCIe partial offload costs.
Quantization/backend penalties: Supports GGUF, AWQ/GPTQ, FP16/BF16 and incorporates penalties for expected speed/quality impact into ranking.

Practical Recommendations ¶

Run whichllm on your machine to obtain an evidence-scored ranking rather than selecting by model size alone.
For latency-sensitive use, inspect speed_confidence and speed_range to set realistic expectations.
Use --gpu simulation and the plan command before hardware purchases to compare real-world run feasibility across cards.

Important Notice: Recommendations are estimation-driven and rely on public benchmarks; unusual drivers, CUDA versions, or proprietary backends can cause deviations.

Summary: whichllm integrates multi-source benchmarks and runtime cost modeling to reduce the engineering risk of “it should fit but doesn’t run well” or buying inappropriate hardware.

88.0%

As a developer, what is the actual experience of using whichllm to download and start models? What are common errors and best practices?

Core Analysis ¶

Issue: whichllm offers an automated flow from recommendation to launch, enabling quick local trials, but it depends on local drivers, compiled dependencies, and network resources—these remain common failure points.

Technical Analysis ¶

One-click flow: whichllm run will create an isolated env (uv), install dependencies, download the model and launch an interactive session; whichllm snippet provides runnable Python snippets.
Common failure points:
Mismatched GPU drivers or CUDA versions cause compile/run errors.
auto-gptq/autoawq or llama-cpp-python require local compilation and may fail if system libs/build tools are missing.
Large model downloads are time-consuming and disk-heavy, susceptible to rate limits or offline failures.

Best Practices ¶

Prepare drivers/deps: Ensure CUDA, NVIDIA drivers, and build toolchains are ready; verify hardware detection with whichllm before heavy ops.
Validate at small scale: Use whichllm snippet or --top 1 --json to fetch model info and run a small interactive test in an isolated env first.
Plan network/storage: Reserve disk space before large downloads and use caching/frozen fallbacks if bandwidth is limited.
Conservative quantization path: If hitting quant/backend issues, start with mainstream stacks (llama-cpp-python + GGUF/FP16) before experimenting with AWQ/GPTQ.

Important Notice: whichllm shortens the path to trying models but cannot eliminate the complexity of local environments and compilation chains.

Summary: Treat whichllm as a rapid trial facilitator, but perform driver/disk/build-tool checks and small-scale verifications before a full run.

86.0%

I plan to use whichllm for hardware purchase decisions (e.g., RTX 4090 vs 5090). Which features support this decision and what are the limitations?

Core Analysis ¶

Issue: whichllm provides GPU simulation, plan, and upgrade to visualize what models will run and how well on candidate cards. It’s a valuable decision-support tool but not the sole basis for purchase.

Technical Analysis ¶

Decision-support features:
--gpu simulation: Emulates VRAM usage and inference speed across different GPUs.
plan: Reverse-plans the minimum hardware or recommends cards for a given model.
upgrade: Compares model coverage and performance across multiple cards for cost/benefit analysis.
Modeling boundaries: Estimates rely on generic bandwidth/VRAM modeling and public benchmarks; they cannot fully capture unified-memory behavior, PCIe offload dynamics, or vendor-specific accelerations.

Practical Advice ¶

Use whichllm’s lower-bound estimates as conservative inputs; reserve 10–30% headroom for driver/system variances.
For critical workloads, perform a target-machine empirical test—borrow/try a representative card and compare whichllm run results against estimates.
Include cost/power/cooling and licensing/compliance considerations in purchase decisions—these are outside whichllm’s scope.

Important Notice: whichllm is not the final arbiter for hardware purchases but significantly narrows candidates and quantifies runability and expected performance.

Summary: Use whichllm for evidence-driven preselection and capacity planning; combine with empirical tests and engineering/cost constraints for the final buy decision.

86.0%

How accurate are whichllm's VRAM and speed estimators? How should I interpret `speed_confidence` and `speed_range` in practice?

Core Analysis ¶

Issue: whichllm’s VRAM/speed estimations provide meaningful expectations for common hardware/backend combinations, but they are not absolute guarantees. speed_confidence and speed_range explicitly express uncertainty.

Technical Analysis ¶

Estimation components: VRAM = weights + GQA KV cache + activation peaks + overhead; speed depends on bandwidth, quantization efficiency, backend implementation differences, and MoE active ratios.
Sources of uncertainty: Input sequence length (activation peaks), driver/CUDA versions, backend (llama-cpp vs transformers), use of unified memory or PCIe offload, and quantization implementation efficiency.
Meaning of speed_confidence: Reflects benchmark coverage and historical consistency of a model/backend combo. High values indicate multiple similar benchmarks support the estimate; low values indicate sparse or divergent data.
Meaning of speed_range: Presents a throughput interval across different system loads, quant backends, and sequence lengths.

Practical Recommendations ¶

If speed_confidence is high and speed_range is tight, treat the estimate as a reliable capacity-planning input; otherwise run whichllm run or a snippet to validate on target hardware.
For purchasing, simulate multiple GPUs with --gpu and use the lower bound of estimates as a conservative metric, leaving ~10–30% headroom.
With MoE, unified memory, or non-standard backends, treat whichllm results as guidance and perform an end-to-end test.

Important Notice: whichllm gives expected ranges and confidence, not production SLAs. Critical production choices should be validated empirically.

Summary: Estimations are valuable for mainstream setups; speed_confidence and speed_range are the signals telling you whether to trust the estimate or to run empirical checks.

84.0%

How can I integrate whichllm into automated selection or CI workflows? What reusable outputs and integration limits exist?

Core Analysis ¶

Issue: whichllm provides automation-friendly outputs (--json, snippet) and an isolated install flow (uv), making it suitable for embedding selection and trial steps into CI, but network, drivers, and native build dependencies limit full automation on generic runners.

Technical Analysis ¶

Reusable outputs:
--json: Structured list with scores, speed_confidence, and speed_range—ideal for programmatic filtering (e.g., score>80 and speed_confidence>0.7).
snippet: Executable Python snippets usable as CI smoke tests to validate model load/inference on runners.
uv/uvx: Isolated install helps reproduce environment setup.
Limits & mitigations:
Network dependency: HF API and model downloads fail in air-gapped CI—use cache/frozen fallbacks or private mirrors.
Native builds: auto-gptq/llama-cpp-python require compilation—solve with containers or prebuilt binary caches.
Hardware limits: GPU-less CI can only perform static selection; place runtime validation on GPU-equipped staging runners.

Practical Recommendations ¶

Use --json as the selection engine input in CI; apply threshold filters and then dispatch snippet to test runners.
Containerize or prebuild artifacts to eliminate flaky native builds; run smoke tests on dedicated GPU runners.
Maintain frozen fallbacks and internal model mirrors for offline or air-gapped CI.

Important Notice: Automation speeds decision-making, but final compatibility checks must run on runners that mirror production hardware and drivers.

Summary: whichllm has strong automation touchpoints (JSON, snippet, uv) but requires network, driver, and build reproducibility practices to be fully reliable in CI.

83.0%

How does whichllm's evidence aggregation and confidence discounting work? What are the advantages and potential pitfalls of this design?

Core Analysis ¶

Issue: whichllm not only aggregates multiple benchmarks but discounts each entry based on whether it is direct, inherited/variant, or self-reported, and on recency—this reduces distortion from inflated or inherited scores.

Technical Analysis ¶

Evidence stratification: Scores are labeled as direct / variant / base / interpolated / self-reported and weighted down by category, preventing small forks from inheriting high ranks via large-base scores.
Recency weighting: Older benchmarks are demoted along model lineages; README prints snapshot dates so users can detect stale information.
Multi-source cross-check: Aggregating LiveBench, Aider, Open LLM Leaderboard reduces individual test bias but relies on sufficient overlap or calibration between leaderboards.

Advantages ¶

More robust recommendations: Resists manipulation by self-reported or inherited scores.
Transparent: Provides benchmark dates and confidence tags for manual verification.

Pitfalls & Recommendations ¶

Data sparsity: Models with few benchmarks may be over-penalized; in such cases use conservative filters (--evidence strict or only direct entries).
Benchmark semantics: Different leaderboards test differing tasks—inspect which metrics are being combined.
Offline/cached lag: Frozen fallbacks in offline mode may be stale.

Important Notice: Treat the evidence weights as decision parameters, not as the single authority.

Summary: Evidence aggregation plus confidence discounting increases recommendation quality but requires user interpretation of confidence tags and benchmark dates for final selection.

82.0%

✨ Highlights

Evidence-based ranking using real benchmarks
Auto-detects GPU/CPU/RAM and recommends runnable models
One-command flows: simulate, download and start chat
License and contributor metadata are unclear; verify compliance risk

🔧 Engineering

Merges multi-source benchmarks and ranks models by quality, speed and VRAM fit
Supports GPU simulation, task-profile filters and multiple model formats (GGUF/AWQ/FP16)
Produces reusable JSON output and runnable snippets for scriptable integration

⚠️ Risks

No clear open-source license declared; legal uncertainty for commercial/distribution use
Repository shows no contributor/release metadata; elevated risk for long-term maintenance and security updates
Depends on downstream data (HuggingFace, leaderboards); ranking can fluctuate with source changes
Compatibility across backends and quantized formats is complex; extra validation needed for heterogeneous hardware

👥 For who?

Developers and researchers who want to run large models locally and understand model formats and hardware constraints
System or procurement decision makers for quick hardware planning and performance estimation
Engineers with deployment experience who can integrate the tool into CI or automation scripts