CUA: Cross‑platform sandboxes and SDKs for desktop‑controlling AI agents

CUA provides cross‑platform sandboxes, unified SDKs, and benchmarks for AI agents that control full desktops, enabling local and cloud training, evaluation and deployment of real desktop tasks.

GitHub trycua/cua Updated 2025-10-08 Branch main Stars 20.2K Forks 1.3K

VM management Multi‑platform (macOS/Linux/Windows) AI agent framework Model zoo & benchmarks

💡 Deep Analysis

What is the actual user experience? What are the learning curve, common pitfalls, and best practices?

Core Analysis ¶

Core question: What practical issues will you encounter, and how to ramp up effectively?

Technical Analysis ¶

Learning curve: Medium-high. You need knowledge of virtualization (macOS Virtualization.Framework nuances), image management, model backend configuration (OpenAI/Anthropic/local prefixes), and dependency ops.
Common pitfalls:
Environment and permission issues (macOS permissions, drivers, container networking)
Resource bottlenecks causing latency or failures (insufficient CPU/GPU/RAM)
Third-party license constraints (e.g., AGPL) affecting production use
Model outputs not matching computer_call format leading to execution errors

Practical Recommendations ¶

Stepwise onboarding: Follow README: pip install cua-agent[all] and run example notebooks/HUD benchmarks on a small VM.
Start small: Validate logic with small models from the Model Zoo, then scale up.
Pin versions: Lock SDK, image, and model prefixes for reproducibility.
Security posture: Run agents under restricted accounts in isolated VMs, limit network and file access.

Caveats ¶

Important Notice: Before production, audit third-party licenses, constrain agent permissions, and monitor resource usage/latency.

Summary: Using an incremental workflow (examples → small models → scale) plus strict versioning and security controls makes the platform manageable despite its medium-high learning curve.

87.0%

How to use CUA's benchmarks (HUD/Notebook, OSWorld-Verified, SheetBench-V2) for reproducible evaluation?

Core Analysis ¶

Core question: How to use CUA’s benchmark tools for reproducible, comparable desktop-agent evaluation?

Technical Analysis ¶

End-to-end benchmarking: HUD/Notebook provides a one-line entry to run benchmarks (OSWorld-Verified, SheetBench-V2) inside VMs, recording structured events, screenshots, and model usage (tokens, cost).
Reproducibility factors: For strict reproducibility you must pin:
1. VM image and snapshot hash
2. SDK and Agent versions
3. Model prefix and version
4. Hardware (CPU/GPU) and network conditions
Comparability: The unified computer_call / computer_call_output format lets different models’ behaviors be directly compared on the same tasks and supports replay and audit.

Practical Steps ¶

Prepare env: Build and tag VM images (OS, browser, apps) and record image hashes.
Pick a benchmark: Run OSWorld-Verified or SheetBench-V2 samples via HUD/Notebook and save output JSON (including usage).
Pin config: Lock SDK, Model Zoo prefix, and hardware specs; save logs/screenshots.
Archive metadata: Store image, model, hardware, and network metadata for reproducibility.

Caveats ¶

Important Notice: One-line benchmarks are great for iteration, but rigorous research requires full metadata capture and strict version pinning.

Summary: CUA’s benchmarking tools lower the barrier to reproducible evaluation, provided you enforce strict environment and model version control.

86.0%

✨ Highlights

Sandbox + unified SDK enabling full desktop control
Integrated model zoo and benchmarks for evaluation
Full desktop control raises security and privacy risks
Sparse contribution/releases — stability and compatibility unverified

🔧 Engineering

Provides Computer and Agent SDKs, supporting local and cloud VM management
Built‑in model zoo and HUD/benchmarks for cross‑model, one‑line benchmarking

⚠️ Risks

High‑privilege desktop operations risk data leakage and misuse; strict isolation and auditing needed
Repo shows no clear releases or active contributions; dependencies and third‑party components require compliance checks

👥 For who?

Researchers and developers for training/evaluating GUI‑operating agents
Enterprise prototyping and product teams for desktop automation and HCI research