openpi: Vision-Language-Action models and training tools for robotics

openpi provides VLA models, checkpoints and DROID tools for robotics research.

GitHub Physical-Intelligence/openpi Updated 2025-09-11 Branch main Stars 8.2K Forks 951

Python Robotics ML Vision-Language-Action (VLA) Checkpoints & Fine-tuning

💡 Deep Analysis

Should I choose LoRA or full-parameter finetuning? How to design a staged finetuning workflow?

Core Analysis ¶

Core Question: Choose LoRA or full-parameter finetuning based on task mismatch and resource budget—LoRA is low-cost for quick adaptation; full finetuning is used when deep representation changes are required.

Technical Analysis ¶

LoRA benefits: Lower memory/compute, fast iteration, suitable for few-shot adaptation (≈22.5GB requirement).
Full finetune benefits: Better when semantic or visual/action representations must be deeply adapted, but requires much larger resources (>70GB).
Data quality: Use the repo’s data filtering (idle filter) to improve signal-to-noise for any finetuning.

Staged Finetuning Workflow (recommended)¶

Data prep: Run data filtering and consistency checks.
Zero-shot baseline: Run expert checkpoints to get baseline metrics.
LoRA finetune: Quick, small-batch runs to test improvement on success rate, collision rate, trajectory smoothness.
Decision point: If LoRA meets targets, proceed to deployment; otherwise consider more data or full finetuning.
Full finetune (if needed): Only with sufficient data and compute—use longer schedules and stricter validation.

Important Notice: Set clear metrics and thresholds to avoid unnecessary full finetuning compute costs.

Summary: Use LoRA as the primary low-cost adaptation method and escalate to full finetuning only when necessary.

89.0%

Why does the project include both flow-based (π₀/π₀.5) and autoregressive-FAST (π₀-FAST) architectures? What are the technical advantages of each?

Core Analysis ¶

Core Question: openpi includes two modeling paradigms to cover different action representations and control requirements: flow-based (continuous probabilistic modeling) and autoregressive-FAST (tokenized sequence generation).

Technical Features and Advantages ¶

Flow-based (π₀ / π₀.5):
Advantage: Models continuous action distributions, enabling diverse sampling and uncertainty representation—beneficial for high-precision or smooth control.
Use cases: Fine manipulation, tasks requiring continuous trajectory sampling or probabilistic exploration.
Autoregressive-FAST (π₀-FAST):
Advantage: Uses a FAST tokenizer to discretize actions for autoregressive generation, typically offering lower latency and more deterministic outputs, and easier integration with classical planners.
Use cases: Real-time control, latency/bandwidth-constrained deployments, and scenarios requiring explicit action tokens for logging or offline analysis.

Practical Recommendations ¶

Task-driven selection: Prefer flow-based for continuous, uncertainty-aware tasks; prefer π₀-FAST for latency-sensitive or symbol-requiring tasks.
Hybrid approach: Consider sampling candidate trajectories with flow-based models and selecting via a tokenized autoregressive controller for online execution.

Important Notice: The repo currently supports π₀.5 only with a flow matching head—autoregressive behavior for π₀.5 may need extra implementation.

Summary: The two architectures are complementary—covering continuous probabilistic control and efficient tokenized control—giving flexibility across deployment scenarios.

88.0%

Under resource constraints (GPU memory / single-node), how to configure training and inference to avoid OOM and iterate quickly?

Core Analysis ¶

Core Question: Under limited GPU memory and single-node constraints, the goal is to reduce per-GPU parameter/activation peaks and adopt a staged finetuning strategy for fast iteration.

Technical Analysis ¶

Known thresholds: Inference >8GB, LoRA ≈22.5GB, Full finetune >70GB (A100/H100).
Available techniques:
LoRA: Low-rank adapters greatly cut memory and compute—first choice for finetuning.
FSDP (single-node multi-GPU): Shards parameters/activations across GPUs to lower per-GPU peaks (fsdp_devices).
AMP & gradient checkpointing: Reduce activation memory.
Gradient accumulation: Keep effective batch size without raising per-step memory.

Practical Recommendations (stepwise)¶

Prefer LoRA for quick adaptation with minimal memory.
Enable FSDP on single-node multi-GPU and tune fsdp_devices to spread memory.
Turn on AMP & checkpointing to lower activation peaks.
Inference optimizations: Reduce parallel sampling, lower temperature/steps, or use stepwise generation to avoid OOM.
Scale down model if memory is still insufficient—use a smaller base for prototyping.

Important Notice: The repo currently does not support multi-node training; extending to multi-node requires custom changes or external frameworks.

Summary: Combining LoRA, single-node FSDP, AMP, checkpointing, and gradient accumulation enables feasible training and iteration under constrained resources.

87.0%

What are the most common pitfalls during deployment and runtime, and how to avoid or quickly diagnose them?

Core Analysis ¶

Core Question: Deployment failures usually stem from environment and data engineering issues (dependencies, LFS, memory, data format/calibration) rather than the model itself. A systematic debugging process reduces downtime.

Common Pitfalls ¶

Dependency & installation issues: Missing submodules or not using GIT_LFS_SKIP_SMUDGE=1, uv environment failures.
Memory / OOM: Misjudged inference/finetuning memory needs or missing FSDP/AMP configuration.
Platform/data mismatch: Camera pose, resolution, or action parameterization differing from training data.
Training script limits: No multi-node support—attempting to scale out will fail.

Fast Diagnosis & Avoidance Steps ¶

Environment check: Prefer official Docker; otherwise git clone --recurse-submodules and GIT_LFS_SKIP_SMUDGE=1 uv sync.
Resource validation: Confirm GPU model, drivers, CUDA, and available memory match README requirements.
Data consistency: Verify observation/action formats, coordinate frames, and calibration assumptions.
Staged runs: Execute inference example → LoRA finetune → full finetune to isolate failures.
Logging & monitoring: Collect model outputs, collision events, and OOM stacks to find root causes quickly.

Important Notice: Use Docker for dependency issues; use simulation zero-shot tests and distribution logging before heavy finetuning when migration fails.

Summary: Dependency/submodule/LFS correctness + resource checks + staged validation are the keys to avoiding and rapidly diagnosing deployment issues.

87.0%

How can I determine whether the provided base/expert checkpoints will transfer to my robot arm or sensor configuration?

Core Analysis ¶

Core Question: To decide whether base/expert checkpoints transfer, compare action space, sensor observation distribution, and control interface between your platform and the training setup.

Technical Analysis ¶

Key alignment factors:
Action DOF and parameterization (continuous vs tokenized; joint vs end-effector space)
Control frequency and limits (velocity/acceleration caps change strategy)
Vision/sensor setup (camera pose, resolution, calibration, depth/RGB)
Empirical validation steps:
1. Run zero-shot inference in simulation or a safe environment and log failure modes (collisions, missed grasps, erratic motions).
2. Compare training data statistics (if available) with your platform’s observation/action distributions.
3. Use a small amount of target data to run LoRA finetuning and check improvement; if LoRA fails, consider full finetuning.

Practical Recommendations ¶

Try zero-shot first with provided expert checkpoints on a similar setup to get a quick signal.
Low-cost finetuning: start with LoRA to evaluate transferability before committing to full finetuning.
Mapping layers: if parameterizations differ, build an intermediate mapping (e.g., end-effector to joint mapping) and jointly finetune it.

Important Notice: Direct transfer to heterogeneous arms or unseen sensor layouts often fails—use simulation verification and staged finetuning.

Summary: Systematically align action/observation stats, run zero-shot tests, then apply staged finetuning (LoRA → full) to quantify transferability and required effort.

86.0%

What are the most suitable and least suitable application scenarios for openpi? What alternatives exist when it's not appropriate?

Core Analysis ¶

Core Question: Suitability depends on task type (desktop manipulation vs large-scale mobility), sensor/robot similarity, and available training resources.

Suitable Scenarios ¶

Desktop manipulation: Folding, grasping, opening containers—tasks covered by the training distribution are strong suits.
Quick adaptation on similar platforms: If your robot’s mechanics and camera poses are similar to DROID/ALOHA/LIBERO, base/expert checkpoints plus LoRA finetuning can be effective.

Unsuitable Scenarios ¶

Large-scale mobility / complex navigation: openpi is not trained for navigation or large-scale environments and will likely generalize poorly.
Heterogeneous or uncovered sensors: Uncommon sensors (non-standard cameras, LiDAR, unusual force sensors) complicate transfer.
Resource-limited teams needing full retraining: Reproducing 10k+ hours pretraining is infeasible without large compute and data.

Alternatives ¶

Model-based controllers: Preferable when dynamics can be modeled—more stable and interpretable.
Task-specific RL pipelines: For navigation/large-scale tasks, use dedicated RL + sim2real workflows.
Other open-source VLA/VLMs: If an alternative pretrained model better matches your data distribution, prefer it to reduce transfer cost.

Important Notice: Validate feasibility with simulation and small-scale LoRA finetuning before committing large compute resources.

Summary: openpi is most valuable for desktop manipulation and closely matched platforms; for mobility, heterogeneous sensors, or low-resource settings, consider alternatives or complementary methods.

86.0%

✨ Highlights

Provides pretrained VLA base models and expert fine-tuned checkpoints
Supports PyTorch and Docker deployment with training and DROID examples
High hardware requirements: inference >8GB; fine-tuning demands significantly more memory
No formal releases and limited contributors; long-term maintenance and compatibility uncertain

🔧 Engineering

Includes π₀, π₀-FAST and π₀.5 flow/autoregressive VLA models with training and inference pipelines
Provides pretrained weights from 10k+ hours of robot data and DROID fine-tuning examples

⚠️ Risks

High compute dependency: full fine-tuning requires 70GB+ VRAM or complex multi-GPU setups, raising entry barriers
Platform adaptation risk: models were developed for specific robots; cross-platform generalization and plug-and-play usability are limited

👥 For who?

Robotics researchers and engineers seeking end-to-end VLA models and checkpoints
Developers with deep learning and GPU cluster experience, or those wanting to experiment on DROID/ALOHA