Peekaboo: Pixel-accurate screen capture and AI-driven GUI automation for macOS

Peekaboo combines pixel-accurate screen capture, structured UI discovery, and multi-provider AI inference on macOS to deliver reproducible, natural-language driven GUI automation with an optional MCP server—ideal for visual automation and local-model integration in engineering and testing workflows.

GitHub steipete/Peekaboo Updated 2026-02-02 Branch main Stars 1.8K Forks 102

Swift/Swift CLI Node.js (MCP) macOS automation screen capture multi-model AI support local models (Ollama) CLI + MCP server accessibility & screen-perms

💡 Deep Analysis

What specific automation problem does Peekaboo solve, and how does it implement a closed loop from visual capture to actionable GUI operations?

Core Analysis ¶

Project Positioning: Peekaboo combines native macOS capture with structured element discovery to join visual understanding and precise GUI actions, solving the problem of reproducible, VQA-driven desktop automation.

Technical Features ¶

Pixel-accurate capture and snapshot-driven flow: Uses native macOS APIs to take Retina multi-screen captures and produce snapshot_id for replay and auditability.
Structured elements and menu discovery: Outputs JSON for menus and menubar items to avoid brittle coordinate probing.
Toolchain + agent layer: Offers atomic actions (see/click/type/drag/...) and links them via models (cloud or local Ollama) into multi-step agents supporting dry-run and resume.

Usage Recommendations ¶

Initial evaluation: Run peekaboo see --app <App> to obtain a snapshot and inspect JSON element detection.
Script authoring: Prefer element IDs and embed snapshot_id into .peekaboo.json for stable replays rather than coordinates.
Agent roll-out: Use dry-run and pin models; add assertions/timeouts around critical steps.

Important Notes ¶

Important: macOS Screen Recording and Accessibility permissions are required otherwise capture/interactions fail.

Summary: Peekaboo creates an auditable chain from high-fidelity visual capture to structured element identification and deterministic action replay, enabling natural-language or test-driven desktop automation.

85.0%

What are the reliability and testing advantages of the snapshot-driven architecture, and how should you leverage them in practical scripts?

Core Analysis ¶

Issue Focus: Peekaboo’s snapshot model captures the visual state as a snapshot_id. This yields reproducibility, auditability, and replayability, but requires strategies to handle dynamic UIs.

Technical Analysis ¶

Reproducibility: Snapshots freeze pixels, element IDs, and menu structure, enabling identical replays for regression tests and audits.
Testability: With strongly typed .peekaboo.json scripts, you can assert step outcomes and use --dry-run/--no-fail-fast for safe validation.
Debugging: Failures can be reproduced on a saved snapshot, avoiding the need to recreate the full live environment.

Practical Recommendations ¶

Freeze key snapshots: Capture and embed critical snapshot_ids into .peekaboo.json before multi-step sequences.
Explicit re-capture and checks: Re-run see for volatile regions and assert element presence before acting.
Timeout and retry: Use --wait and retry logic around steps susceptible to animation or loading delays.
CI discipline: Ensure fixed resolution/Retina and screen-index settings when running scripts in CI.

Important Notes ¶

Important: Snapshots can become outdated if windows move, theme/language changes, or dynamic content updates. Re-capture or handle failure branches explicitly.

Summary: Treat snapshot_id as the testing and replay foundation; combine with re-capture, assertions and retry strategies to maximize robustness.

85.0%

Why choose a native macOS (Swift) implementation with an optional MCP (Node.js) server, and what architectural advantages does this combination provide?

Core Analysis ¶

Issue Focus: Peekaboo’s native macOS (Swift) core plus an optional Node.js MCP server balances high-fidelity OS integration with flexible service-level integration.

Technical Analysis ¶

Native Swift benefits: Direct use of macOS Screen Recording and Accessibility APIs ensures pixel accuracy, low-latency event injection, and correct Retina/multi-screen coordinate mapping.
MCP (Node.js) purpose: Offers a lightweight service interface for integration with clients (Claude Desktop, Cursor), handling model provider credentials, concurrency, and inter-process communication.
Dual-mode advantages: CLI is ideal for dev scripts and CI; MCP is better for long-running services and GUI client integration.

Practical Recommendations ¶

Prefer the native binary (Homebrew) for lowest latency and best capture fidelity.
Use MCP when integrating with desktop agents or multi-user workflows (npx @steipete/peekaboo).
Layer model responsibilities: Keep sensitive/local models (Ollama) at the native level; use MCP for cross-tool model adaptation.

Important Notes ¶

Important: Native implementation limits cross-platform portability—the high-fidelity features are macOS-specific, and MCP cannot substitute for native capture capabilities.

Summary: The Swift + MCP architecture provides pixel-accurate performance at the OS level and flexible integration at the service level, making it suitable for precise desktop automation that needs to integrate with external agents.

85.0%

What are Peekaboo's precision and limitations in pixel-accurate and multi-screen/Retina scenarios, and how to improve robustness in complex UIs?

Core Analysis ¶

Issue Focus: Peekaboo supports pixel-accuracy and Retina, but rendering differences and dynamic UIs can reduce precision; engineering mitigations are necessary for robustness.

Technical Analysis ¶

Source of accuracy: Native Swift use of system APIs ensures coordinate/pixel consistency for capture and event injection.
Key limitations: GPU-accelerated compositing, animations, custom-rendered controls, and tiny/transparent UI elements can cause recognition or click offsets.
Multi-screen complexity: Mixed scaling (Retina vs non-Retina) and screen ordering (screen-index) can introduce coordinate mismatches.

Practical Recommendations (Improve Robustness)¶

Explicit display settings: Use --retina and --screen-index and keep resolution/scale consistent on target machines.
Prefer structured elements: Use JSON element IDs from see and menu/menubar commands instead of raw coordinates.
Minimize animation: Disable app animations when possible or wait for stable frames before actions.
Add redundancy checks: Assert expected outcomes after clicks and re-capture/retry on failure.
Layered verification: For custom-drawn controls, combine OCR/vision-model checks with pixel-based confirmation.

Important Notes ¶

Important: Pixel-accurate tooling cannot eliminate inherent visual inference errors; headless or minimized sessions will be especially unreliable.

Summary: With display determinism, element-ID-first practices, waits/retries and dual verification strategies, Peekaboo can be made robust in complex Retina/multi-screen setups, though inherent limits remain.

85.0%

What are the best-fit use cases and unsuitable scenarios for Peekaboo, and how should one weigh trade-offs when choosing an automation tool?

Core Analysis ¶

Issue Focus: Define Peekaboo’s applicability: it excels in macOS-native, pixel-accurate, replayable automation, but is not suited for cross-platform, headless, or highly custom-rendered UI scenarios.

Technical Analysis (Fit / Not Fit)¶

Well-suited for:
End-to-end UI and regression testing of macOS native apps (with reproducible snapshots)
RPA tasks interacting with system menus, menubar, and Dock
Prototyping or products that couple VQA/NL agents with desktop actions
Privacy-sensitive deployments needing local models (Ollama)
Not well-suited for:
Cross-platform (Windows/Linux) automation needs
Fully headless or minimized desktop environments
Highly custom rendering (games, complex GPU-powered UIs) where visual recognition is unreliable

Practical Trade-offs ¶

Platform constraint first: If target is macOS, Peekaboo is a strong candidate; otherwise pick a cross-platform tool.
Visible desktop requirement: Ensure CI/target systems can provide a visible display or virtual framebuffer.
Model/privacy needs: Local model support is a decisive advantage when data cannot leave the host.

Important Notes ¶

Important: Evaluate not just feature parity but environment guarantees (display, permissions) and long-term model/ snapshot maintenance costs.

Summary: Peekaboo is a strong fit for macOS-native GUI automation, reproducible testing, and privacy-first scenarios; consider alternatives for cross-platform or headless use cases.

85.0%

Compared to traditional coordinate-based RPA or pure visual-QA systems, what are Peekaboo's advantages and trade-offs, and when should you choose or avoid it?

Core Analysis ¶

Issue Focus: Compare Peekaboo to coordinate-based RPA and pure visual-QA systems to understand trade-offs in reliability, testability, and NL-driven automation.

Technical Comparison (Key Points)¶

Coordinate-based RPA: Simple and often cross-platform, but brittle against UI changes and hard to audit/replay.
Pure VQA systems: Good at understanding visual content, but typically lack reliable event injection and replayable action sequences.
Peekaboo’s hybrid advantage:
Structured element and menu JSON reduces coordinate brittleness;
Snapshot + typed scripts enable replay/testability;
Agent layer converts natural language into auditable multi-step tool calls, filling the action gap of VQA.

When to choose Peekaboo ¶

Target platform is macOS and deep system UI interaction is required.
You need replayable, testable automation with NL agent capabilities.
Privacy constraints favor local models (Ollama).

When to avoid Peekaboo ¶

You need cross-platform support (Windows/Linux).
Execution must be headless or minimized (Peekaboo relies on visible screens).
The target UI is heavily custom-rendered (games/GPU-heavy) where visual recognition is unreliable.

Important Notes ¶

Important: The trade-off is between “precision & reproducibility (macOS-specific)” versus “cross-platform & headless capability.”

Summary: Peekaboo is a strong hybrid solution for macOS-native automation combining VQA and reliable action replay; choose alternatives if cross-platform or headless execution is a strict requirement.

85.0%

✨ Highlights

Pixel-accurate captures with Retina scaling and multi-screen support
Built-in natural-language agent chaining for multi-step GUI automation
Supports local and remote AI providers (multi-vendor integration)
Requires macOS Screen Recording and Accessibility permissions
v3 is currently in beta and has several known issues listed in changelog

🔧 Engineering

Provides a CLI and optional MCP server with unified tools to capture, recognize, and operate UI elements
Compatible with multiple AI providers (GPT‑5.1, Claude, Grok, Gemini, Ollama local models)
Supports reproducible workflows, strict typing, and testable automation scenarios

⚠️ Risks

Repository metadata shows no releases or contributors/commits (0), which may indicate a mirrored index or metadata gap
macOS 15+ and Xcode 16+ requirements limit cross‑platform adoption
Screen recording and accessibility permissions pose privacy and compliance risks for enterprise deployment

👥 For who?

Suitable for macOS automation engineers, SREs, QA/product testers, and tooling integrators
Also fits AI developers needing local vision inference or offline model support
Better suited for users familiar with system permissions and command-line toolchains