Maigret: Collect and analyze user accounts across thousands of sites by username

An OSINT-oriented username enumeration tool that automatically searches and aggregates account profiles across thousands of sites, provides multi-format reporting and proxy/Tor support, and is suited for preliminary identity linking and lead collection.

GitHub soxoj/maigret Updated 2026-04-30 Branch main Stars 34.4K Forks 2.6K

Python CLI & Web UI OSINT Username enumeration Proxy / Tor support Report export (HTML/PDF/JSON)

💡 Deep Analysis

What exact problem does this project solve? How should I decide whether to adopt it for username-centric cross-site account discovery and information aggregation?

Core Analysis ¶

Project Positioning: Maigret addresses the problem of starting from a single username and automatically checking a large number of public sites for account existence and collecting available profile data, aggregating heterogeneous sources into machine-readable and visual reports—valuable for OSINT, digital forensics, and due diligence.

Technical Features ¶

Data-driven site definitions: Per-site request templates and parsing rules are centralized, making extension and fixes manageable.
Asynchronous fetching & embeddable library: Python >=3.10 async core with a thin CLI wrapper, enabling integration into larger pipelines.
Proxy & anonymity support: Native HTTP/SOCKS5 (Tor) and I2P support for restricted or darknet checks.
Recursive ID extraction: Extracts additional usernames/IDs from pages and continues searches to broaden discovery.
Multi-format outputs: JSON/NDJSON, HTML, PDF, CSV, interactive graphs, facilitating downstream analysis and human review.

Usage Recommendations ¶

Run a pilot: Start with the default scan (maigret username) to evaluate coverage across the top 500 sites; expand with -a or --tags if needed.
Integrate results: Use --json ndjson to feed results into deduplication, scoring, and analyst review pipelines.
Prepare maintenance: Set up site definitions updates and self-checks (--self-check), and plan for local fallback.

Important Notes ¶

Compliance risk: Aggregating personal public data may be subject to GDPR/CCPA and site TOS—ensure lawful use and internal compliance.

Summary: Maigret is a strong open-source choice for systematic, localizable, and anonymous username-centric discovery if you can manage parsing-rule maintenance and operational work to handle anti-scraping limitations.

90.0%

How do site definitions affect result accuracy? How should I design a maintenance process to ensure long-term effectiveness?

Core Analysis ¶

Core issue: Maigret’s accuracy depends heavily on site definitions that specify how to detect usernameClaimed/usernameUnclaimed and extract profile data. Stale definitions cause false positives/negatives and data loss.

Technical Analysis ¶

Why definitions matter: Per-site parsing rules and request templates determine detection and extraction correctness.
Causes of breakage: DOM/API changes, localization, or site redesigns break parsing logic.

Recommended Maintenance Process ¶

Site tiering: Classify sites into high/medium/low value and apply stricter SLAs for high-value targets.
Automated self-tests (CI): Maintain test accounts or mocked responses for critical sites; run checks in CI to detect regressions and trigger alerts.
Fast-fix workflow: Version-control definitions, use PRs for fixes, and automate publishing; feed regression data back to the definitions repo.
Fallbacks & rollbacks: Provide the built-in DB as a safe fallback and keep change history for rollbacks.
Monitoring & metrics: Track match rate, failure rate, and parse errors; notify maintainers on anomalies.

Practical tips ¶

Maintain a stable set of test usernames representative of common templates.
Assign confidence scores to heuristics; route low-confidence matches to manual review.
Prioritize fixes by business impact to use maintainer resources effectively.

Note: Maintaining site definitions requires ongoing investment; neglect will degrade tool effectiveness.

Summary: With tiering, CI-based self-tests, fast repair workflows, and monitoring, you can keep site definitions accurate and Maigret reliable over time.

89.0%

Why does the project use Python async and a data-driven site definitions architecture? What are the pros and cons of this technical design?

Core Analysis ¶

Core question: Python async combined with data-driven site definitions is chosen to address scalability and maintainability for large-scale, IO-bound cross-site enumeration.

Technical Analysis ¶

Why async: The task is IO-bound. async enables high concurrency in a single process, reducing resource consumption and latency when probing thousands of sites.
Why data-driven site definitions: Encoding per-site request templates and parsing rules in JSON allows:
Rapid site additions/fixes without code changes;
Automated/community updates (GitHub pull);
Support for different extraction types (HTML, API, heuristic).

Advantages ¶

High scalability: Add thousands of sites by updating data files.
Operational agility: Fixes can be applied to definitions without deploying new code.
Resource efficiency: Async reduces thread/process overhead.

Limitations & Risks ¶

Parsing fragility: Definitions are sensitive to DOM/API changes, causing false positives/negatives and requiring continuous maintenance.
Shifted complexity: Error handling, proxy pools, and CAPTCHA workarounds become operational concerns and are harder to debug.
Operational constraints: Large concurrent scans risk triggering site defenses; rate-limiting, retries, and proxy strategies are necessary.

Practical Recommendations ¶

CI for site definitions: Implement self-checks and regression tests for critical sites.
Expose concurrency controls: Offer rate limits, timeouts, and retry knobs to avoid being blocked.
Use local fallback snapshots: Fall back to built-in DB when update fetches fail.

Summary: The architecture is well-suited for scalable username enumeration, balancing efficiency and maintainability, but requires ongoing definition maintenance and robust network/anti-blocking strategies.

88.0%

In which scenarios is Maigret particularly suitable? What are its clear limitations and what alternative solutions should be considered?

Core Analysis ¶

Core issue: Identify Maigret’s best-fit use cases and its limits so you can choose whether to use it alone or combine it with other solutions.

Well-suited scenarios ¶

OSINT initial discovery: Broad public account searches starting from a username to build candidate dossiers.
Local/offline deployment needs: Running inside controlled environments without third-party API keys.
Anonymous/darknet coverage: Built-in Tor/I2P support helps check .onion/.i2p and geo-restricted sites.
Product embedding: Integrate as a detection engine in due diligence or anti-fraud workflows, feeding downstream scoring and analyst review.

Clear limitations ¶

No authenticated/paid content: Only public-visible data; cannot replace authenticated APIs or private data sources.
Accuracy tied to definition maintenance: Parsing rules require continuous updates.
Anti-scraping & rate limits: Maintaining long-term coverage requires proxies and ops investment.
Compliance/legal exposure: Aggregation of personal data may be regulated.

Alternatives & complements ¶

Commercial APIs/data vendors: For guaranteed SLA, data completeness, and access to paid/authenticated data.
Custom scrapers for key sites: Build authenticated modules for a small set of critical sites.
Hybrid approach: Use Maigret for broad discovery, then escalate high-value targets to paid services or manual forensic analysis.

Summary: Maigret offers cost-effective, private, and anonymous strengths for large-scale public account discovery. For authenticated data, enterprise SLAs, or legal-evidence scenarios, supplement with commercial or custom solutions.

88.0%

What are common failure or false-positive scenarios in actual use, and how should I configure and operate the system to mitigate them?

Core Analysis ¶

Core issue: Fetch failures (403/429/timeouts) and parsing false positives are the most common operational problems for username enumeration tools; they stem from target site protections and fragile site definitions.

Common Scenarios (based on evidence)¶

Anti-scraping defenses: High concurrency or repeated requests from the same IP lead to 403/429 or CAPTCHAs.
Stale parsing rules: DOM/API changes break usernameClaimed/usernameUnclaimed heuristics causing misclassification.
Proxy misconfiguration: Tor/I2P daemons or SOCKS5 config not running properly, blocking access.
Resource/throughput strain: Full scans (-a) or many usernames concurrently consume heavy network/time resources.

Mitigation & Operational Recommendations ¶

Rate limit and tier scans: Start with top 500 sites or --tags-based batches to avoid large bursts.
Use a proxy pool: Rotate proxies or Tor circuits for frequently blocked sites; tune concurrency and connection pools.
CI for site definitions: Create self-tests for critical sites, run --self-check regularly to detect parsing regressions.
Post-processing & analyst review: Export --json ndjson for dedup/scoring; route high-value matches to manual validation.
Retry & backoff policies: Implement exponential backoff for 403/429 with limited retries; reattempt timeouts in low-traffic windows.

Note: Bypassing CAPTCHAs or violating site TOS can bring legal/compliance risks. Use caution and follow laws and policies.

Summary: Combining rate limits, proxy rotation, automated definition testing, and result post-processing will reduce failures and false positives to manageable levels, but institutional compliance and analyst review remain essential.

87.0%

How can I embed Maigret as a library in my Python async data pipeline? What implementation details should I watch out for?

Core Analysis ¶

Core issue: Maigret exposes an async API suitable for embedding in Python async applications. The main challenge is coordinating event loop usage, concurrency control, and network/proxy configuration.

Technical Analysis ¶

How to call: The README states the CLI is a thin wrapper around an async core—so import Maigret and call its async APIs directly to avoid sub-process overhead.
Concurrency & resources: Manage concurrency via asyncio.Semaphore or a task pool. Expose concurrency/timeouts as config to avoid triggering anti-scraping defenses.
Event loop concerns: Ensure you’re running within the same event loop; avoid creating multiple loops (use asyncio.run_coroutine_threadsafe or schedule tasks correctly in frameworks like FastAPI).
Network/proxy config: Pass proxy/Tor/I2P settings to the library layer so all requests share the same network policy; verify Tor daemon readiness.
Result handling: Consume results as JSON/NDJSON or native objects and route them into queues or DBs for deduping, scoring, and analyst review.

Practical Tips ¶

Build an adapter layer: Centralize concurrency, proxy settings, and retry logic; treat Maigret as a fetching engine.
Test & monitor: Add CI checks for critical sites and monitor failure rates and latencies.
Async-friendly integration: await Maigret’s coroutines directly in services (e.g., FastAPI) to minimize overhead.

Note: When embedding, enforce access controls, logging, and analyst verification for high-value matches.

Summary: Embedding Maigret into an async pipeline is efficient and appropriate if you correctly manage the event loop, concurrency, and network settings.

86.0%

How do Maigret's proxy, Tor, and I2P supports help access restricted sites in practice? What operational points and risks should I be aware of?

Core Analysis ¶

Core issue: Maigret’s native support for HTTP/SOCKS5 (Tor) and I2P allows checking restricted or anonymous-network sites, but entails trade-offs in latency, stability, and compliance.

Technical & Practical Analysis ¶

How it helps: Using SOCKS5 (Tor) or HTTP proxies lets Maigret reach geographically-restricted or darknet-only sites (.onion, .i2p) while hiding the source IP.
Coverage benefit: Enables discovery on sites not accessible from conventional networks and helps bypass basic geoblocking.
Performance cost: Tor/I2P introduce higher latency and lower throughput, increasing timeouts and scan duration.
Stability concerns: Exit node variability and proxy quality issues can cause inconsistent results; some sites block anonymized traffic.

Operational Points ¶

Run daemons beforehand: Ensure Tor/I2P daemons are running and SOCKS5 ports are reachable.
Use a healthy proxy pool: Rotate quality proxies to minimize single-point bans.
Lower concurrency & increase timeouts: Tune settings for anonymity paths to reduce false negatives.
Implement fallback & monitoring: Switch to alternative proxies or retry strategies when anonymous paths fail.

Risks & Compliance ¶

Compliance risk: Accessing and scraping sites via anonymous networks may violate TOS or local laws; jurisdictions vary—document and assess legal risk.

Summary: Proxy, Tor, and I2P support materially improves coverage for restricted/darknet resources but requires operational safeguards for performance, reliability, and legal compliance.

86.0%

✨ Highlights

Supports 3,000+ sites, defaults to scanning top 500 by traffic
No API keys required; offers multiple export formats and an embeddable library
Can partially bypass blocks and CAPTCHAs, but has practical limitations
Involves privacy and legal risks; verify compliance before use

🔧 Engineering

Performs recursive username searches and parses profile pages to extract linked IDs for further discovery
Provides CLI, built-in web UI, and a Python library; supports HTML/PDF/JSON report exports
Supports Tor/I2P and arbitrary HTTP/SOCKS proxies; auto-updates site DB with offline fallback

⚠️ Risks

Repository overview shows zero contributors and no releases; verify repo activity and maintainer commitment
Large-scale scanning can trigger target site protections or implicate privacy laws; commercial use requires additional compliance safeguards

👥 For who?

Suitable for OSINT researchers and security analysts; requires knowledge of HTTP, proxies, and basic scripting/ops
Also suitable for enterprise investigation and compliance teams for bulk username checks and initial lead aggregation