Maigret: Collect and analyze user accounts across thousands of sites by username
An OSINT-oriented username enumeration tool that automatically searches and aggregates account profiles across thousands of sites, provides multi-format reporting and proxy/Tor support, and is suited for preliminary identity linking and lead collection.
GitHub soxoj/maigret Updated 2026-04-30 Branch main Stars 32.6K Forks 2.4K
Python CLI & Web UI OSINT Username enumeration Proxy / Tor support Report export (HTML/PDF/JSON)

💡 Deep Analysis

7
What exact problem does this project solve? How should I decide whether to adopt it for username-centric cross-site account discovery and information aggregation?

Core Analysis

Project Positioning: Maigret addresses the problem of starting from a single username and automatically checking a large number of public sites for account existence and collecting available profile data, aggregating heterogeneous sources into machine-readable and visual reports—valuable for OSINT, digital forensics, and due diligence.

Technical Features

  • Data-driven site definitions: Per-site request templates and parsing rules are centralized, making extension and fixes manageable.
  • Asynchronous fetching & embeddable library: Python >=3.10 async core with a thin CLI wrapper, enabling integration into larger pipelines.
  • Proxy & anonymity support: Native HTTP/SOCKS5 (Tor) and I2P support for restricted or darknet checks.
  • Recursive ID extraction: Extracts additional usernames/IDs from pages and continues searches to broaden discovery.
  • Multi-format outputs: JSON/NDJSON, HTML, PDF, CSV, interactive graphs, facilitating downstream analysis and human review.

Usage Recommendations

  1. Run a pilot: Start with the default scan (maigret username) to evaluate coverage across the top 500 sites; expand with -a or --tags if needed.
  2. Integrate results: Use --json ndjson to feed results into deduplication, scoring, and analyst review pipelines.
  3. Prepare maintenance: Set up site definitions updates and self-checks (--self-check), and plan for local fallback.

Important Notes

Compliance risk: Aggregating personal public data may be subject to GDPR/CCPA and site TOS—ensure lawful use and internal compliance.

Summary: Maigret is a strong open-source choice for systematic, localizable, and anonymous username-centric discovery if you can manage parsing-rule maintenance and operational work to handle anti-scraping limitations.

90.0%
How do site definitions affect result accuracy? How should I design a maintenance process to ensure long-term effectiveness?

Core Analysis

Core issue: Maigret’s accuracy depends heavily on site definitions that specify how to detect usernameClaimed/usernameUnclaimed and extract profile data. Stale definitions cause false positives/negatives and data loss.

Technical Analysis

  • Why definitions matter: Per-site parsing rules and request templates determine detection and extraction correctness.
  • Causes of breakage: DOM/API changes, localization, or site redesigns break parsing logic.
  1. Site tiering: Classify sites into high/medium/low value and apply stricter SLAs for high-value targets.
  2. Automated self-tests (CI): Maintain test accounts or mocked responses for critical sites; run checks in CI to detect regressions and trigger alerts.
  3. Fast-fix workflow: Version-control definitions, use PRs for fixes, and automate publishing; feed regression data back to the definitions repo.
  4. Fallbacks & rollbacks: Provide the built-in DB as a safe fallback and keep change history for rollbacks.
  5. Monitoring & metrics: Track match rate, failure rate, and parse errors; notify maintainers on anomalies.

Practical tips

  • Maintain a stable set of test usernames representative of common templates.
  • Assign confidence scores to heuristics; route low-confidence matches to manual review.
  • Prioritize fixes by business impact to use maintainer resources effectively.

Note: Maintaining site definitions requires ongoing investment; neglect will degrade tool effectiveness.

Summary: With tiering, CI-based self-tests, fast repair workflows, and monitoring, you can keep site definitions accurate and Maigret reliable over time.

89.0%
Why does the project use Python async and a data-driven site definitions architecture? What are the pros and cons of this technical design?

Core Analysis

Core question: Python async combined with data-driven site definitions is chosen to address scalability and maintainability for large-scale, IO-bound cross-site enumeration.

Technical Analysis

  • Why async: The task is IO-bound. async enables high concurrency in a single process, reducing resource consumption and latency when probing thousands of sites.
  • Why data-driven site definitions: Encoding per-site request templates and parsing rules in JSON allows:
  • Rapid site additions/fixes without code changes;
  • Automated/community updates (GitHub pull);
  • Support for different extraction types (HTML, API, heuristic).

Advantages

  • High scalability: Add thousands of sites by updating data files.
  • Operational agility: Fixes can be applied to definitions without deploying new code.
  • Resource efficiency: Async reduces thread/process overhead.

Limitations & Risks

  • Parsing fragility: Definitions are sensitive to DOM/API changes, causing false positives/negatives and requiring continuous maintenance.
  • Shifted complexity: Error handling, proxy pools, and CAPTCHA workarounds become operational concerns and are harder to debug.
  • Operational constraints: Large concurrent scans risk triggering site defenses; rate-limiting, retries, and proxy strategies are necessary.

Practical Recommendations

  1. CI for site definitions: Implement self-checks and regression tests for critical sites.
  2. Expose concurrency controls: Offer rate limits, timeouts, and retry knobs to avoid being blocked.
  3. Use local fallback snapshots: Fall back to built-in DB when update fetches fail.

Summary: The architecture is well-suited for scalable username enumeration, balancing efficiency and maintainability, but requires ongoing definition maintenance and robust network/anti-blocking strategies.

88.0%
In which scenarios is Maigret particularly suitable? What are its clear limitations and what alternative solutions should be considered?

Core Analysis

Core issue: Identify Maigret’s best-fit use cases and its limits so you can choose whether to use it alone or combine it with other solutions.

Well-suited scenarios

  • OSINT initial discovery: Broad public account searches starting from a username to build candidate dossiers.
  • Local/offline deployment needs: Running inside controlled environments without third-party API keys.
  • Anonymous/darknet coverage: Built-in Tor/I2P support helps check .onion/.i2p and geo-restricted sites.
  • Product embedding: Integrate as a detection engine in due diligence or anti-fraud workflows, feeding downstream scoring and analyst review.

Clear limitations

  • No authenticated/paid content: Only public-visible data; cannot replace authenticated APIs or private data sources.
  • Accuracy tied to definition maintenance: Parsing rules require continuous updates.
  • Anti-scraping & rate limits: Maintaining long-term coverage requires proxies and ops investment.
  • Compliance/legal exposure: Aggregation of personal data may be regulated.

Alternatives & complements

  1. Commercial APIs/data vendors: For guaranteed SLA, data completeness, and access to paid/authenticated data.
  2. Custom scrapers for key sites: Build authenticated modules for a small set of critical sites.
  3. Hybrid approach: Use Maigret for broad discovery, then escalate high-value targets to paid services or manual forensic analysis.

Summary: Maigret offers cost-effective, private, and anonymous strengths for large-scale public account discovery. For authenticated data, enterprise SLAs, or legal-evidence scenarios, supplement with commercial or custom solutions.

88.0%
What are common failure or false-positive scenarios in actual use, and how should I configure and operate the system to mitigate them?

Core Analysis

Core issue: Fetch failures (403/429/timeouts) and parsing false positives are the most common operational problems for username enumeration tools; they stem from target site protections and fragile site definitions.

Common Scenarios (based on evidence)

  • Anti-scraping defenses: High concurrency or repeated requests from the same IP lead to 403/429 or CAPTCHAs.
  • Stale parsing rules: DOM/API changes break usernameClaimed/usernameUnclaimed heuristics causing misclassification.
  • Proxy misconfiguration: Tor/I2P daemons or SOCKS5 config not running properly, blocking access.
  • Resource/throughput strain: Full scans (-a) or many usernames concurrently consume heavy network/time resources.

Mitigation & Operational Recommendations

  1. Rate limit and tier scans: Start with top 500 sites or --tags-based batches to avoid large bursts.
  2. Use a proxy pool: Rotate proxies or Tor circuits for frequently blocked sites; tune concurrency and connection pools.
  3. CI for site definitions: Create self-tests for critical sites, run --self-check regularly to detect parsing regressions.
  4. Post-processing & analyst review: Export --json ndjson for dedup/scoring; route high-value matches to manual validation.
  5. Retry & backoff policies: Implement exponential backoff for 403/429 with limited retries; reattempt timeouts in low-traffic windows.

Note: Bypassing CAPTCHAs or violating site TOS can bring legal/compliance risks. Use caution and follow laws and policies.

Summary: Combining rate limits, proxy rotation, automated definition testing, and result post-processing will reduce failures and false positives to manageable levels, but institutional compliance and analyst review remain essential.

87.0%
How can I embed Maigret as a library in my Python async data pipeline? What implementation details should I watch out for?

Core Analysis

Core issue: Maigret exposes an async API suitable for embedding in Python async applications. The main challenge is coordinating event loop usage, concurrency control, and network/proxy configuration.

Technical Analysis

  • How to call: The README states the CLI is a thin wrapper around an async core—so import Maigret and call its async APIs directly to avoid sub-process overhead.
  • Concurrency & resources: Manage concurrency via asyncio.Semaphore or a task pool. Expose concurrency/timeouts as config to avoid triggering anti-scraping defenses.
  • Event loop concerns: Ensure you’re running within the same event loop; avoid creating multiple loops (use asyncio.run_coroutine_threadsafe or schedule tasks correctly in frameworks like FastAPI).
  • Network/proxy config: Pass proxy/Tor/I2P settings to the library layer so all requests share the same network policy; verify Tor daemon readiness.
  • Result handling: Consume results as JSON/NDJSON or native objects and route them into queues or DBs for deduping, scoring, and analyst review.

Practical Tips

  1. Build an adapter layer: Centralize concurrency, proxy settings, and retry logic; treat Maigret as a fetching engine.
  2. Test & monitor: Add CI checks for critical sites and monitor failure rates and latencies.
  3. Async-friendly integration: await Maigret’s coroutines directly in services (e.g., FastAPI) to minimize overhead.

Note: When embedding, enforce access controls, logging, and analyst verification for high-value matches.

Summary: Embedding Maigret into an async pipeline is efficient and appropriate if you correctly manage the event loop, concurrency, and network settings.

86.0%
How do Maigret's proxy, Tor, and I2P supports help access restricted sites in practice? What operational points and risks should I be aware of?

Core Analysis

Core issue: Maigret’s native support for HTTP/SOCKS5 (Tor) and I2P allows checking restricted or anonymous-network sites, but entails trade-offs in latency, stability, and compliance.

Technical & Practical Analysis

  • How it helps: Using SOCKS5 (Tor) or HTTP proxies lets Maigret reach geographically-restricted or darknet-only sites (.onion, .i2p) while hiding the source IP.
  • Coverage benefit: Enables discovery on sites not accessible from conventional networks and helps bypass basic geoblocking.
  • Performance cost: Tor/I2P introduce higher latency and lower throughput, increasing timeouts and scan duration.
  • Stability concerns: Exit node variability and proxy quality issues can cause inconsistent results; some sites block anonymized traffic.

Operational Points

  1. Run daemons beforehand: Ensure Tor/I2P daemons are running and SOCKS5 ports are reachable.
  2. Use a healthy proxy pool: Rotate quality proxies to minimize single-point bans.
  3. Lower concurrency & increase timeouts: Tune settings for anonymity paths to reduce false negatives.
  4. Implement fallback & monitoring: Switch to alternative proxies or retry strategies when anonymous paths fail.

Risks & Compliance

Compliance risk: Accessing and scraping sites via anonymous networks may violate TOS or local laws; jurisdictions vary—document and assess legal risk.

Summary: Proxy, Tor, and I2P support materially improves coverage for restricted/darknet resources but requires operational safeguards for performance, reliability, and legal compliance.

86.0%

✨ Highlights

  • Supports 3,000+ sites, defaults to scanning top 500 by traffic
  • No API keys required; offers multiple export formats and an embeddable library
  • Can partially bypass blocks and CAPTCHAs, but has practical limitations
  • Involves privacy and legal risks; verify compliance before use

🔧 Engineering

  • Performs recursive username searches and parses profile pages to extract linked IDs for further discovery
  • Provides CLI, built-in web UI, and a Python library; supports HTML/PDF/JSON report exports
  • Supports Tor/I2P and arbitrary HTTP/SOCKS proxies; auto-updates site DB with offline fallback

⚠️ Risks

  • Repository overview shows zero contributors and no releases; verify repo activity and maintainer commitment
  • Large-scale scanning can trigger target site protections or implicate privacy laws; commercial use requires additional compliance safeguards

👥 For who?

  • Suitable for OSINT researchers and security analysts; requires knowledge of HTTP, proxies, and basic scripting/ops
  • Also suitable for enterprise investigation and compliance teams for bulk username checks and initial lead aggregation