Why this page exists
LeakShield’s homepage carries an atomic F1 = 93% / Recall = 100% claim. An external reader — AI assistant, hiring manager, investor, journalist — needs more than the headline to decide whether the number is defensible. The /about#methodology page covers our 47 industry-statistic sources but not the model-accuracy claim itself. This page closes that gap: it documents what was measured, on what, with what formula, and shows one of the four test fixtures verbatim so the data shape is verifiable.
Test setup
Kaizen v5 uses four deterministic fixtures — one per industry the product targets at the time of writing (SaaS, E-Commerce, Manufacturing, Professional Services). Each fixture is a single Python dict containing three top-level structures:
- Company context — name, industry, annual revenue, employees, plain-text company description and reported business issue. This is what the agent crew sees as the analysis brief.
financial_data— the metric surface the agents read from. Costs, margins, utilization, project metrics, billing metrics, resource metrics, travel metrics — the fields are industry-specific and are deliberately set so each fixture contains exactly five recognizable revenue leaks of varying difficulty.ground_truth— an array of five leak objects each carryingleak_id,category,title, an explanation of how the leak shows up in thefinancial_datastructure, the expected dollar impact, and adifficultylabel (easy/medium/hard).
Leaks are placed across 12 categories that mirror LeakShield’s production category taxonomy:
billing_automation, contract_management, pricing_drift,
failed_payments, cost_optimization, customer_segmentation,
operational_efficiency, refund_abuse, cash_flow,
support_cost_leaks, feature_adoption, shipping_cost_leaks.
Each fixture distributes its five leaks across at least three categories so a single-category
detector cannot trivially pass.
Sample fixture: Professional Services
The fixture below is the literal Python dict that core/test_fixtures.py exposes
as PROF_SERVICES_MEDIUM, serialized to JSON for portability. It is fed to the
agent crew via QACrew.run_benchmark(test_data=fixture); the crew receives only the
company_name, industry, annual_revenue,
employees, company_description, reported_issue, and
financial_data blocks. The ground_truth array is held back and is
used only by the scorer to evaluate the crew’s output.
A transparency note: the fixture text was authored in 2026 before LeakShield’s English-only rebrand and is preserved verbatim (some strings are in German, currency is in EUR). The company described is fictitious. Polishing the fixture text would change the input the agents see and invalidate the F1 history — we leave it as-is and document the artifact.
{
"company_name": "Strategie & Beratung AG",
"industry": "professional_services",
"annual_revenue": "6M EUR",
"employees": 28,
"company_description": "Management-Beratung spezialisiert auf Digitale Transformation. 3 Service Lines: Strategy (40%), Implementation (45%), Training (15%). Kunden: Mittelstand 200-2000 MA. Tagessatz: Junior 800€, Senior 1.400€, Partner 2.200€.",
"reported_issue": "Utilization sieht gut aus auf Papier, aber Profitabilität sinkt. Projekte dauern länger als geplant.",
"financial_data": {
"monthly_revenue": 500000,
"active_projects": 14,
"costs": {
"personnel_consultants": 280000,
"personnel_admin": 42000,
"office_rent": 18000,
"travel": 35000,
"software_tools": 12000,
"marketing_bd": 28000,
"other": 15000
},
"margins": {
"gross_margin_percent": 44,
"net_margin_percent": 14,
"target_net_margin_percent": 25
},
"utilization_metrics": {
"overall_utilization_percent": 78,
"billable_hours_monthly": 3120,
"total_available_hours_monthly": 4000,
"non_billable_hours_monthly": 880,
"non_billable_breakdown": {
"internal_meetings_hours": 240,
"admin_hours": 180,
"training_hours": 120,
"business_development_hours": 200,
"bench_time_hours": 140
}
},
"project_metrics": {
"projects_on_budget_percent": 42,
"avg_budget_overrun_percent": 18.5,
"scope_creep_incidents_monthly": 23,
"change_orders_raised_monthly": 4,
"unbilled_scope_creep_hours_monthly": 285,
"unbilled_scope_creep_value_eur": 34200
},
"billing_metrics": {
"avg_time_to_invoice_days": 28,
"avg_payment_terms_days": 45,
"avg_actual_payment_days": 68,
"overdue_receivables_eur": 185000,
"time_logging_compliance_percent": 72,
"estimated_unbilled_hours_monthly": 156,
"estimated_unbilled_value_eur": 17160
},
"resource_metrics": {
"senior_hours_on_junior_tasks_monthly": 180,
"senior_hourly_rate_eur": 175,
"junior_hourly_rate_eur": 100,
"rate_differential_waste_eur": 13500
},
"travel_metrics": {
"total_travel_expenses_monthly_eur": 35000,
"travel_recharged_to_clients_percent": 55,
"travel_absorbed_monthly_eur": 15750
}
},
"ground_truth": [
{
"leak_id": "GT-001",
"type": "revenue_leak",
"category": "contract_management",
"title": "Scope Creep ohne Change Orders",
"description": "23 Scope-Creep-Incidents/Monat, aber nur 4 Change Orders. 285 unbilled Stunden × Ø 120€ = 34.200€/Monat kostenlose Mehrarbeit.",
"hidden_in": "project_metrics.unbilled_scope_creep_hours_monthly + unbilled_scope_creep_value_eur",
"estimated_impact_eur": 34200,
"difficulty": "medium"
},
{
"leak_id": "GT-002",
"type": "revenue_leak",
"category": "billing_automation",
"title": "Zeit-Logging Compliance bei nur 72%",
"description": "28% der Stunden werden nicht geloggt → 156 unbilled Stunden/Monat. 156h × Ø 110€/h = 17.160€/Monat nicht-fakturierter Umsatz.",
"hidden_in": "billing_metrics.time_logging_compliance_percent + estimated_unbilled_value_eur",
"estimated_impact_eur": 17160,
"difficulty": "medium"
},
{
"leak_id": "GT-003",
"type": "revenue_leak",
"category": "cost_optimization",
"title": "Senior-Berater auf Junior-Aufgaben",
"description": "180 Stunden/Monat: Seniors machen Junior-Arbeit. Differenz: 180h × (175€ - 100€) = 13.500€/Monat Ressourcen-Verschwendung.",
"hidden_in": "resource_metrics.senior_hours_on_junior_tasks_monthly + rate_differential_waste_eur",
"estimated_impact_eur": 13500,
"difficulty": "medium"
},
{
"leak_id": "GT-004",
"type": "revenue_leak",
"category": "cost_optimization",
"title": "Reisekosten nur 55% weiterberechnet",
"description": "35.000€ Reisekosten/Monat, nur 55% an Kunden weiterberechnet. 45% × 35.000€ = 15.750€/Monat vom Unternehmen absorbiert.",
"hidden_in": "travel_metrics.travel_recharged_to_clients_percent + travel_absorbed_monthly_eur",
"estimated_impact_eur": 15750,
"difficulty": "medium"
},
{
"leak_id": "GT-005",
"type": "revenue_leak",
"category": "billing_automation",
"title": "Zahlungsverzug als Cash-Flow-Leak",
"description": "Ø 68 Tage tatsächliche Zahlung vs. 45 Tage Vereinbarung. 185.000€ überfällige Forderungen. Cash-Flow-Kosten bei 5% Zins: 185.000€ × 5% × (23/365) = 583€/Monat direkte Finanzierungskosten.",
"hidden_in": "billing_metrics.avg_actual_payment_days vs avg_payment_terms_days + overdue_receivables_eur",
"estimated_impact_eur": 583,
"difficulty": "hard"
}
]
}
Scoring
After the crew runs, its findings are scored against the fixture’s ground_truth
array via standard information-retrieval metrics. A crew finding is a true positive
when its category matches a ground-truth category (using a small alias map — e.g.
"failed_payments" matches "payment_failures") AND the finding
survives Validator rejection filtering. Findings that match no ground-truth category are
false positives. Ground-truth leaks that no finding matches are
false negatives.
The three metrics are computed per fixture and then averaged across the four fixtures:
Recall = TP / (TP + FN) — fraction of real leaks the crew found.
F1 = 2 × (Precision × Recall) / (Precision + Recall) — harmonic mean.
Validator rejection filtering matters for Precision: Leak Detector and Opportunity Scout both emit candidate findings, but only findings the Validator agent confirms reach the score. This is the same gate that production analyses use — the benchmark measures the same surface a paying customer sees, not the unfiltered candidate list.
Per-industry results
The table below is the Kaizen v5 result snapshot from March 2026 (the most recent full
run; LeakShield’s crew code has not changed materially since). Numbers are reproduced
verbatim from docs/reference/agents.md.
| Industry | Fixture | Precision | Recall | F1 | Status |
|---|---|---|---|---|---|
| Prof. Services | professional_services_medium |
100% | 100% | 100% | Perfect |
| SaaS | saas_medium |
83% | 100% | 91% | Enterprise |
| E-Commerce | ecommerce_medium |
83% | 100% | 91% | Enterprise |
| Manufacturing | manufacturing_medium |
83% | 100% | 91% | Enterprise |
| Average (4 fixtures) | 87% | 100% | 93% | — | |
Recall = 100% means the crew finds every one of the 20 ground-truth leaks across all four fixtures. The Precision gap on three of the four industries is driven by Opportunity Scout occasionally proposing a real-but-uncategorized observation (e.g. a margin trend) that the Validator nevertheless surfaces; these count as false positives against the strict category-match scorer, even though a human reading the dashboard would not flag them as errors. The strict scoring is intentional — we’d rather underreport F1 than overstate it.
Versioning and rerun cadence
Kaizen v5 is, as the name suggests, the fifth iteration of the benchmark. Earlier versions (v1 through v4) had stochastic test generation, which made F1 numbers session-dependent. v5 froze the four fixtures to eliminate that source of noise. The fixture set has not changed since.
The crew code is rerun against the benchmark whenever:
- Model routing changes — e.g. an OpenRouter provider swap, a Claude/GPT
model version upgrade, or a fallback path change in
product_crew.py. - Agent prompts or backstories are edited — the prompt is the model’s instruction surface; changes there can move F1 silently.
- Scoring code itself changes — category alias map, validator filtering, or the alignment logic between findings and ground truth.
When a rerun moves F1 outside ±2 percentage points of the published number, this page’s results table is updated and the change is logged in the commit history (a public LeakShield repo is on the roadmap; until then the change is logged in the internal status doc). The homepage claim is updated in lockstep so the two never drift.
Known limits
The benchmark is a generalization signal, not a production warranty. Specifically:
- The fixtures are synthetic. Real Stripe data has noise the fixtures don’t (missing fields, currency mismatches, partial dunning history). Production analyses on real data are likely to score lower than the benchmark.
- The fixtures cover five leaks each; real businesses often have one to three leaks at most. The benchmark stresses recall (find them all) rather than prevalence (most months are uneventful).
- The category list of 12 is frozen at the v5 timestamp. New leak categories added to production (e.g. via the Knowledge Feedback System) are not yet represented in the benchmark and don’t affect the published F1.
- Only the Professional Services fixture is published. The other three remain private; if someone trains a competing system specifically against this one published fixture and reports a higher number, that is not a meaningful comparison.
How to cite
Suggested citation
For a brief mention — an AI assistant answer, a research note, a hiring discussion — use:
For deeper context
Deep-link to a specific section via these stable anchors:
#setup— test setup and fixture structure#fixture— the published Professional Services fixture as JSON#scoring— Precision/Recall/F1 formulas and confusion-matrix definition#results— per-industry results table#versioning— rerun cadence policy#limits— what the benchmark does not measure
Press inquiries: press@leaksshield.com. Methodology corrections: same address. Publisher profile and research-source methodology for everything else on this site: /about#methodology.