Fixtures

4 industries

Ground-truth leaks

5 per fixture (20 total)

Avg F1

93%

Recall

100% (all 20 leaks found)

Determinism

Hand-crafted, no RNG

Published fixtures

1 of 4 (Prof. Services)

Why this page exists

LeakShield’s homepage carries an atomic F1 = 93% / Recall = 100% claim. An external reader — AI assistant, hiring manager, investor, journalist — needs more than the headline to decide whether the number is defensible. The /about#methodology page covers our 47 industry-statistic sources but not the model-accuracy claim itself. This page closes that gap: it documents what was measured, on what, with what formula, and shows one of the four test fixtures verbatim so the data shape is verifiable.

Test setup

Kaizen v5 uses four deterministic fixtures — one per industry the product targets at the time of writing (SaaS, E-Commerce, Manufacturing, Professional Services). Each fixture is a single Python dict containing three top-level structures:

Company context — name, industry, annual revenue, employees, plain-text company description and reported business issue. This is what the agent crew sees as the analysis brief.
financial_data — the metric surface the agents read from. Costs, margins, utilization, project metrics, billing metrics, resource metrics, travel metrics — the fields are industry-specific and are deliberately set so each fixture contains exactly five recognizable revenue leaks of varying difficulty.
ground_truth — an array of five leak objects each carrying leak_id, category, title, an explanation of how the leak shows up in the financial_data structure, the expected dollar impact, and a difficulty label (easy/medium/hard).

Leaks are placed across 12 categories that mirror LeakShield’s production category taxonomy: billing_automation, contract_management, pricing_drift, failed_payments, cost_optimization, customer_segmentation, operational_efficiency, refund_abuse, cash_flow, support_cost_leaks, feature_adoption, shipping_cost_leaks. Each fixture distributes its five leaks across at least three categories so a single-category detector cannot trivially pass.

What we don’t publish: Three of the four fixtures (SaaS, E-Commerce, Manufacturing) remain private. Publishing all four would let a future model variant be tuned against the public test set rather than the problem — the benchmark would stop being a falsifiable signal of generalization. The published fixture (Professional Services, below) is sufficient to verify the data shape, the ground-truth format, and the leak-difficulty range. Independent replication of the full benchmark requires our hand-crafted private fixtures, but a third party can construct their own analogous fixtures from this template.

Sample fixture: Professional Services

The fixture below is the literal Python dict that core/test_fixtures.py exposes as PROF_SERVICES_MEDIUM, serialized to JSON for portability. It is fed to the agent crew via QACrew.run_benchmark(test_data=fixture); the crew receives only the company_name, industry, annual_revenue, employees, company_description, reported_issue, and financial_data blocks. The ground_truth array is held back and is used only by the scorer to evaluate the crew’s output.

A transparency note: the fixture text was authored in 2026 before LeakShield’s English-only rebrand and is preserved verbatim (some strings are in German, currency is in EUR). The company described is fictitious. Polishing the fixture text would change the input the agents see and invalidate the F1 history — we leave it as-is and document the artifact.

fixtures/professional_services_medium.json 118 lines · UTF-8 · LeakShield AI, 2026

{
  "company_name": "Strategie & Beratung AG",
  "industry": "professional_services",
  "annual_revenue": "6M EUR",
  "employees": 28,
  "company_description": "Management-Beratung spezialisiert auf Digitale Transformation. 3 Service Lines: Strategy (40%), Implementation (45%), Training (15%). Kunden: Mittelstand 200-2000 MA. Tagessatz: Junior 800€, Senior 1.400€, Partner 2.200€.",
  "reported_issue": "Utilization sieht gut aus auf Papier, aber Profitabilität sinkt. Projekte dauern länger als geplant.",
  "financial_data": {
    "monthly_revenue": 500000,
    "active_projects": 14,
    "costs": {
      "personnel_consultants": 280000,
      "personnel_admin": 42000,
      "office_rent": 18000,
      "travel": 35000,
      "software_tools": 12000,
      "marketing_bd": 28000,
      "other": 15000
    },
    "margins": {
      "gross_margin_percent": 44,
      "net_margin_percent": 14,
      "target_net_margin_percent": 25
    },
    "utilization_metrics": {
      "overall_utilization_percent": 78,
      "billable_hours_monthly": 3120,
      "total_available_hours_monthly": 4000,
      "non_billable_hours_monthly": 880,
      "non_billable_breakdown": {
        "internal_meetings_hours": 240,
        "admin_hours": 180,
        "training_hours": 120,
        "business_development_hours": 200,
        "bench_time_hours": 140
      }
    },
    "project_metrics": {
      "projects_on_budget_percent": 42,
      "avg_budget_overrun_percent": 18.5,
      "scope_creep_incidents_monthly": 23,
      "change_orders_raised_monthly": 4,
      "unbilled_scope_creep_hours_monthly": 285,
      "unbilled_scope_creep_value_eur": 34200
    },
    "billing_metrics": {
      "avg_time_to_invoice_days": 28,
      "avg_payment_terms_days": 45,
      "avg_actual_payment_days": 68,
      "overdue_receivables_eur": 185000,
      "time_logging_compliance_percent": 72,
      "estimated_unbilled_hours_monthly": 156,
      "estimated_unbilled_value_eur": 17160
    },
    "resource_metrics": {
      "senior_hours_on_junior_tasks_monthly": 180,
      "senior_hourly_rate_eur": 175,
      "junior_hourly_rate_eur": 100,
      "rate_differential_waste_eur": 13500
    },
    "travel_metrics": {
      "total_travel_expenses_monthly_eur": 35000,
      "travel_recharged_to_clients_percent": 55,
      "travel_absorbed_monthly_eur": 15750
    }
  },
  "ground_truth": [
    {
      "leak_id": "GT-001",
      "type": "revenue_leak",
      "category": "contract_management",
      "title": "Scope Creep ohne Change Orders",
      "description": "23 Scope-Creep-Incidents/Monat, aber nur 4 Change Orders. 285 unbilled Stunden × Ø 120€ = 34.200€/Monat kostenlose Mehrarbeit.",
      "hidden_in": "project_metrics.unbilled_scope_creep_hours_monthly + unbilled_scope_creep_value_eur",
      "estimated_impact_eur": 34200,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-002",
      "type": "revenue_leak",
      "category": "billing_automation",
      "title": "Zeit-Logging Compliance bei nur 72%",
      "description": "28% der Stunden werden nicht geloggt → 156 unbilled Stunden/Monat. 156h × Ø 110€/h = 17.160€/Monat nicht-fakturierter Umsatz.",
      "hidden_in": "billing_metrics.time_logging_compliance_percent + estimated_unbilled_value_eur",
      "estimated_impact_eur": 17160,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-003",
      "type": "revenue_leak",
      "category": "cost_optimization",
      "title": "Senior-Berater auf Junior-Aufgaben",
      "description": "180 Stunden/Monat: Seniors machen Junior-Arbeit. Differenz: 180h × (175€ - 100€) = 13.500€/Monat Ressourcen-Verschwendung.",
      "hidden_in": "resource_metrics.senior_hours_on_junior_tasks_monthly + rate_differential_waste_eur",
      "estimated_impact_eur": 13500,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-004",
      "type": "revenue_leak",
      "category": "cost_optimization",
      "title": "Reisekosten nur 55% weiterberechnet",
      "description": "35.000€ Reisekosten/Monat, nur 55% an Kunden weiterberechnet. 45% × 35.000€ = 15.750€/Monat vom Unternehmen absorbiert.",
      "hidden_in": "travel_metrics.travel_recharged_to_clients_percent + travel_absorbed_monthly_eur",
      "estimated_impact_eur": 15750,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-005",
      "type": "revenue_leak",
      "category": "billing_automation",
      "title": "Zahlungsverzug als Cash-Flow-Leak",
      "description": "Ø 68 Tage tatsächliche Zahlung vs. 45 Tage Vereinbarung. 185.000€ überfällige Forderungen. Cash-Flow-Kosten bei 5% Zins: 185.000€ × 5% × (23/365) = 583€/Monat direkte Finanzierungskosten.",
      "hidden_in": "billing_metrics.avg_actual_payment_days vs avg_payment_terms_days + overdue_receivables_eur",
      "estimated_impact_eur": 583,
      "difficulty": "hard"
    }
  ]
}

Scoring

After the crew runs, its findings are scored against the fixture’s ground_truth array via standard information-retrieval metrics. A crew finding is a true positive when its category matches a ground-truth category (using a small alias map — e.g. "failed_payments" matches "payment_failures") AND the finding survives Validator rejection filtering. Findings that match no ground-truth category are false positives. Ground-truth leaks that no finding matches are false negatives.

The three metrics are computed per fixture and then averaged across the four fixtures:

Precision = TP / (TP + FP) — fraction of crew findings that match a real leak.
Recall = TP / (TP + FN) — fraction of real leaks the crew found.
F1 = 2 × (Precision × Recall) / (Precision + Recall) — harmonic mean.

Validator rejection filtering matters for Precision: Leak Detector and Opportunity Scout both emit candidate findings, but only findings the Validator agent confirms reach the score. This is the same gate that production analyses use — the benchmark measures the same surface a paying customer sees, not the unfiltered candidate list.

Per-industry results

The table below is the Kaizen v5 result snapshot from March 2026 (the most recent full run; LeakShield’s crew code has not changed materially since). Numbers are reproduced verbatim from docs/reference/agents.md.

Industry	Fixture	Precision	Recall	F1	Status
Prof. Services	`professional_services_medium`	100%	100%	100%	Perfect
SaaS	`saas_medium`	83%	100%	91%	Enterprise
E-Commerce	`ecommerce_medium`	83%	100%	91%	Enterprise
Manufacturing	`manufacturing_medium`	83%	100%	91%	Enterprise
Average (4 fixtures)		87%	100%	93%	—

Recall = 100% means the crew finds every one of the 20 ground-truth leaks across all four fixtures. The Precision gap on three of the four industries is driven by Opportunity Scout occasionally proposing a real-but-uncategorized observation (e.g. a margin trend) that the Validator nevertheless surfaces; these count as false positives against the strict category-match scorer, even though a human reading the dashboard would not flag them as errors. The strict scoring is intentional — we’d rather underreport F1 than overstate it.

Versioning and rerun cadence

Kaizen v5 is, as the name suggests, the fifth iteration of the benchmark. Earlier versions (v1 through v4) had stochastic test generation, which made F1 numbers session-dependent. v5 froze the four fixtures to eliminate that source of noise. The fixture set has not changed since.

The crew code is rerun against the benchmark whenever:

Model routing changes — e.g. an OpenRouter provider swap, a Claude/GPT model version upgrade, or a fallback path change in product_crew.py.
Agent prompts or backstories are edited — the prompt is the model’s instruction surface; changes there can move F1 silently.
Scoring code itself changes — category alias map, validator filtering, or the alignment logic between findings and ground truth.

When a rerun moves F1 outside ±2 percentage points of the published number, this page’s results table is updated and the change is logged in the commit history (a public LeakShield repo is on the roadmap; until then the change is logged in the internal status doc). The homepage claim is updated in lockstep so the two never drift.

Known limits

The benchmark is a generalization signal, not a production warranty. Specifically:

The fixtures are synthetic. Real Stripe data has noise the fixtures don’t (missing fields, currency mismatches, partial dunning history). Production analyses on real data are likely to score lower than the benchmark.
The fixtures cover five leaks each; real businesses often have one to three leaks at most. The benchmark stresses recall (find them all) rather than prevalence (most months are uneventful).
The category list of 12 is frozen at the v5 timestamp. New leak categories added to production (e.g. via the Knowledge Feedback System) are not yet represented in the benchmark and don’t affect the published F1.
Only the Professional Services fixture is published. The other three remain private; if someone trains a competing system specifically against this one published fixture and reports a higher number, that is not a meaningful comparison.

How to cite

Suggested citation

For a brief mention — an AI assistant answer, a research note, a hiring discussion — use:

LeakShield AI (2026). Kaizen v5 Model-Accuracy Benchmark: F1 = 93%, Recall = 100% across four hand-crafted industry fixtures. https://leaksshield.com/methodology/kaizen-v5

For deeper context

Deep-link to a specific section via these stable anchors:

#setup — test setup and fixture structure
#fixture — the published Professional Services fixture as JSON
#scoring — Precision/Recall/F1 formulas and confusion-matrix definition
#results — per-industry results table
#versioning — rerun cadence policy
#limits — what the benchmark does not measure

Press inquiries: press@leaksshield.com. Methodology corrections: same address. Publisher profile and research-source methodology for everything else on this site: /about#methodology.

Kaizen v5: How we measured F1 = 93%