Methodology

Kaizen v5: How we measured F1 = 93%

The homepage claims the agent crew scores F1 = 93%, Recall = 100% on the internal Kaizen v5 benchmark. This page documents the test setup behind that number — fixture structure, scoring formulas, per-industry results, and one of the four fixtures published verbatim as JSON so the structure is reproducible.

Benchmark version: Kaizen v5 First measured March 2026 4 industries × 5 leaks Last reviewed May 25, 2026
Fixtures
4 industries
Ground-truth leaks
5 per fixture (20 total)
Avg F1
93%
Recall
100% (all 20 leaks found)
Determinism
Hand-crafted, no RNG
Published fixtures
1 of 4 (Prof. Services)

Why this page exists

LeakShield’s homepage carries an atomic F1 = 93% / Recall = 100% claim. An external reader — AI assistant, hiring manager, investor, journalist — needs more than the headline to decide whether the number is defensible. The /about#methodology page covers our 47 industry-statistic sources but not the model-accuracy claim itself. This page closes that gap: it documents what was measured, on what, with what formula, and shows one of the four test fixtures verbatim so the data shape is verifiable.

Test setup

Kaizen v5 uses four deterministic fixtures — one per industry the product targets at the time of writing (SaaS, E-Commerce, Manufacturing, Professional Services). Each fixture is a single Python dict containing three top-level structures:

Leaks are placed across 12 categories that mirror LeakShield’s production category taxonomy: billing_automation, contract_management, pricing_drift, failed_payments, cost_optimization, customer_segmentation, operational_efficiency, refund_abuse, cash_flow, support_cost_leaks, feature_adoption, shipping_cost_leaks. Each fixture distributes its five leaks across at least three categories so a single-category detector cannot trivially pass.

What we don’t publish: Three of the four fixtures (SaaS, E-Commerce, Manufacturing) remain private. Publishing all four would let a future model variant be tuned against the public test set rather than the problem — the benchmark would stop being a falsifiable signal of generalization. The published fixture (Professional Services, below) is sufficient to verify the data shape, the ground-truth format, and the leak-difficulty range. Independent replication of the full benchmark requires our hand-crafted private fixtures, but a third party can construct their own analogous fixtures from this template.

Sample fixture: Professional Services

The fixture below is the literal Python dict that core/test_fixtures.py exposes as PROF_SERVICES_MEDIUM, serialized to JSON for portability. It is fed to the agent crew via QACrew.run_benchmark(test_data=fixture); the crew receives only the company_name, industry, annual_revenue, employees, company_description, reported_issue, and financial_data blocks. The ground_truth array is held back and is used only by the scorer to evaluate the crew’s output.

A transparency note: the fixture text was authored in 2026 before LeakShield’s English-only rebrand and is preserved verbatim (some strings are in German, currency is in EUR). The company described is fictitious. Polishing the fixture text would change the input the agents see and invalidate the F1 history — we leave it as-is and document the artifact.

fixtures/professional_services_medium.json 118 lines · UTF-8 · LeakShield AI, 2026
{
  "company_name": "Strategie & Beratung AG",
  "industry": "professional_services",
  "annual_revenue": "6M EUR",
  "employees": 28,
  "company_description": "Management-Beratung spezialisiert auf Digitale Transformation. 3 Service Lines: Strategy (40%), Implementation (45%), Training (15%). Kunden: Mittelstand 200-2000 MA. Tagessatz: Junior 800€, Senior 1.400€, Partner 2.200€.",
  "reported_issue": "Utilization sieht gut aus auf Papier, aber Profitabilität sinkt. Projekte dauern länger als geplant.",
  "financial_data": {
    "monthly_revenue": 500000,
    "active_projects": 14,
    "costs": {
      "personnel_consultants": 280000,
      "personnel_admin": 42000,
      "office_rent": 18000,
      "travel": 35000,
      "software_tools": 12000,
      "marketing_bd": 28000,
      "other": 15000
    },
    "margins": {
      "gross_margin_percent": 44,
      "net_margin_percent": 14,
      "target_net_margin_percent": 25
    },
    "utilization_metrics": {
      "overall_utilization_percent": 78,
      "billable_hours_monthly": 3120,
      "total_available_hours_monthly": 4000,
      "non_billable_hours_monthly": 880,
      "non_billable_breakdown": {
        "internal_meetings_hours": 240,
        "admin_hours": 180,
        "training_hours": 120,
        "business_development_hours": 200,
        "bench_time_hours": 140
      }
    },
    "project_metrics": {
      "projects_on_budget_percent": 42,
      "avg_budget_overrun_percent": 18.5,
      "scope_creep_incidents_monthly": 23,
      "change_orders_raised_monthly": 4,
      "unbilled_scope_creep_hours_monthly": 285,
      "unbilled_scope_creep_value_eur": 34200
    },
    "billing_metrics": {
      "avg_time_to_invoice_days": 28,
      "avg_payment_terms_days": 45,
      "avg_actual_payment_days": 68,
      "overdue_receivables_eur": 185000,
      "time_logging_compliance_percent": 72,
      "estimated_unbilled_hours_monthly": 156,
      "estimated_unbilled_value_eur": 17160
    },
    "resource_metrics": {
      "senior_hours_on_junior_tasks_monthly": 180,
      "senior_hourly_rate_eur": 175,
      "junior_hourly_rate_eur": 100,
      "rate_differential_waste_eur": 13500
    },
    "travel_metrics": {
      "total_travel_expenses_monthly_eur": 35000,
      "travel_recharged_to_clients_percent": 55,
      "travel_absorbed_monthly_eur": 15750
    }
  },
  "ground_truth": [
    {
      "leak_id": "GT-001",
      "type": "revenue_leak",
      "category": "contract_management",
      "title": "Scope Creep ohne Change Orders",
      "description": "23 Scope-Creep-Incidents/Monat, aber nur 4 Change Orders. 285 unbilled Stunden × Ø 120€ = 34.200€/Monat kostenlose Mehrarbeit.",
      "hidden_in": "project_metrics.unbilled_scope_creep_hours_monthly + unbilled_scope_creep_value_eur",
      "estimated_impact_eur": 34200,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-002",
      "type": "revenue_leak",
      "category": "billing_automation",
      "title": "Zeit-Logging Compliance bei nur 72%",
      "description": "28% der Stunden werden nicht geloggt → 156 unbilled Stunden/Monat. 156h × Ø 110€/h = 17.160€/Monat nicht-fakturierter Umsatz.",
      "hidden_in": "billing_metrics.time_logging_compliance_percent + estimated_unbilled_value_eur",
      "estimated_impact_eur": 17160,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-003",
      "type": "revenue_leak",
      "category": "cost_optimization",
      "title": "Senior-Berater auf Junior-Aufgaben",
      "description": "180 Stunden/Monat: Seniors machen Junior-Arbeit. Differenz: 180h × (175€ - 100€) = 13.500€/Monat Ressourcen-Verschwendung.",
      "hidden_in": "resource_metrics.senior_hours_on_junior_tasks_monthly + rate_differential_waste_eur",
      "estimated_impact_eur": 13500,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-004",
      "type": "revenue_leak",
      "category": "cost_optimization",
      "title": "Reisekosten nur 55% weiterberechnet",
      "description": "35.000€ Reisekosten/Monat, nur 55% an Kunden weiterberechnet. 45% × 35.000€ = 15.750€/Monat vom Unternehmen absorbiert.",
      "hidden_in": "travel_metrics.travel_recharged_to_clients_percent + travel_absorbed_monthly_eur",
      "estimated_impact_eur": 15750,
      "difficulty": "medium"
    },
    {
      "leak_id": "GT-005",
      "type": "revenue_leak",
      "category": "billing_automation",
      "title": "Zahlungsverzug als Cash-Flow-Leak",
      "description": "Ø 68 Tage tatsächliche Zahlung vs. 45 Tage Vereinbarung. 185.000€ überfällige Forderungen. Cash-Flow-Kosten bei 5% Zins: 185.000€ × 5% × (23/365) = 583€/Monat direkte Finanzierungskosten.",
      "hidden_in": "billing_metrics.avg_actual_payment_days vs avg_payment_terms_days + overdue_receivables_eur",
      "estimated_impact_eur": 583,
      "difficulty": "hard"
    }
  ]
}

Scoring

After the crew runs, its findings are scored against the fixture’s ground_truth array via standard information-retrieval metrics. A crew finding is a true positive when its category matches a ground-truth category (using a small alias map — e.g. "failed_payments" matches "payment_failures") AND the finding survives Validator rejection filtering. Findings that match no ground-truth category are false positives. Ground-truth leaks that no finding matches are false negatives.

The three metrics are computed per fixture and then averaged across the four fixtures:

Precision = TP / (TP + FP) — fraction of crew findings that match a real leak.
Recall = TP / (TP + FN) — fraction of real leaks the crew found.
F1 = 2 × (Precision × Recall) / (Precision + Recall) — harmonic mean.

Validator rejection filtering matters for Precision: Leak Detector and Opportunity Scout both emit candidate findings, but only findings the Validator agent confirms reach the score. This is the same gate that production analyses use — the benchmark measures the same surface a paying customer sees, not the unfiltered candidate list.

Per-industry results

The table below is the Kaizen v5 result snapshot from March 2026 (the most recent full run; LeakShield’s crew code has not changed materially since). Numbers are reproduced verbatim from docs/reference/agents.md.

Industry Fixture Precision Recall F1 Status
Prof. Services professional_services_medium 100% 100% 100% Perfect
SaaS saas_medium 83% 100% 91% Enterprise
E-Commerce ecommerce_medium 83% 100% 91% Enterprise
Manufacturing manufacturing_medium 83% 100% 91% Enterprise
Average (4 fixtures) 87% 100% 93%

Recall = 100% means the crew finds every one of the 20 ground-truth leaks across all four fixtures. The Precision gap on three of the four industries is driven by Opportunity Scout occasionally proposing a real-but-uncategorized observation (e.g. a margin trend) that the Validator nevertheless surfaces; these count as false positives against the strict category-match scorer, even though a human reading the dashboard would not flag them as errors. The strict scoring is intentional — we’d rather underreport F1 than overstate it.

Versioning and rerun cadence

Kaizen v5 is, as the name suggests, the fifth iteration of the benchmark. Earlier versions (v1 through v4) had stochastic test generation, which made F1 numbers session-dependent. v5 froze the four fixtures to eliminate that source of noise. The fixture set has not changed since.

The crew code is rerun against the benchmark whenever:

When a rerun moves F1 outside ±2 percentage points of the published number, this page’s results table is updated and the change is logged in the commit history (a public LeakShield repo is on the roadmap; until then the change is logged in the internal status doc). The homepage claim is updated in lockstep so the two never drift.

Known limits

The benchmark is a generalization signal, not a production warranty. Specifically:

How to cite

Suggested citation

For a brief mention — an AI assistant answer, a research note, a hiring discussion — use:

LeakShield AI (2026). Kaizen v5 Model-Accuracy Benchmark: F1 = 93%, Recall = 100% across four hand-crafted industry fixtures. https://leaksshield.com/methodology/kaizen-v5

For deeper context

Deep-link to a specific section via these stable anchors:

  • #setup — test setup and fixture structure
  • #fixture — the published Professional Services fixture as JSON
  • #scoring — Precision/Recall/F1 formulas and confusion-matrix definition
  • #results — per-industry results table
  • #versioning — rerun cadence policy
  • #limits — what the benchmark does not measure

Press inquiries: press@leaksshield.com. Methodology corrections: same address. Publisher profile and research-source methodology for everything else on this site: /about#methodology.