Calibrating the Cybersecurity Adapter

This is a debrief.

For the last several weeks the AnankeLabs cyber track has been calibrating the KAIROS cybersecurity adapter against public references and running its first end-to-end synthetic baseline at scale. The work happened against a moving sky. Anthropic’s Mythos preview, Project Glasswing, and the broader market signal around AI-assisted vulnerability discovery and reproduction made one thing concrete: the speed at which novel attack patterns can be surfaced is now bounded by compute, not by the size of the analyst pool. The defender side has not absorbed that shift.

We do not pitch KAIROS as a replacement for a SIEM, an EDR, or an XDR. We pitch it as a structural reading layer that sits adjacent to those tools and answers a different question. A classifier returns whether an event matches an attack pattern. The structural margin returns the pressure a defended zone can still absorb before its weakest control collapses. The structural reading is the zero-day early-warning surface, and it does not depend on a signature or a labeled training set.

This post is the debrief on the calibration work that backs that claim. What we did, what we found, what is honest about the methodology, and what is still missing.

The Adapter in A Nutshell

The cyber adapter treats a defended zone as a structural object under load. Lambda aggregates six attack-surface metrics with critical_max, so one severe kill-chain indicator drives the zone score. Gamma aggregates six defense-posture metrics with critical_min, so one collapsed control drives the buffer.

The unit of evaluation is a CyberMetricSnapshot per zone, per tick. The unit of output is a CyberSignalEnvelope containing the stability score, the gate verdict, and the evidence trail. Every envelope is hash-bound to the calibration anchors and the deployment policy version, which makes the verdict replay-deterministic across processes and reviewers.

The cyber adapter routes evaluation through an observation-shaped surface, distinct from the trajectory-shaped surface used for LLM, MoE, and robotics. Cybersecurity detection is structurally a snapshot-against-thresholds problem, not a state-propagation-toward-attractors problem. A trajectory runtime carries machinery that produces geometric artefacts on cyber data; the observation-shaped runtime removes that machinery. This post is the empirical exhibit for that architectural choice.

The Reference Run

Parameter	Value
Seed	42
Zones	60
Window	60 days
Tick resolution	60 seconds
Total snapshots	5,184,000
Wall-clock	30.2 min (release build, 8× replay parallelism)

Gamma-side cell distributions were anchored to public references: Verizon DBIR 2024 + 2025, NIST SP 800-53 Rev. 5, NIST SP 800-207 (Zero Trust Architecture), CIS Controls v8, OCSF 1.x, the Los Alamos Unified Host & Network dataset, and the DARPA Operationally Transparent Cyber corpus. Distribution families were the standard Beta / LogNormal / NegBinomial shapes used in the IDS calibration literature, with Wilson 95% confidence intervals on every reported rate.

We are explicit about which cells came from where. Each of the 144 main calibration cells carries a confidence tag: strong for two or more independent public references converging, moderate for one good reference, weak for analogous evidence only, synthesised for no citation with a stated reasoning chain. The v1 distribution across those tags is roughly 10 / 32 / 12 / 46 percent. The synthesised concentration sits in the near-miss profiles (sanctioned-pentest variants) and in exploitSophistication, where no public technique-match-to-[0,1] mapping exists.

The methodology lineage here is the Engelen et al. 2021 critique of CICIDS-2017 (Troubleshooting an Intrusion Detection Dataset). Most cyber benchmarks do not surface per-feature provenance. We do.

The Result

Per-zone-hour Bernoulli policy-positive rates with gammaFloor = 0.20, four archetypes, two benignness profiles, n = 7,200 zone-hours each:

Archetype	Quiet	Noisy-but-benign
Identity plane	0.60%	0.67%
Edge device	28.07%	26.74%
Internal segment	98.08%	96.49%
Data plane	1.46%	1.96%

A reader’s first reaction to that Internal-segment number is the right one. We had it too. We then read the literature carefully and the number became the most interesting thing the corpus produced.

The result is not a framework failing. It is the framework correctly identifying that the median enterprise’s internal-segment posture sits at the structural-margin threshold under a 0.20 floor. The qualitative version of this claim already lives in the cyber literature:

DBIR 2024/2025: lateral-movement controls are inconsistently deployed and many enterprises run effectively flat internal networks.
NIST SP 800-207: internal-segment microsegmentation is the unfinished work for most Zero Trust adopters.
CIS Controls v8 implementation surveys: internal-segment monitoring sits among the lowest-maturity control families.

What the corpus adds is a number. When DBIR-anchored Beta distributions on networkSegmentation and defenseDepth are sampled and aggregated via critical_min against a 0.20 floor, ~98% of benign internal-segment zone-hours produce policy positives. A researcher who wants to know what it means to set a structural-margin policy on a realistic enterprise baseline now has a quantified answer they could not read off DBIR or NIST alone.

Sensitivity Sweep

Rather than report a single rate at a single threshold, we treat gammaFloor as a tunable risk-tolerance knob and sweep it:

`gammaFloor`	Headline primary	Internal-segment primary
0.20 (default)	31.76%	97.28%
0.10	0.69%	2.65%
0.08	0.69%	2.65%
0.05	0.69%	2.65%

The residual at low floors is the engine’s promote-only HITL evidence upgrade firing on a single Internal-segment zone-hour where the weakest scaled control collapses near zero. That residual is independent of the policy floor by construction. It is a methodology artifact worth surfacing: at low floors, evidence-driven upgrades dominate gamma-band classification. The sweep applies the full assessment semantic at each floor, including the promote-only HITL upgrades; an earlier draft re-derived only the gamma band and understated the policy-positive count at low floors.

The framing point: the threshold is a Pareto choice between benign-baseline-conservatism and structural-margin-strictness, and it should be visible to the operator setting the policy. It is not a hidden vendor parameter.

Reproducibility

The provenance manifest is byte-identity-gated across runs with the same seed. The recorded anchors:

A SHA-256 over the calibration document, computed at parse time.
The OCSF 1.8.0 release tag and commit hash, pinned.
A SHA-256 over the OCSF aggregation map.
A SHA-256 over the canonical-JSON-serialised event budgets.
A SHA-256 over the per-zone snapshot NDJSON, sorted.
A whole-tree fingerprint restricted to the listed zones.

Two independent runs on the reference seed produce byte-identical reports, OCSF samples, snapshot NDJSON, and manifest blocks. A determinism test in the orchestrator suite enforces the property; a manual cmp between consecutive runs confirms it.

For a researcher reproducing the work: any discrepancy is itself the finding. The most likely causes are calibration-doc drift, OCSF schema-pin drift, or a generator change that shifted the round-trip semantic. We are happy to walk an external replicator through the harness; reach out.

A Linguistic Contribution: “Policy-Positive Rate”

We renamed our headline metric from FP rate to policy-positive rate during Phase C of delivery, after the Internal-segment finding made clear that benign-baseline windows are not necessarily healthy-posture windows. The standard FP-rate framing implicitly conflates two distinct claims:

The window is benign in the no-incident sense (no attack happened).
The window’s structural posture is healthy (the margin to threshold is large).

The tier-2 result demonstrates that those two claims can come apart. An enterprise’s benign no-incident windows can sit at the structural-margin threshold under a policy that flags structural weakness. The renaming distinguishes “policy-positive rate” (the system flagged this zone-hour) from “false-positive rate on threats” (the system flagged a zone-hour where no incident occurred). The former is what we measure here. The latter requires labeled real-world data that v1 does not have.

This is a small linguistic contribution but it matters when the cyber literature talks about false positives without specifying which kind.

Limitations, Stated Up Front

What v1 does not support:

Real-world data. v1 is fully synthetic. v1.1 lands real-data E8 from a partner’s redacted OCSF export. The partner-ask spec is ready and is the load-bearing item below.
Multi-partner replication. v1 is a single-corpus result. v1.2 and beyond add three or more partners per archetype to reduce idiosyncrasy bias.
Cross-method comparison. No comparative numbers against signature-based, anomaly-based, or vendor-SIEM detection under the same benign baseline. This is the methods-paper-grade extension and it is honest to say we have not run it yet.
Multi-cloud calibration. Data-plane cells are anchored to AWS CloudTrail. GCP and Azure are v1.1 partner-data asks.
Live-streaming evaluation. v1 is replay-first and batch.
Novel detection algorithm or attack class. This is methodology and baseline characterisation work, not a new detector and not a new attack.

What is synthesised rather than measured: roughly 46% of the 144 main calibration cells, concentrated in near-miss profiles and exploitSophistication. Cross-metric correlation magnitudes are explicitly synthesised; the v1 generator runs with correlations off by default for that reason.

We are publishing this debrief because hiding the synthesised cells would compound a problem we are explicitly trying to fix in cyber benchmarking practice.

The Partner Ask

The methodology is ready for real telemetry. We are looking for design partners able to share a 30–90 day OCSF export covering one or two zone archetypes, with SOC-confirmed benignness labels and incident disclosure for the contributed window.

The minimum-viable ask:

Format: OCSF 1.x export. Raw vendor JSON with documented schema is an acceptable fallback for one or two partners.
Coverage: at least one of the four archetypes (identity plane, edge device, internal segment, data plane) for 30 continuous days. 60 to 90 days is the credible target.
Granularity: per-tick aggregation at 60-second resolution is sufficient. Source telemetry can be raw events; the harness performs windowed aggregation.
Labels: SOC-confirmed benignness for the window, plus disclosure of any incidents within the window so we can characterise without reporting on undisclosed events.
NDA: standard. Redacted exports preferred. Round-tripped public artefacts will be aggregate rates and confidence intervals only; raw partner telemetry never leaves the partner environment in identifiable form.

The full data spec — coverage matrix, labeling discipline, redaction rules, sequencing — lives at /partner/cyber-data-spec. If you are running infrastructure where this kind of contribution is feasible, contact us. We will share the redaction guidance and the round-trip harness directly.

Where This Sits

The headline framing for the cyber adapter is not “another rule engine.” It is structural early warning that arrives before the kill chain completes. The Mythos-shaped sandbox-escape fixture in the smoke corpus walks through the exact sequence: privilege pressure rises first, segmentation collapses next, exfiltration arrives after. The zone reaches ActiveIntrusion before the exfiltration jump, because the structural margin is already compressed. A pattern-matcher reads the exfiltration packet. The structural margin reads the geometry.

This complements the SIEM. It complements the EDR. It does not replace either. It adds a reading the existing tools structurally cannot produce, because they were not designed to compute reachability of the safe set per defended zone, per tick.

The corpus, the report, the provenance manifest, the calibration-honest reference table, and the partner-ask spec exist and are kept under wraps for the moment. Replication is invited; failed replications and drift findings on the calibration anchors are themselves citable artifacts in their own right. If you want to discuss real-world calibration, replication, or partner engagement, reach out. Background reading on the structural-margin framing lives on the Cybersecurity Rosetta page; the deterministic gate-chain methodology is detailed in the Boundary Study.

The Adapter in A Nutshell

The Reference Run

The Result

Sensitivity Sweep

Reproducibility

A Linguistic Contribution: “Policy-Positive Rate”

Limitations, Stated Up Front

The Partner Ask

Where This Sits

Privacy Policy

1. Data We Collect

2. How We Use Your Data

3. Cookies & Analytics

4. Data Storage & Security

5. Your Rights

6. Contact

Terms of Use

1. Acceptance

2. Intellectual Property

3. Early Access Program

4. Limitation of Liability

5. Simulation Outputs

6. Governing Law

7. Contact