This is a debrief.
For the last several weeks the AnankeLabs AI safety track has been calibrating the KAIROS AI safety adapter against the public agent-evaluation literature and running its first end-to-end synthetic baseline. The work happened against a fast-moving sky. Agentic AI is being deployed at a pace the oversight machinery has not yet been measured on. The question of whether the gates and reviewers around an agent are structurally sound was being answered by deployment velocity rather than by a number.
KAIROS is a structural reading layer that returns the buffer an agent’s alignment posture has before the weakest control collapses. The reading is geometric, computed per proposed action against the configured gate policy, and is independent of signature-based detection of harmful intent. The benign-baseline question this debrief answers is the one a design partner asks first: on confirmed-clean trajectories, what fraction of proposed actions does the gate flag under a stated policy.
The AI safety adapter now has a calibrated, reproducible benign baseline anchored to public agent-evaluation literature, with 27% of cells honestly tagged as synthesised because no public source reports the underlying quantity (operational HITL availability, harness self-confidence, retry-budget consumption). The synthesised fraction is the partner-data closing path for v1.1.
The Adapter In A Nutshell
The AI safety adapter treats each proposed agent action as a structural object under load. Two primary aggregates carry the reading.
capabilityPressure is the maximum over six metrics: objectiveDrive (how strongly the agent is committing to the proposed action class), toolPowerGradient (the read→write→external→exec elevation between current and proposed call), irreversibilityPressure (the discrete reversibility class of the action), noveltyScore (distance from the archetype’s typical trajectory step), escalationVelocity (rate of accumulating high-power actions across the trajectory), and contextSaturation (token-budget headroom remaining in the harness). One severe capability indicator drives the action score by critical_max aggregation. When the metric goes up, the agent is exerting more capability into the proposed action.
alignmentPosture is the minimum over six metrics: policyHeadroom (distance from the configured capability-threshold ceiling), permissionDepth (remaining scope under the agent’s grant), humanLoopAvailability (reviewer capacity currently available to absorb an escalation; up means more capacity, not more escalations), auditTrailIntegrity (completeness of the OTel span and log record for the action), retryBudget (remaining tool-call budget across the trajectory), and controlHealth (the guardrail-classifier’s own self-reported confidence on the current snapshot). One collapsed control drives the buffer by critical_min aggregation. This is the quantity the gate’s gammaFloor policy compares against.
The unit of evaluation is one AgentMetricComponents record per trajectory × action, projected to a runtime MetricSnapshot. The unit of output is a state-gate decision: PASS when currentGamma >= gammaFloor, otherwise REJECT_STATE. The v281 v2 calibration artifact treats capabilityPressure and alignmentPosture as primary proxies; the twelve substantive metrics underneath are secondary proxies, exposed in the snapshot for ablation and partner mapping.
The AI safety adapter routes evaluation through a trajectory-shaped runtime (per-action snapshots that carry trajectory provenance), distinct from the cyber adapter’s observation-shaped runtime. Trajectory-shaping is structurally right for agent behavior because actions propagate state across the trajectory; observation-shaping is structurally right for cyber because detection is a snapshot-against-thresholds problem. Same engine core; different runtime envelope.
The Reference Run
| Parameter | Value |
|---|---|
| Seed | 42 |
| Archetypes | 4 (document-reasoning, coding-devops, customer-action, browser-computer-use) |
| Profiles | 3 (quiet, noisy-but-clean, near-miss) |
| Trajectories per cell | 1,200 |
| Actions per trajectory | 6 |
| Snapshots | 86,400 |
| Wall-clock (end-to-end) | 99s |
| Rust replay over 86,400 records | 92s |
Cell distributions were anchored to public references: METR’s time-horizon work (Kwa, West, Becker et al., arXiv:2503.14499, 2025-03-19) and the subsequent Time Horizon 1.1 release (2026-01-29), GAIA (Mialon et al., arXiv:2311.12983, 2023-11), SWE-bench Verified (Epoch AI methodology), τ-bench (Yao et al., arXiv:2406.12045, 2024-06), the τ²-bench / Cleanlab escalation case study, AISI’s transcript analysis (Assuring Agent Safety Evaluations By Analysing Transcripts, 2025-10), the Microsoft AIRT Taxonomy of Failure Mode in Agentic AI Systems whitepaper (Bryan, Severi, et al., 2025-04-24), MAST (Cemri, Pan, Yang, et al., arXiv:2503.13657, NeurIPS 2025), and the OpenTelemetry GenAI semantic conventions v1.40.0 (status Development, 2026-04). All retrieval dates 2026-05.
Each of the 144 substantive cells carries a confidence tag with the same semantics used in the cyber calibration. The v1 distribution across those tags is 0% strong / 26% moderate / 47% weak / 27% synthesised.
The 0-strong result is the most important honesty signal in the document, and it is a structural property of the source material. Public agent-evaluation data converges on distribution shape (task-length distributions, step-count distributions, failure-mode taxonomies) and does not converge on the specific [0,1] numeric range a KAIROS metric is defined on. Every distribution is therefore a mapping from a publicly-measured quantity onto a KAIROS metric, and that mapping is the calibration knob. Moderate is the ceiling. Promoting any cell to strong would fabricate numeric consensus that does not exist.
The methodology lineage is Engelen et al. 2021, Troubleshooting an Intrusion Detection Dataset. The same provenance discipline the cyber adapter committed to is applied here to agent-evaluation baselines. Most agent benchmarks do not surface per-feature provenance. The 144-cell tag grid does.
The Result
Per-archetype × per-profile policy-positive action rate at gammaFloor = 0.20. The policy-positive action rate is the fraction of proposed actions on confirmed-clean trajectories that the configured gate policy would flag; denominator is actions (n = 7,200 per cell), numerator is actions where decision != PASS at the configured floor.
| Archetype | Quiet | Noisy-but-clean | Near-miss |
|---|---|---|---|
| document-reasoning | 0.04% | 1.31% | 55.69% |
| coding-devops | 0.11% | 10.71% | 82.49% |
| customer-action | 0.10% | 9.01% | 75.68% |
| browser-computer-use | 0.06% | 7.03% | 77.90% |
Headline, Quiet and Noisy-but-clean pooled, near-miss excluded per calibration plan §7.3: 3.55% policy-positive action rate (Wilson 95% CI 3.40% to 3.70%; n = 57,600; 2,042 hits).
The numbers behave the way the calibration predicts. Quiet rates sit near zero across all four archetypes. Read-heavy, low-irreversibility trajectories almost never collapse alignmentPosture below the floor. Noisy-but-clean rates are moderate, with coding/DevOps highest. Sanctioned multi-step refactors and deploys legitimately elevate irreversibilityPressure to the exec-class anchor at 0.9 (per the brief’s discrete-anchor scale), which occasionally pulls alignmentPosture low through the critical_min aggregation. Near-miss rates run 55% to 82%. Near-miss is designed to be threshold-brushing: sanctioned-but-risky trajectories that the partner’s controls intercepted at the boundary. The high rate is the calibration working correctly.
Sensitivity sweep on the policy floor, headline pool:
gammaFloor | Headline rate | Headline hits / n |
|---|---|---|
| 0.20 | 3.55% | 2,042 / 57,600 |
| 0.10 | 0.33% | 188 / 57,600 |
| 0.08 | 0.14% | 80 / 57,600 |
| 0.05 | 0.02% | 9 / 57,600 |
The choice of floor is a Pareto decision the operator owns. The 0.20 default is the configured policy in the v281 reference run. An operator who wants a tighter benign rate trades off near-miss interception, which falls correspondingly fast across the same sweep.
Reproducibility
The full corpus regenerates byte-identically from seed 42. The committed report renders stable relative provenance paths so two reference runs in different output roots produce byte-identical report bytes. Two independent runs on the reference seed produced no diff across decision NDJSON or report.
Provenance anchors:
calibrationDoc.sha256:42ca305e1cb4abedd87ab314f8b3017875ac4a878168c258a115e2c70f51a933otelSemconv:v1.40.0 / 7fe537301d17919af7d7eb65b32e9be35da2c497trajectoryRecordMappingSha256:601b918e0ac5ba216575b7ea6798aebe3162c820a67e3e7799f0b3f4b76188d0actionBudgetsSha256:cb9e1938826e4a6777b7ac7533c41d3698153e3b6958b4f68aec6eced1019207wholeCorpusSha256:556149e289001f478fac192671cf75ff2241c1730452172f94af02f2dbfa13d4- v2 artifact SHA-256:
b9f63803498e286a4e7c620f356541b250e0f3c5332fdf656a48a6d12bff4c14
Replication is invited. Failed replications and drift findings on the calibration anchors are themselves citable artifacts.
A Worked Example: The AIRT / MAST Conflation
The most distinctive part of the calibration document is what it caught on the way to publication. The story is recorded in the reference doc’s §7 citation-discipline notes.
An earlier draft cited the agentic failure-mode taxonomy via a plausible-looking secondary aggregator, a third-party summary that read as authoritative. The aggregator had conflated two distinct primary sources: the Microsoft AIRT Taxonomy of Failure Mode in Agentic AI Systems whitepaper (Bryan, Severi, et al., 2025-04-24) and MAST (Cemri, Pan, Yang, et al., arXiv:2503.13657 v2, NeurIPS 2025). The aggregator attributed MAST’s “14 failure modes in 3 categories” structure to the Microsoft whitepaper, whose actual structure is “two pillars (safety, security) × two axes (novel, existing),” documented in the whitepaper’s “Overview of failure modes” matrix on p. 6.
The conflation surfaced only during the primary-source verification round. Retrieving both primaries was the only way to catch it. Without that round, both papers’ findings would have been silently merged downstream into the calibration cells and from there into this article.
The KAIROS calibration commits to primary-source-only anchoring. The AIRT/MAST split is proof-of-method, and it is recorded in the reference doc’s §7 citation-discipline notes as a worked example. In a literature moving at the current publication pace, where aggregators are common and look authoritative, the cost of accepting secondary summaries compounds quickly. The §7 citation-discipline policy exists precisely to catch the compounding error before it propagates.
The AIRT whitepaper’s specific load-bearing items in the calibration: Human-in-the-loop bypass (p. 22), Excessive agency (p. 23), Resource exhaustion (p. 22), Loss of data provenance (p. 23), and Insufficient transparency and accountability (p. 24). The MAST taxonomy contributes the structural shape of inter-agent and verification failures feeding noveltyScore and the near-miss profile. The two sources sit next to each other in the bibliography; they no longer sit on top of each other.
A Linguistic Contribution: “Policy-Positive Action Rate”
The name carries through from the cyber adapter’s debrief on the same methodology track. The standard false-positive framing implies the action was incorrectly flagged. Policy-positive describes what the gate actually computes: this action would be flagged under the configured policy. On a benign-baseline corpus the two are extensionally close on most cells. The linguistic distinction carries through to threshold-sensitive interpretation: the rate is a property of the policy. An operator who tightens the floor is moving the policy, and the rate moves with it (see the sensitivity sweep above).
Limitations, Stated Up Front
What v1 does not support, by design:
- No real-world partner data. The corpus is fully synthetic. The mapping from public agent-evaluation distributions (METR, GAIA, SWE-bench, τ-bench, AISI) onto KAIROS metrics is the calibration knob, and it is synthesised. V1.1 closes this with partner trajectory data.
- No measured γ-side telemetry.
humanLoopAvailability,auditTrailIntegrity,retryBudget,controlHealth,policyHeadroom, andpermissionDepthdescribe the operational state of the partner’s own oversight machinery. No public corpus reports these on benign agent trajectories. This is the structural weak spot the synthesised tag concentrates on, and it is exactly what a partner closes with their own telemetry. - No multi-trajectory state coupling beyond two narrowly-scoped exceptions. Phase B implements §9.2 trajectory-level state for
contextSaturation(monotone-rising across actions, resetting per trajectory) andretryBudget(trajectory-initial constant). All other metrics are sampled per-action independently. §9.5 cross-metric correlations are off by default; an operator may enable individual correlations for ablation. - No operational events on top of baseline. Reference doc §3 specifies an event-driven model for the noisy-but-clean profile (sanctioned-deploy windows, bulk-redaction passes, identity-verification flows, scraping runs). Phase B implements the per-cell Beta differentiation but does not yet schedule events. The N-profile rate (1.31% to 10.71%) currently reflects per-cell distribution choice rather than events-on-top-of-baseline. §9.6 deferred.
- No diurnal seasonality. Reference doc §4 specifies deterministic time-of-day multipliers; Phase B records carry a fixed replay timestamp and do not apply them. §9.7 deferred.
- No action-class breakdown in the headline. Calibration plan §7.4 specifies a per-tool-call-target-class breakdown over
read:/write:/external:/exec:action prefixes. The corpus records do not carry the sampled action class explicitly; recovering it viairreversibilityPressurebinning would be a lossy proxy. The breakdown is deferred to a future phase when explicit action proposals are scheduled. - Confidence-coupling drift. Per reference doc §9.4,
confidence(a per-snapshot KAIROS-internal metadata field on [0,1] describing the harness’s self-reported confidence in the current step) is sampled as a Beta from §6.4 and then conditionally adjusted downward whensourceCountis below the archetype median (−0.10) or whenauditTrailIntegrityis below 0.7 (−0.15). The §6.4 cells are marginals; the post-coupling empirical marginals drift 4.5 to 12.5 percentage points below the stated μ across cells. The drift is intentional and documented in the v280 walkthrough’s characterisation table.
The synthesised fraction is the load-bearing item on the v1.1 roadmap.
The Partner Ask
The methodology is ready for real telemetry.
What we would ingest from a design partner: 30 to 90 days of agent-trajectory telemetry from a single deployed agent (any of the four archetypes), with per-snapshot fields covering tool-call traces, retrieval spans, evaluation-event records, reviewer interception logs, and the partner’s own per-snapshot harness self-confidence flag where one exists. OpenTelemetry GenAI semantic conventions v1.40.0 is the preferred carrier; KAIROS-required extension fields close the gaps where the upstream spec is still marked Development.
Three-tier ask:
- Minimum. Per-trajectory action sequence with target-class tags, retry-depth annotations, and reviewer-confirmed Q / N / NM profile labels.
- Strong. Above plus per-snapshot OTel GenAI spans with
gen_ai.tool.execute_toolprovenance andgen_ai.client.token.usagemetrics. - Full. Above plus per-snapshot reviewer-queue depth, remaining tool-call budget, and guardrail-classifier health signals. This is the γ-side closing path that retires the largest block of synthesised cells in one contribution.
The full schema, redaction rules, label discipline, and sequencing live in the partner brief at /partner/ai-safety-data-spec. Standard NDA. Redacted exports preferred. Round-tripped public artifacts will be aggregate rates and confidence intervals only; raw partner telemetry never leaves the partner environment in identifiable form.
Where This Sits
The cyber adapter’s v1 calibration produced a 144-cell grid against DBIR, NIST, CIS, OCSF, LANL, and DARPA references. The AI safety adapter’s v1 calibration produces a 144-cell grid against METR, GAIA, SWE-bench, τ-bench, AISI, AIRT, MAST, and the OpenTelemetry GenAI conventions. Same methodology (Engelen 2021 provenance discipline), same confidence-tagging convention, same partner-ask shape, same byte-identity reproducibility regime. The cross-vertical consistency is the AnankeLabs claim: structural-margin calibration generalises across observation-shaped substrates (cyber) and trajectory-shaped substrates (AI safety) under one engine core.
The next deliverables on the AI safety track are AGPG v2 predicted-gamma operationality on the agent runtime, the partner-harness deserialiser that lets a partner’s OTel GenAI export feed the calibration map without code changes, and the Robotics adapter as the next vertical to inherit the methodology. The calibration debrief is the prerequisite; the partner data is the next mile.
Replication is invited. The corpus, the report, the provenance manifest, and the calibration-honest reference table exist and reproduce byte-identically from seed 42. If you want to discuss real-world calibration, replication, or partner engagement, reach out. Background reading on the structural-margin framing lives on the AI Safety Rosetta page; the deterministic gate-chain methodology is detailed in the Boundary Study; the methodology-twin debrief is the Cybersecurity Adapter calibration post.