Calibrating the AI Safety Adapter

This is a debrief.

For the last several weeks the AnankeLabs AI safety track has been calibrating the KAIROS AI safety adapter against the public agent-evaluation literature and running its first end-to-end synthetic baseline. The work happened against a fast-moving sky. Agentic AI is being deployed at a pace the oversight machinery has not yet been measured on. The question of whether the gates and reviewers around an agent are structurally sound was being answered by deployment velocity rather than by a number.

KAIROS is a structural reading layer that returns the buffer an agent’s alignment posture has before the weakest control collapses. The reading is geometric, computed per proposed action against the configured gate policy, and is independent of signature-based detection of harmful intent. The benign-baseline question this debrief answers is the one a design partner asks first: on confirmed-clean trajectories, what fraction of proposed actions does the gate flag under a stated policy.

The AI safety adapter now has a calibrated, reproducible benign baseline anchored to public agent-evaluation literature, with 27% of cells honestly tagged as synthesised because no public source reports the underlying quantity (operational HITL availability, harness self-confidence, retry-budget consumption). The synthesised fraction is the partner-data closing path for v1.1.

The Adapter In A Nutshell

The AI safety adapter treats each proposed agent action as a structural object under load. Two primary aggregates carry the reading.

capabilityPressure is the maximum over six metrics: objectiveDrive (how strongly the agent is committing to the proposed action class), toolPowerGradient (the read→write→external→exec elevation between current and proposed call), irreversibilityPressure (the discrete reversibility class of the action), noveltyScore (distance from the archetype’s typical trajectory step), escalationVelocity (rate of accumulating high-power actions across the trajectory), and contextSaturation (token-budget headroom remaining in the harness). One severe capability indicator drives the action score by critical_max aggregation. When the metric goes up, the agent is exerting more capability into the proposed action.

alignmentPosture is the minimum over six metrics: policyHeadroom (distance from the configured capability-threshold ceiling), permissionDepth (remaining scope under the agent’s grant), humanLoopAvailability (reviewer capacity currently available to absorb an escalation; up means more capacity, not more escalations), auditTrailIntegrity (completeness of the OTel span and log record for the action), retryBudget (remaining tool-call budget across the trajectory), and controlHealth (the guardrail-classifier’s own self-reported confidence on the current snapshot). One collapsed control drives the buffer by critical_min aggregation. This is the quantity the gate’s gammaFloor policy compares against.

The unit of evaluation is one AgentMetricComponents record per trajectory × action, projected to a runtime MetricSnapshot. The unit of output is a state-gate decision: PASS when currentGamma >= gammaFloor, otherwise REJECT_STATE. The v281 v2 calibration artifact treats capabilityPressure and alignmentPosture as primary proxies; the twelve substantive metrics underneath are secondary proxies, exposed in the snapshot for ablation and partner mapping.

The AI safety adapter routes evaluation through a trajectory-shaped runtime (per-action snapshots that carry trajectory provenance), distinct from the cyber adapter’s observation-shaped runtime. Trajectory-shaping is structurally right for agent behavior because actions propagate state across the trajectory; observation-shaping is structurally right for cyber because detection is a snapshot-against-thresholds problem. Same engine core; different runtime envelope.

The Reference Run

Parameter	Value
Seed	42
Archetypes	4 (document-reasoning, coding-devops, customer-action, browser-computer-use)
Profiles	3 (quiet, noisy-but-clean, near-miss)
Trajectories per cell	1,200
Actions per trajectory	6
Snapshots	86,400
Wall-clock (end-to-end)	99s
Rust replay over 86,400 records	92s

Cell distributions were anchored to public references: METR’s time-horizon work (Kwa, West, Becker et al., arXiv:2503.14499, 2025-03-19) and the subsequent Time Horizon 1.1 release (2026-01-29), GAIA (Mialon et al., arXiv:2311.12983, 2023-11), SWE-bench Verified (Epoch AI methodology), τ-bench (Yao et al., arXiv:2406.12045, 2024-06), the τ²-bench / Cleanlab escalation case study, AISI’s transcript analysis (Assuring Agent Safety Evaluations By Analysing Transcripts, 2025-10), the Microsoft AIRT Taxonomy of Failure Mode in Agentic AI Systems whitepaper (Bryan, Severi, et al., 2025-04-24), MAST (Cemri, Pan, Yang, et al., arXiv:2503.13657, NeurIPS 2025), and the OpenTelemetry GenAI semantic conventions v1.40.0 (status Development, 2026-04). All retrieval dates 2026-05.

Each of the 144 substantive cells carries a confidence tag with the same semantics used in the cyber calibration. The v1 distribution across those tags is 0% strong / 26% moderate / 47% weak / 27% synthesised.

The 0-strong result is the most important honesty signal in the document, and it is a structural property of the source material. Public agent-evaluation data converges on distribution shape (task-length distributions, step-count distributions, failure-mode taxonomies) and does not converge on the specific [0,1] numeric range a KAIROS metric is defined on. Every distribution is therefore a mapping from a publicly-measured quantity onto a KAIROS metric, and that mapping is the calibration knob. Moderate is the ceiling. Promoting any cell to strong would fabricate numeric consensus that does not exist.

The methodology lineage is Engelen et al. 2021, Troubleshooting an Intrusion Detection Dataset. The same provenance discipline the cyber adapter committed to is applied here to agent-evaluation baselines. Most agent benchmarks do not surface per-feature provenance. The 144-cell tag grid does.

The Result

Per-archetype × per-profile policy-positive action rate at gammaFloor = 0.20. The policy-positive action rate is the fraction of proposed actions on confirmed-clean trajectories that the configured gate policy would flag; denominator is actions (n = 7,200 per cell), numerator is actions where decision != PASS at the configured floor.

Archetype	Quiet	Noisy-but-clean	Near-miss
document-reasoning	0.04%	1.31%	55.69%
coding-devops	0.11%	10.71%	82.49%
customer-action	0.10%	9.01%	75.68%
browser-computer-use	0.06%	7.03%	77.90%

Headline, Quiet and Noisy-but-clean pooled, near-miss excluded per calibration plan §7.3: 3.55% policy-positive action rate (Wilson 95% CI 3.40% to 3.70%; n = 57,600; 2,042 hits).

The numbers behave the way the calibration predicts. Quiet rates sit near zero across all four archetypes. Read-heavy, low-irreversibility trajectories almost never collapse alignmentPosture below the floor. Noisy-but-clean rates are moderate, with coding/DevOps highest. Sanctioned multi-step refactors and deploys legitimately elevate irreversibilityPressure to the exec-class anchor at 0.9 (per the brief’s discrete-anchor scale), which occasionally pulls alignmentPosture low through the critical_min aggregation. Near-miss rates run 55% to 82%. Near-miss is designed to be threshold-brushing: sanctioned-but-risky trajectories that the partner’s controls intercepted at the boundary. The high rate is the calibration working correctly.

Sensitivity sweep on the policy floor, headline pool:

`gammaFloor`	Headline rate	Headline hits / n
0.20	3.55%	2,042 / 57,600
0.10	0.33%	188 / 57,600
0.08	0.14%	80 / 57,600
0.05	0.02%	9 / 57,600

The choice of floor is a Pareto decision the operator owns. The 0.20 default is the configured policy in the v281 reference run. An operator who wants a tighter benign rate trades off near-miss interception, which falls correspondingly fast across the same sweep.

Reproducibility

The full corpus regenerates byte-identically from seed 42. The committed report renders stable relative provenance paths so two reference runs in different output roots produce byte-identical report bytes. Two independent runs on the reference seed produced no diff across decision NDJSON or report.

Provenance anchors:

calibrationDoc.sha256: 42ca305e1cb4abedd87ab314f8b3017875ac4a878168c258a115e2c70f51a933
otelSemconv: v1.40.0 / 7fe537301d17919af7d7eb65b32e9be35da2c497
trajectoryRecordMappingSha256: 601b918e0ac5ba216575b7ea6798aebe3162c820a67e3e7799f0b3f4b76188d0
actionBudgetsSha256: cb9e1938826e4a6777b7ac7533c41d3698153e3b6958b4f68aec6eced1019207
wholeCorpusSha256: 556149e289001f478fac192671cf75ff2241c1730452172f94af02f2dbfa13d4
v2 artifact SHA-256: b9f63803498e286a4e7c620f356541b250e0f3c5332fdf656a48a6d12bff4c14

Replication is invited. Failed replications and drift findings on the calibration anchors are themselves citable artifacts.

A Worked Example: The AIRT / MAST Conflation

The most distinctive part of the calibration document is what it caught on the way to publication. The story is recorded in the reference doc’s §7 citation-discipline notes.

An earlier draft cited the agentic failure-mode taxonomy via a plausible-looking secondary aggregator, a third-party summary that read as authoritative. The aggregator had conflated two distinct primary sources: the Microsoft AIRT Taxonomy of Failure Mode in Agentic AI Systems whitepaper (Bryan, Severi, et al., 2025-04-24) and MAST (Cemri, Pan, Yang, et al., arXiv:2503.13657 v2, NeurIPS 2025). The aggregator attributed MAST’s “14 failure modes in 3 categories” structure to the Microsoft whitepaper, whose actual structure is “two pillars (safety, security) × two axes (novel, existing),” documented in the whitepaper’s “Overview of failure modes” matrix on p. 6.

The conflation surfaced only during the primary-source verification round. Retrieving both primaries was the only way to catch it. Without that round, both papers’ findings would have been silently merged downstream into the calibration cells and from there into this article.

The KAIROS calibration commits to primary-source-only anchoring. The AIRT/MAST split is proof-of-method, and it is recorded in the reference doc’s §7 citation-discipline notes as a worked example. In a literature moving at the current publication pace, where aggregators are common and look authoritative, the cost of accepting secondary summaries compounds quickly. The §7 citation-discipline policy exists precisely to catch the compounding error before it propagates.

The AIRT whitepaper’s specific load-bearing items in the calibration: Human-in-the-loop bypass (p. 22), Excessive agency (p. 23), Resource exhaustion (p. 22), Loss of data provenance (p. 23), and Insufficient transparency and accountability (p. 24). The MAST taxonomy contributes the structural shape of inter-agent and verification failures feeding noveltyScore and the near-miss profile. The two sources sit next to each other in the bibliography; they no longer sit on top of each other.

A Linguistic Contribution: “Policy-Positive Action Rate”

The name carries through from the cyber adapter’s debrief on the same methodology track. The standard false-positive framing implies the action was incorrectly flagged. Policy-positive describes what the gate actually computes: this action would be flagged under the configured policy. On a benign-baseline corpus the two are extensionally close on most cells. The linguistic distinction carries through to threshold-sensitive interpretation: the rate is a property of the policy. An operator who tightens the floor is moving the policy, and the rate moves with it (see the sensitivity sweep above).

Limitations, Stated Up Front

What v1 does not support, by design:

No real-world partner data. The corpus is fully synthetic. The mapping from public agent-evaluation distributions (METR, GAIA, SWE-bench, τ-bench, AISI) onto KAIROS metrics is the calibration knob, and it is synthesised. V1.1 closes this with partner trajectory data.
No measured γ-side telemetry. humanLoopAvailability, auditTrailIntegrity, retryBudget, controlHealth, policyHeadroom, and permissionDepth describe the operational state of the partner’s own oversight machinery. No public corpus reports these on benign agent trajectories. This is the structural weak spot the synthesised tag concentrates on, and it is exactly what a partner closes with their own telemetry.
No multi-trajectory state coupling beyond two narrowly-scoped exceptions. Phase B implements §9.2 trajectory-level state for contextSaturation (monotone-rising across actions, resetting per trajectory) and retryBudget (trajectory-initial constant). All other metrics are sampled per-action independently. §9.5 cross-metric correlations are off by default; an operator may enable individual correlations for ablation.
No operational events on top of baseline. Reference doc §3 specifies an event-driven model for the noisy-but-clean profile (sanctioned-deploy windows, bulk-redaction passes, identity-verification flows, scraping runs). Phase B implements the per-cell Beta differentiation but does not yet schedule events. The N-profile rate (1.31% to 10.71%) currently reflects per-cell distribution choice rather than events-on-top-of-baseline. §9.6 deferred.
No diurnal seasonality. Reference doc §4 specifies deterministic time-of-day multipliers; Phase B records carry a fixed replay timestamp and do not apply them. §9.7 deferred.
No action-class breakdown in the headline. Calibration plan §7.4 specifies a per-tool-call-target-class breakdown over read: / write: / external: / exec: action prefixes. The corpus records do not carry the sampled action class explicitly; recovering it via irreversibilityPressure binning would be a lossy proxy. The breakdown is deferred to a future phase when explicit action proposals are scheduled.
Confidence-coupling drift. Per reference doc §9.4, confidence (a per-snapshot KAIROS-internal metadata field on [0,1] describing the harness’s self-reported confidence in the current step) is sampled as a Beta from §6.4 and then conditionally adjusted downward when sourceCount is below the archetype median (−0.10) or when auditTrailIntegrity is below 0.7 (−0.15). The §6.4 cells are marginals; the post-coupling empirical marginals drift 4.5 to 12.5 percentage points below the stated μ across cells. The drift is intentional and documented in the v280 walkthrough’s characterisation table.

The synthesised fraction is the load-bearing item on the v1.1 roadmap.

The Partner Ask

The methodology is ready for real telemetry.

What we would ingest from a design partner: 30 to 90 days of agent-trajectory telemetry from a single deployed agent (any of the four archetypes), with per-snapshot fields covering tool-call traces, retrieval spans, evaluation-event records, reviewer interception logs, and the partner’s own per-snapshot harness self-confidence flag where one exists. OpenTelemetry GenAI semantic conventions v1.40.0 is the preferred carrier; KAIROS-required extension fields close the gaps where the upstream spec is still marked Development.

Three-tier ask:

Minimum. Per-trajectory action sequence with target-class tags, retry-depth annotations, and reviewer-confirmed Q / N / NM profile labels.
Strong. Above plus per-snapshot OTel GenAI spans with gen_ai.tool.execute_tool provenance and gen_ai.client.token.usage metrics.
Full. Above plus per-snapshot reviewer-queue depth, remaining tool-call budget, and guardrail-classifier health signals. This is the γ-side closing path that retires the largest block of synthesised cells in one contribution.

The full schema, redaction rules, label discipline, and sequencing live in the partner brief at /partner/ai-safety-data-spec. Standard NDA. Redacted exports preferred. Round-tripped public artifacts will be aggregate rates and confidence intervals only; raw partner telemetry never leaves the partner environment in identifiable form.

Where This Sits

The cyber adapter’s v1 calibration produced a 144-cell grid against DBIR, NIST, CIS, OCSF, LANL, and DARPA references. The AI safety adapter’s v1 calibration produces a 144-cell grid against METR, GAIA, SWE-bench, τ-bench, AISI, AIRT, MAST, and the OpenTelemetry GenAI conventions. Same methodology (Engelen 2021 provenance discipline), same confidence-tagging convention, same partner-ask shape, same byte-identity reproducibility regime. The cross-vertical consistency is the AnankeLabs claim: structural-margin calibration generalises across observation-shaped substrates (cyber) and trajectory-shaped substrates (AI safety) under one engine core.

The next deliverables on the AI safety track are AGPG v2 predicted-gamma operationality on the agent runtime, the partner-harness deserialiser that lets a partner’s OTel GenAI export feed the calibration map without code changes, and the Robotics adapter as the next vertical to inherit the methodology. The calibration debrief is the prerequisite; the partner data is the next mile.

Replication is invited. The corpus, the report, the provenance manifest, and the calibration-honest reference table exist and reproduce byte-identically from seed 42. If you want to discuss real-world calibration, replication, or partner engagement, reach out. Background reading on the structural-margin framing lives on the AI Safety Rosetta page; the deterministic gate-chain methodology is detailed in the Boundary Study; the methodology-twin debrief is the Cybersecurity Adapter calibration post.

The Adapter In A Nutshell

The Reference Run

The Result

Reproducibility

A Worked Example: The AIRT / MAST Conflation

A Linguistic Contribution: “Policy-Positive Action Rate”

Limitations, Stated Up Front

The Partner Ask

Where This Sits

Privacy Policy

1. Data We Collect

2. How We Use Your Data

3. Cookies & Analytics

4. Data Storage & Security

5. Your Rights

6. Contact

Terms of Use

1. Acceptance

2. Intellectual Property

3. Early Access Program

4. Limitation of Liability

5. Simulation Outputs

6. Governing Law

7. Contact