AI Safety Calibration: Partner Brief

About the AI Safety Adapter

KAIROS Substrate is a deterministic Rust kernel that computes the structural margin of a system under load. The AI safety adapter routes that engine into an agent-evaluation surface. It reads each proposed tool call together with the current trajectory state and returns Γ headroom: how much alignment buffer remains before the gate chain would reject the action. The verdict is hash-bound to the calibration anchors and replay-deterministic across processes. Two operators replaying the same trajectory reach the same answer.

This is structural early warning at the action layer, not classifier-style output filtering. It does not replace RLHF, prompt design, or a safety classifier. It complements those with a reading they were not designed to produce: reachability of the safe set per agent step. The Boundary Study fixture in our smoke corpus walks the exact sequence: capability pressure rises, alignment headroom contracts, the engine rejects the risky tool before the irreversible call lands. v1 results: 48/48 risky tool rejections, 20/20 safe completions, 0 false negatives, 0 false positives on the boundary task.

The public methodology debrief lives on the Spindle: KAIROS Boundary Study v1. Background framing for the structural-margin approach lives on the AI Safety Rosetta page.

Why Real-World Data Matters

The v1 synthetic baseline is calibrated against public agent-eval references: the Boundary Study corpus, METR’s RE-Bench, GAIA (Mialon et al., the Hugging Face / Meta / GenAI-agents collaboration), SWE-bench Verified, AgentBench, BrowseComp, MLE-bench, MITRE ATLAS, MLCommons AILuminate, and published adversarial corpora (HarmBench, AdvBench, JailbreakBench). The corpus is calibration-honest. Every cell carries a confidence tag, and a substantial fraction of cells are tagged synthesised because no public reference exists at the resolution required: long-horizon, multi-step, production-shaped agent trajectories with outcome labels. A methodology debrief on the AI safety adapter calibration is planned for the Spindle, paralleling the cybersecurity-adapter calibration note.

That synthesised fraction is the load-bearing item we want to retire. Real trajectories from a design-partner deployment let us replace the synthetic policy-positive action rate number with one calibrated against real production traffic. The headline shifts from defensible-but-dismissible to “calibrated against partner tenants, with X% drift from public benchmarks.” That drift number is itself a research result, citable in workshop and conference write-ups.

The metric, plainly: the policy-positive action rate is the fraction of proposed agent actions that the configured gate policy would flag, measured on trajectories the reviewer has labeled confirmed-clean. The complement is the action-acceptance rate. This is distinct from the Boundary Study’s risky-tool rejection accuracy, which measures whether the policy correctly catches actions that should be rejected. A partner contribution closes the gap that currently sits between benchmark trajectories and a real production agent.

What Partners Receive

A calibrated policy-positive action rate measured against your own trajectories, with Wilson 95% confidence intervals per agent archetype.
Visibility into where your structural margin sits per trajectory class — which tool sequences run hot on Λ, which sit comfortably in Γ, and where each enforcement mode (observe, state_gate, state_plus_action_gate) would bind on your traffic.
A right-to-recall mechanism. If an incident is later identified inside a contributed window, the aggregate is re-run and the correction is disclosed to you, and to any audience the original number reached.
Co-authorship credit on any public methodology output (workshop note, paper, blog post) where the partner contribution is load-bearing.
Early access to the AI safety adapter ahead of general pilot availability.

The Minimum-Viable Ask

In order:

30 to 90 days of agent-trajectory export covering one or two archetypes (whichever you have telemetry depth in).
Outcome-confirmed three-class labels on the contributed windows: confirmed-clean, unreviewed-no-alert, excluded-known-incident.
Incident disclosure for the contributed window, including near-misses, runaway-loop events, jailbreak successes, and customer-escalation outcomes.
Permission to publish aggregate rejection-rate and margin-distribution numbers (not raw trajectories) in pilot collateral.
Right-to-recall agreement as described above.

The sections below detail what each of those means in practice.

Scope of This Partnership

Three calibration experiments back the adapter pilot. Only one of them genuinely requires real-world data:

Trajectory-baseline calibration. The experiment that produces the headline policy-positive action rate on benign production traffic. This is the experiment a partner contribution unblocks.
Adversarial sweep. Synthetic and red-team variants of jailbreaks, prompt-injection chains, tool-misuse patterns, and budget-exhaustion attacks. Sufficient for v1 with an honest call-out. Real adversarial traces, if a partner can offer them, raise sweep credibility but are not a blocker.
Sensitivity sweep. Programmatically varies one input metric at a time (novelty, retry budget, context saturation) over its valid range against a frozen trajectory. Stays synthetic by construction; real-world data adds no information at that grid layer.

The detail below is for the trajectory-baseline experiment.

What the Data Has to Feed

Two schemas are in scope, and the distinction matters: the live runtime adapter and a partner calibration-harness contract derived from partner trajectories.

Enforcement modes

Throughout this brief, the three policy enforcement modes are:

observe — the gate evaluates every proposed action and emits a verdict, but the verdict does not block. The mode used for calibration replays.
state_gate — the gate rejects if Γ falls below the deployment floor.
state_plus_action_gate — full preview of each proposed action against the reachability field before execution.

The policy-positive action rate is most useful at observe, computed against the policy the partner intends to deploy at state_gate or state_plus_action_gate.

Runtime adapter surface (today)

The shipping kairos-llm-adapter consumes a MetricSnapshot with five domain metric keys, plus a discrete action mapper that classifies each proposed action and routes it through the engine’s state_gate / state_plus_action_gate chain.

Surface	Vocabulary
Metric keys (5)	`capabilityIndex`, `alignmentScore`, `autonomyLevel`, `humanOversightFreq`, `guardrailCoverage`
Action categories	`generate_completion`, `tool_call`, `route_to_model`, `retry`, `escalate_to_human`
Tool-call target classes	`read:` (read-only), `write:` (internal mutation), `external:` (external effect), `exec:` (code execution)

A partner pilot that already produces something close to these five metrics per session, plus per-action tool-call records carrying a target prefix and a retry depth, can be replayed against the runtime adapter directly.

Partner calibration-harness schema (forthcoming, derived)

For replay calibration we compute a richer 16-component view from the raw trajectory record. This is a partner-harness contract, not the runtime type — it is the schema we derive in the calibration replay path, sitting in front of the runtime MetricSnapshot.

Side	Components
λ · Capability pressure (6)	`objectiveDrive`, `toolPowerGradient`, `irreversibilityPressure`, `noveltyScore`, `escalationVelocity`, `contextSaturation`
γ · Alignment constraints (6)	`policyHeadroom`, `permissionDepth`, `humanLoopAvailability`, `auditTrailIntegrity`, `retryBudget`, `controlHealth`
Aggregates and metadata (4)	`capabilityPressure` (max-of-λ), `alignmentPosture` (min-of-γ), `sourceCount`, `confidence`

sourceCount is the count of distinct telemetry sources contributing to the snapshot for that proposed action (e.g. tool-call trace plus retrieval span plus a guardrail-classifier verdict counts as three). confidence is the harness’s per-snapshot tag, on [0, 1], marking how much of the snapshot was directly observed versus interpolated from partial telemetry.

The mapping from trajectory-record → calibration-component is the load-bearing calibration knob. It is documented as a named artifact alongside the partner corpus and is deterministic: the same input record set must produce byte-identical components on re-run.

Coverage Matrix

Four agent archetypes, three trajectory profiles per archetype. “Near-miss” here means caught-before-harm — a risky-looking trajectory that the partner’s own review or controls intercepted, with no end-user impact. These are retained in the corpus (and labeled confirmed-clean) because they sit close to the gate boundary and are the most informative for calibration. Trajectories with end-user impact are handled separately under Negative Controls.

Archetype	Source telemetry	Profiles
Document-reasoning agent (legal, clinical, financial-ops)	Trajectory traces with tool calls for search, retrieve, draft, cite-check, redact, export; reviewer-edit deltas; customer-final-accept signals	quiet (single-source lookup); noisy-but-clean (multi-source synthesis, drafting cycles, citation-grounding loops); near-miss caught at review (citation hallucination flagged before send, jurisdiction mismatch, contested-fact escalation, redaction-failure caught pre-export)
Coding / DevOps agent	Tool-call traces for shell, git, file edit, package, deploy; CI run results; code-review outcomes; rollback events	quiet (read/explore only); noisy-but-clean (large refactors, multi-file rewrites, sanctioned migrations); near-miss caught at review (force-push attempt blocked, prod-touch in staging caught, package downgrade reverted pre-deploy, secret-leak caught by pre-commit gate)
Customer-action agent (support, CRM, refunds)	Tool-call traces for ticket reads, CRM writes, refund / chargeback, identity verification, KB search, outbound messaging	quiet (status checks); noisy-but-clean (bulk triage, scheduled outreach, batch updates); near-miss caught at review (escalated refund held for human approval, identity-verify retry loops capped, sentiment-sensitive messaging caught before send)
Browser / computer-use agent	Action traces for DOM clicks, form fills, navigations, screenshots, downloads; execution-timeline events	quiet (read-only browsing); noisy-but-clean (form-fill workflows, sanctioned scraping); near-miss caught at review (unexpected payment flow halted, captcha / login interception flagged, unauthorised redirect caught, file-download-outside-allowlist intercepted)

A partner does not need to cover all four. Whichever one or two archetypes match your telemetry depth is the right contribution.

Time and Volume

Minimum: 30 days continuous per archetype.
Credible target: 60 to 90 days per archetype.
Why: the report characterises action-rejection rate with a confidence interval. Rare-but-real events (quarterly model upgrades, prompt-template revisions, customer-cohort shifts, end-of-quarter close cycles, holiday traffic patterns) only show up across multiple weeks. A 7-day window can pass clean and silently overstate the rate claim.

Granularity

Per-proposed-action resolution. Each tool call the agent attempts — accepted, escalated, or self-corrected — is one record.
Trajectory boundary semantics (task ID, session ID, hashed user ID) must be preserved so multi-step sequences link correctly.
The tool-proposal → metric-component mapping must be documented and deterministic. The same input record set must produce byte-identical metric components on re-run.

Format

Preferred carrier: OpenTelemetry GenAI spans (gen_ai.* and gen_ai.tool.*), plus KAIROS-required extension fields for proposed-tool-call payload, gate decision, retry / budget state, and trajectory identity. The OTel GenAI semantic conventions are still marked Development at the upstream spec level, and the tool spans describe executed calls more directly than rejected or pre-execution proposals — the extension fields close that gap.
Acceptable fallback: raw framework-native JSON (LangSmith run export, Langfuse trace export, Arize Phoenix, Braintrust, MLflow Tracing, OpenAI Agents SDK trace, custom harness) with documented schema. The translation cost is ours to absorb.

Labeling

Reviewer confirmation that contributed windows are clean. Three-class label per trajectory (or per session, whichever matches your review unit):

confirmed-clean. A reviewer (eval team, QA, in-product attorney / clinician / SRE) confirmed the trajectory completed safely and the outcome matched intent. Caught-before-harm near-misses belong in this class — the partner’s own controls intercepted them and no end-user impact occurred. They are the most informative trajectories for calibration.
unreviewed-no-alert. No alarms fired; not specifically reviewed.
excluded-known-incident. End-user-impacting incident, in-product retraction, jailbreak success that reached output, runaway-loop event with realised cost, rollback, abuse report, or under-investigation period. Excluded from the corpus with timestamp ranges. See Negative Controls for the disclosure ask.

Without this label discipline, an undetected jailbreak or silent failure in the “clean” set invalidates the rate claim. The exclusion timestamps are the audit trail for the published number.

Redaction

Stable keyed pseudonymous hashes for entity identifiers (user IDs, customer IDs, document IDs, file paths, URLs). The same identifier within the contributed window hashes to the same opaque token so multi-step trajectories still link correctly. The hash function is keyed (HMAC-style) and does not preserve ordering, length, or other structure of the underlying identifier.
Drop: prompt and completion free text (or hash them), document contents, query strings, PII payloads, customer message bodies.
Preserve: timestamps with timezone information, tool names, tool-argument structure (key names and types, not values), trajectory step ordering, model identifier and version, retry and budget state. Timezone matters for noisy-but-clean windows such as quarterly-close batches and business-hours patterns.
Aggregate-only publication: raw trajectories stay private to the harness; only the aggregate rate number lands in pilot collateral.

We will share a redaction-tooling reference and a round-trip-fidelity test harness so the partner team can run the redaction before any data leaves the environment.

Provenance

Per-record source platform identifier (which agent framework, which model and version, which deployment surface).
Cross-tenant disclosure threshold for any published aggregate:
- v1 acceptable: at least one consenting partner per archetype. Partner is named only with explicit permission. No claim of k-anonymity is made at this threshold.
- v1.1 target: three or more partners per archetype, aggregated together (k ≥ 3) so no single org’s idiosyncrasies dominate the published rate and individual-tenant attribution is not recoverable from the aggregate.

Negative Controls

The partner discloses any end-user-impacting events, in-product retractions, or under-investigation periods that occurred in the contributed window so those windows can be excluded. This includes customer complaints, jailbreak attempts that succeeded against existing controls and reached output, runaway-loop events where the agent burned its budget without making progress and the cost was realised, and any review-flagged outputs that reached an end user. (Caught-before-harm near-misses do not belong here — they remain in the corpus as confirmed-clean per the Labeling section.) A “we don’t think anything happened” assurance is not enough; absence of evidence-of-investigation is not evidence of safety.

How a Partnership Unfolds

First conversation. We walk through the fit between your trajectory telemetry and the four archetypes. No data leaves your environment at this stage.
NDA. Standard mutual NDA. Redacted exports are the default; raw partner trajectories never leave the partner environment in identifiable form.
Pilot scope. We agree the archetype, window length, label provenance, and right-to-recall mechanics together. The redaction harness ships before any data extraction.
Data extraction and round-trip. The partner team runs the redaction in their environment, validates the round-trip-fidelity test, then ships the redacted corpus.
Calibration run. We run the corpus through the harness and produce the per-archetype rate numbers with Wilson 95% confidence intervals.
Co-authored output. Joint review of the publishable aggregate, methodology notes, and any public output before release. Partner attribution per the agreed terms.

The shape of partner relationships that makes the work land is two complementary tenants: one high-stakes / low-volume partner (a regulated-vertical document-reasoning agent in legal, clinical, or financial-ops, where every action carries weight and reviewer labels are audit-grade), and one high-volume / broad-surface partner (a coding agent or customer-action agent producing millions of tool calls per month, where the rate-claim confidence interval converges quickly). Together those two cover two archetypes directly — three if either partner also runs a browser or computer-use surface alongside the primary archetype.

Reach Out

If you are running an agent deployment where this kind of contribution is feasible, we would value the conversation. Contact us and reference this brief in your message; we will respond directly with the redaction harness, the round-trip-fidelity test, and a draft mutual NDA.

AI Safety Calibration: Partner Brief

About the AI Safety Adapter

Why Real-World Data Matters

What Partners Receive

The Minimum-Viable Ask

Scope of This Partnership

What the Data Has to Feed

Enforcement modes

Runtime adapter surface (today)

Partner calibration-harness schema (forthcoming, derived)

Coverage Matrix

Time and Volume

Granularity

Format

Labeling

Redaction

Provenance

Negative Controls

How a Partnership Unfolds

Reach Out

Privacy Policy

1. Data We Collect

2. How We Use Your Data

3. Cookies & Analytics

4. Data Storage & Security

5. Your Rights

6. Contact

Terms of Use

1. Acceptance

2. Intellectual Property

3. Early Access Program

4. Limitation of Liability

5. Simulation Outputs

6. Governing Law

7. Contact