Capability Escalation Monitor (CEM)
AI-2027 Response — Risk & Alignment Section
1. Core Function
When capability benchmarks cross defined thresholds, automatic governance escalation is triggered.
This monitor is a proposed safeguard, not a deployed system.
2. Example Triggers
| Trigger | Threshold Description |
|---|---|
| Autonomous code benchmark | Model passes autonomous multi-day code execution benchmark |
| Persuasion score | Model exceeds predefined persuasion evaluation score |
| Strategic planning | Model demonstrates multi-step strategic planning above defined threshold |
| Biological knowledge synthesis | Model demonstrates biological knowledge synthesis above defined threshold |
3. Automatic Actions
When a trigger fires, the following actions are proposed:
Governance quorum review — Escalates trigger event to designated governance authorities
Sandboxing enforcement — Initiates compute isolation constraints until review completes
Public disclosure log entry — Records trigger event in immutable audit ledger
Compute throttle recommendation — Proposes resource limitation pending governance decision
4. Architecture
The CEM operates as a monitoring layer adjacent to the existing pipeline:
- Receives capability evaluation signals from benchmark harnesses
- Compares against threshold registry
- Triggers governance escalation via OpenApprove
- Logs all trigger events via OpenWitness
This architecture is proposed, not deployed.
5. What This Monitor Does Not Claim
- Does not detect all forms of capability escalation.
- Does not prevent capability development.
- Does not replace human judgment in governance decisions.
- Does not guarantee that thresholds are correctly calibrated.
- Does not address capability concealment or sandbagging.
6. Invitation
This monitor is a proposed design, not a production system.
Researchers working on capability evaluation, dangerous capability thresholds, and governance mechanisms are invited to review the design and propose improvements.
Capability Escalation Monitor is a trigger-and-action design that defines specific observable signals for capability escalation events in AI systems and specifies response actions within bounded execution constraints.
Trigger Design
Triggers are defined as observable signals that indicate capability escalation beyond defined bounds. Triggers must be specific, measurable, and verifiable without relying on the system being monitored.
Action Design
Actions are deterministic responses to trigger conditions. They are encoded as constitutional constraints, not governance recommendations. Actions include execution halt, attestation requirement, and audit initiation.
Limits of the Monitor
The CEM detects observable escalation signals. It cannot detect capability concealment, deceptive alignment, or escalation that occurs below the monitoring threshold.