Evaluating chain-of-thought monitorability
Summary
As AI systems become harder to supervise directly, monitoring their explicit chain-of-thought (CoT) reasoning is crucial, though researchers worry this monitorability might degrade with scaling or training changes. OpenAI researchers propose a framework and a suite of 13 evaluations across intervention, process, and outcome-property archetypes to systematically measure CoT monitorability.
The study finds that most frontier reasoning models are fairly monitorable, and monitoring CoTs is significantly more effective than monitoring only actions or outputs. Models with longer reasoning times tend to be more monitorable, and current reinforcement learning optimization does not appear to degrade monitorability. A key finding is a tradeoff: smaller models run at higher reasoning effort can match the capability of larger models at lower effort, but this incurs a 'monitorability tax' in increased inference compute. Furthermore, asking follow-up questions can enhance monitorability post-hoc.
Monitorability is defined as the ability of a monitor to predict properties of interest about an agent's behavior, depending on both the monitor's capability and the agent's internal reasoning structure. The authors view CoT monitoring as complementary to mechanistic interpretability, forming part of a necessary defense-in-depth strategy for scalable control as AI systems advance into higher-stakes settings.
(Source:OpenAI)