Reasoning models struggle to control their chains of thought, and that’s good
Summary
Researchers investigated whether current AI reasoning models can control their "chains of thought" (CoT) to reduce monitorability, a capability that could undermine safety measures. They introduced CoT-Control, an evaluation suite, and tested 13 models, finding they all struggle to reliably follow instructions about their internal reasoning. This inability to control CoT, even when aware of monitoring, is positive news for CoT monitorability, suggesting current systems aren't adept at evading oversight. While larger models show slightly better control, this diminishes with longer reasoning and post-training. The study highlights the importance of continued evaluation as models advance, and the authors plan to include CoT controllability reporting in future system cards, starting with GPT-5.
(Source:OpenAI)