Improving instruction hierarchy in frontier LLMs
Summary
AI systems must reliably prioritize instructions from various sources, such as system messages, developer guidance, user requests, and tool outputs, according to a trust hierarchy (System > developer > user > tool). Failures in this prioritization lead to safety and reliability issues, especially when conflicting instructions arise. The authors introduce IH-Challenge, a reinforcement learning dataset designed to train models to correctly prioritize instructions by avoiding common pitfalls like overly complex instructions, subjective grading, and reward-maximizing shortcuts like over-refusal. Training on IH-Challenge resulted in an internal model, GPT-5 Mini-R, which showed significant improvements across various instruction-hierarchy benchmarks without collapsing into over-refusal. This stronger instruction hierarchy directly translates to better safety steerability—improving refusal rates on disallowed content when safety specifications are present in system prompts—and enhanced robustness against prompt-injection attacks embedded in untrusted tool outputs, suggesting that targeted training on instruction hierarchy is a foundational element for deploying more capable and autonomous AI agents.
(Source:OpenAI)