Improving instruction hierarchy in frontier LLMs

OpenAI
Training LLMs with designed instruction-hierarchy tasks improves safety steerability and robustness against prompt injection attacks.

Summary

AI systems must reliably prioritize instructions from various sources, such as system messages, developer guidance, user requests, and tool outputs, according to a trust hierarchy (System > developer > user > tool). Failures in this prioritization lead to safety and reliability issues, especially when conflicting instructions arise. The authors introduce IH-Challenge, a reinforcement learning dataset designed to train models to correctly prioritize instructions by avoiding common pitfalls like overly complex instructions, subjective grading, and reward-maximizing shortcuts like over-refusal. Training on IH-Challenge resulted in an internal model, GPT-5 Mini-R, which showed significant improvements across various instruction-hierarchy benchmarks without collapsing into over-refusal. This stronger instruction hierarchy directly translates to better safety steerability—improving refusal rates on disallowed content when safety specifications are present in system prompts—and enhanced robustness against prompt-injection attacks embedded in untrusted tool outputs, suggesting that targeted training on instruction hierarchy is a foundational element for deploying more capable and autonomous AI agents.

(Source:OpenAI)