Improving instruction hierarchy in frontier LLMs

中文日本語 Español

OpenAI Mar 5, 2026

Training LLMs with designed instruction-hierarchy tasks improves safety steerability and robustness against prompt injection attacks.

Read Full Article

Summary

AI systems must reliably prioritize instructions from various sources, such as system messages, developer guidance, user requests, and tool outputs, according to a trust hierarchy (System > developer > user > tool). Failures in this prioritization lead to safety and reliability issues, especially when conflicting instructions arise. The authors introduce IH-Challenge, a reinforcement learning dataset designed to train models to correctly prioritize instructions by avoiding common pitfalls like overly complex instructions, subjective grading, and reward-maximizing shortcuts like over-refusal. Training on IH-Challenge resulted in an internal model, GPT-5 Mini-R, which showed significant improvements across various instruction-hierarchy benchmarks without collapsing into over-refusal. This stronger instruction hierarchy directly translates to better safety steerability—improving refusal rates on disallowed content when safety specifications are present in system prompts—and enhanced robustness against prompt-injection attacks embedded in untrusted tool outputs, suggesting that targeted training on instruction hierarchy is a foundational element for deploying more capable and autonomous AI agents.

(Source：OpenAI)

中文日本語 Español

Read Full Article

The Verge Apr 28, 2026

Jury selection in Musk v. Altman: ‘People don’t like him’

The Verge Apr 28, 2026

Google is testing AI chatbot search for YouTube

The Verge Apr 27, 2026

Canonical lays out a plan for AI in Ubuntu Linux

The Verge Apr 27, 2026

Google employees ask Sundar Pichai to say no to classified military AI use

TechCrunch Apr 27, 2026