More details on Fable 5’s cyber safeguards and our jailbreak framework

Anthropic
Anthropic details cybersecurity safety classifiers for Fable 5 and introduces a draft framework for grading the severity of AI jailbreaks.

Summary

Fable 5 is now globally available with enhanced cybersecurity safeguards, including safety classifiers designed to block dangerous use cases while permitting benign ones. The system categorizes activities into four levels: prohibited, high-risk dual use, low-risk dual use, and benign. Complementing these protections, the company has proposed an early-draft "Cyber Jailbreak Severity" (CJS) framework. This scale evaluates jailbreak risks based on capability gain, breadth of utility, ease of weaponization, and discoverability, aiming to establish a standardized industry language for assessing and mitigating AI model safety threats.

(Source:Anthropic)