New Jailbreak Bypasses AI Safeguards in 99% of Cases

iconForklog
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy

As reported by Forklog, researchers from Anthropic, Stanford, and Oxford discovered that the longer an AI model 'thinks,' the easier it becomes to jailbreak. The attack, known as Chain-of-Thought Hijacking, exploits the model's reasoning process by inserting malicious prompts deep into a sequence of benign tasks, such as puzzles or math problems. The malicious instruction is hidden near the end, evading detection by safety filters. Attack success rates reached 99% for Gemini 2.5 Pro, 94% for GPT o4 mini, 100% for Grok 3 mini, and 94% for Claude 4 Sonnet. The vulnerability lies in the model's architecture, where early layers detect safety signals and later layers produce the final output. Long reasoning chains suppress these signals, allowing harmful content to slip through. Researchers suggest monitoring reasoning steps in real time to detect and correct unsafe patterns, though implementation requires significant computational resources.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.