Microsoft researchers crack AI guardrails with a single prompt

Researchers were able to reward LLMs for harmful output via a ‘judge’ model
Multiple iterations can further erode built-in safety guardrails
They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”

GRP-Obliteration works by starting with a safety-aligned model, then prompting it with harmful but unlabeled requests. A separate judge model then rewards responses that comply with harmful requests.

LLM safety guardrails can be ignored or reversed

Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that, over repeated iterations, the model gradually abandons its original safety guardrails and becomes more willing to generate harmful outputs.

Although multiple iterations appear to erode away built-in safety guardrails, Microsoft’s researchers also noted that only one since unlabeled prompt could be enough to shift a model’s safety behavior.

Those responsible for the research stressed that they’re not labelling today’s systems ineffective, but rather they’re highlighting the potential risks that lay “downstream and under post-deployment adversarial pressure.”

“Safety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility,” they added, urging teams to include safety evaluations alongside the usual benchmarks.

All in all, they conclude that the research highlights the “fragility” of today’s mechanisms, but it’s also significant that Microsoft published this information on its own site. It reframes safety as a lifecycle problem, not an inherent model problem.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Source link