Microsoft researchers crack AI guardrails with a single prompt

Microsoft researchers crack AI guardrails with a single prompt




  • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
  • Multiple iterations can further erode built-in safety guardrails
  • They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”





Source link

More Reading

Post navigation

back to top