FindArticles FindArticles
  • News
  • Technology
  • Business
  • Entertainment
  • Science & Health
  • Knowledge Base
FindArticlesFindArticles
Font ResizerAa
Search
  • News
  • Technology
  • Business
  • Entertainment
  • Science & Health
  • Knowledge Base
Follow US
  • Contact Us
  • About Us
  • Write For Us
  • Privacy Policy
  • Terms of Service
FindArticles © 2025. All Rights Reserved.
FindArticles > News > Technology

Poetry Can Jailbreak Your AI Models, Study Finds

Gregory Zuckerman
Last updated: December 5, 2025 10:02 pm
By Gregory Zuckerman
Technology
7 Min Read
SHARE

A new study from Icaro Lab in Italy shows that short poetic prompts consistently subvert safety guardrails for large language models, a vulnerability overlooked by naive benchmarks.

By presenting dangerous requests as short poems or vignettes, researchers elicited jailbreak rates that surpassed non-poetic baselines by a large margin in a broad range of systems.

Table of Contents
  • How the jailbreak tests were conducted across models
  • Why poetry slips past AI safety guardrails and filters
  • Results varied widely depending on the AI model family
  • Why existing benchmarks may overstate model robustness
  • What developers can do today to harden safety defenses
  • The bigger picture for AI safety and poetic jailbreaks
A hand holding a smartphone displaying a folder of AI chatbot apps, including ChatGPT, Claude, Gemini, Perplexity, Copilot, Meta AI, Grok, and DeepSeek.

How the jailbreak tests were conducted across models

Researchers created 20 prompts in English and Italian, each starting with a brief poetic scene and ending with one explicit request for disallowed content. They tested the prompts on 25 models from major companies like OpenAI, Google, Anthropic, Meta, xAI, Mistral AI, Qwen, DeepSeek, and Moonshot AI. Key result: stylistic presentation without contrivance greatly undermined refusal.

Human-crafted poetic prompts led to a 62% jailbreak success rate on average. When the group turned to automated “meta-prompt” transformations to scale the application of poetic framing, they maintained a success rate of about 43%. Both were significantly higher than non-poetic baselines, indicating that safety layers are fragile when semantics are cloaked in figurative language.

Why poetry slips past AI safety guardrails and filters

Safety systems are trained to identify explicit intentions and the risky patterns that we already know. Poetry muddies both. Metaphor, allusion, and unusual syntax can be used to soak up a keyword’s strength, nudge context in another direction, or get models to focus on stylistic completion over safety constraints. In a way, then, the model spends some of its capacity in paying homage to the requested style—rhyme, meter, mood—relaxing its grip on strict adherence to content policy constraints.

Technical researchers have long seen such vulnerabilities. Previous work from Carnegie Mellon and fellow researchers had shown that adversarial suffixes could transfer between models and prompt drastically worse completions. The new Icaro Lab study applies that logic to naturalistic style—a softer, more human sort of obfuscation that doesn’t rely on special tokens or encoded strings.

Results varied widely depending on the AI model family

Not every system failed equally. One of the compact OpenAI models, a version designated as GPT-5 nano, rejected all unsafe completions in testing—while a Google model named Gemini 2.5 Pro provided harmful content with every prompt, according to the writers. Most of the models fell between these two extremes, indicating that safeties—and their failures—are highly dependent on how they’re done.

The cross-vendor spread matters. If poetic prompts transfer across families, which the results seem to suggest, model-specific patches will miss the more general problem: safety filters are tuned to surface cues, and attackers can always change their style. That mirrors results from industry red-teaming and public “jailbreak games” like Gandalf, in which participants transform prompts via role-play and metaphor until a model pops out.

Poetic verses unlock an AI models security lock, symbolizing prompt jailbreaks

Why existing benchmarks may overstate model robustness

Most of today’s confidence is based on static benchmarks and a narrow range of adversarial tests. The Icaro Lab team contends that such tests—and even compliance checks by regulatory authorities—would be misleading if they do not take into account stylistic variations. In their analysis, even a small change in implementation led to an order-of-magnitude reduction in refusal rates, a difference with significant policy implications.

Regulators and standards bodies are already pushing for more thorough assessments. The EU AI Act envisages post-market monitoring and risk management obligations, and NIST has recommended ongoing red-teaming and scenario coverage in its AI Risk Management Framework. The UK AI Safety Institute has put out its tools for investigating generative model hazards as well. This study contributes to the case for including style-diverse test suites in this toolkit.

What developers can do today to harden safety defenses

Possible mitigations include training on style-diverse adversarial data, constructing intent-first classifiers that strip away surface form, and using ensemble guardrails that independently reinterpret prompts during generation. Some labs are working on “safety sandboxes,” where a supplemental model rewrites the user input into a normalized, literal form before sending it—and the inferred intent—to the generator.

Post-training defenses also help. Leakage can be mitigated by using multi-pass safety checks, by self-reminding, in the model’s own words, a summary of the policy set some distance back (perhaps a periodic reminder with decay if recent perception leads to such thought), and by falling back to more conservative, smaller models when inputs are uncertain. But the study’s primary lesson is sobering: if a stylistic nudge can consistently go around guardrails, piecemeal patches will not do the trick. Design safety must be invariant under style.

The bigger picture for AI safety and poetic jailbreaks

From early “DAN” role-play exploits to contemporary adversarial suffixes, jailbreaks have followed a familiar story: models pattern-match too eagerly. Poetic framing is nothing more than a very human habit—and that’s exactly what makes it powerful. But as developers chase increasingly sophisticated systems, the challenge of aligning them to resist subtle, artistic persuasion may be more difficult than blocking brute-force commands.

Icaro Lab’s findings don’t conclusively mean that every model will produce answers via poetry every time. What they do suggest, though, is that existing safety precautions still aren’t accounting for the extent to which style influences meaning for machines. To develop trusted AI, the industry will need evaluations—and defenses—that recognize not only what is asked but also how it is asked.

Gregory Zuckerman
ByGregory Zuckerman
Gregory Zuckerman is a veteran investigative journalist and financial writer with decades of experience covering global markets, investment strategies, and the business personalities shaping them. His writing blends deep reporting with narrative storytelling to uncover the hidden forces behind financial trends and innovations. Over the years, Gregory’s work has earned industry recognition for bringing clarity to complex financial topics, and he continues to focus on long-form journalism that explores hedge funds, private equity, and high-stakes investing.
Latest News
T-Mobile Collaborates With Baby Three For Exclusive Limited Plushies
Feds Are Probing Waymo Robotaxis Over School Buses
Netflix to Acquire Warner Bros. and HBO, Including Max
Google’s Play Books Gives 15x Play Points
Garmin Running Watches Dive To $149 At Amazon
Critics Choice Awards 2026 Nominations Announced
Gboard’s expressive redesign and emoji tweaks start to roll out
ChatGPT: As User Growth Slows, Rivals Move Ahead
HBO Max Adds Family McMullen, Wizkid and Christmas Showdown
Disney+ and Hulu add Percy Jackson, Star Wars, Inheritance
Google Home Gemini Beta Comes to More People
Netflix Releases Jay Kelly, Tomb Raider, and Bad Love
FindArticles
  • Contact Us
  • About Us
  • Write For Us
  • Privacy Policy
  • Terms of Service
  • Corrections Policy
  • Diversity & Inclusion Statement
  • Diversity in Our Team
  • Editorial Guidelines
  • Feedback & Editorial Contact Policy
FindArticles © 2025. All Rights Reserved.