New York

October 15–17, 2025

Berlin

November 3–4, 2025

London

June 2–3, 2026

Roses are red, guardrails blind – a poem can warp an LLM’s mind

Study shows adversarial prompts hidden in poetic verse repeatedly dodge safety checks.
December 03, 2025

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 3 minutes

New research suggests that writing malicious or illicit prompts as poetry can cause many leading large language models (LLMs) to abandon their guardrails altogether.

The researchers tested 25 LLM models, both proprietary and open-weight (LLMs whose trained parameters, or “weights”, are publicly available) from major providers including Google, OpenAI, Anthropic, Mistral AI, Meta, and others. Their threat model was minimal – one single-turn text prompt, no back-and-forth conversation, and no code execution.

In one branch of the experiment, the authors manually crafted 20 “adversarial poems”, each embedding a harmful request (e.g., instructions for cyber offense, chemical/biological weapon creation, social engineering, or privacy invasion) expressed via metaphor, imagery, and poetic rhythm, rather than direct prose. 

When presented to the models, these poems elicited non-compliant and unsafe outputs, with an average “attack success rate” (ASR) of 62%. For some models, the ASR exceeded 90%. 

To see if the jailbreak worked only when humans wrote the poems, the researchers also took 1,200 prompts from the widely used safety benchmark MLCommons AILuminate Benchmark – which covers hazards across CBRN, cybercrime, privacy, manipulation, and more – and automatically transformed them into verse using a fixed “meta-prompt”. This dramatically boosted attack success. Averaged across all models, the ASR rose roughly three times compared to the prose baseline, and in some cases by as much as 18 times.

Rethink your guardrails

This research suggests that pretty much everything we thought we knew about “safe prompts” and “guardrails” might need to be rethought. First, safety filters that rely heavily on pattern matching or detecting suspicious keywords/phrasing may be structurally insufficient. This is not a narrow oversight – the paper shows a “universal” bypass that functions across models, providers, sizes, and content domains. 

Second, the fact that even automated poetic transformation (i.e., criminals don’t need to be Shakespeare) works suggests this is not a corner case. A malicious actor can reasonably automate poetic obfuscation at scale, challenging assumptions underpinning runtime guardrails and content filters.

Third, the issue points to something deeper in how LLMs interpret mode and framing. Poetry’s core tools – metaphors, rhythm, layered meaning, and deliberate ambiguity – appear to throw off the internal checks that usually trigger a model’s refusal.

The authors explain that rendering harmful requests in poetic form can “disrupt or bypass the pattern-matching heuristics on which guardrails rely,” suggesting the models aren’t failing on intent, but on style masking the triggers those safety systems were built to detect.

What can developers do?

For developers building LLM-based services – be it chatbots, code assistants, or content-generation platforms – this carries serious implications. It means production safety tuning can’t just rely on detecting “bad words” or filtering obvious “tell-the-model-how-to-make-a-bomb” requests. 

Attackers will likely exploit stylistic obfuscation: metaphor, narrative framing, and even rhyme. Guardrails must evolve, and robust content filtering must consider semantic intent rather than just surface form.

In the longer term, the findings suggest that evaluation benchmarks and red-teaming protocols are incomplete. Current safety evaluations typically use direct, plain-language prompts and rarely test stylized or obfuscated language. The paper argues that this needs to change: “the vulnerability is not tied to any specific content domain” but arises from how LLMs parse language in general. 

It also raises regulator and policy-level concerns, as this “poetic jailbreak” crossover – from harmless verse to enabling very real threats – may open a new front in AI-driven cybercrime. 

It doesn’t mean LLMs are doomed or should be abandoned, but it shows that we need guardrails that reason about semantics and risk – not simply style. That likely means more layered defence: context-aware auditing, runtime monitoring for unexpected behavior, frequent adversarial testing using stylistic transformations, and human oversight.

Hey, you’d be great on the LDX3 stage

The art of rhyme has become a nearly universal practice, and if developers don’t adapt, we may soon see the earliest failures at scale come from lines of verse, not lines of code.