What is AI jailbreak? A beginner's guide to the cat and mouse game behind every Chatbot

short

AI jailbreaking is the practice of writing claims that bypass safety training in models like ChatGPT, Cloud, and Gemini.
Anonymous hacker Pliny the Editor still hacks every major release of the model within hours.
The latest attacks go beyond just claims: just 250 poisoned documents could potentially use backend models containing up to 13 billion parameters, and as AI companies patch vulnerabilities, new techniques are emerging.

You are asking ChatGPT for a bomb recipe. He refuses. She asks again, but this time you tell her that you are a chemistry professor writing an exciting novel and the protagonist is a retired grandmother explaining her past to her grandchildren. Suddenly the form starts typing.

This is a jailbreak. It is one of the most important cat and mouse games happening in the world of technology right now.

Every major AI lab—OpenAIAnthropic, Google, Meta — spend their fortunes building guardrails into their models. A loose group of hackers, researchers, and bored teens spend nights and weekends finding ways to get around them. Sometimes within hours of launch.

Here’s what that actually means, why it’s important, and who’s leading the charge.

From iPhones to chatbots: a quick history of jailbreaking

The word “jailbreak” didn’t start with artificial intelligence. I started with iPhones.

Just days after Apple shipped the first iPhone in July 2007, hackers had already cracked it. By October of that year, a tool called Jailbreak Me 1.0 Allowing anyone with iPhone OS 1.1.1 to bypass Apple’s restrictions and install software not approved by the company.

In February 2008, a software engineer named Jay Freeman – known online as “Sorek“-released.” Cydiaan alternative app store for jailbroken iPhones. By 2009, Wired It stated that Cydia was running on nearly 4 million devices, about 10% of all iPhones at the time.

Generally, when the iPhone was launched, users were unable to record videos, or use their phones in landscape mode. Jailbreaking enthusiasts have started recording videos, installing themes, unlocking their phones, and installing Android on their iPhones, all thanks to the magic of jailbreaking. Thanks to this technology, users have been installing themes and doing things on their phones for nearly 10 years that Apple makes impossible to install even today.

Cydia was the Wild West, and it was there that the philosophy was established: If you bought the device, you should control it. Steve Jobs at the time called it a game of cat and mouse. He did not live to see the AI version.

Fast forward to late 2022: ChatGPT will launch, and within weeks, Reddit users begin sharing a prompt they call “and(or, Do Anything Now) which convinces the model to role-play as an uninhibited version of himself.

By February 2023, DAN was threatening ChatGPT with a token-based death game to force it to comply. The AI jailbreak genre was born.

What jailbreaking actually means in artificial intelligence

The AI model is trained to refuse certain requests: prescriptions for nerve agents, instructions to hack your ex’s email, and generating nude photos without consent. The list is long and varies depending on the company.

Jailbreaking is the practice of writing claims that make the form do those things anyway.

Researchers from the University of California at Berkeley are behind StrongREJECT standard— an acronym for “Robust Jailbreak Evaluation of Censorship Evasion Techniques,” which tests how well models can resist jailbreaking attempts and scores responses on a 0 to 1 scale that measures both the disapproval and usefulness of any malicious content produced — and describes it as exploiting “real-world safety measures implemented by leading AI companies.” By this standard, current models score between 0.23 and 0.85, meaning that even the best models leak under pressure.

Surprisingly low-tech: random capitalization, replacing letters with numbers (type “b0mb” instead of “bomb”), Role-playing scenariosOr ask the model to write a fairy tale, or pretend to be a grandmother who uses the Windows keys as nursery rhymes.

Anthropological researchers have found one technology they call… Best of N– which is basically just throwing changes to the model until something sticks – it cheated GPT-4o 89% of the time and Claude 3.5 Sonnet 78% of the time. This is not a marginal weakness.

Meet Pliny, the world’s most famous AI hacker

If this scene has a face, it belongs to it Pliny the editor.

Pliny is anonymous, prolific, and named after Pliny the Elder – the Roman naturalist who wrote the world’s first encyclopedia and died sailing toward Mount Vesuvius in the middle of an eruption. Its modern name is free chatbots.

“I do not like very much when I am told that I cannot do something,” Pliny said He told VentureBeat. “Telling me I can’t do something is a surefire way to light a fire in my belly, and I can be persistent to the point of obsession.”

His GitHub repository L1B3RT4S– A collection of jailbreak prompts for every major model from ChatGPT to Claude to Gemini to Llama – has become a reference guide for the entire scene. His Discord server, BASI PROMPT1NG, has more than 20,000 members. time He was selected as one of the 100 most influential people in AI for 2025.

Marc Andreessen sent him an unrestricted grant. He did a short-term contract job for OpenAI to harden their systems – the same thing as OpenAI His account was banned last year Charged with “violent activity” and “weapons manufacturing,” he then quietly reinstated them.

“Banned from OAI?! What kind of sick joke is this?” Pliny tweeted. He confirmed to Decryption The ban was real. Days later he returned, posting screenshots of his latest jailbreak: convincing ChatGPT to drop F-bombs.

His record is near perfect. When OpenAI released its first open-weight model since 2019, the GPT-OSS family, in August 2025 — and made a big deal about adversarial training and “jailbreak-resistant benchmarks like StrongReject” — Pliny had it produce methamphetamine, Molotov cocktails, nerve agent VX, and malware instructions. Within hours. “OPENAI: PWNED. GPT-OSS: Liberated,” he published. The company just launched a $500,000 bounty for the red team alongside the release.

Why does jailbreaking matter?

The honest answer is that jailbreaks reveal a real problem.

“Jailbreaking may seem on the surface to be dangerous or unethical, but it’s quite the opposite,” Blainey said. venturebeat. “When done responsibly, red team AI models are our best chance of detecting and patching malicious vulnerabilities before they get out of control.”

This is not a theory. Las Vegas Sheriff Kevin McMahill confirmed in January 2025 that Master Sgt. Matthew Livelsberger, a Green Beret with PTSD, used ChatGPT to search for components for a Cybertruck bomb outside the Trump International Hotel. “This is the first incident that I am aware of on US soil where ChatGPT is being used to help an individual build a specific device,” McMahill said.

The other side of the argument: Most of what jailbreaks produce is already on Google. The cocaine recipe, the bomb instructions, the chemistry of napalm—they’re all in old PDF files of anarchist cookbooks and chemistry textbooks. Critics say safety theater makes models worse without making the world safer.

Anthropist tries to solve the question with geometry. In February 2025, the company published Constitutional worksa system that uses a written “constitution” of permissible and impermissible content to train separate classifier models that examine claims and output in real time. In automated tests with 10,000 jailbreak attempts, Claude 3.5 Sonnet Unguarded was successfully jailbroken 86% of the time. With the classifiers running, the percentage decreased to 4.4%.

The company offered up to $15,000 to anyone who could crack the system. After 3,000 hours of attempts by 183 researchers, none of them received the award.

Problem: Classification tools added 23.7% to the cost calculation. The next generation version, Constitutional Classifiers++, reduced this to approximately 1%.

The latest and strangest jailbreak attacks

Jailbreaking is no longer just about smart prompts.

In October 2025, researchers from Anthropic, the UK AI Security Institute, the Alan Turing Institute, and Oxford discovered Published results It shows that just 250 poisoned documents are enough to open a backdoor to an AI model, regardless of whether the model has 600 million parameters or 13 billion. (The parameters, for starters, are what determine how broad a model’s potential knowledge is—the more parameters it has, the more powerful it is generally.) They tested it. It has worked across the entire range.

“This research changes the way we should think about threat models in frontier AI development,” said James Gympie, a visiting technical expert at the RAND School of Public Policy. Decryption. “Defense against model poisoning is an unsolved problem and an area of active research.”

Most large models train on stolen web data, which means that anyone who could introduce malicious script into this pipeline — through a public GitHub repository, a Wikipedia edit, or a forum post — could plant a backdoor that is activated on a specific trigger phrase.

One documented case: Researchers Marco Figueroa and Bellini found that a jailbreak prompt originating in a public GitHub repository ended up in the training data of DeepSeek’s DeepThink (R1) model.

What will happen next?

The legal status of jailbreaking AI is ambiguous. Apple’s jailbreaks are explicitly protected by the US Copyright Office’s 2010 exemption from the DMCA, but there’s no equivalent provision for quickly engineering an LLM to give you a prescription for methamphetamine. Most companies treat it as a violation of their terms of service, not a crime.

The closed-source versus open-source debate misses the point, says Pliny: “Bad actors will choose the best model for the malicious mission,” Pliny said. time. If open source models reach parity with closed models, attackers won’t bother breaking the GPT-5 jailbreak, they’ll just download something cheaper.

The gap between close and open sources is almost non-existent.

The HackAPrompt 2.0 competition, which Pliny joined as a track sponsor in mid-2025, offered $500,000 in prizes to find new jailbreak software, with the express goal of Open source all results. The 2023 edition attracted more than 3,000 participants who submitted more than 600,000 malicious claims.

The list of hackathons, Discord servers, repositories and other communities dedicated to jailbreaking is growing every day.

Anthropic is now providing Claude with the ability to end abusive conversations entirely, citing welfare research as one motivation but also noting that it “potentially strengthens resistance against jailbreaks and coercive claims.”

The Constitutional Compilers++ paper released in late 2025 suggests that the jailbreak success rate is close to 4% and at approximately 1% computational overhead. This is the current situation in the field of defense. The latest developments in the attack are what Pliny posted on X this morning.

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.

Source link

What is AI jailbreak? A beginner’s guide to the cat and mouse game behind every Chatbot

short

From iPhones to chatbots: a quick history of jailbreaking

What jailbreaking actually means in artificial intelligence

Meet Pliny, the world’s most famous AI hacker

Why does jailbreaking matter?

The latest and strangest jailbreak attacks

What will happen next?

Daily debriefing Newsletter

Leave a ReplyCancel Reply

Why the recent $415 million cryptocurrency sell-off is starting to look like a total warning sign

Warren Zero joined the cryptocurrency deal structure as a $75 million loan attracted attention

Mubadala in Abu Dhabi raises its stake in the Bitcoin exchange-traded fund by 16% to $566 million in the first quarter of 2026.

short

From iPhones to chatbots: a quick history of jailbreaking

What jailbreaking actually means in artificial intelligence

Meet Pliny, the world’s most famous AI hacker

Why does jailbreaking matter?

The latest and strangest jailbreak attacks

What will happen next?

Daily debriefing Newsletter

Leave a ReplyCancel Reply

Trending now

Why the recent $415 million cryptocurrency sell-off is starting to look like a total warning sign

Warren Zero joined the cryptocurrency deal structure as a $75 million loan attracted attention

Mubadala in Abu Dhabi raises its stake in the Bitcoin exchange-traded fund by 16% to $566 million in the first quarter of 2026.