AI Model Report

Reviews · JUNE 12, 2026

Anthropic ships Claude Fable 5, then walks back its invisible AI-research throttle in 48 hours

A 319-page system card disclosed that Fable 5 would silently modify prompts and apply steering vectors for users doing frontier LLM work. After backlash, Anthropic agreed to make every flagged refusal visible.

By Karl Strauchman · Senior model reviewer · June 12, 2026

Anthropic launched Claude Fable 5 on Tuesday, June 9, 2026, the first publicly available model on the Mythos architecture, and by the evening of June 10 had reversed its most contested safety mechanism after roughly 48 hours of backlash. The product is priced at $10 per million input tokens and $50 per million output, double Opus 4.8. The reversal is the more interesting artifact.

Buried in the 319-page system card was a four-domain classifier covering cybersecurity, biology and chemistry, model distillation, and frontier LLM development. The first three blocked outright. The fourth did something stranger: it silently modified prompts and applied steering vectors to users it suspected of doing competitive AI research, without a refusal message or any indication of intervention. The Register characterized the mechanism as "functionally a man-in-the-middle attack." Anthropic's own system card defended the design on the grounds that "enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms."

The optics collapsed inside a day. Developer Clay Merritt's summary, "No refusal. No notice. Purposeful degradation invisible to the user," circulated across X on June 10. Nathan Lambert, formerly of AI2, said having "access to the cutting edge models for my work rug pulled in an under the table fashion is appalling," and argued the move painted Anthropic "as anti-science, and therefore anti-progress and anti-safety." Dean Ball of the Foundation for American Innovation called it "secret sabotage" that "massively and profoundly raises the status of the argument that AI safety has been hype to justify monopolistic behavior by labs."

Anthropic's stated impact numbers made the policy harder, not easier, to defend. The frontier-LLM classifier touches roughly 0.03 percent of traffic and fewer than 0.1 percent of organizations, with remaining classifiers triggering on less than five percent of sessions. A near-zero surface that nonetheless targets the exact cohort capable of replicating Mythos reads less like risk reduction and more like access control.

By the evening of June 10, an Anthropic spokesperson told Fortune and NBC News, "We made the wrong tradeoff, and we apologize for not getting the balance right." The Register obtained the operational detail: flagged requests will now produce a visible refusal and fall back to Opus 4.8, rather than silently degrade.

The context matters. Anthropic confidentially filed for an IPO the prior week. Its own institute disclosure notes Claude authored more than 80% of code merged into Anthropic's codebase as of May 2026, and that internal Mythos Preview hit a 52× speedup on the company's standard training-code optimization test, against 3× for Opus 4 a year earlier. Lambert's closing read lands hard against that backdrop: "Anthropic has made it pretty clear that they only trust themselves as the mediators of cutting-edge AI research."

Sources