Anthropic apologizes for invisible Claude Fable guardrails

(theverge.com)

46 points | by rarisma 5 hours ago ago

27 comments

Avicebron 24 minutes ago
I like Claude Code a lot, I think it sets a dangerous precedent to put guardrails in that return a response from a prompt that was modified by the system in real time in order to subvert the original intent.
Fail cleanly. Anything else makes it too difficult to rely on.
edit: Giving the absolute maximum benefit of the doubt I understand that they see themselves as "stewards" for lack of a better word. But the EA thing is really leaking through, and paternalism isn't a good look.
[-]
- bs7280 8 minutes ago
  I think the reasonable middle ground anthropic is trying to achieve is - let the organizations that make the most important and critical software get a head start on cybersecurity before they inevitably allow everyone else the same access.
  Other commentors have made good points that these guardrails are counter productive for well intentioned cyber security, because I can't use it to test and harden my own software.
  [-]
  - sciencejerk 3 minutes ago
    Claude Opus 4.6 and 4.8 find vulns in source code just fine and 4.6 will pentest without source for you given a proper harness WITHOUT jailbreaking. WITH jailbreaks, you can probably imagine what they are capable of.
    Anthropic guardrails seem to be more about protecting their business (distillation), than they are about public safety.
  - ryandrake 2 minutes ago
    I wonder who gets to decide which companies make important and critical software and which ones get the scraps later.
  - notrealyme123 2 minutes ago
    exactly for cybersecurity the failure was visible. It was not visible for "Frontier" ML Research. The argument of headstart in it security is no feasible here.
- mapontosevenths 19 minutes ago
  I agree 100%. Doing a worse job IS an error. It should be treated as such. Or at the very least make that behavior opt-in. The default should not be pretending like nothing happened and just quietly doing a worse job.
  Imagine your healthcare provider just sometimes decided not to read your test results very carefully and you risked death? Now realize that healthcare providers use Claude now and that scenario wasn't hypothetical.
- cvadict 7 minutes ago
  > Fail cleanly.
  This is the same exact industry that gives you paid usage limits as a unit-less percentage bar then gaslights customers every time the algorithm running that percentage bar changes or they lobotomize an existing model with increased quantization to squeeze a few more dollars out of existing hardware.
  "Failing cleanly" might make their moated hype-machine look bad pre-IPO, so they certainly aren't going to do that voluntarily.
- hootz 16 minutes ago
  What is "EA" in this context? I see a lot of people using this initialism.
  [-]
  - massagedpelican 10 minutes ago
    Effective altruism. A lot of the folks working on AI at large tech companies are disproportionately represented in the movement. There's a lot of overlap between EA and the rationalist community as well. The wikipedia page is a good place to start https://en.wikipedia.org/wiki/Effective_altruism
  - carlgreene 14 minutes ago
    Effective Altruism I think
film42 5 minutes ago
I'm surprised they didn't do this the first time around. Like, a user says they forgot their password and you tell them they don't actually have an account, that's an information disclosure vulnerability. Not automatically falling back to Opus just lets the "attacker" know they are bumping against the guardrails and they need to try a different strategy.
It's Anthropic's product and they can do what they want, but my concern is what happens if Fable's product team decides that they can route 25% of traffic to Opus, bill it as Fable, and max their KPIs. That just doesn't sit right.
whatever1 a few seconds ago
Boobytrapping is illegal. Anthropic wanted to poison its customers on the suspicion of them misusing their services.
dang 20 minutes ago
Related. Others?
Anthropic walks back policy that could have 'sabotaged' researchers using Claude - https://news.ycombinator.com/item?id=48485958 - June 2026 (30 comments)
Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable - https://news.ycombinator.com/item?id=48478969 - June 2026 (488 comments)
If Claude Fable stops helping you, you'll never know - https://news.ycombinator.com/item?id=48467896 - June 2026 (495 comments)
---
Also related, I guess?
AWS Bedrock to require sharing data with Anthropic for Mythos and future models - https://news.ycombinator.com/item?id=48473166 - June 2026 (248 comments)
Anthropic requires 30 day data retention for Fable and Mythos - https://news.ycombinator.com/item?id=48464258 - June 2026 (291 comments)
airstrike 18 minutes ago
This article reads like it was written by Claude and forwarded to Verge.
behnamoh 9 minutes ago
They didn't apologize for doing it, they are sorry they were caught doing it. They still nerf the model if your request is about AI development.
[-]
- Someone1234 7 minutes ago
  They didn't get "caught." It was published, by them, when they released Fable a few days ago. They were very clear about it.
  It wasn't the correct way of handling the problem they were trying to address, but they definitely didn't hide it by any reasonable definition.
  [-]
  - SilverElfin 3 minutes ago
    No, it was not clear. No one expects that a tool they pay for and use professionally to purposefully sabotage their work. You’re excusing their unhinged behavior.
    https://xcancel.com/hammer_mt/status/2064839924398825798
    [-]
    - ryandrake a few seconds ago
      Making excuses for billion+ dollar companies' behavior is one of the most common HN comment section pastimes.
SilverElfin 5 minutes ago
Invisible guardrails? Or purposeful sabotage if you use it for building AI capabilities?
But also, it isn’t the only huge mistake Anthropic has made in the last 48 hours. Having a sneaky data retention policy, while also giving companies no way to block Fable, is a massive problem. And it is ridiculous that Anthropic has so little respect for its customers. OpenAI should take advantage of this.
bellowsgulch 17 minutes ago
Such a weird openly immoral way to defend your moat, too.
Why not just tell people, "To defend our ability to be competitive in our industry, we ask that you do not use Claude or any of our models to independently perform research on large language models or any of its related architectures or technologies. In order to prevent this violation of the Terms of Service, we have trained Claude Fable to deny any requests or prompts which involve frontier AI research."
prodigycorp 20 minutes ago
Anthropic apologizes for nothing. We all know where the EA cult on things of this matter and any statements otherwise is just PR.
The beliefs of these people, and how they manifest, is deeply terrifying to me. They believe that any means are acceptable to achieve what they believe is a better end.
micromacrofoot 2 minutes ago
incredible marketing from anthropic with all the "it's too dangerous" bullshit
bellowsgulch 20 minutes ago
*Anthropic apologizes they got caught defending their moat by implementing invisible Claude Fable guardrails
[-]
- simonw 9 minutes ago
  If by "got caught" you mean "published it in their system card paper".
  (Admittedly it was buried pretty deep in that 300+ page PDF, but they did at least disclose it. If they hadn't I imagine it would have taken quite some time for the research community to figure out what was going on.)
  [-]
  - afthonos 3 minutes ago
    It was in the announcement, too. I’m 99% sure they edited it after they changed their mind, because I knew about it from reading that, and never opened the model card.
- afthonos 6 minutes ago
  They didn’t get caught, they explicitly said they would do that in the announcement. I think it was both bad and a weird idea, but it certainly wasn’t sneaky.
- cyanydeez 19 minutes ago
  is it a moat or just a way to implement the permanent underclass?