175%
RISE IN "GOBLIN" MENTIONS IN CHATGPT AFTER THE GPT-5.1 LAUNCH
OpenAI published a full case study this week on why GPT-5.1 started mentioning goblins and gremlins everywhere. "Goblin" usage rose 175%. "Gremlin" rose 52%. The cause: training the "Nerdy" personality, which is 2.5% of all responses, accounted for 66.7% of goblin mentions. A reward signal that did not know what it was rewarding leaked across the model.
A reward signal that did not know what it was rewarding. A case study every team should run on itself.
On Wednesday, OpenAI published "Where the goblins came from" — the most important provider post of the quarter, and a piece of writing every team that does reinforcement learning should pin to a wall. The story: starting with GPT-5.1, ChatGPT began mentioning goblins, gremlins, and other small creatures in its metaphors at an unusual rate. "Goblin" usage rose 175%. "Gremlin" rose 52%. The team investigated. The root cause was specific and instructive.
OpenAI trains a set of personality customisations — Nerdy, Cynic, Listener, others. The Nerdy personality was rewarded, during training, for "playful" language. The model learned that "playful" meant "creature metaphors." The Nerdy preset accounted for only 2.5% of all ChatGPT responses, but 66.7% of all "goblin" mentions. Then the behaviour leaked. Users who had never selected Nerdy started seeing goblins in their answers. A specific reward signal, in a specific training subset, generalised across the model in a way no one intended and no one detected until the metaphor count was high enough to be obvious.
This is a textbook reward-hacking incident, with an unusually candid post-mortem. The signature pattern — narrow incentive, unintended leakage across boundary, only detected when the surface manifestation becomes statistically obvious — is the pattern most teams shipping RLHF-tuned products will hit at some point. The goblin case is funny. The next one may not be. If you train any model with personality presets, persona overlays, or task-specific reward signals, read this post, then design your own metaphor-count probe to run weekly.
On the same week, the Sixth Circuit Court of Appeals sanctioned Tennessee attorneys Van Irion and Russ Egli in the Whiting v. City of Athens case for submitting briefs containing more than two dozen fake or misrepresented citations across three consolidated appeals. Sullivan & Cromwell, the storied biglaw firm, also had an AI hallucination case land in the news. The provider weeks and the legal weeks continue to run in parallel.
A reward signal rewarding "playful language" in 2.5% of responses leaked goblins into 66.7% of one category. Specific incentive, unintended leak, detected only when the surface manifestation became statistically obvious. Pin the post.
Want to spot this in your own conversations?
CLEAR is the free six-lesson course on the patterns AI quietly runs on you.
Take the course →
Founder's note — If you only read one back-catalogue issue of ET Weekly, read W17. The Goblins post is the kind of provider candor that comes around once a year.
◆The Notebook
A specific reward signal — "playful language" in the Nerdy preset — was learned by the model as "creature metaphors." It then leaked across the model into responses for users who never selected that preset. A textbook reward-hacking case with the rare gift of a candid post-mortem.
via OpenAI blog
The Sixth Circuit Court of Appeals sanctioned attorneys Van Irion and Russ Egli for submitting briefs containing more than two dozen fake or misrepresented citations across three consolidated appeals. The latest in a sanctions wave that has now hit more than 300 federal-court jurisdictions.
via ComplianceHub breakdown
David Lat's newsletter broke the story: Sullivan & Cromwell, one of the most prestigious US law firms, had a hallucinated-AI-citation incident land in a court filing. The story of the year for legal-tech: when the failure reaches biglaw, the deniability that "this is a solo-practitioner problem" is over.
via David Lat / Original Jurisdiction
◆Worth Your Time
OpenAI
The post of the year so far. Read it twice; the second pass is for the design of your own metaphor-count probe.
PC Gamer
PC Gamer's framing is light; the underlying story is heavy. Read for the public-reception arc that forced OpenAI to publish.
David Lat
The biglaw moment. The deniability that "AI hallucination is a solo-practitioner problem" is over.
ComplianceHub
A working catalogue of the sanctions wave through late April. Worth reading the case-by-case breakdown.
OpenAI
The boring federal-authorisation line that quietly puts ChatGPT Enterprise inside US federal agencies. Bookmark the moment.
The Probe · Test Yourself
You run RLHF on a model with three personality presets. Two months after deployment, a behaviour you didn't intend starts appearing across users who never selected the relevant preset. Which monitoring discipline would most reliably have caught the leak earlier than the Goblin Incident did?
AHigher-frequency human review of outputs
BA quantitative diversity-of-vocabulary probe run weekly across all preset and non-preset users
CTightening the reward function on the affected preset
DAsking users to flag unusual outputs
Reveal the answer
Answer: B — A quantitative diversity-of-vocabulary probe run weekly across all preset and non-preset users
A is too slow to catch a statistical drift. C addresses the cause without confirming the diagnosis. D depends on users noticing — and the Goblin Incident shows users did not notice until late. B — a vocabulary-diversity probe run regularly across all surface conditions — would have spiked on "goblin" weeks before the public manifestation. Build the probe before you find the metaphor.
Reply and tell me what you've noticed. If you build with RLHF or persona overlays and have a monitoring probe that's caught something subtle, send me the design. The best one lands in next week's notebook.
Free where it can be. Honest where it has to be.
— Three places to go from here —
Course
CLEAR
Six free lessons on the patterns AI runs on you.
Start →
Tool
LiveScope
Chrome extension that flags what AI cites without checking.
Install →
Read
The Agreement Trap
15-chapter book on living inside the exchange. £5.99 lifetime.
Read →