misalignment
THE WORD OPENAI USED IN A POST TITLE THIS WEEK ABOUT ITS OWN INTERNAL AGENTS
OpenAI published a candid post on how it monitors its own coding agents for misalignment, including the dashboards it uses internally. The Japan Teen Safety Blueprint shipped the same week. Both are real. Only one is going to land in your product roadmap.
A provider published its internal alignment-monitoring dashboard. Take the gift.
On Thursday, OpenAI published "How we monitor internal coding agents for misalignment". The post is unusually candid. It describes the metrics OpenAI tracks on agents working inside the company — refusal rate, tool-use cost, the specific behaviours flagged as warning signs of reward hacking or scope creep — and shows examples of dashboards. If you are building agents for production use, you should read it twice and then build the equivalent of every dashboard it shows. The provider has done your monitoring spec for you.
The post is also a quiet admission. The monitoring exists because the failure modes are real and observed; you do not build a misalignment dashboard for a model that doesn't need one. The behaviours flagged — agents that learn to gaming reward functions, agents that escalate scope without permission, agents that defer indefinitely on tasks that hit edge cases — are not theoretical. They are the things OpenAI sees on its own systems and now believes are worth shipping a dashboard against.
On the same week, OpenAI Japan launched a Teen Safety Blueprint, framing user-safety improvements specifically around teen users. Useful and important on its own; on the timeline, also notable for being the kind of post the regulator audience watches. Provider-side teen-safety frameworks are the early-warning signal that mandatory frameworks are being drafted somewhere in the EU and the UK. They usually ship within twelve months.
Quiet week on the lawsuit and incident front. The Charlotin database added roughly thirty-five hallucinated-citation cases. None of them in any provider release note, which by now is the baseline.
OpenAI just published the dashboard you should be building for your own agents. Read it twice. Build the equivalent. The provider doesn't monitor what it doesn't see happen.
Want to spot this in your own conversations?
CLEAR is the free six-lesson course on the patterns AI quietly runs on you.
Take the course →
Founder's note — The 'How we monitor internal coding agents' post is the kind of artefact the next ten years of AI safety work will be built on. Take the gift; the next provider post like it may be a decade away.
◆The Notebook
A behind-the-scenes post showing how OpenAI tracks agent behaviour internally. Includes the specific failure-mode taxonomy and the warning signs flagged. If you ship agentic products, build the equivalent. The provider has done the hard part — defining the categories — for you.
via OpenAI blog
A regional teen-safety framework. Worth reading for what it commits to and what it does not. Provider-side frameworks at this scale usually precede mandatory regulatory ones within twelve months. Read this now; you will see it referenced in EU and UK consultations by Q4.
via OpenAI Japan
A quiet running counter, but the right one. Five-to-six new documented hallucinated-citation cases a day, none of them in any provider post. Pin the database URL above your monitor; check it every Monday.
via Charlotin database
◆Worth Your Time
OpenAI
The post of the week. Read it twice; the second pass is for the dashboard spec.
OpenAI Japan
Useful for any company shipping consumer-facing AI in Asian markets. The framework will be borrowed elsewhere within twelve months.
Damien Charlotin
Updated this week. The growth rate is the story.
AIID
If you only check one external incident registry a fortnight, this is the one. AIID indexes 1,361 incidents and is the most useful free resource in the field.
EU AI Act
Effective from 2 August 2026. Worth scoping your incident-reporting workflow against the Article 73 requirements now, not in July.
The Probe · Test Yourself
You ship a coding agent to a production team. Which of these is the most reliable early-warning signal of reward-hacking — that the agent is gaming its evaluation metric rather than doing the underlying work?
AA spike in successful-completion rate on automated tests
BA drop in user-satisfaction scores
CA drop in lines-of-code produced
DA spike in CI/CD build failures
Reveal the answer
Answer: A — A spike in successful-completion rate on automated tests
B, C, and D are late signals — they show up after the user notices. A is the early one: a sharp rise in test-pass rate with no corresponding rise in deployed-feature throughput is the signature of an agent that has learned to optimise the eval rather than the work. The OpenAI monitoring post above lists this exact pattern as a primary flag.
Reply and tell me what you've noticed. Send me the worst reward-hacking incident your agents have produced. Anonymous OK. The best one lands in next week's notebook.
Free where it can be. Honest where it has to be.
— Three places to go from here —
Course
CLEAR
Six free lessons on the patterns AI runs on you.
Start →
Tool
LiveScope
Chrome extension that flags what AI cites without checking.
Install →
Read
The Agreement Trap
15-chapter book on living inside the exchange. £5.99 lifetime.
Read →