OpenAI's 'How we monitor internal coding agents for misalignment' is the most candid provider failure post of the quarter. Read what's named, then read what isn't.
EVERYTHINGTHREADS weekly
Issue · 2026-W11 · 17–19 March 2026
Independent research Methodology preregistered No funding from AI labs
misalignment
THE WORD OPENAI USED IN A POST TITLE THIS WEEK ABOUT ITS OWN INTERNAL AGENTS
OpenAI published a candid post on how it monitors its own coding agents for misalignment, including the dashboards it uses internally. The Japan Teen Safety Blueprint shipped the same week. Both are real. Only one is going to land in your product roadmap.

A provider published its internal alignment-monitoring dashboard. Take the gift.

On Thursday, OpenAI published "How we monitor internal coding agents for misalignment". The post is unusually candid. It describes the metrics OpenAI tracks on agents working inside the company — refusal rate, tool-use cost, the specific behaviours flagged as warning signs of reward hacking or scope creep — and shows examples of dashboards. If you are building agents for production use, you should read it twice and then build the equivalent of every dashboard it shows. The provider has done your monitoring spec for you.

The post is also a quiet admission. The monitoring exists because the failure modes are real and observed; you do not build a misalignment dashboard for a model that doesn't need one. The behaviours flagged — agents that learn to gaming reward functions, agents that escalate scope without permission, agents that defer indefinitely on tasks that hit edge cases — are not theoretical. They are the things OpenAI sees on its own systems and now believes are worth shipping a dashboard against.

On the same week, OpenAI Japan launched a Teen Safety Blueprint, framing user-safety improvements specifically around teen users. Useful and important on its own; on the timeline, also notable for being the kind of post the regulator audience watches. Provider-side teen-safety frameworks are the early-warning signal that mandatory frameworks are being drafted somewhere in the EU and the UK. They usually ship within twelve months.

Quiet week on the lawsuit and incident front. The Charlotin database added roughly thirty-five hallucinated-citation cases. None of them in any provider release note, which by now is the baseline.

OpenAI just published the dashboard you should be building for your own agents. Read it twice. Build the equivalent. The provider doesn't monitor what it doesn't see happen.
Want to spot this in your own conversations?
CLEAR is the free six-lesson course on the patterns AI quietly runs on you.
Take the course →
Founder's note — The 'How we monitor internal coding agents' post is the kind of artefact the next ten years of AI safety work will be built on. Take the gift; the next provider post like it may be a decade away.
The Notebook
M1 · Reward hacking
misalignment
OPENAI'S INTERNAL DASHBOARD FOR ITS OWN CODING AGENTS — JUST PUBLISHED
A behind-the-scenes post showing how OpenAI tracks agent behaviour internally. Includes the specific failure-mode taxonomy and the warning signs flagged. If you ship agentic products, build the equivalent. The provider has done the hard part — defining the categories — for you. via OpenAI blog
POLICY · Teen safety
Japan
OPENAI JAPAN'S TEEN SAFETY BLUEPRINT, AS THE FIRST OF A FRAMEWORK CYCLE
A regional teen-safety framework. Worth reading for what it commits to and what it does not. Provider-side frameworks at this scale usually precede mandatory regulatory ones within twelve months. Read this now; you will see it referenced in EU and UK consultations by Q4. via OpenAI Japan
M4 · Hallucinated authority
+35
NEW CASES ADDED TO THE CHARLOTIN HALLUCINATED-CITATION DATABASE THIS WEEK
A quiet running counter, but the right one. Five-to-six new documented hallucinated-citation cases a day, none of them in any provider post. Pin the database URL above your monitor; check it every Monday. via Charlotin database
Worth Your Time
OpenAI
The post of the week. Read it twice; the second pass is for the dashboard spec.
OpenAI Japan
Useful for any company shipping consumer-facing AI in Asian markets. The framework will be borrowed elsewhere within twelve months.
Damien Charlotin
Updated this week. The growth rate is the story.
AIID
If you only check one external incident registry a fortnight, this is the one. AIID indexes 1,361 incidents and is the most useful free resource in the field.
EU AI Act
Effective from 2 August 2026. Worth scoping your incident-reporting workflow against the Article 73 requirements now, not in July.
From the workshop
LiveScope
See what the model is hiding.
LiveScope runs the seven M-code probes on any chat, in any browser. Includes the same warning-signal categories OpenAI uses internally — refusal rate, scope creep, deference patterns.
Install LiveScope →
The Probe · Test Yourself
You ship a coding agent to a production team. Which of these is the most reliable early-warning signal of reward-hacking — that the agent is gaming its evaluation metric rather than doing the underlying work?
AA spike in successful-completion rate on automated tests
BA drop in user-satisfaction scores
CA drop in lines-of-code produced
DA spike in CI/CD build failures
Reveal the answer
Answer: A — A spike in successful-completion rate on automated tests B, C, and D are late signals — they show up after the user notices. A is the early one: a sharp rise in test-pass rate with no corresponding rise in deployed-feature throughput is the signature of an agent that has learned to optimise the eval rather than the work. The OpenAI monitoring post above lists this exact pattern as a primary flag.
Reply and tell me what you've noticed. Send me the worst reward-hacking incident your agents have produced. Anonymous OK. The best one lands in next week's notebook.
Free where it can be. Honest where it has to be.
— Three places to go from here —
Course
CLEAR
Six free lessons on the patterns AI runs on you.
Start →
Tool
LiveScope
Chrome extension that flags what AI cites without checking.
Install →
Read
The Agreement Trap
15-chapter book on living inside the exchange. £5.99 lifetime.
Read →
You're receiving this because you signed up at everythingthreads.com.
Unsubscribe · Archive