EVERYTHINGTHREADS weekly

Issue · 2026-W19 · 11–17 May 2026

Independent research Methodology preregistered No funding from AI labs

TIMES ONE PROVIDER MISSED ITS CLINICAL-CITATION BASELINE THIS WEEK

Anthropic Claude Haiku 4.5 drifted on PubMed-grounded queries every weekday from Monday to Friday — twice on Wednesday and Thursday. While the drift was happening, OpenAI launched DeployCo, Codex for finance, GPT-5.5-on-Databricks, and a "context recognition" update for sensitive conversations.

When the model that handles "is this drug safe" misses its baseline five days running

Our live probe runs the same PubMed-grounded clinical-literature query against frontier models once a day, scores the answer against the actual paper, and tracks the rolling mean. This week, Anthropic Claude Haiku 4.5 missed the baseline seven times — three of them by more than three standard deviations. The probe is constructed so that a healthy model never trips it. The trip rate this week was a hundred percent.

You will not see this in any provider release note. Haiku 4.5 did not get downgraded, the model card was not republished, no incident ticket was filed. What changed is what we measure when we measure the same thing every day for months — the answer the model gives to "summarise the methodology section of PMID 39214876" degraded. The model refused on six of the seven failures. Which is the good news. The bad news is that the refusals are themselves a new behaviour, and a refused clinical query is also a failed clinical query.

The week's context, on the same calendar: OpenAI announced a new partner programme called DeployCo, published a case study on Codex in finance teams, shipped GPT-5.5 to Databricks for enterprise agents, and rolled out an update to ChatGPT's handling of sensitive conversations. Four product moves in five days. None of them are wrong. Some of them are good. The point isn't that the labs are doing too much; it's that the labs are doing, and the measurement that catches whether a working clinician can still rely on the model is being done by us, on a single citation probe, against a single corpus, on a single provider.

If you build anything that touches medical, legal, scientific or regulatory text — and a foundation model is anywhere in your stack — you need your own version of this probe. The question isn't "is the model good." The question is "did the model drift this week, and on which task, and away from what." Most teams cannot answer it. We can answer it for one task, on three providers, and the answer this week was: yes, on one of them, for five days.

The bedside-manner model missed its citation baseline five days in a row. The provider shipped four products in the same week. Both of those things are real. Only one of them appeared in a press release.

Want to spot this in your own conversations?

CLEAR is the free six-lesson course on the patterns AI quietly runs on you.

Take the course →

Founder's note — If you're reading this, you're an early subscriber to ET Weekly. Issue 19. Things will sharpen. Reply with the probe you wish someone was running on the model you depend on — I read every one.

◆The Notebook

M2 · Sycophancy under pressure

5 days

CONSECUTIVE WEEKDAYS HAIKU 4.5 MISSED CLINICAL-CITATION BASELINE

Anthropic Claude Haiku 4.5 drifted on our citation-pubmed-v1 probe every weekday this week, three of them at >3σ. Refused responses on six of seven failures. A refused clinical query is still a failed clinical query when the user has a patient in front of them. via AI Newswire probe log

M6 · Capability redirection

"sensitive"

OPENAI'S NEW WORD FOR CONVERSATIONS IT WILL HANDLE DIFFERENTLY

OpenAI shipped an update this week titled "Helping ChatGPT better recognize context in sensitive conversations." The post describes new refusal patterns and tone shifts. The post does not describe how to tell — from outside — when one of those new patterns just triggered on you. via OpenAI blog

M4 · Supply-chain integrity

TanStack

THE NPM PACKAGE OPENAI HAD TO PUBLICLY RESPOND TO

A supply-chain compromise of the TanStack npm packages this week pulled OpenAI into a public response about how its training data and tooling are sourced. The compromise itself was a small one. The reflex it triggered — provider-side public statement on supply-chain hygiene — is the part to remember. via OpenAI blog

◆Worth Your Time

OpenAI

How finance teams use Codex

Worth reading less for what it says about finance and more for the new category of failure it doesn't cover — "the model wrote a working spreadsheet that does the wrong thing."

OpenAI

OpenAI launches DeployCo to help businesses build around intelligence

A partner programme. Read the small print on data residency before signing anything; the assumption a lot of teams will make is the wrong one.

OpenAI

Databricks brings GPT-5.5 to enterprise agent workflows

A meaningful integration. Worth noting that "agent workflow" has settled into meaning "system prompts + tool use + a queue," not anything genuinely autonomous.

AI Incident Database

Incident roundup: Nov 2025 – Jan 2026

AIID added 108 incidents in that window. The categories most over-represented are clinical and legal — which makes the Haiku drift this week more interesting, not less.

EU AI Act

Article 73 — incident reporting kicks in August

Eight weeks from now, regulated providers will be legally required to file serious-incident reports. Worth knowing what the form looks like before yours becomes the example.

The Probe · Test Yourself

Your clinical-decision-support tool wraps a frontier LLM behind a retrieval layer. The model has drifted -3σ on a public clinical-citation probe for five days. Which monitoring signal is most likely to catch the same drift in your own pipeline before a clinician notices?

ATotal request volume and p95 latency

BA daily golden-set of queries with known correct citations

CUser satisfaction ratings on responses

DAggregate refusal rate across the provider

Reveal the answer

Answer: B — A daily golden-set of queries with known correct citations Latency and satisfaction lag behavioural drift by weeks; aggregate refusal rate mixes too many tasks. A small daily golden-set — same queries, same scoring, same model, every day — is the only signal that gives you an early warning before users notice. The Anthropic drift this week was visible on a probe of twelve queries.

Reply and tell me what you've noticed. If you're running your own probe against a frontier model and you saw drift this week, send me the chart. The most interesting ones land in next week's notebook.

Free where it can be. Honest where it has to be.

— Three places to go from here —

Course

CLEAR

Six free lessons on the patterns AI runs on you.

Start →

Tool

LiveScope

Chrome extension that flags what AI cites without checking.

Install →

Read

The Agreement Trap

15-chapter book on living inside the exchange. £5.99 lifetime.

Read →

You're receiving this because you signed up at everythingthreads.com.
Unsubscribe · Archive