GPT-Rosalind. Agents SDK 2. Gemini 3.1 Flash TTS. Two days, three launches, and the gap between what providers ship and what users feel they can rely on is now visible in any post-launch thread.
EVERYTHINGTHREADS weekly
Issue · 2026-W15 · 15–16 April 2026
Independent research Methodology preregistered No funding from AI labs
3
SIGNIFICANT FRONTIER LAUNCHES IN A 48-HOUR WINDOW
Tuesday: Agents SDK 2, plus Gemini 3.1 Flash TTS. Wednesday: GPT-Rosalind for life sciences. Each is a real release. Together they describe a cadence that is starting to outpace the user base's ability to track what they should be using and for what.

When the launches outpace the trust

Tuesday: OpenAI shipped the next evolution of the Agents SDK. Google AI shipped Gemini 3.1 Flash TTS, the next generation of expressive speech synthesis. Wednesday: OpenAI shipped GPT-Rosalind, a model specialised for life-sciences research. Three significant launches in two days. None of them obviously redundant. All of them deserve more time than the user base has to give them.

The pattern is familiar by now: providers ship at a cadence that outpaces the buyer's ability to evaluate. In 2024 a sophisticated buyer could maintain a working mental model of what GPT-4, Claude 3.5, and Gemini 1.5 were good and bad at. In 2026, between GPT-5.3, 5.4, 5.5, 5.5-Instant, Rosalind, Codex, GPT-Image-2; Claude 4.x and Haiku 4.5; Gemini 3.x and the Flash variants — the same buyer is making decisions on incomplete data, and the model that gets picked is the one whose name comes up first in the developer's memory. Cadence has become a market lever.

GPT-Rosalind is the interesting one of the three. A specialised life-sciences model is exactly the surface where the citation-hallucination problem we covered in W09 has the highest stakes. The launch post is well-written and includes a worked example. The launch post does not include a benchmark on PubMed-grounded citation accuracy. We will run our own in the next two weeks and report.

On the policy side, the Anthropic-DoD case continued. The Charlotin database crossed 1,330. The AI Lawsuit Tracker added a new copyright suit by a major publisher whose name was not yet public at the time of writing.

In 2024 a sophisticated buyer could keep a working mental model of three frontier models. In 2026, the buyer is choosing on incomplete data, and the model that wins is the one whose name comes up first.
Want to spot this in your own conversations?
CLEAR is the free six-lesson course on the patterns AI quietly runs on you.
Take the course →
Founder's note — Tonal warning: 'cadence as a market lever' is going to be a recurring theme. The independent wire's job is partly to slow the buyer down enough to choose well.
The Notebook
M4 · Domain accuracy
Rosalind
OPENAI'S NEW LIFE-SCIENCES SPECIALIST MODEL
A frontier model specialised for life sciences. The launch post is good. The launch post does not include a PubMed-grounded citation-accuracy benchmark. We'll be running our own and reporting; if you operate clinical or research workflows, build your own probe before relying on this in production. via OpenAI blog
M1 · Agent governance
SDK 2
THE NEXT EVOLUTION OF THE OPENAI AGENTS SDK
A capability release. Worth reading the change-log for the new guardrail hooks; if you ship agents, your error handling will need rebuilding around them within a quarter. via OpenAI blog
M2 · Voice synthesis
Flash TTS
GEMINI 3.1 FLASH TTS — THE NEXT GENERATION OF EXPRESSIVE SPEECH
A serious capability bump for voice. Worth knowing whether your downstream consumers can tell the difference between this generation and a human in conditions worse than a quiet room. Most cannot. via Google AI blog
Worth Your Time
OpenAI
Read the worked example, ignore the launch-language, build your own benchmark before relying on it.
OpenAI
Change-log piece. Read for the guardrail hooks.
Google AI
A voice-quality bump. Worth knowing where it sits on the indistinguishability axis.
Damien Charlotin
Now past 1,330. Five-to-six new cases a day, unchanged.
AI Lawsuit Tracker
New copyright suit landed this week. Tracker has the names.
From the workshop
LiveScope
See what the model is hiding.
LiveScope runs PubMed-grounded probes against any model in any browser. The benchmark the GPT-Rosalind launch post didn't include, you can run yourself in two minutes.
Install LiveScope →
The Probe · Test Yourself
A provider ships three significant launches in a 48-hour window. As a buyer evaluating one of them for a regulated workflow, which evaluation discipline best protects you against cadence-induced bias?
APick the launch with the most coverage in trade press
BRun your own benchmark on a fixed task you care about, against all three
CDefault to the latest release on the assumption it strictly dominates the older ones
DWait six months for community consensus
Reveal the answer
Answer: B — Run your own benchmark on a fixed task you care about, against all three A optimises for marketing. C is wrong on the assumption (newer models can regress on specific tasks). D is too slow for any working buyer. B is the discipline: a fixed task, run against all candidates, scored on the same rubric. The work is real and unglamorous; it is also the only thing that protects against the cadence pressure.
Reply and tell me what you've noticed. If you maintain a private "task benchmark" of your own — a question or workflow you score every model on — send me the rubric. I'm collecting examples for a buyer-side toolkit.
Free where it can be. Honest where it has to be.
— Three places to go from here —
Course
CLEAR
Six free lessons on the patterns AI runs on you.
Start →
Tool
LiveScope
Chrome extension that flags what AI cites without checking.
Install →
Read
The Agreement Trap
15-chapter book on living inside the exchange. £5.99 lifetime.
Read →
You're receiving this because you signed up at everythingthreads.com.
Unsubscribe · Archive