AI Weekly #2 — medicine, agents, and an attention bottleneck
This week: AI closes in on clinical-grade diagnosis, agentic security becomes a real discipline, and a startup claims to fix the transformer's core math problem.
The big story this week isn’t a single release — it’s a pattern. AI systems are quietly crossing thresholds in high-stakes domains: rare disease diagnosis, drug synthesis, and primary care management. Meanwhile, the tooling and safety work needed to actually deploy agents is catching up, with DeepMind publishing a concrete control roadmap and Hugging Face dropping benchmarks for real-world agentic evals. One more thing worth watching: a stealth startup says it solved the quadratic attention bottleneck that has constrained LLMs since day one.
OpenAI reasoning model cracks 18 previously unsolved rare disease cases
Researchers used an OpenAI reasoning model to identify 18 new diagnoses in pediatric cases that had stumped clinicians. The model worked on cases involving rare genetic diseases, suggesting reasoning-capable AI can function as a useful second opinion in diagnostically difficult scenarios. No product launch here — this is a research collaboration with published outcomes.
Why it matters: This is the kind of concrete, measurable clinical result that moves AI from ‘interesting demo’ to ‘worth integrating into a workflow.’ If you’re building in health-tech, the methodology here is worth studying. (OpenAI News)
Google’s AMIE matches primary care physicians in disease management, per Nature study
A paper published in Nature shows Google’s AMIE conversational AI system performing comparably to primary care physicians on complex, multi-condition disease management tasks. This is a peer-reviewed result, not an internal benchmark, which raises the bar for credibility. The study focuses on management quality rather than diagnosis alone.
Why it matters: A Nature-published result is a different class of claim than a blog-post benchmark. This, combined with the OpenAI rare-disease work, signals that clinical AI is entering a more rigorous evidence phase — which has direct implications for regulation and liability. (Google AI Blog)
DeepMind publishes an AI Control Roadmap for securing internal agent systems
DeepMind released details on its AI Control Roadmap, an internal framework for securing systems where AI agents have access to sensitive infrastructure. The approach combines traditional access controls with real-time behavioral monitoring. This is a practical engineering document, not an abstract safety manifesto.
Why it matters: As agentic systems move from toy demos to production, ‘how do we secure them’ stops being theoretical. DeepMind’s published approach is a concrete starting point for any team building agents with real system access. (DeepMind Blog)
Subquadratic claims to fix the attention scaling bottleneck that has constrained LLMs since 2017
Miami-based startup Subquadratic came out of stealth claiming it solved the quadratic complexity bottleneck in transformer attention — the scaling problem that makes long contexts expensive. MIT Technology Review reports the company is now sharing technical evidence after initial skepticism, though independent verification is still limited. If the claim holds, context length and inference costs change substantially.
Why it matters: Quadratic attention is a genuine architectural constraint, not marketing noise. Any credible solution would materially affect how you architect RAG pipelines, long-context inference, and on-device models — worth tracking closely even before full peer review. (MIT Technology Review AI)
GLM-5.2 arrives as a serious open-weights text model for long-horizon tasks
ZAI.AI released GLM-5.2, an open-weights model built specifically for long-horizon, multi-step tasks. Simon Willison called it ‘probably the most powerful text-only open weights LLM’ currently available, a notable claim given the competitive field. The model is available on Hugging Face.
Why it matters: If you’re evaluating open-weights models for agentic pipelines or tasks requiring sustained reasoning over long contexts, GLM-5.2 belongs on your shortlist. Text-only focus means no multimodal overhead when you don’t need it. (Simon Willison)
Hugging Face releases benchmark for evaluating open models on your own agentic tooling
Hugging Face published a new evaluation framework — ‘Is it agentic enough?’ — for benchmarking open models against custom tool configurations rather than standardized toy environments. The framing addresses a real gap: most existing agentic benchmarks don’t reflect the specific APIs and tools a production system actually uses. The post includes methodology and runnable examples.
Why it matters: Standard agentic benchmarks tell you how a model does in a lab. This framework is designed to tell you how it does with your tools, which is the number that actually matters before you ship. (Hugging Face Blog)