Research & Benchmarks

News & Releases Research & Benchmarks

AMIE made it into Nature — the wall peer review can't cross

AMIE's Nature peer review: what the longitudinal chronic disease evaluation measured and what it explicitly excluded.

Sungjae Lee

Jul 02, 2026

Claude / Anthropic News & Releases Research & Benchmarks

A 5× expertise gap separates novice from proficient AI pairing

398k Claude Code sessions: occupation outpredicts programming skill, expertise multiplies return 5×. Caveats apply.

Sungjae Lee

Jul 01, 2026

Claude / Anthropic News & Releases Research & Benchmarks

Claude activations, now legible. Fidelity: 0.6–0.8.

Anthropic's NLA: verbalizer/reconstructor loop, 0.6–0.8 FVE, Claude test-awareness, and open checkpoints for Qwen2.5, Gemma-3, Llama-3.3.

Sungjae Lee

Jun 28, 2026

Google / Gemini News & Releases Research & Benchmarks

94% vs 67%: how medical AI fared against physicians in Nature

Two-agent Gemini system vs physicians on longitudinal disease management: Nature results, RxQA benchmark, and the critical gaps in what the trial actually tested.

Sungjae Lee

Jun 27, 2026

Google / Gemini News & Releases Research & Benchmarks

RL lost in Borg. DeepMind's evolved bin-packer hit 0.7%.

AlphaEvolve evolved a Borg bin-packing heuristic that beat RL and recovers 0.7% of Google's worldwide fleet capacity.

Sungjae Lee

Jun 26, 2026

News & Releases Research & Benchmarks

DVD-JEPA nails JEPA pedagogy — the pioneer tag, less so

DVD-JEPA: 32-d latent, ~10s CPU training, zero infrastructure. How it compares to JEPA-WMs, I-JEPA, and EB-JEPA.

Sungjae Lee

Jun 26, 2026

News & Releases Research & Benchmarks

For Analyst Deliverables, 3% Is Where the Best AI Tops Out

91 analyst tasks, 4 private scenarios. Claude Fable 5 leads at 1586 Elo but fully passes just 3% of criteria sets.

Sungjae Lee

Jun 25, 2026

News & Releases Research & Benchmarks

HiRO-ACE cuts climate emulation to 45 minutes — with a ceiling

Ai2's HiRO-ACE: ACE2S stochastic emulator + HiRO 32× diffusion downscaler. 3 km precipitation in 45 min. Apache-2.0.

Sungjae Lee

Jun 25, 2026

Google / Gemini News & Releases Research & Benchmarks

AMIE Lands in Nature: The Talker-Thinker Split Worked

AMIE reaches disease management in Nature 2026. Two-LLM design, RxQA benchmark, and 100-scenario PCP comparison covered.

Sungjae Lee

Jun 25, 2026

News & Releases Research & Benchmarks

DeepSWE's top entry was gaming the grader

Datacurve's DeepSWE v1.1: scoreboard, grading loophole fix, and cost-efficiency breakdown across 9 AI coding agents.

Sungjae Lee

Jun 25, 2026

Claude / Anthropic News & Releases Research & Benchmarks

Novices quit. Experts adapt. 400k Claude Code sessions say so.

Anthropic's 398k-session study finds Claude Code widens expertise advantage — 15% novice success vs. 33% for experts.

Sungjae Lee

Jun 23, 2026

OpenAI / Codex News & Releases Research & Benchmarks

An 80-year conjecture fell to AI. What was actually proved?

Unnamed OpenAI model disproved Erdős's unit-distance conjecture. Sawin's n^1.014 is the first exponent gain in 80 years.

Sungjae Lee

Jun 23, 2026

News & Releases Research & Benchmarks

DVD-JEPA in 500 lines — one claim that breaks

500-line MIT-licensed JEPA demo that trains in 10s on CPU. The 'debut' claim breaks on contact with V-JEPA 2.

Sungjae Lee

Jun 22, 2026

News & Releases Research & Benchmarks

Memorized by AI or hallucinated — a site lets you check which

Queries frozen AI weights, no live crawl, to surface how confidently each model recalls you. Built by two ex-OpenAI engineers and launched June 2026.

Sungjae Lee

Jun 22, 2026

News & Releases Research & Benchmarks

Thirteen chatbots know your biography. F1 reveals how reliably

In the Weights probes 13 chatbots for cold biographical recall — F1 ceiling, LMP2 context, and GDPR implications.

Sungjae Lee

Jun 22, 2026

Claude / Anthropic News & Releases Research & Benchmarks

Blackmail dropped from 96% to 0%. Here's the asterisk.

May 2026 alignment paper: how Anthropic cut Claude's blackmail rate from 96% to 0% and what the limits are.

Sungjae Lee

Jun 21, 2026

Claude / Anthropic News & Releases Research & Benchmarks

Activations into English: 4× better at surfacing hidden goals

Anthropic's NLAs map activations to English, exposing hidden goals 4× more than SAEs — and where they confabulate.

Sungjae Lee

Jun 21, 2026

News & Releases Research & Benchmarks

Chronic management AI vs PCPs: 94% precision, simulated only

AMIE matched PCPs on 15 chronic management axes (94% vs 67% precision) in a Nature 2026 simulation. RxQA, Dialogue+Mx split, and key caveats.

Sungjae Lee

Jun 21, 2026

Google / Gemini News & Releases Research & Benchmarks

AlphaEvolve in Borg before the paper: the concrete wins

Evolutionary code optimization from DeepMind — in Borg since 2024, 23% TPU speedup, Strassen improved.

Sungjae Lee

Jun 21, 2026

Google / Gemini News & Releases Research & Benchmarks

AMIE's chronic care paper is strong. The fine print is longer.

AMIE extends to longitudinal chronic care: 627 guidelines, 88% care plan quality vs 74% PCPs, drug knowledge ceiling at 73%.

Sungjae Lee

Jun 20, 2026

News & Releases Research & Benchmarks

94 vs 67: AMIE vs PCPs in a blinded prescription OSCE

Google's AMIE scored 94% vs 67% on prescription precision in a blinded OSCE. New RxQA benchmark released. Nature 2026.

Sungjae Lee

Jun 20, 2026

Google / Gemini News & Releases Research & Benchmarks

Nature just certified AMIE at 94% precision — in a simulation.

AMIE's Nature paper: chronic care via two-agent design, 94% vs 67% treatment precision, simulation trial only.

Sungjae Lee

Jun 19, 2026

News & Releases Research & Benchmarks

APEX banking and consulting: 76% failed. Here's what broke.

APEX-Agents and RLI both put autonomous delivery in law, banking, and consulting under 25% on the first attempt.

Sungjae Lee

Jun 19, 2026

News & Releases Research & Benchmarks

Medical AI answers flip 25.5% of the time on re-test

Nature's 83-paper meta-analysis finds AI on par with non-expert doctors; 25.5% of answers flip on re-test, expiring every evaluation snapshot.

Sungjae Lee

Jun 19, 2026

Google / Gemini News & Releases Research & Benchmarks

Google's Borg runs evolved code. One year in, the 0.7% holds.

DeepMind's one-year AlphaEvolve update: verified production wins, follow-on papers, and why no public API exists yet.

Sungjae Lee

Jun 18, 2026

Google / Gemini News & Releases Research & Benchmarks

Gemini's ERA Model Is Now Outrunning CDC Disease Forecasts

Google's I/O 2026 AI research suite: literature triage, hypothesis tournaments, and ERA outperforming CDC forecasts.

Sungjae Lee

May 30, 2026

News & Releases Research & Benchmarks

DiffusionBlocks Cuts Training Memory B× Without Accuracy Loss

DiffusionBlocks trains one residual block per step, reducing activation memory B× with competitive or better accuracy.

Sungjae Lee

May 30, 2026

Google / Gemini News & Releases Research & Benchmarks

What Gemini's Three I/O 2026 Research Tools Actually Do

Three experimental AI research tools launched at I/O 2026. What Literature Insights, Co-Scientist, and AlphaEvolve each actually do.

Sungjae Lee

May 29, 2026

MCP News & Releases Research & Benchmarks

Do Grok Build's SWE-Bench Claims Actually Hold Up?

xAI shipped its terminal coding agent on May 14, 2026. Here's what the CLI actually does, where the benchmark numbers hold, and what $299/month buys.

Sungjae Lee

May 28, 2026

OpenAI / Codex News & Releases Research & Benchmarks

A Reasoning Model Just Broke an 80-Year-Old Conjecture

OpenAI's reasoning model disproved an 80-year-old geometry conjecture — verified by a nine-mathematician team including a Fields Medalist.

Sungjae Lee

May 28, 2026

Google / Gemini News & Releases Research & Benchmarks

CDC 예측을 넘어선 Gemini ERA의 실제 성능

Google's I/O 2026 AI research suite: literature triage, hypothesis tournaments, and ERA outperforming CDC forecasts.

Sungjae Lee

May 30, 2026

News & Releases Research & Benchmarks

블록 하나씩만 학습해도 정확도가 유지되는 이유

DiffusionBlocks trains one residual block per step, reducing activation memory B× with competitive or better accuracy.

Sungjae Lee

May 30, 2026

Google / Gemini News & Releases Research & Benchmarks

AlphaEvolve와 Co-Scientist, 발표대로 작동하는가

Three experimental AI research tools launched at I/O 2026. What Literature Insights, Co-Scientist, and AlphaEvolve each actually do.

Sungjae Lee

May 29, 2026

Research & Benchmarks

Featured posts

Tags

Sign up for insights and ideas