METR (@metr.org) profile banner on Bluesky

METR

@metr.org

1.4Kfollowers
1following
124posts

METR is a research nonprofit that builds evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.

Pinned posts

Pinned
METR avatar
METR·May 19

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

Post image
2
4
21

Top posts

METR avatar
METR·Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

Post image
107
3.0K
6.9K
METR avatar
METR·Mar 19

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

Post image
3
5
20
METR avatar
METR·Jul 30

We have open-sourced anonymized data and core analysis code for our developer productivity RCT. The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.

1
8
28

Latest posts

METR avatar
METR·May 11

We surveyed 349 technical researchers, engineers, and managers (in February–April 2026) about how they use AI tools at work. On average, participants self-report that AI use made their work 1.6–2.1x more valuable, and that this multiplier will grow over time.

Post image
1
3
7
METR avatar
METR·May 9

We reviewed a section of Anthropic’s February 2026 Risk Report focused on automated R&D risk from Opus 4.6. While we take issue with the adequacy of evidence the report provides, we agree with Anthropic about the overall level of risk & remain excited to pilot reviews like these.

1
0
7
METR avatar
METR·May 8

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

Post image
4
16
85
METR avatar
METR·Apr 10

We co-developed MirrorCode with @epochai.bsky.social to test AI on extremely long-horizon blackbox software reimplementation tasks. We found that recent public models are able to fully implement at least some programs we estimate would take humans weeks or months to implement.

1
0
4
METR avatar
METR·Apr 10

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

Post image
1
0
12
METR avatar
METR·Mar 5

We’re correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.

Post image
2
1
9
METR avatar
METR·Feb 24

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

Post image
3
5
24
METR avatar
METR·Feb 23

We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.

Post image
2
0
4
METR avatar
METR·Feb 21

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

Post image
1
1
24
METR avatar
METR·Feb 4

We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.

Post image
3
2
18