METR (@metr.org) profile banner on Bluesky

METR

@metr.org

1.4Kfollowers
1following
124posts

METR is a research nonprofit that builds evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.

Pinned posts

Pinned
METR avatar
METR·2d

We’re correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.

Post image
2
1
7

Top posts

METR avatar
METR·Jul 10

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

Post image
106
3.0K
6.9K
METR avatar
METR·Mar 19

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

Post image
3
5
20
METR avatar
METR·Jul 30

We have open-sourced anonymized data and core analysis code for our developer productivity RCT. The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.

1
8
28

Latest posts

METR avatar
METR·Feb 24

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

Post image
3
4
21
METR avatar
METR·Feb 23

We estimate that GPT-5.3-Codex with reasoning effort `high` (not `xhigh`) has a 50%-time-horizon of around 6.5 hours (95% CI of 3 hrs to 17 hrs) on our suite of software tasks. OpenAI provided API access for this evaluation.

Post image
2
0
4
METR avatar
METR·Feb 21

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

Post image
1
1
24
METR avatar
METR·Feb 4

We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.

Post image
3
2
18
METR avatar
METR·Feb 3

We’ve started to measure time horizons for recent models using our updated methodology. On this expanded suite of software tasks, we estimate that Gemini 3 Pro has a 50%-time-horizon of around 4 hrs (95% CI of 2 hr 10 mins to 7 hrs 20 mins).

Post image
1
0
7
METR avatar
METR·Feb 3

We’re updating the way we measure model time horizons on software tasks (TH 1.0→1.1). The updated methodology incorporates more of the tasks from HCAST, expanding our total from 170 to 228. This produces tighter estimates, especially at longer horizons.

Post image
1
0
5
METR avatar
METR·Jan 23

How well can AI-based monitoring detect when an agent is covertly pursuing a side objective? In early work, we find clear trends: more capable models (in terms of time horizon) are better able to detect covert behavior.

Post image
1
0
12
METR avatar
METR·Sep 2

We estimate that Claude Opus 4.1 has a 50%-time-horizon of around 1 hr 45 min (95% confidence interval of 50 to 195 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min.

Post image
1
0
5
METR avatar
METR·Aug 13

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT. We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.

1
7
25
METR avatar
METR·Aug 12

Before publishing our recent developer productivity RCT, we thought hard about how to accurately and clearly communicate our results. In a new blog post, we outline some of our key considerations regarding scientific integrity and communication.

1
1
9