Skip to content

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)

8.4 relevance
Score Breakdown
technical depth
9
novelty
8
actionability
9
community
6
strategic
7
personal
10

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Rigorous evaluation of AI models for agent skills, highly actionable and timely.

AI/ML dev.to
Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)
Summary

Claude Opus 4.7 tops the baseline leaderboard at 80.5% native behavior rate, but 880 evals across nine models show that loading agent skills consistently lifts performance by +11 to +23 points, with weaker models like Haiku 4.5 gaining the most (+23.1). A cheap model with a skill (Haiku 4.5 at 84.3%) outperforms every unskilled model including Opus 4.7, suggesting skill selection now matters more than model tier for agentic coding tasks.

Author

Tessl

More from Tessl →