Open-world evaluations for measuring frontier AI capabilities
CRUX is a new evaluation project for measuring frontier AI capabilities on long, messy open-world tasks, moving beyond traditional benchmarks.
CRUX is a new evaluation project for measuring frontier AI capabilities on long, messy open-world tasks, moving beyond traditional benchmarks.
A new paper investigates AI agent reliability by quantifying the gap between claimed capabilities and actual performance. The study proposes methods to measure this divergence and improve the robustness of agent systems.
AI will not automatically reduce legal services costs. The article applies the 'AI as Normal Technology' framework to the legal sector, questioning the assumption that AI automation will systematically drive down prices.
Critique of Moravec's paradox relevance, the famous claim that tasks easy for humans are hard for AI and vice versa. The article questions the validity and usefulness of this principle in current context.
Article positioning AI as normal technology rather than revolutionary. Challenges dominant hype discourse and proposes a more nuanced perspective on actual capabilities and limitations of current systems.
Article questioning whether AI could slow science by creating a production-progress paradox: increased publication volume without proportional improvement in quality or genuine scientific understanding.
The article challenges the notion that AGI represents a capability threshold triggering sudden impacts. It questions the staged progression model toward general intelligence.
Analysis of recent AI technology trends to assess whether progress is slowing. Examines current technological claims and their empirical foundation.
Analysis of 78 election deepfakes: political misinformation is not primarily an AI problem. Electoral manipulation issues predate the technology and cannot be solved by technical solutions alone.
The UK's liver transplant matching algorithm may systematically exclude younger patients. Seemingly minor technical decisions can have life-or-death effects.
A new benchmark measures AI's ability to automate computational reproducibility in science. The study assesses the impact of AI models on improving scientific result reproduction practices.
AI companies are shifting from AGI rhetoric to building concrete products. The article identifies five major challenges in this transition: monetization, user integration, inference costs, technical differentiation, and regulatory compliance.
Critique of AI existential risk probability estimates presented as quantified. The article denounces how speculation is laundered through pseudo-quantification to influence policy, lacking solid empirical grounding.
Critical article on AI agent evaluation. Questions current benchmarking methods and proposes rethinking what makes a meaningful AI agent.
The article challenges scaling myths in AI, asserting that model growth will hit limits. The timing of this saturation remains uncertain.
Scientists must treat AI as a tool, not an infallible oracle. AI hype leads to flawed research that fuels more hype, creating a vicious cycle.
Traditional AI leaderboards are becoming obsolete as cost-performance tradeoffs grow complex. The article advocates replacing leaderboards with Pareto curves to evaluate AI agents, showing how $2,000 in spending reveals true efficiency-resource compromises.