Page 27 of 139

AllHigh signalRecent
5531 articles
Reddit r/MachineLearning·

The famous METR AI time horizons graph contains numerous severe errors [D]

Nathan Witkin (NYU Stern) harshly critiques METR's AI time horizons graph. Errors include: unmeasured human baselines merely estimated, hourly-paid benchmarkers incentivized to work slowly, biased sample toward authors' peers, and failure to account for familiarity advantage (5-18x faster). Witkin concludes the graph contains too many compounding errors to be salvaged.

BenchmarksEvalsAI safety
SIG
75
HYP
45
Reddit r/MachineLearning·

We gave an LLM a structural graph of a codebase before exploring. It used 54% MORE context than without one. Paper + explanation inside [R]

Controlled study on TypeScript codebase (25 sections, 3,250 files): LLM (Kimi K2.6) equipped with structural graph (Blueprint: Universal Ctags + ast-grep + BM25) consumed 54% more input tokens (63,541 vs 41,327) but explored deeper (6 turns vs 5). Graph costs ~6,500 tokens and increases model's navigational confidence.

Code generationRAGBenchmarks
SIG
75
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> github /</span> copilot-sdk

GitHub releases a multi-platform SDK for integrating Copilot Agent into third-party apps and services. Enables developers to access Copilot's AI capabilities through a standardized API.

AI AgentsCode generationTools
SIG
75
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> microsoft /</span> agent-governance-toolkit

Microsoft releases governance toolkit for autonomous AI agents. Includes policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering. Covers all 10 OWASP Agentic Top 10 risks.

AI AgentsAI safetyTools
SIG
75
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> google-research /</span> timesfm

TimesFM is a pretrained foundation model developed by Google Research for time-series forecasting. The GitHub repository provides an open-source implementation of this specialized model.

DeepMindOpen sourceBenchmarks
SIG
75
HYP
20
Simon Willison·

FTC to Require Cox Media Group, Two Other Firms to Pay Nearly $1 Million to Settle Charges They Deceived Customers About “Active Listening” AI-Powered Marketing Service

FTC requires Cox Media Group and two other firms to pay nearly $1 million to settle charges they deceived customers about an "Active Listening" AI marketing service. The service claimed to listen to conversations via smart devices for ad targeting, but actually used no voice data at all.

RegulationAI safetyBusiness
SIG
75
HYP
25
arXiv cs.CL·

FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

FlyRoute is a self-evolving agent profiling framework that improves enterprise query routing. Via a data flywheel mechanism, it collects capability evidence from real traffic, distills learned descriptions, and injects them into an LLM router with BM25-retrieved successes. On a proprietary dataset, FlyRoute improves from 72.57% (zero-shot) to 89.83% accuracy after 7,211 labeled queries.

AI AgentsMulti-agentPrompt engineering
SIG
75
HYP
25
arXiv cs.LG·

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

Comparative study of 5 classifiers (logistic regression, random forest, XGBoost, SVM, naive Bayes) for chronic kidney disease risk prediction. All achieve AUROC 1.00 internally (UCI, 400 patients) but collapse on external MIMIC-IV data (AUROC 0.48-0.58). Calibration and conformal coverage severely degraded. No model meets clinical deployment criteria.

EvalsAI safety
SIG
75
HYP
15