Page 12 of 137

AllHigh signalRecent
5477 articles
arXiv cs.AI·

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Systematic study of the full lifecycle of model-generated agent skills: extraction, consumption, and transfer. Evaluation framework spanning 5 agentic task domains. Findings: skills beneficial on average but exhibit non-trivial negative transfer; extractor/consumer performance independent of model scale. Introduction of meta-skill to improve quality and reduce negative transfer.

AI AgentsMulti-agentReinforcement learning
SIG
78
HYP
15
arXiv cs.CL·

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

QASC (Query-Adaptive Semantic Chunking) improves document segmentation for RAG by integrating user queries at chunking stage. Using cosine similarity scoring, contextual window expansion, and chunk-level aggregation, QASC achieves F1=0.85, a 18-27% relative improvement over fixed chunking and 8-12% over semantic/agentic methods on 100 technical documents and 200 queries.

RAGBenchmarksPapers
SIG
78
HYP
15
arXiv cs.CL·

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Study of 16 language models (1.5B–72B parameters) showing representational convergence does not extend to reasoning processes. Models align more on collectively failed problems (CKA=0.897) than solved ones (CKA=0.830). Post-decision representations diverge sharply (CKA=0.274), and shared information exerts minimal causal influence (1.5–5.5% flip rate).

PapersReasoningEvals
SIG
78
HYP
15
arXiv cs.CL·

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

MaR (Metacognition-as-Reward) is an RL framework improving LLM reasoning via two dimensions: metacognitive knowledge (identifying task-relevant information) and metacognitive regulation (planning the reasoning process). Tested on 22 benchmarks, Qwen3.5-9B + MaR achieves up to 7.7% gain over base model and 11.0% over vanilla DAPO, surpassing GPT-OSS-120B on average.

Reinforcement learningReasoningQwen
SIG
78
HYP
25
Reddit r/LocalLLaMA·

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

CPU benchmark of Needle (26M) vs Qwen3-0.6B on function calling: 50 queries across 5 difficulty tiers. Needle wins on accuracy (72% vs 56% tool_match) and latency (10.9s vs 47.9s). Needle fails on tool selection, Qwen3 on tag emission. Qwen3 dominates on multilingual queries (Hindi, French).

QwenBenchmarksCode generation
SIG
78
HYP
15