Page 18 of 138

AllHigh signalRecent
5505 articles
arXiv cs.AI·

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

MEMOIR, a memory-guided tree-search framework, automatically synthesizes solvers for combinatorial optimization using LLMs. With a two-level memory hierarchy (branch-local and global), it achieves 96.7% solution validity across 7 problems (scheduling, routing, packing), outperforming baselines by 9.2 points and reducing run-to-run validity variance by over an order of magnitude.

AI AgentsReasoningCode generation
SIG
78
HYP
25
arXiv cs.AI·

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Dual-process memory architecture for scientific agents: decouples episodic window (10 messages) from semantic consolidation (3 tokens/message). Evaluation on 15,000 messages across 6 LLMs (OpenAI, Anthropic, Google): maintains 70-85% accuracy at 10,000 messages with 62% fewer tokens. Identifies trade-offs: Dual Process excels at numeric/temporal queries, RAG for historical retrieval.

AI AgentsReasoningRAG
SIG
78
HYP
25
arXiv cs.AI·

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS introduces a Knowledge Infrastructure (KI) enabling AI agents to execute complex Earth science simulations. On 3,000 trials, KI-equipped agents produced physically plausible simulations in 84% of cases vs. <40% without KI. An automated Knowledge Dissection Toolkit (KDT) generated 119 KIs across 14 Earth-science domains, showing operational expertise is structured and extractable rather than ad hoc.

AI AgentsReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

LGBO (LLM-Guided Bayesian Optimization) embeds LLM semantic reasoning into every Bayesian Optimization iteration via a region-lifted preference mechanism. Tested on physics, chemistry, biology, and materials science benchmarks, LGBO reaches 90% of best observed value in 6 iterations for Fe-Cr battery electrolyte optimization, versus 10+ for standard BO.

ReasoningBenchmarksPapers
SIG
78
HYP
25
arXiv cs.AI·

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround is a task-inference framework for household agents operating on complete scenes. It structures reasoning in three steps: grounding (extracting relevant context), inference (executable structure), execution (action sequences). Evaluated on FullHome (400 tasks), it improves success rates and makes Qwen3.5-9B competitive with GPT-4 while reducing token costs by 18x.

AI AgentsReasoningRobotics
SIG
78
HYP
25
arXiv cs.AI·

LARGER: Lexically Anchored Repository Graph Exploration and Retrieval

LARGER is a context retrieval framework for repository-level coding agents combining lexical search with structural graph exploration (imports, call chains, type hierarchies) without external databases. On LocBench, it improves file-level Acc@5 by +13.9 points (or +11.8 with fixed hyperparameters) and shows consistent gains on test generation and codebase QA benchmarks.

AI AgentsCode generationBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

LoRA-Over improves parameter-efficient fine-tuning (PEFT) by enriching the optimization landscape during training via auxiliary over-parameterization, then collapsing this enrichment into standard LoRA structure at inference. Evaluated on GLUE, MT-Bench, GSM8K, and HumanEval with LLaMA 2-7B and 3.1-8B, the framework consistently outperforms vanilla LoRA with no additional inference cost.

Fine-tuningLlamaBenchmarks
SIG
78
HYP
18
arXiv cs.AI·

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely is a causal intelligence layer for SRE workflows that structures environment topology and causal dependencies. Benchmark across 4 agent configurations (Claude Code, OpenAI Codex, HolmesGPT): with Causely, mean time-to-diagnosis reduced 63%, token consumption -60%, tool calls -78%, API cost per run -57%, root-cause accuracy 75%→100%.

AI AgentsBenchmarksClaude Code
SIG
78
HYP
25
arXiv cs.AI·

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization. It contains 196 tasks (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) and tests generalization to unseen configurations. Cursor Agent, Claude Code, and Codex Agent achieve speedups up to 6.89x, but PyTorch-to-HIP optimizations show correctness drops on unseen configurations.

AI AgentsCode generationBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

New AG-MG parallel corpus with 132,481 sentence pairs for Ancient-to-Modern Greek translation. Creation pipeline combines web-scraping, VecAlign alignment with fine-tuned LaBSE embeddings, and Gemini 2.5 Flash LLM-based correction. Benchmark of NMT models (NLLB, M2M100) and Greek LLM (Llama-Krikri-8B): full fine-tuning achieves 13.16 BLEU, gains up to +10.3 points.

BenchmarksFine-tuningEmbeddings
SIG
78
HYP
15
arXiv cs.AI·

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

Systematic audit of two critical vulnerabilities in clinical AI: adversarial fragility and cross-lingual drift. On CheXNet (DenseNet121), accuracy collapses from 89.3% to 62.0% under imperceptible FGM perturbation (epsilon=0.021). Llama3.1:8b and NatLAS show major degradation on Nigerian Pidgin and Yoruba (80%→65%, 85%→55%). Standard defenses fail.

AI safetyAlignmentEvals
SIG
78
HYP
25