Vous manquez vite de messages ? Claude ajoute un nouveau contrôle
Anthropic adds a new control to manage message limits on Claude. The feature improves visibility without fully solving the quota problem.
516 articles
Anthropic adds a new control to manage message limits on Claude. The feature improves visibility without fully solving the quota problem.
LinTree improves LLM reasoning by explicitly representing the tree structure of search traces. Researchers show raw access to search history alone fails to reliably outperform LLM-guided heuristic search. Adding parent pointers to explicitly represent the linearized tree structure improves performance and search efficiency on Blocks World, grid Navigation, and Sokoban.
New approach combining Answer-Set Programming (ASP) and Reinforcement Learning to create logical abstractions of state spaces. Authors reimplement the CARCASS framework (originally in Prolog) using ASP, a fully declarative language, and evaluate it on Blocks World and Minigrid. ASP provides richer modelling for logical representations of Markov Decision Processes.
HypoAgent is a multi-agent framework for interactive abductive hypothesis generation over knowledge graphs. Three coordinated agents (intent recognition, hypothesis generation, root cause analysis) enable multi-turn dialogue and fine-grained diagnosis of failed hypotheses. SOTA on commonsense and biomedical KGs.
SCALE is a self-improving framework for web agents using MLLMs. It employs three adversarial roles (Selector, Predictor, Judger) to autonomously explore agent limitations and expand cognitive boundaries. SCALE-Hop optimizes global planning via graph exploration. A SCALE-20k dataset from 19 real websites with 20k structured demonstrations validates the approach across multiple MLLMs.
GLIDE is an open-source Python library unifying prediction-powered inference methods (PPI++, Stratified PPI, Predict-Then-Debias) for evaluating agentic systems. It combines human annotations and LLM judgments into unbiased estimates with valid confidence intervals, reducing annotation costs while maintaining precision.
DecomposeR, a deep research framework, trains Qwen3-8B in two RL stages: planner RL learns typed DAG structures and query decomposition, then answerer RL learns branch execution and synthesis. Achieves 5.1-8.0 point improvements on long-form benchmarks through explicit planning and structured rewards.
AdaCoM, an external LLM system, manages context for frozen LLM agents via reinforcement learning on long-horizon tasks (web search, deep research). Learned strategies reveal a Fidelity-Reliability Trade-off: high-performing agents benefit from higher-fidelity context preservation, while lower-performing agents require aggressive compression.
Researchers reframe healthcare mechanism design as program synthesis for LLMs. Medi-Sim, a multi-agent simulator, evaluates rule programs against strategic provider responses (coding, selection, delay, effort, triage). LLM-guided evolutionary code search synthesizes a mixed-objective program that eliminates up-coding, halves rejections, and retains baseline profitability.
√LTS algorithm for tree search with implicit rerooting. Three rerooter designs proposed: clustering-based on state-space structure, heuristic-based with cost-to-go estimates, and hybrid. Avoids explicit subgoal generation, reduces computational overhead, and achieves state-of-the-art online training efficiency on tested domains.
Reinforcement learning framework for autonomous driving using uncertainty-aware expert advice with adaptive thresholds. Epistemic and aleatoric uncertainty trigger expert intervention; commitment-cooldown strategy prevents long-term dependence. CARLA experiments: +5-7% success rate vs IQN baseline.
PhyDrawGen is a neuro-symbolic pipeline generating physics diagrams from text while respecting physical laws. An LLM extracts a typed scene graph, a deterministic solver converts it to a planar straight-line graph, and Qwen-VL fine-tunes a propose-verify loop. Evaluated on 1,449 problems (mechanics, optics, electromagnetism), it outperforms GPT-5-image and Gemini.
COLLEAGUE.SKILL is an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge extraction. The system produces versioned packages with two coordinated tracks: capability (practices, mental models, decision heuristics) and bounded behavior (communication style, interaction rules). 18.5k GitHub stars, 215 skills from 165 contributors.
Formalizes causal pathways for rare events in structural equation models. Proposes a formal definition of causal pathways and identifies conditions where testable implications depend only on the causal abstraction defined by the rare event pathway, rather than the full causal graph.
Novel MADQI evaluation metric for unsupervised anomaly detection in maritime AIS datasets. Combines four indices (ARC, PPS, SDS, ECE) through automatic normalization. Achieves MADQI score of 80.37% on AIS data, with ECE=0.907 and ARC=1.000 for detecting abnormal vessel behavior.
Unicorn, a multi-dataset pretraining framework, bridges the trade-off between channel-independent models (scalable but ignoring dependencies) and channel-dependent models (expressive but dimension-bounded). Using a latent prototype codebook, it projects heterogeneous channels into a shared space to learn identity-agnostic, reusable correlation patterns transferable across domains.
DSFM (Dual-Spectral Flow Matching) generates synthetic fMRI time series by combining discrete wavelet transform (DWT) and discrete cosine transform (DCT) with spectral flow matching. The model captures non-stationarity and spatiotemporal dynamics of BOLD signals to improve brain network classification.
Paper proposes an alternative to deep neural networks for LLMs using RBF networks. The model finds the global optimum of the loss function in closed form in a single iteration, eliminating iterative training. High-level overview with case study and comparison to similar methods provided.
VeriGate extends GRPO by combining verifier rewards with step-level supervision. The method uses a Process Reward Model (PRM) to assign fine-grained credit to tokens, avoiding gradient collapse when all trajectories receive identical rewards. On MATH with Qwen2.5-Instruct (1.5B/7B), VeriGate improves accuracy by ~20% and ~12% respectively.
Study on KV cache prompt redundancy during decoding. Researchers show upper-layer prompt cache can be replaced with chat template scaffolds without significant accuracy loss, revealing redundancy is structural rather than semantic. Results validated across Qwen3, Gemma 3, and Llama 3 families.
arXiv study on LLM security against untrusted inputs. Researchers test whether wrapping untrusted content in mock tool calls improves robustness across 7 models and 3 LLM-as-a-Judge tasks. Finding: the approach fails and typically increases attack success rates, inverting the expected instruction hierarchy.
CanLegalRAGBench is an evaluation benchmark for RAG systems applied to Canadian law, based on realistic queries and expert-annotated answers. The study shows open-source embedding models are competitive with closed-source alternatives, but identifies hallucinations in 8-29% of generated answers unsupported by retrieved documents.
Bias evaluation of multimodal speech recognition models (audio-visual). Researchers create videos pairing different faces with identical audio and measure transcription accuracy variations. Findings: quality-of-service gaps up to 4.05 word error rate points across gender, ethnicity, and intersections on Whisper-Flamingo and Gemini.
Study on LLM teams playing ChGK (collective reasoning quiz). Three strategies tested: Voting, Silent Team (captain observes answers), Talkative Team (captain observes answers + rationales). On 572 questions from 2025, teams outperform single models (+20 points). Best team: 44.23% accuracy, approaching human performance. Sharing rationales mitigates errors.
Protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. Uses RAG with open-source LLMs for semantic verification and hallucination detection through cross-model majority voting.
A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.
Formal study of calibration for probabilistic label ranking. Authors define a hierarchy of notions (full rankings, sub-rankings, top-k) and show popular models are poorly calibrated. Application to RLHF reward models reveals calibration and accuracy are not perfectly correlated.
LongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.
NumLeak measures memorization of public benchmarks in frontier LLMs. Models recall Fama-French data (r=0.97-0.99), US unemployment, and NOAA temperature with high fidelity. On recent unseen data, parse rate drops to 21-57% but r stays ~0.99 for answered months. A one-line system-prompt defense blocks 99.8% of attacks.
New differentially private sketching mechanism using fast transforms (Hadamard matrices). Combines matrix compression with privacy guarantees for DP linear regression. First fast method for DP ordinary least squares with improved runtime and utility guarantees.
AMNESIA is the first large-scale open-source benchmark for machine unlearning in medical LLMs. It contains 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. The authors evaluate 4 unlearning methods and show that forgetting individual patients erodes knowledge of others with the same condition.
XOResNet introduces OR-ADD shortcut connections and XOR meta-residuals to improve learning in deep spiking neural networks. Tested on Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, the model outperforms existing SNNs by reducing spike redundancy and information loss.
Multi-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.
Unified Gait2Hip-60 benchmark comparing LSTM, Transformer, and Mamba to predict hip muscle forces and joint moments from gait kinematics. Transformer outperforms other models (R²=0.819 for forces, R²=0.862 for moments). External validation on 9 femoral head osteonecrosis patients shows moderate generalization (R²=0.537–0.569).
EGGROLL, a low-rank factorization of Evolution Strategies perturbations, reduces memory complexity from O(mn) to O(r(m+n)) for gradient-free training of Spiking Neural Networks. On N-MNIST, the method achieves 79.21% test accuracy with 2.23× speedup versus full-rank ES, enabling on-chip learning on neuromorphic hardware without surrogate gradients.
arXiv study on iterative refinement of LLM-generated reward functions for sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven refinement guided by failure-mode taxonomy. Results: DoorKey-8x8 improves from 2.3% to 97.6%, KeyCorridor from 31.2% to 86.7%. Limitations: method restricted to PPO and sparse structured tasks.
New arXiv paper introducing context-dependent argumentation frameworks (CDAFs), extending Dung's theory. An agent strategically manipulates context through a defeat function to make target arguments accepted. Introduces ACTIVATION-MANIPULATION decision problem with baseline complexity bounds.
AutoSci is a memory-centric multi-agent system automating the full scientific research lifecycle. It combines SciMem (structured memory), SciFlow (5-stage execution), SciDAG (multi-agent operators), and SciEvolve (continuous learning). Code available on GitHub.
FAM-Bench is a 2500-instance multimodal benchmark verified by nutrition experts, evaluating Food-as-Medicine reasoning across 13 health conditions. Two tasks: assess dish suitability for a condition (image + ingredients) and rank 4 dishes by clinical relevance. Tests integration of nutritional constraints, visual cues, and ingredient evidence.
CoSee, an auditing framework, analyzes failure modes of modular visual reasoning systems using shared working memory. On 4B–8B models, two dominant failure modes emerge: Noise Reinforcement (reusing ungrounded notes) and Policy Collapse (under-specified answers). The study shows naive shared workspaces amplify hallucinations without explicit verification.
TraceGraph is a graph-based framework that transforms multi-model agent trajectories into shared decision landscapes. It builds graphs over state-action-observation spaces, identifies productive cores and trap regions, then proposes a trap-aware recovery pipeline. On SWE-bench, this approach improves resolution rate from 40.4% to 43.5%.
XLGoBench is a benchmark of synthetic algorithmic tasks to detect cross-lingual gaps in LLM abilities. The benchmark is commensurate across languages, scalable (variable complexity), quantifiable (objective correctness), and transparent (auditable templates). Experiments reveal persistent cross-lingual gaps in multiple state-of-the-art models.
Eywa is a provenance-grounded memory architecture for persistent AI agents, storing immutable source evidence before deriving facts and validating memories against typed signals. Retrieval uses a deterministic multi-route read path with zero LLM calls. Results: 90.19% judge accuracy on LoCoMo C1-C4, 88.2% on LongMemEval-S, 81.45% mean nugget score on BEAM.
Theoretical paper defining pairwise reference alignment as an ordinal observable for language model evaluation. Formulates statistical framework to measure whether a model ranks preferred responses above rejected responses, with finite-sample estimators and concentration bounds. Empirical validation on Qwen2.5 and RewardBench.
MASA (Model-Aware Skill Alignment) adapts procedural skills for LLM agents to each model backbone without weight modification. A hierarchical evolution pipeline rewrites skills via hill climbing and UCB-driven tree search, then a lightweight rewriter trained on trajectories reproduces adaptation in a single forward pass. Gains up to 25.8 points across three interactive environments and four backbones.
Study of gender-specific neurons in language models (feminine, masculine, gender-neutral). Authors propose neuron-level intervention method to identify and control gendered language generation. Experiments on two open-source LMs show gender neurons concentrate in early layers. Code and datasets released.
Method to assign consistent predictability scores to short trajectory windows on a deterministic-stochastic continuum. GON (Gauge-Fixed Ordinal Network), a temporal convolutional model, resolves cross-system ambiguity via anchor-and-variance objective. Transfer validated on 5 dynamical systems, outperforming scratch training across all window budgets.
Zeroth-order sampling method with variance reduction for non-log-concave distributions in black-box settings. Proposes ZO-APMC for inverse problems with generative priors. First non-asymptotic convergence guarantees established.
ElasticMem introduces a learnable latent memory framework for LLM agents with adaptive retrieval and elastic budget allocation via learned policy. On Qwen2.5-3B and 7B backbones, achieves 26.2% and 24.6% QA accuracy gains, 66.3% and 27.2% ALFWorld success improvements, with lowest token cost.
Study of alignment between LLM uncertainty and human uncertainty through behavioral analysis and internal activation patterns. Authors measure calibration and alignment across multiple-choice and open-ended factual recall datasets, assessing impact of instruction fine-tuning.
CobSeg is a multi-branch architecture for dialogue topic segmentation. It separates semantic continuity from lexical transitions and uses boundary informativeness weighting. Evaluated on 5 benchmarks, it reduces Pk by 0.7 points on VHF and achieves Pk=1.0 on DialSeg711, without LLM calls at inference.
Untrained neural networks match early visual cortex better than trained networks. Study on 720 THINGS images and fMRI from 3 subjects shows one training epoch reduces V1 alignment by 25-90% depending on learning rule. Backpropagation degrades most (Δr = -0.080), while predictive coding and STDP preserve alignment better (Δr ~ -0.04).
Method to establish correspondences between embedding vectors from different black-box encoders. Exploits local geometric consistency of independently trained contrastive encoders: short-range distances preserved up to scale factor. Uses iterative reference-based geometric embedding hashing with paired anchors to recover vector links. Code released.
GraphARC is an AI benchmark for abstract reasoning on graph-structured data, generalizing the ARC paradigm to graph transformations. Current language models fail on full graph transformation tasks despite understanding graph properties, revealing a comprehension-execution gap.
Novel transformer-based architecture for autonomous resource management in heterogeneous satellite clusters (optical and SAR). Uses model-free reinforcement learning for real-time decision-making in Earth Observation missions. Demonstrates significant performance improvements and transferability across varying cluster sizes.
Persona-based evaluation framework for pluralistic alignment in generative AI. Replaces monolithic benchmarks with structured manifold of synthetic cognitive profiles representing diverse human perspectives. Reveals systematic degradation of persona coherence under sequential inference, suggesting need for dynamic regulatory mechanisms.
UniScale unifies model routing and test-time scaling (TTS) in a single optimization space to balance LLM inference quality and computational cost. The framework uses LinUCB and contextual multi-armed bandit theory to learn adaptive inference policies online, with cost modeling and efficiency-aware learning.
New post-training method for reasoning models: Feedback Distillation trains the model to match its own distribution conditioned on LLM-generated feedback at token level. Tested on Lean4 theorem-proving, it maintains greater trajectory diversity than GRPO, improves policy entropy and pass@k scaling. Combined with GRPO, it outperforms either method alone.
COMPASS is a safety alignment framework for multi-step LLM search agents. It combines Cognitive Tree Exploration (CTE) to synthesize stealthy attack trajectories and Introspective Step-wise Alignment (ISA) to supervise risky intermediate actions. Results: favorable safety-utility trade-off requiring substantially less training data.
SLAT is an RL framework that reduces redundancy in chain-of-thought reasoning by selectively suppressing low-utility segments. On standard benchmarks, the method achieves 50% reasoning length reduction while maintaining competitive accuracy.
Theoretical paper positioning diffusion models as part of a family of learning techniques that withhold information and train models to recover it. Author argues destruction-based information withholding is more flexible than hand-crafted techniques, especially in data-scarce settings. Raises exploration challenges and proposes diffusion-native research directions.
SubsurfaceGen is a GPU-accelerated generator for 3D velocity models and seismic data at field scale. Authors release a dataset of 4,276 2D slices covering 6 geological settings (10 km × 10 km × 6.19 km at 10 m resolution). Evaluation of neural operators on wavefield prediction and end-to-end velocity inversion with out-of-distribution testing.
Multi-agent LLM systems assume agreement between agents indicates reliability. Authors show communication induces correlated failures and false consensus. They propose CAGE-CAL, a counterfactual agent-graph calibration framework comparing post-communication dependencies with no-communication scenarios to adjust confidence accordingly.
COFT is a training-free decoding method that reduces biases in LLM chain-of-thought generation. It uses masked counterfactual prompts and logit fusion to attenuate attribute-driven biases, with distribution-free marginal validity guarantees. Evaluation across 6 models: 30-55% bias reduction (median 38%) with negligible utility loss and ≤11% computational overhead.
Study on long-term effects of data selection during multi-stage LLM fine-tuning. Authors show that short-term optimal strategies (loss-based, gradient-based, diversity-based) can slow future learning and increase catastrophic forgetting. They propose LHAS (Long-Horizon Aware Selection) to evaluate selection as a global training intervention.
Method to detect and classify dataset usage in research literature using a multitask GLiNER framework. Combines dataset mention extraction, relation identification, and usage-context classification. Leverages synthetic data generation and LLM-based revalidation to address label scarcity.
Researchers improve multilingual speculative decoding by comparing three strategies: fine-tuning draft models on task-specific data, fine-tuning on unlabeled monolingual corpora, and training n-gram draft models. Evaluation across 11 languages on translation and story generation tasks. N-gram models provide consistent speedups despite lower acceptance rates.
Method to automatically generate fine-grained evaluation rubrics without human annotation, tested on four benchmarks. Training-free approach, then iterative fine-tuning via meta-judge reward signals. A fine-tuned 14B rubric generator outperforms larger proprietary models.
LLM-FACETS is an open-source framework for evaluating LLM factuality, epistemic calibration, and reproducibility. Web interface, plugin architecture, deterministic metrics (BLEU, ROUGE, BERTScore) run locally, log-probability visualization, multi-judge consensus, RAG Triad metrics. Designed for technical experts, domain experts, and compliance officers per EU AI Act and NIST standards.
Comparative study of generic vs domain-specific embeddings for multilingual clinical search (ICD-10-CM). A bi-encoder fine-tuned on Gemini-generated synthetic data (6 languages) outperforms BioBERT-ST: R@5=0.822 vs 0.790, with major gains in Portuguese (+0.115). Open recipe for LLM-based medical retrievers.
Item Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.
Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.
ImmigrationQA: source-grounded QA dataset of 17,058 pairs across 13 U.S. immigration law subdomains. Fine-tuned Llama 3.2 3B with LoRA on corpus of 10,056 validated documents. Fine-tuned model: 1.08/3.0 (16.8% fully correct) vs Llama 3 8B base: 0.85/3.0 (4% fully correct), 27% relative improvement. Cost: ~$29. Dataset, model, and code publicly released.
Study of global narrative dominance in LLMs via CulturalNB, a dataset of 717 Bengali cultural instances with parallel English-Bangla question-answer pairs. Evaluation of 9 LLMs shows English questions increase global substitution and reduce local perspective coverage, even with local evidence provided.
Comparative study of zero-shot multi-label topic classification using knowledge graphs extracted from documents. Framework tested on 15 LLMs and 8 datasets: keyword-enhanced variant outperforms baseline, graph augmentation helps small models but hurts large ones, and self-consistency decoding increases costs fivefold without performance gains.
GRiD, a diffusion-model framework, generates graph-like rules for knowledge graph reasoning. Combines supervised pre-training and reinforcement learning to discover complex rules (cycles, branches) beyond simple chains. Evaluated on 6 benchmarks with open-source code.
MAVEN is a lightweight symbolic reasoning scaffold to improve generalization of LLM agents in tool-calling tasks. Evaluated on BFCL v3, TauBench, Tau2Bench, AceBench and a new MAVEN-Bench benchmark, it increases GPT-OSS-120b accuracy from 48% to 71% without additional training, at roughly 1/10 the cost of proprietary baselines.
CSRM (Configurable Safety Reward Model) jointly optimizes calibrated safety compliance and reward modeling to adapt LLMs to heterogeneous and evolving safety requirements. Achieves 94.6% F1 on CoSApien and 75.8% F1 on DynaBench without additional human annotation.
Activation steering study across four multilingual LLMs (5 figurative categories, 6 languages). Directions learned in one language transfer effectively to others, particularly German. Composite cross-lingual directions match or exceed native directions, providing direct evidence of reusable but target-dependent figurative signals across languages.
arXiv paper on autonomous agentic data engineering for model specialization. GPT-5.2 constructs a training curriculum improving a student model by 57.29% through iterative, agent-driven data adaptation. Formalizes a novel task evaluating LLMs as autonomous data engineers.
Study of stylistic signatures introduced by LLM alignment. Researchers show post-training creates a detectable AI-like style. They propose PASTA, a training-free method that localizes and ablates this signature during decoding, reducing detection rates across 11 aligned models and 6 AI detectors.
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks. Built via an EHR-LLM-KB pipeline, it generates ~960k QA items covering diagnosis, treatment, and prognosis. 30+ LLMs benchmarked reveal persistent gaps toward clinical reliability.
Study on harness self-evolution (prompts, skills, memories, tools) in LLM agents. Analyzes two capabilities: harness-updating (producing useful updates) and harness-benefit (benefiting from them). Findings: harness-updating is capability-agnostic (Qwen3.5-9B matches Claude Opus gains), while harness-benefit is non-monotonic (mid-tier models benefit most).
Study applying MAP-Elites (quality diversity algorithm) to procedural generation of FPS levels. Two novel representations (Point-Line, Spatial-Layout) improve map characterization. Topological and emergent metrics defined. MESB generates map populations with higher diversity and quality than previous approaches.
Study on encoding factored tasks (FTS) into SAT for planning. Authors propose multiple strategies for translating the factored transition relation into propositional logic and analyze the impact of task transformations and parallelism on SAT-based planners.
DisjunctiveNet introduces a neuro-symbolic framework to embed hard mixed-integer linear constraints and logical rules directly into neural networks using differentiable optimization layers. Through hierarchical convex relaxations, the approach ensures exact rule satisfaction while maintaining strong predictive performance on real-world datasets.
Scientific ML framework for turbine Remaining Useful Life (RUL) prediction. Shared encoder (CNN + bidirectional LSTM + attention pooling) with task-specific heads predicts turbine gas temperature, Delta TGT, and RUL with quantified uncertainty intervals. Evaluated on heterogeneous real-world fleet data using MAE, PICP, MPIW, and coverage-width criterion metrics.
Benchmarking of 5 uncertainty quantification methods (Delta, Bayesian Monte Carlo Dropout, Bootstrap, LUBE, MVE) for turbine gas temperature degradation prediction. Evaluation on real dataset using coverage probability and prediction interval width metrics. Trade-offs identified between accuracy and reliability.
Process-Level Latent Variable Model (PLVM) predicts future behavioral strategies from partial cross-task process traces. Tested on PowerWash Simulator: fusing traces from two cleaning tasks predicts whether a player adopts "Zone Planner" or "Zone Hopper" strategy on unseen Fire Station level. Applicable to adaptive systems (tutors, games, human-AI collaboration).
DisasterLex is a knowledge-graph-mediated text-to-SQL framework for querying geospatial disaster-analytics databases. It uses an Expert Knowledge Graph (107 concepts, 117 causal edges) to route natural-language queries across 36 heterogeneous tables. On 75 test queries, it outperforms 4 baselines (LightRAG, HippoRAG 2, ReFoRCE, CHESS) by 1.4x to 2.75x.
5WBENCH, a balanced 5,000-sample benchmark across 5W categories, reveals unlearning methods fail on causal (Why) questions. MAAT, a three-phase framework operating on LoRA weights, combines gradient-projected ascent, SVD rank pruning, and KL-hidden-state repair to simultaneously achieve high forgetting and retention on causal knowledge.
Researchers train a small encoder-decoder transformer on the zeta map, a classical bijection in q,t-Catalan combinatorics. Mechanistic interpretability tools (cross-attention analysis, linear probing, causal intervention) reveal a level-based mechanism. Translation into an explicit peak-centered traversal algorithm (scaffolding map) proven equivalent to the zeta map.
Distributed approach for constrained multi-agent reinforcement learning combining state-augmented policy learning with consensus over Lagrange multipliers. Agents learn offline policies and coordinate via local communication. Linear scalability to thousands of agents, demonstrated on smart grid demand response.
Unified framework for gradient aggregation in multi-objective optimization. Authors establish convergence rates to Pareto stationarity via sufficient alignment condition, showing non-conflicting directions within gradient convex hull ensure convergence. Introduces capped MGDA from CVaR formulation, validated on synthetic and practical benchmarks.
Study of black-box LLM distillation through bounded behavioral indistinguishability. Authors evaluate Qwen and Llama pairs with 5,000-prompt suite, showing LoRA improves semantic similarity (0.788→0.862 for Qwen, 0.814→0.874 for Llama) but leaves detectable behavioral differences exploitable by adversaries.
New method to accelerate diffusion-based language models (dLLMs). Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) reduce unnecessary denoising iterations by analyzing token-wise trajectories and forecasting future logit trends without additional training.
SAGE is an adaptive gate using von Mises-Fisher density estimation to control memory evolution in agentic LLMs. It classifies candidate facts as ADD (novel), NOOP (redundant), or MERGE (uncertain), reducing expensive LLM calls. On LoCoMo, SAGE cuts API cost by 3.4× and latency by 2.5× with GPT-4o-mini.
TeachObs is a human-validated multimodal benchmark for classroom video analysis. It contains 30 public lessons from 8 countries split into 5,158 15-second scenes, annotated by 7 researchers with 39 observation codes (20 visual, 19 non-visual). Evaluation of 5 vision-capable LLMs across 3 tasks: no single model consistently outperforms others.
EUDAIMONIA is a benchmark evaluating harmful social dynamics in LLMs. It contains 969 user inputs and 3,147 design-violation checks, testing 22 recent models. Claude-Opus-4.7 and GPT-5.5 violate 30.7% and 27.2% of checks respectively, revealing persistent social-alignment failures not resolved by extended thinking.
Evaluation of semantic stability in 16 LLMs (general-purpose and medical) under clinically equivalent prompt reformulations. Proposes NLI-based verification framework and three sensitivity metrics (MVS, ΔC, WCI). Finding: domain specialization does not consistently improve robustness to meaning-preserving variations.