Page 25 of 192

AllHigh signalRecent

7679 articles

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT is an open-source toolkit for syntactic parsing of child-adult interactions in CHILDES. It includes a dependency parser trained on UD-English-CHILDES, a POS tagger, and a construction tagger. The parser outperforms SpaCy and Stanza on this specialized domain.

Open source Benchmarks

SIG

HYP

arXiv cs.CL·May 20

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom is a framework for building custom benchmarks to evaluate application-specific scientific capabilities in LLMs. It organizes scientific knowledge into ontology-grounded units, uses multi-model consensus voting to identify relevant units, and generates benchmarks from real data in chemistry and healthcare without expert annotation.

Benchmarks Evals Papers

SIG

HYP

Reddit r/LocalLLaMA·May 19

Carbon: Decoding the Language of Life

Hugging Face releases Carbon, a family of open-source DNA foundation models. Carbon-3B matches SOTA (Evo2-7B) while being 275× faster. The approach adapts modern LLM techniques: deterministic 6-mer tokenization, factorized loss (FNS) mid-training, and curation of functional biological data.

Open source Benchmarks Fine-tuning

SIG

HYP

Page 25 of 192

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Carbon: Decoding the Language of Life

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

Supervising the search process produces reliable and generalizable information-seeking agents

Reasoning Can Be Restored by Correcting a Few Decision Tokens

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Spatial Blindness in Whole-Slide Multiple Instance Learning

Membership Inference Attacks on Discrete Diffusion Language Models

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

Locally Coherent Parallel Decoding in Diffusion Language Models

GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

From Prompts to Protocols: An AI Agent for Laboratory Automation

AgentWall: A Runtime Safety Layer for Local AI Agents

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Improved Baselines with Representation Autoencoders

Language-Switching Triggers Take a Latent Detour Through Language Models

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency