Mellum 2 12B A2.5B
JetBrains releases Mellum 2, a coding-focused 12B/2.5B MoE. Reasoning performance matches Qwen 3.5 9B, underperforms Qwen 3.5 4B on general tasks. Technical report published.
JetBrains releases Mellum 2, a coding-focused 12B/2.5B MoE. Reasoning performance matches Qwen 3.5 9B, underperforms Qwen 3.5 4B on general tasks. Technical report published.
Comparative benchmark of MTP (Multi-Token Prediction) quantizations between unsloth and bartowski on Qwen 3.5-4B, 3.5-9B, and 3.6-27B. Bartowski uses Q8_0 for MTP head (larger files). Tests for Snapdragon with Q4_0, IQ4_NL, Q4_1, MXFP4_MOE, Q8_0 limited to 24GB VRAM RTX 3090. Unsloth generally faster in decoding throughput and VRAM efficient.
VibeETL: open-source visual ETL platform built in 3 months by former data scientist. Polars + Rust backend, React Flow frontend with native BFS layout algorithm. Zero external dependencies, sandboxed Python execution (30s timeout). Lightweight Alteryx alternative.
Theoretical paper on universal multiclass transductive online learning with unbounded label space. Characterizes learnability: only two possible optimal rates (bounded or logarithmic). Introduces LCLL tree combinatorial structure and extends results to agnostic and stochastic settings.
Formal study of calibration for probabilistic label ranking. Authors define a hierarchy of notions (full rankings, sub-rankings, top-k) and show popular models are poorly calibrated. Application to RLHF reward models reveals calibration and accuracy are not perfectly correlated.
DSFM (Dual-Spectral Flow Matching) generates synthetic fMRI time series by combining discrete wavelet transform (DWT) and discrete cosine transform (DCT) with spectral flow matching. The model captures non-stationarity and spatiotemporal dynamics of BOLD signals to improve brain network classification.
Unicorn, a multi-dataset pretraining framework, bridges the trade-off between channel-independent models (scalable but ignoring dependencies) and channel-dependent models (expressive but dimension-bounded). Using a latent prototype codebook, it projects heterogeneous channels into a shared space to learn identity-agnostic, reusable correlation patterns transferable across domains.
Novel MADQI evaluation metric for unsupervised anomaly detection in maritime AIS datasets. Combines four indices (ARC, PPS, SDS, ECE) through automatic normalization. Achieves MADQI score of 80.37% on AIS data, with ECE=0.907 and ARC=1.000 for detecting abnormal vessel behavior.
Formalizes causal pathways for rare events in structural equation models. Proposes a formal definition of causal pathways and identifies conditions where testable implications depend only on the causal abstraction defined by the rare event pathway, rather than the full causal graph.
COLLEAGUE.SKILL is an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge extraction. The system produces versioned packages with two coordinated tracks: capability (practices, mental models, decision heuristics) and bounded behavior (communication style, interaction rules). 18.5k GitHub stars, 215 skills from 165 contributors.
XOResNet introduces OR-ADD shortcut connections and XOR meta-residuals to improve learning in deep spiking neural networks. Tested on Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, the model outperforms existing SNNs by reducing spike redundancy and information loss.
Reinforcement learning framework for autonomous driving using uncertainty-aware expert advice with adaptive thresholds. Epistemic and aleatoric uncertainty trigger expert intervention; commitment-cooldown strategy prevents long-term dependence. CARLA experiments: +5-7% success rate vs IQN baseline.
√LTS algorithm for tree search with implicit rerooting. Three rerooter designs proposed: clustering-based on state-space structure, heuristic-based with cost-to-go estimates, and hybrid. Avoids explicit subgoal generation, reduces computational overhead, and achieves state-of-the-art online training efficiency on tested domains.
Researchers reframe healthcare mechanism design as program synthesis for LLMs. Medi-Sim, a multi-agent simulator, evaluates rule programs against strategic provider responses (coding, selection, delay, effort, triage). LLM-guided evolutionary code search synthesizes a mixed-objective program that eliminates up-coding, halves rejections, and retains baseline profitability.
AdaCoM, an external LLM system, manages context for frozen LLM agents via reinforcement learning on long-horizon tasks (web search, deep research). Learned strategies reveal a Fidelity-Reliability Trade-off: high-performing agents benefit from higher-fidelity context preservation, while lower-performing agents require aggressive compression.
Unified Gait2Hip-60 benchmark comparing LSTM, Transformer, and Mamba to predict hip muscle forces and joint moments from gait kinematics. Transformer outperforms other models (R²=0.819 for forces, R²=0.862 for moments). External validation on 9 femoral head osteonecrosis patients shows moderate generalization (R²=0.537–0.569).
SCALE is a self-improving framework for web agents using MLLMs. It employs three adversarial roles (Selector, Predictor, Judger) to autonomously explore agent limitations and expand cognitive boundaries. SCALE-Hop optimizes global planning via graph exploration. A SCALE-20k dataset from 19 real websites with 20k structured demonstrations validates the approach across multiple MLLMs.
EGGROLL, a low-rank factorization of Evolution Strategies perturbations, reduces memory complexity from O(mn) to O(r(m+n)) for gradient-free training of Spiking Neural Networks. On N-MNIST, the method achieves 79.21% test accuracy with 2.23× speedup versus full-rank ES, enabling on-chip learning on neuromorphic hardware without surrogate gradients.
Method to align speech and co-speech gestures using semantic motion anchors: discretizes 3D gestures into motion primitives, verbalizes them into structured descriptions, and provides contrastive supervision. 8.2% R@1 improvement on BEAT2; retrieved gestures are semantically meaningful rather than generic motion patterns.
New method to accelerate diffusion-based language models (dLLMs). Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) reduce unnecessary denoising iterations by analyzing token-wise trajectories and forecasting future logit trends without additional training.
Study of black-box LLM distillation through bounded behavioral indistinguishability. Authors evaluate Qwen and Llama pairs with 5,000-prompt suite, showing LoRA improves semantic similarity (0.788→0.862 for Qwen, 0.814→0.874 for Llama) but leaves detectable behavioral differences exploitable by adversaries.
Unified framework for gradient aggregation in multi-objective optimization. Authors establish convergence rates to Pareto stationarity via sufficient alignment condition, showing non-conflicting directions within gradient convex hull ensure convergence. Introduces capped MGDA from CVaR formulation, validated on synthetic and practical benchmarks.
arXiv study on iterative refinement of LLM-generated reward functions for sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven refinement guided by failure-mode taxonomy. Results: DoorKey-8x8 improves from 2.3% to 97.6%, KeyCorridor from 31.2% to 86.7%. Limitations: method restricted to PPO and sparse structured tasks.
Process-Level Latent Variable Model (PLVM) predicts future behavioral strategies from partial cross-task process traces. Tested on PowerWash Simulator: fusing traces from two cleaning tasks predicts whether a player adopts "Zone Planner" or "Zone Hopper" strategy on unseen Fire Station level. Applicable to adaptive systems (tutors, games, human-AI collaboration).
Benchmarking of 5 uncertainty quantification methods (Delta, Bayesian Monte Carlo Dropout, Bootstrap, LUBE, MVE) for turbine gas temperature degradation prediction. Evaluation on real dataset using coverage probability and prediction interval width metrics. Trade-offs identified between accuracy and reliability.
CoSee, an auditing framework, analyzes failure modes of modular visual reasoning systems using shared working memory. On 4B–8B models, two dominant failure modes emerge: Noise Reinforcement (reusing ungrounded notes) and Policy Collapse (under-specified answers). The study shows naive shared workspaces amplify hallucinations without explicit verification.
XLGoBench is a benchmark of synthetic algorithmic tasks to detect cross-lingual gaps in LLM abilities. The benchmark is commensurate across languages, scalable (variable complexity), quantifiable (objective correctness), and transparent (auditable templates). Experiments reveal persistent cross-lingual gaps in multiple state-of-the-art models.
Scientific ML framework for turbine Remaining Useful Life (RUL) prediction. Shared encoder (CNN + bidirectional LSTM + attention pooling) with task-specific heads predicts turbine gas temperature, Delta TGT, and RUL with quantified uncertainty intervals. Evaluated on heterogeneous real-world fleet data using MAE, PICP, MPIW, and coverage-width criterion metrics.
Theoretical paper defining pairwise reference alignment as an ordinal observable for language model evaluation. Formulates statistical framework to measure whether a model ranks preferred responses above rejected responses, with finite-sample estimators and concentration bounds. Empirical validation on Qwen2.5 and RewardBench.
Study of alignment between LLM uncertainty and human uncertainty through behavioral analysis and internal activation patterns. Authors measure calibration and alignment across multiple-choice and open-ended factual recall datasets, assessing impact of instruction fine-tuning.
CobSeg is a multi-branch architecture for dialogue topic segmentation. It separates semantic continuity from lexical transitions and uses boundary informativeness weighting. Evaluated on 5 benchmarks, it reduces Pk by 0.7 points on VHF and achieves Pk=1.0 on DialSeg711, without LLM calls at inference.
Method to establish correspondences between embedding vectors from different black-box encoders. Exploits local geometric consistency of independently trained contrastive encoders: short-range distances preserved up to scale factor. Uses iterative reference-based geometric embedding hashing with paired anchors to recover vector links. Code released.
Novel transformer-based architecture for autonomous resource management in heterogeneous satellite clusters (optical and SAR). Uses model-free reinforcement learning for real-time decision-making in Earth Observation missions. Demonstrates significant performance improvements and transferability across varying cluster sizes.
Persona-based evaluation framework for pluralistic alignment in generative AI. Replaces monolithic benchmarks with structured manifold of synthetic cognitive profiles representing diverse human perspectives. Reveals systematic degradation of persona coherence under sequential inference, suggesting need for dynamic regulatory mechanisms.
COMPASS is a safety alignment framework for multi-step LLM search agents. It combines Cognitive Tree Exploration (CTE) to synthesize stealthy attack trajectories and Introspective Step-wise Alignment (ISA) to supervise risky intermediate actions. Results: favorable safety-utility trade-off requiring substantially less training data.
Multi-agent LLM systems assume agreement between agents indicates reliability. Authors show communication induces correlated failures and false consensus. They propose CAGE-CAL, a counterfactual agent-graph calibration framework comparing post-communication dependencies with no-communication scenarios to adjust confidence accordingly.
Method to detect and classify dataset usage in research literature using a multitask GLiNER framework. Combines dataset mention extraction, relation identification, and usage-context classification. Leverages synthetic data generation and LLM-based revalidation to address label scarcity.
Researchers improve multilingual speculative decoding by comparing three strategies: fine-tuning draft models on task-specific data, fine-tuning on unlabeled monolingual corpora, and training n-gram draft models. Evaluation across 11 languages on translation and story generation tasks. N-gram models provide consistent speedups despite lower acceptance rates.
Comparative study of zero-shot multi-label topic classification using knowledge graphs extracted from documents. Framework tested on 15 LLMs and 8 datasets: keyword-enhanced variant outperforms baseline, graph augmentation helps small models but hurts large ones, and self-consistency decoding increases costs fivefold without performance gains.
Study applying MAP-Elites (quality diversity algorithm) to procedural generation of FPS levels. Two novel representations (Point-Line, Spatial-Layout) improve map characterization. Topological and emergent metrics defined. MESB generates map populations with higher diversity and quality than previous approaches.