Topic

#Fine-tuning

Fine-tuning means retraining a pre-trained AI model on a specific dataset to adapt it to a precise task. For example, OpenAI fine-tuned GPT-4 to follow instructions more accurately in ChatGPT.

40Articles

5Sources

72Avg. signal

arXiv cs.CL·Jun 18

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Framework for customization and efficient deployment of LLM-based multi-agent systems in enterprise settings. Combines continual pretraining, supervised fine-tuning, and preference optimization to adapt compact models to specialized domains. Integrates speculative decoding and FP8 quantization to reduce latency and costs. Achieves 4.48x throughput speedup while maintaining performance.

Multi-agent Fine-tuning Business

SIG

HYP

arXiv cs.LG·Jun 18

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA is an air traffic control training simulator automating pilot roles through speech recognition, instruction interpretation, and response generation. The system reduces Word Error Rate from 107.80% to 23.45% on Singaporean-accented aviation speech, and evaluates trainee radiotelephony communications achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores.

Voice Fine-tuning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Structural pruning framework for Mixture-of-Experts models operating at channel level rather than expert level. Attribution-based method reformulates pruning as channel-score coverage maximization. Experiments on DeepSeek and Qwen models achieve 50% structured pruning with 4-bit quantization, 5.27× memory reduction on Qwen3-30B-A3B.

DeepSeek Qwen Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM is an agentic LLM pipeline deployed at DiDi to extract semantic user profiles from massive behavioral logs. The system uses 27 analytical tools to mine platform-scale data and generates utility-aligned profiles, achieving +6.14% AUC improvement and +0.47% GMV gain in A/B testing.

AI Agents Llama RAG

SIG

HYP

arXiv cs.CL·Jun 18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Activation steering improves synthetic data generation for low-resource languages. Two strategies tested: Language Steering (linguistic identity) and Quality Steering (well-formedness). Evaluation across 4 open-source LLMs, 11 languages, classification tasks. Early-layer steering increases diversity and downstream performance.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework improving LLM pragmatic reasoning through counterfactual reasoning traces. Without human-labeled data, it combines supervised fine-tuning and reinforcement learning. On 4 benchmarks (PragMega, Ludwig, MetoQA, AltPrag), it gains +5.37% and +5.50% absolute for Qwen3-8B and Qwen3-14B.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 18

Efficient Financial Language Understanding via Distillation with Synthetic Data

Distillation framework with synthetic data for financial sentiment analysis. Knowledge transfer from large instruction-tuned teacher to compact student models. Clustering-based seed selection generates synthetic examples via few-shot prompting. Compact model outperforms teacher on complex/noisy text with minimal supervision.

Fine-tuning RAG Prompt engineering

SIG

HYP

arXiv cs.LG·Jun 18

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CodeBlock is a structure-aware sparse supervision framework for code LLM fine-tuning. It selects syntactically coherent code blocks rather than isolated tokens, estimating utility via generalized cross-entropy and data-flow signals. On 6 code-generation benchmarks, CodeBlock outperforms full-token SFT while using only 1.9% of supervised response tokens.

Code generation Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT refines SFT training data distribution using on-policy Influence Functions. The method uses model rollouts as validation targets to minimize proximity gap and debias gradient norm bias. Experiments on 7B instruction and reasoning models show consistent performance ceiling improvements over existing curation baselines.

Fine-tuning Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·Jun 18

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Study shows SFT overtraining can invert model rankings during RLVR fine-tuning. On Qwen2.5-Coder-3B, increasing SFT depth raises pre-RL pass@1 but reduces GRPO pass@10 from 0.806 to 0.481. Pre-RL entropy positively correlates with RLVR outcomes (ρ=+0.69). Two-stage entropy-based diagnostic identifies high-risk checkpoints.

Reinforcement learning Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE is a training-free framework for dynamic adapter selection at inference time. It represents each adapter through centroids computed from embeddings of its training set. Tested on Llama 3.2 1B across 23 NLP tasks, it recovers 97.44% of upper-bound performance and achieves 89.7% average selection accuracy on 44 tasks.

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Novel LLM personalization: store user facts as surgical edits in a hash-keyed memory table (Engram) instead of global LoRA. Reduces memory footprint by 33,000x, improves indirect-reasoning accuracy by 5.6x on average, and enables stacking multiple users without cross-contamination.

Fine-tuning Reasoning Papers

SIG

HYP

Reddit r/MachineLearning·Jun 17

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

Researcher experiments with iterative targeted SFT combined with mechanistic interpretability on a 31B model. Strategy: contrastive training on specific capability dimensions, then circuit ablation to map causal dependencies between dimensions and optimize future training order.

Fine-tuning Reasoning Evals

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

GLM-5.2 is a win for local AI

GLM-5.2 (744B) under MIT license marks progress for local AI despite its massive footprint. The community can distill its reasoning capabilities into 8B/70B models, significantly improving local setups.

Open source Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Self-Generated Error Training for Token Editing in Diffusion Language Models

Training method to improve token editing in diffusion language models (LLaDA2.1). Addresses training-inference mismatch between random corruptions and model's own errors. Uses no-gradient draft pass followed by supervision on self-generated corruptions via LoRA. Reduces edit intensity and transcription errors.

Code generation Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Are you speaking my languages? On spoken language adherence in multimodal LLMs

LLM-based ASR systems often misidentify output languages in multilingual contexts. Authors propose three mitigation strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning to improve language adherence while preserving code-switching flexibility and ASR performance.

Voice Prompt engineering Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 17

ProCUA-SFT Technical Report

ProCUA-SFT is a dataset of 3.1M step-level SFT samples generated automatically from 93K synthetic trajectories across 2,484 application combinations. Fine-tuning UI-TARS 7B on ProCUA-SFT achieves 45.0% on OSWorld, a +18.7 percentage-point improvement over the base model and +35% above AgentNet. The pipeline uses Kimi-K2.5 as task generator, precondition judge, and trajectory executor.

AI Agents Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Pruned models pass multiple-choice benchmarks but fail in open generation. Multilingual study shows that under high-sparsity pruning (Wanda), correct answers are demoted rather than erased: they reappear with beam search or sampling. Multiple-choice benchmarks overstate the usability of compressed LLMs.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

Study on bilingual fine-tuning for low-resource ASR across 9 language pairs. Uses language identification tokens prepended to input text. Results: bilingual fine-tuning improves performance when language ID accuracy is high; providing the token at inference mitigates low language ID performance.

Voice Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Learning task-specific subspaces via interventional post-training of speech foundation models

Post-training refinement method for speech foundation models using interventional contrastive learning. Transforms entangled representations into separate content and speaker subspaces via interventional dataset and multi-part contrastive loss. Improves out-of-domain speaker verification and keyword spotting performance.

Voice Fine-tuning Papers

SIG

HYP

arXiv cs.CL·Jun 17

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Fine-tuning Qwen3.5-27B to predict PHQ-9 depression scores directly from transcripts of conversations with an AI mental health application. 6,283 users (3,111 ground-truth labels + Claude Opus pseudolabels). Performance: MAE=2.6, RMSE=4.0, r=0.80, AUC=0.91 at PHQ-9≥10 clinical threshold.

Fine-tuning Reasoning Qwen

SIG

HYP

arXiv cs.LG·Jun 17

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

InferBERT combines transformers with Do-calculus to detect causal adverse drug events in pharmacovigilance. Comparative study on AILF and TRAM benchmarks: BioBERT outperforms XGBoost, ALBERT, and Med-LLaMA. Finding: domain-specific pre-training outweighs model size.

Benchmarks Fine-tuning AI safety

SIG

HYP

arXiv cs.LG·Jun 17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD stabilizes on-policy distillation for LLMs by replacing unbounded log-ratio rewards with Box-Cox power transformation. On 6 mathematical reasoning benchmarks with Qwen3, achieves +6.37 Avg@8/+5.71 Pass@8 gains vs vanilla OPD, reduces wall-clock time by 59.2% and peak GPU memory by 23.1%.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Study of LLM adaptation for 3D CT report generation in medical imaging. RAD3D-Prefix, a lightweight diagnostic-prior framework, integrates image embeddings and multi-label classification logits. Across LLMs from 96.1M to 1.6B parameters, freezing the model and training only projection layers outperforms full fine-tuning, reducing clinical hallucination and overfitting.

Fine-tuning Vision

SIG

HYP

arXiv cs.LG·Jun 17

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

Researchers identify a critical issue in knowledge editing for MLLMs: updates work with multimodal inputs (text+image) but fail with unimodal inputs alone. They propose DECODE, a method that localizes and decouples modality-specific neurons to propagate edits consistently across all input types.

Fine-tuning Vision Evals

SIG

HYP

arXiv cs.LG·Jun 17

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

Generative model based on GPT architecture for inverse design of heterogeneous catalysts. Pretrained on 133 million structures, fine-tuned on ~460,000 optimized structures. Achieves 98% structural validity, 95% optimization validity, and improves screening efficiency 1.5–4× for reaction-targeted catalyst discovery.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 17

ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors

Hardware-aware finetuning method for DNN deployment on ReRAM crossbar arrays. Uses range-shrunk sinh transformation to mitigate I-V non-linearity and incorporates retention errors into regularization loss. Results: ResNet18/DeiT-Tiny no degradation, MobileNetV3 <2% on ImageNet, F-1 -1 point on SQuAD v2.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

7B model fine-tuned to predict next step in concurrent Go programs by learning event distributions rather than single labels. On 798 predictions from real bugs (CockroachDB, Kubernetes, gRPC, etcd), achieves 36.2% accuracy with <1000 traces, outperforming Gemini 3.5 Flash zero-shot (34.8%). Dataset, adapters, and tooling released.

Code generation Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 17

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine introduces a framework for fine-grained visual dependency modeling in mathematical reasoning. A new dataset augments visual annotations with visual dependency ratings. Two-stage progressive training balances answer correctness and visual grounding rewards according to each sample's intrinsic visual necessity, reducing reward bias.

Reasoning Vision Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Get in here: Community model build thread

A Reddit thread proposes building a community model through distributed compute using a Mixture-of-Experts (MoE) approach. The 'Branch-Train-Stitch' strategy distributes a dense prototype model to participants who train it independently on their hardware, then merge the submodels into an MoE. Key decisions include prototype size (2B or 7B) based on available VRAM.

Open source Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Qwen3.6 27B quants

User benchmarks Qwen3.6 27B extreme quantization (IQ3 XXS turbo4) vs Q8 on code review task. IQ3 XXS (5min, 1230pp/50tg) generates comparable recommendations to Q8 (1h56m, 306pp/3tg). Finding: aggressive quantization adequate for coding tasks with good prompting.

Qwen Code generation Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Be wary of Qwen/Claude distillations - they're often worse than the base model

Qwen/Claude distillations circulating on r/LocalLLaMA (Qwopus, Fable 5 on Qwen 3.6) use 4k-10k training samples, insufficient to improve performance. Compared to 700k samples in official DeepSeek-R1 distillations, these models don't exceed base Qwen and slightly degrade quality despite different reasoning style.

Qwen Claude Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 16

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix automates data mixture optimization for model training via gradient descent. The method reformulates mixture selection as a bilevel optimization problem, jointly optimizing mixture coefficients and model parameters. A single proxy model suffices, drastically reducing search cost compared to prior approaches.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR synergizes Monte Carlo Tree Search with test-time reinforcement learning for optimization modeling. The framework decomposes modeling into four stages, refines a transient LoRA adapter via GRPO at each node, and employs an unsupervised multi-faceted reward system. Achieves state-of-the-art results across five optimization benchmarks with a 4B backbone.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 16

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner is a framework using fine-tuned LLMs with RAG to extract user preferences from natural language and integrate them into public transit routing optimization. Evaluated on 8 personas and 5 contexts, the system combines fine-tuning (output structure) and RAG (query-specific context) to identify solutions overlooked by existing planners.

RAG Fine-tuning Prompt engineering

SIG

HYP

arXiv cs.CL·Jun 16

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA introduces Nemotron 3 Ultra, a 550B-parameter (55B active) Mamba-Transformer MoE hybrid model pre-trained on 20T tokens with 1M context length. Uses SFT, RL, and multi-teacher distillation. Achieves ~6x inference throughput of public LLMs with comparable accuracy. Base, post-trained, and quantized checkpoints, training data, and recipe open-sourced on HuggingFace.

AI Agents Reasoning Open source

SIG

HYP

arXiv cs.CL·Jun 16

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

Study on layer-wise redundancy in LLMs. Authors characterize how layers absorb or amplify perturbations during pruning: early layers amplify, middle and late layers absorb. They propose absorption-aware correction using a per-layer absorption coefficient, improving OWL and AlphaPruning by 7.13% perplexity reduction and 1.02% zero-shot accuracy boost at 70% sparsity.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

Spokes: Optimizing for Diverse Pretraining Data Selection

SPOKES optimizes pretraining data selection through a probabilistic diversification framework based on G-Vendi score and exponentiated gradient descent. On FineWeb and DCLM, the method improves downstream performance by +1.5 and +1.4 points when jointly optimizing quality and diversity, outperforming semantic deduplication.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard is a safety guardrail system for Chinese LLMs with fine-grained taxonomy (5 macro, 31 micro categories). Authors construct 405k training samples via RAG and prompt rewriting, plus 51k annotated test samples. Model achieves +15.92% F1 improvement over Qwen3Guard-8B-Strict using Direct Preference Optimization.

AI safety Alignment Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD is a self-reframing distillation method to improve safe-helpfulness balance in LLMs. It rewrites sensitive prompts using philosophical guidelines to surface benign intent, reframes responses into safer and more helpful versions, then fine-tunes the model on self-reframed responses. Tested on DNA and LINGUASAFE, SHARD improves helpfulness while preserving safety.

Fine-tuning AI safety Alignment

SIG

HYP