Topic

#Voice

In AI, voice refers to speech synthesis and recognition technologies that enable machines to speak or understand human speech. ElevenLabs, for instance, generates realistic synthetic voices from text.

40Articles

8Sources

71Avg. signal

arXiv cs.CL·Jun 18

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Paper presents speech-driven approach for Chinese dialect discrimination. Combines MFCC features, HMM-DNN speech recognition model, attention mechanism and CNN. Evaluation on two benchmark Chinese dialect corpora shows improvement over state-of-the-art methods.

Voice Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 3.0, the reference tool since 2016 for forced speech-to-text alignment, achieves state-of-the-art performance on English, Japanese, and Korean with boundary errors <15ms. New capabilities: model adaptation, cross-language phone remapping, expanded language/dialect coverage, harmonized IPA dictionaries.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 18

Continuous Audio Thinking for Large Audio Language Models

Continuous Audio Thinking (CoAT) adds a continuous latent workspace to large audio language models to preserve acoustic information (phonetics, prosody, affect, pitch) before text generation. Tested on Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo, CoAT improves performance on audio reasoning, music classification, and transcription with no additional decoding cost.

Reasoning Voice Qwen

SIG

HYP

arXiv cs.LG·Jun 18

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA is an air traffic control training simulator automating pilot roles through speech recognition, instruction interpretation, and response generation. The system reduces Word Error Rate from 107.80% to 23.45% on Singaporean-accented aviation speech, and evaluates trainee radiotelephony communications achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores.

Voice Fine-tuning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

CDDTLDA framework for Chinese dialect discrimination with scarce annotation resources. Uses transfer learning on ASR models, data augmentation (speed, pitch, noise), and self-attention to capture shared semantic features. Outperforms state-of-the-art on two benchmark corpora.

Voice Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model.

Inflect-Nano-v1, a 4.63M parameter TTS model, is the 2nd smallest publicly released speech synthesis model. Comprises acoustic model (3.46M) and vocoder (1.17M), generates 24 kHz English audio. ~17x smaller than Kokoro, ~108x smaller than Chatterbox. Runs locally via PyTorch, suited for embedded devices and offline voice assistants.

Voice Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

A Year Building a Fully Local Home Voice Assistant · Fulloch

Developer shares 12-month journey building a fully local home voice assistant using open-source models as Alexa alternative. Documents what worked and what didn't throughout the project.

Open source Voice AI Agents

SIG

HYP

Reddit r/MachineLearning·Jun 17

Mel AI just shared a demo of video-native AI characters that can talk, react, and respond to camera context in real time [N]

Mel AI demonstrates video-native AI characters that talk, lip-sync, show facial reactions, and respond in real time to camera context. The system detects user environment and adapts responses accordingly. This approach moves beyond text-based Character AI (founded by former Google/LaMDA developers).

AI Agents Vision Voice

SIG

HYP

arXiv cs.CL·Jun 17

Are you speaking my languages? On spoken language adherence in multimodal LLMs

LLM-based ASR systems often misidentify output languages in multilingual contexts. Authors propose three mitigation strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning to improve language adherence while preserving code-switching flexibility and ASR performance.

Voice Prompt engineering Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 17

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a multi-task benchmark for clinical speech AI covering 12 datasets and 27 tasks across diverse health conditions. Tasks are structured by speech production stages (conceptualization, formulation, articulation). Evaluation of 12 audio encoders shows large-scale speech models outperform domain-specific ones, but none generalize reliably across clinical speech.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·Jun 17

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

MultiClin, a clinical ASR benchmark, evaluates speech recognition model robustness to multiscript variability (multiple valid orthographic forms of the same term). Conventional metrics underestimate performance. Script unification consistently yields best ASR performance.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·Jun 17

Perceptual compensation for tonal context in self-supervised speech models

Study of wav2vec2.0 examining perceptual compensation for tonal context in Mandarin Chinese. Purely self-supervised pre-trained models show no evidence of compensation in embedding similarities. Probing classifiers reveal partial compensation but fail to replicate human performance. Supervised objectives appear necessary to abstract certain phonological regularities.

Papers Evals Voice

SIG

HYP

arXiv cs.CL·Jun 17

Learning task-specific subspaces via interventional post-training of speech foundation models

Post-training refinement method for speech foundation models using interventional contrastive learning. Transforms entangled representations into separate content and speaker subspaces via interventional dataset and multi-part contrastive loss. Improves out-of-domain speaker verification and keyword spotting performance.

Voice Fine-tuning Papers

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Comparative study of LLM abilities to predict next speaker, turn changes, and addressee in multi-party conversations. On the AMI corpus, LLMs outperform supervised models and humans in next speaker prediction without audio-visual access. MM-LLMs exceed text-based LLMs but remain below human performance for addressee and turn-change prediction.

Benchmarks Evals Vision

SIG

HYP

arXiv cs.CL·Jun 17

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

Study on bilingual fine-tuning for low-resource ASR across 9 language pairs. Uses language identification tokens prepended to input text. Results: bilingual fine-tuning improves performance when language ID accuracy is high; providing the token at inference mitigates low language ID performance.

Voice Fine-tuning Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

A fast, optimised, and open source application for running local AI easily (made for Apple Silicon only)

AeroLLM, open-source app optimized for Apple Silicon, runs local LLMs, TTS, and STT through a GUI. Uses MLX backend for native inference, downloads models from Hugging Face with RAM-based recommendations, exposes optional API endpoint. v0.1.0 released.

Open source Tools Llama

SIG

HYP

arXiv cs.CL·Jun 16

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Practical evaluation method for long-form simultaneous speech-to-speech translation (SimulS2ST) on continuous input. Uses ASR, forced alignment, and sentence embeddings to recover timestamps and align target text to source sentences, then computes sentence-level latency and quality metrics (YAAL, xCOMET). Reveals substantial latency accumulation in current systems on long speech.

Voice Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Study on lexical stress transfer in English-to-Chinese speech-to-speech translation. Authors construct a stress-annotated Mandarin dataset, develop an XLS-R-based stress detector, and propose an objective cross-lingual evaluation metric. A fine-tuned CosyVoice3 S2ST system outperforms existing systems in stress preservation while maintaining competitive translation quality.

Voice Benchmarks Evals

SIG

HYP

arXiv cs.CL·Jun 15

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

Study of internal representations in text-based LLMs and an ASR model examining whether V+up phrasal verbs develop distinct representations as a function of frequency and predictability. All models show evidence of holistic storage driven by these factors, supporting usage-based theories of language.

Papers Reasoning Voice

SIG

HYP

arXiv cs.CL·Jun 15

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

New continual learning approach to improve Automatic Speech Recognition (ASR) on disfluent speech. Researchers introduce explicit disfluency tokens into a pretrained ASR model, then continue training on datasets with varying disfluency distributions. Analysis reveals trade-off between marker learning and ASR performance.

Voice Papers

SIG

HYP

arXiv cs.CL·Jun 15

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL is a modular continual learning dataset for evaluating ASR robustness under real-world distribution shifts (accents, noise, recording conditions, speech impairments). Authors propose a real-world-inspired curriculum and evaluate three continual learning strategies to analyze how robustness develops, transfers, and is forgotten.

Benchmarks Evals Voice

SIG

HYP

arXiv cs.CL·Jun 15

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex is a native full-duplex speech language model using a single autoregressive LLM without external VAD module. Fine-tuned on 400K samples with DPO, it achieves 92% turn-taking success and 100% interruption success on InstructS2S-Eval, improving speech-response score from 2.17 to 3.39 over Moshi.

Voice AI Agents Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

Voice-to-voice chatbot update

Real-time local voice chatbot using Qwen3.5-397B (Unsloth UD-Q3_K_XL), Whisper-small STT, and Orpheus Q4_K_XL TTS with custom SNAC decoder on ONNX. Interruptible with context preservation, 21.3 GB VRAM max on 24GB GPU, bf16 KV cache at 131k tokens. GitHub code coming soon.

Qwen Voice Code generation

SIG

HYP

Reddit r/LocalLLaMA·Jun 13

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

Zyphra releases ZONOS2, an open-source TTS model (Apache 2.0) with 8B parameters and 900M active at inference. Sparse MoE focused on zero-shot high-fidelity voice cloning (44.1 kHz DAC). Prosody score 88.7, outperforming Qwen 3 TTS (87.6) and ElevenLabs V3 (83.2). Trained on 6M+ audio hours, reads raw UTF-8 without phonemizer.

Voice Open source Benchmarks

SIG

HYP

Simon Willison·Jun 12

OpenAI WebRTC Audio Session, now with document context

Simon Willison updated his OpenAI WebRTC tool to support GPT-Realtime-2 model (GPT-5-class reasoning) and added document context feature. Users can now paste text to have audio conversations about specific documents directly in the browser.

OpenAI Voice Tools

SIG

HYP

ActuIA·Jun 12

OVHcloud-Gladia : la brique vocale qui manquait au cloud souverain

OVH Groupe is in exclusive negotiations to acquire Gladia, a French startup specializing in speech recognition and AI transcription. The acquisition aims to strengthen OVH's sovereign cloud offering by adding native audio processing capabilities.

Voice Open source Business

SIG

HYP

GitHub Trending·Jun 12

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> NVIDIA-NeMo /</span> NeMo

NVIDIA NeMo is an open-source framework for building generative AI models: LLMs, multimodal, ASR, and TTS. Designed for researchers and developers, it provides a scalable foundation for training and deployment.

Open source Infrastructure Code generation

SIG

HYP

arXiv cs.CL·Jun 12

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PRISM is a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis. It introduces a prosody-to-language translation mechanism to stabilize LLM reasoning and integrates external knowledge tools. Results show improvements in empathy, prosodic appropriateness, and response quality across metrics.

Multi-agent Voice AI Agents

SIG

HYP

arXiv cs.CL·Jun 12

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow optimizes simultaneous speech-to-speech translation by reducing inter-chunk pauses to improve acoustic fluency. The framework leverages model-internal signals (linguistic diversity, temporal variability) to balance low latency with natural speech flow, validated on short- and long-form benchmarks.

Voice Papers Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 11

How I implemented ASR bias for voice transcription models [Open Source]

Developer implemented ASR biasing in an open source Whisper Flow clone. This transcription technique injects custom vocabulary into the model's system prompt to improve recognition of specific words and phrases. Works with Groq, OpenAI, Deepgram, and local models (whisper.cpp, MLX).

Voice Code generation Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 11

Infinite Music Glitch on my Arduino with Magenta Realtime 2

User built a local realtime music AI system on ESP32 and MacBook M4 Pro. ESP32 captures voice via MLX Whisper, a Qwen model decides tool calls (add drums, Lo-fi, Jazz, remove guitar), and Magenta Realtime 2 generates music locally over WebSockets.

AI Agents Voice Open source

SIG

HYP

arXiv cs.AI·Jun 11

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

MA-DLE introduces a memory augmentation method for automatic depression level estimation from speech. The system enhances GRU-extracted features using a selective memory bank (historical temporal features + dynamic memory features based on variability) and a Hierarchical Attention Fusion module. Evaluated on DAIC-WOZ and E-DAIC datasets, it achieves state-of-the-art performance.

Voice Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 11

Context-Aware Multimodal Claim Verification in Spoken Dialogues

MAD2 is a benchmark of 1,000 two-speaker audio dialogues (3,368 verifiable claims, ~10h audio) for misinformation detection in conversation. Authors propose calibrated multimodal fusion combining context-aware audio encoder and dialogue-aware text model. Conversational structure improves verification more than misinformation framing.

Benchmarks Vision Voice

SIG

HYP

arXiv cs.CL·Jun 11

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics evaluates semantic reasoning in audio language models across five tasks: entailment, consistency, plausibility, accent drift, and accent restraint. The study reveals critical limitations in audio reasoning evaluation beyond transcription, particularly regarding accent variation and domain shift effects.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·Jun 11

Pretrained self-supervised speech models can recognize unseen consonants

Pretrained self-supervised speech models (Wav2Vec2, HuBERT) fine-tuned on Khoisan languages (G|ui, West !Xoon) recognize click consonants more accurately than non-clicks, showing self-supervision generalizes to rare phonemes despite training data skewed toward high-resource languages.

Papers Benchmarks Voice

SIG

HYP

Reddit r/LocalLLaMA·Jun 11

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Developer builds fully offline voice loop for Ollama + LM Studio. Stack: Silero VAD (voice activity detection), Parakeet TDT 0.6B (ONNX INT8 STT, 25 languages), Supertonic TTS 3 (multilingual ONNX synthesis). CPU-only, zero data leaves machine. Cross-platform (macOS/Linux/Windows), tested on 4-year-old ThinkPad.

Voice Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

A developer benchmarked Google's new on-device dictation app Eloquent with proprietary models. Result: ~50% of dictations return incomplete (20+ words reduced to 5-10). When transcription completes (15/50 tests), accuracy is competitive (~24% WER vs ~21% for Qwen3-ASR), but the chat-style model often refuses to transcribe instead of producing text.

DeepMind Benchmarks Voice

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

User reports Gemma 4 12B (unified audio/vision/text model) ignores audio input when system prompt exceeds ~21k tokens. Model works well with minimal prompt but generates generic/hallucinated responses with dense context. Behavior reproduced across vLLM, llama.cpp, and LiteRT-LM. Appears to be an inherent attention saturation limit.

Gemini Voice Multi-agent

SIG

HYP

arXiv cs.CL·Jun 10

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge is an on-policy self-distillation method that teaches Speech Language Models to use paralinguistic cues (tone, emotion, noise) in dialogue. On Qwen3-Omni-thinking, it raises VoxSafeBench SAR from 14.6% to 40.3% and improves EchoMind from 3.27 to 3.92, while preserving general abilities.

Voice Reasoning Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 10

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Interpretability study on audio-visual LLMs (AVLLMs): traces information flow between audio and visual tokens in Qwen2.5-Omni and Video-SALMONN2 Plus (3B/7B scales). Authors demonstrate audio-visual tokens can be discarded post-information-transfer without prediction degradation, improving inference efficiency.

Vision Voice Evals

SIG

HYP