# Llama

Llama Open source Infrastructure

llama.cpp now supports model management (downloading etc) via API

llama.cpp merges PR #23976 adding model management via API. On-demand downloading, loading, and unloading from directory. UI coming soon. Full lifecycle deployment and management through API alone.

SIG

HYP

Llama Open source Infrastructure

llama.cpp - how to free up even more space on your GPU

llama.cpp optimizes GPU memory management. Key parameters: --no-mmproj-offload frees 1GB for vision models, --cache-type-k/v reduces KV cache by 50-75%, --spec-draft-n-max=2 optimizes speculative decoding. Flash attention enabled by default. Tested on Qwen 3.6-27B with 150k context on RTX 3090.

SIG

HYP

Local models went from mostly useless to actually useful really fast. What changed?

Local models shifted from marginal tools to viable solutions in one year. Gemma, Qwen, GLM, Kimi now replace some API calls for coding, private documents, and local workflows, though gaps remain on complex tasks requiring planning and error correction.

Llama Open source Qwen

SIG

HYP

Llama Prompt engineering Evals

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Analysis of low narrative diversity in LLM-generated stories. The author examines why models produce repetitive tales with similar characters and structures despite varied prompts.

SIG

HYP

Hacker News (AI)·Jun 16

GPT‑NL: a sovereign language model for the Netherlands

GPT-NL is a sovereign language model trained for Dutch, developed in the Netherlands. The project aims to reduce dependence on American models and preserve linguistic technological independence.

Open source Llama

SIG

HYP

Llama Open source Benchmarks

Glimmer 1 - Glint Research. A foundational 10,000 parameter language model

Glint Research introduces Glimmer 1, a foundational 10k parameter language model trained on 500K tokens of FineWeb-Edu. Standard Llama architecture with 16 hidden dims, 2 layers, 4 attention heads, 512 token context window. Benchmarks: arc_easy 25.46%, wikitext-2 byte perplexity 14.73.

SIG

HYP

Open source Llama Alignment

[Article] The Case For Open-Weight Models And Why We Can't Trust Frontier Labs | provos.org

Article arguing for open-weight models against frontier labs. Criticizes power concentration among few companies and advocates for accessibility and transparency of AI model weights.

SIG

HYP

Llama Code generation Benchmarks

Nex-N2 Pro is the real deal

N2 Pro (rebranded as Rio-3.5) shows strong performance on coding benchmarks on 128GB macOS. User reports 100% consistency without hallucinations on private llama.cpp tests, outperforming previously tested models except GPT-5.x.

SIG

HYP

A fast, optimised, and open source application for running local AI easily (made for Apple Silicon only)

AeroLLM, open-source app optimized for Apple Silicon, runs local LLMs, TTS, and STT through a GUI. Uses MLX backend for native inference, downloads models from Hugging Face with RAM-based recommendations, exposes optional API endpoint. v0.1.0 released.

Open source Tools Llama

SIG

HYP

arXiv cs.AI·Jun 16

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

Mechanistic interpretability audit of LLaMA 3.1-8B-Instruct on 54 moral prompts using Transluce platform. Reveals Situational Anchor Effect: domain-specific representations dominate activation rankings regardless of ethical content. Ethics capacity remains constant but salience is highly sensitive to prompt's interpretive frame. Identifies candidate ethics neuron (L16/N3837) stable across temperatures.

Llama Alignment Evals

SIG

HYP

Open source Llama Code generation

Nex2 mini Phase Twin - 16gb footprint, 30b model

Nex2 mini Phase Twin: 30B model optimized for 16GB VRAM. Designed for Intel A770 cards, runs on single GPU and scales with two. Achieves 89 tok/s on A770 16GB. Auto-calibrates to hardware.

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

UI/svg block rendering by ServeurpersoCom · Pull Request #24080 · ggml-org/llama.cpp

Pull request #24080 on llama.cpp adds UI/SVG block rendering. Video demonstration shows SVG rendering capabilities integrated into the project.

Llama Open source Tools

SIG

HYP

arXiv cs.LG·Jun 15

Efficient On-Device Diffusion LLM Inference with Mobile NPU

llada.cpp is the first NPU-aware inference framework accelerating diffusion LLMs on mobile devices. Three techniques optimize execution: Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime. LLaDA-8B achieves 17x-42x latency reduction vs CPU baseline.

SIG

HYP

arXiv cs.LG·Jun 15

Small LLMs: Pruning vs. Training from Scratch

Comparative study of pruning vs. training from scratch on Llama-3.1-8B (ratios 0.5–0.8, 6 methods). Pruning outperforms random initialization with equal token budget, but advantage narrows with more tokens. Fine-grained pruning retains benefit even with unlimited budget; coarse structured pruning can be matched by training from scratch.

Llama Benchmarks Papers

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

EAGLE support merged into llama.cpp

EAGLE support has been merged into llama.cpp. EAGLE is an inference acceleration technique for language models that reduces latency by predicting multiple tokens in parallel.

Open source Llama AI safety

SIG

HYP

Hacker News (AI)·Jun 14

Cloud-based LLM gold rush is ending

The cloud-based LLM gold rush is ending. Inference costs are falling, competition intensifies, and margins compress. Providers must innovate beyond API access to survive.

Business Llama OpenAI

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

Introducing the Heretic Grimoire: The takedown-resilient, local-first backup system that keeps uncensored models available forever

Heretic announces a decentralized backup system for uncensored local models. Models compressed to 9 KB enable phone storage. The project builds takedown-resilient infrastructure with official website and redundant documentation.

SIG

HYP

Llama Open source Code generation

#24260 merged Llama.cpp Arch Cohere-Moe Support Added

PR #24260 merged into llama.cpp adds Cohere-MoE architecture support. Users testing North Mini Code models, which are ~3GB smaller in Q8 quantization compared to Qwen 3.6 27B for local inference on homelabs.

SIG

HYP

Open source Regulation Llama

A single federal order switched off the best cloud model overnight. Clearest case for running local I've seen yet.

A frontier cloud model was globally suspended within 48 hours after a U.S. Commerce Department order restricted access to foreign nationals. The incident highlights cloud dependency risks: a local 70B model remains operational without interruption, slower but autonomous.

SIG

HYP

llama-launcher v1.3 release -> Bayesian Optimisation

llama-launcher v1.3 adds Bayesian optimization via Optuna to automatically tune llama-server parameters. The tool reports up to 15% speed improvement on Gemma 12B MTP with no manual intervention.

Llama Tools Open source

SIG

HYP

Open source Llama Infrastructure

Some thoughts on decentralized model sharing: What models should we share, and how?

Discussion on decentralized distribution of open-source LLM models. Author proposes prioritizing sharing of unquantized base models (fp16/bf16) over derived variants, arguing base models are essential primary data to preserve against growing restrictions from closed model providers.

SIG

HYP

PWA Support has been merged

PWA support merged into llama.cpp (PR #23871). The llama-server web UI can now install as a native app on desktop/home screen, with standalone window mode and proper icons for faster reopening and better caching.

Llama Open source Tools

SIG

HYP

Llama Benchmarks Code generation

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Dual-GPU benchmark (2× RTX 3080 20GB) comparing llama.cpp (row/tensor split) vs ik_llama (graph split) on Qwen3.6-27B-Q8_0. Row split: 1732 t/s prompt, 23 t/s generation, VRAM 18.2/18.5 GB. Tensor and graph split results incomplete in excerpt.

SIG

HYP

Llama Code generation Benchmarks

Not All MTP Assistants Are Created Equal

Hands-on experience with MTP (Multi-Token Prediction) speculative decoding in llama.cpp. MTP assistants are not interchangeable: identical names and architectures don't guarantee same performance. Gemma 4 26B Q4: ~30 t/s → 55-62 t/s with correct assistant. Unquantized assistant models outperform Q4 versions (~10 t/s faster).

SIG

HYP

EAGLE3 has landed in llama.cpp

EAGLE3 merged into llama.cpp after 6 months of development. The helper model receives guidance from the main model, unlike MTP where it operates independently.

Llama Open source

SIG

HYP

Open source Infrastructure Llama

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

InfiniteKV compresses KV cache into 104-byte searchable records stored in RAM or disk instead of deleting old tokens. Mistral-7B correctly answers at token 76,747 (2.3× its 32,768 training window). One million tokens requires ~3 GB instead of 122 GB.

SIG

HYP

LLM context compression at 16x beats KV cache

LLM context compression technique achieves 16x compression ratio, outperforming traditional KV cache approaches. Method significantly reduces memory usage while maintaining response quality.

Llama

SIG

HYP

arXiv cs.AI·Jun 12

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

HCPD, a hallucination detection method without access to model internals or external references. An LLM agent adaptively decomposes judgment into weighted, interpretable criteria aligned via weak supervision on semantic consistency. Code released.

Llama AI Agents Evals

SIG

HYP

arXiv cs.CL·Jun 12

Localizing Anchoring Pathways in Language Models

Study of internal anchoring mechanisms in language models. Researchers localize circuits responsible for anchoring bias (where irrelevant numbers influence answers) in Qwen and Llama 7B-8B models. Edge-level attribution methods recover this signal more faithfully than node-level methods.

Qwen Llama Reasoning

SIG

HYP

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

llama.cpp benchmark on Intel 250K Plus CPU: optimizing --threads argument yields +80% performance gain (49 → 88 tok/s). 16 threads optimal vs 6 threads (P-cores only). Using all 18 cores drops performance without throttling detected.

SIG

HYP

advice for dual-gpu asymmetric

User with RTX 3080 Ti 12GB + RTX 3080 20GB optimizing asymmetric dual-GPU inference. Gemma 4 31B Q4_K_XL reaches 20t/s with standard cache, 70t/s when compressing K/V cache to q4_0. Seeks clarification on GGUF memory expansion and dual-GPU configuration advice.

SIG

HYP

Llama Open source Infrastructure

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking)

Discussion on speed optimizations for llama.cpp with MoE models on multi-GPU setups. Author explores -ngl, -ncmoe, -fitt, -ub flags and their impact on throughput (50→120 tps in prompt processing). Questions practical relevance of these optimizations for AI career prospects.

SIG

HYP

Llama Open source Benchmarks

NVFP4 with llama.cpp - FAQs?

Community discussion on NVFP4 in llama.cpp. Users compare NVFP4 against Q4-Q8 quantizations for 8GB GPUs (RTX 4060, AMD, Intel). Questions: NVFP4 quality vs Q6/Q8, benchmarks (speed, perplexity), recommended models (Qwen 3.5-9B, Gemma-4-12B). Resources: HuggingFace NVFP4 and GGUF lists.

SIG

HYP