Topic

#Embeddings

Embeddings are numerical vector representations of text, images, or audio that capture their semantic meaning. For example, OpenAI's text-embedding-3-small model converts sentences into vectors used for search or similarity tasks.

40Articles

7Sources

69Avg. signal

arXiv cs.CL·Jun 18

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus is a morphology-aware neural tokenizer for agglutinative Turkish. The model uses differentiable Poisson-binomial dynamic programming to segment morphemes with 1.425 bits-per-character compression and MorphScore macro-F1 of 0.61 (vs ~0.32 for subword tokenizers). Lossless by construction: decode(encode(w)) = w.

Embeddings Papers Open source

SIG

HYP

arXiv cs.CL·Jun 18

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem introduces a memory architecture for personalized dialogue agents on edge devices (8 GB VRAM). Replaces cosine similarity with Fisher-Rao metric for retrieval and uses Fisher-guided token distillation for compression. Achieves +4.51 pp gains in open-domain reasoning and +4.17 pp in temporal reasoning on LOCOMO and LongMemEval-S benchmarks.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·Jun 18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

DICE improves long-document retrieval by splitting documents into chunks, encoding each independently, then aggregating vectors into a single representation. On LongEmbed, gains reach 90.0 for Dream Passkey >4k (vs 30.0) and 74.0 for Needle >4k (vs 23.3). The approach reduces Evidence Dilution Index (EDI) in 92.8% of cases.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Examining the Limits of Word2Vec with Toki Pona

Word2Vec study on Toki Pona, constructed language with ~130 words. Training on 1.4M sentences (7.95M tokens). Comparison of two models: with and without non-Toki Pona tokens (named entities, loanwords). Finding: sparse tokens bring similar words closer; Word2Vec works even with extremely reduced vocabulary, relying on distributional patterns rather than lexicon size.

Embeddings Papers Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Multimodal fusion framework for time-to-event prediction (PE mortality, CVD outcomes) aligning CT and longitudinal EHR representations using foundation models. Four strategies tested (late fusion, contrastive alignment, cross-attention, co-attention) on 3,099–2,951 patients. Contrastive fusion improves concordance index by 1.5–5.4% vs unimodal baselines.

Benchmarks Embeddings Vision

SIG

HYP

arXiv cs.CL·Jun 16

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC is an open dependency-parsed treebank of Greek spanning 8 diachronic periods (Archaic to Modern) under PROIEL XML 2.0 schema. Verse-level cross-alignment of New Testament with Latin, Gothic, Old Church Slavonic, and Classical Armenian. Annotation via Stanford Stanza, sentence alignment via LaBSE, word alignment via multilingual-BERT. v0.4 released open-source.

Benchmarks Open source Embeddings

SIG

HYP

arXiv cs.AI·Jun 16

Hierarchical Modeling of ICD Codes in EHR Foundation Models

Study on integrating ICD-10-CM hierarchy into EHR foundation models. Authors compare two approaches: augmenting BERT sequences with hierarchical tokens and injecting hierarchy into graph-based code representations. Experiments on MIMIC-IV and eICU show explicit hierarchy encoding improves predictions in-domain and in cross-dataset transfer.

Papers Embeddings RAG

SIG

HYP

arXiv cs.CL·Jun 16

Transfer Learning for FHIR Questionnaire Terminology Binding

Retrieval study to automatically bind LOINC codes to FHIR Questionnaire items in healthcare. Six methods tested (TF-IDF, MiniLM, BioBERT, BioLORD, contrastive fine-tuning, GPT reranker) on 97,314 codes. BioLORD (encoder pre-trained on biomedical ontologies) achieves R@1=0.185 without task-specific data; contrastive fine-tuning reaches R@5=0.389. GPT augmentation degrades performance.

Embeddings Fine-tuning RAG

SIG

HYP

Reddit r/MachineLearning·Jun 15

Concept-Vector: A design framework for human-interpretable word embeddings [P]

Concept-Vector presents a design framework to distill word embeddings into human-interpretable concept-vectors, where each component tracks semantic, syntactic, or statistical aspects with human-readable labels. Data design project without empirical model validation, shared for critical feedback.

Embeddings Papers

SIG

HYP

arXiv cs.LG·Jun 15

Numbers Already Carry Their Own Embeddings

AOE (Adelic operation-preserved embeddings) is a training-free representation encoding numbers while preserving additive and multiplicative structure via p-adic signatures. Plug-and-play, it achieves 100% on Weaving Pattern benchmark and improves algebraic combinatorics performance without task-specific retraining.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 15

Hyperdimensional computing for structured querying on tabular data embeddings

Hyperdimensional Computing (HDC) and Holographic Reduced Representations applied to tabular row embeddings. Derives interpretable similarity thresholds for structured queries (equality/inequality predicates), evaluated on two real-world datasets against EmbDI baseline. HDC reliably identifies zero-match predicates.

Embeddings Vector search Papers

SIG

HYP

arXiv cs.CL·Jun 15

Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

First application of likelihood ratio framework for authorship attribution in Japanese. Fusion of stylometric systems and embeddings from pre-trained language models on ~1000-character blog excerpts. Fused system improves discriminability (log-likelihood-ratio cost: 0.32484) while maintaining excellent calibration.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 12

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur kernels: random feature construction combining sketched finite modulation and radial randomization via Bernstein-Widder scale. Feature dimension Dm without O(d²) cost of exact modulation. Exact variance guarantees and operator-norm bounds controlled by intrinsic dimension, with kernel ridge regression applications.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.CL·Jun 11

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

First end-to-end RAG pipeline running all neural stages on mobile NPU (Snapdragon X Elite Hexagon). Embedding, reranking, LLM generation on-device. On 120-query Wikipedia benchmark: 18.1x faster LLM prefilling, 4.0x lower system energy vs CPU, answer quality parity (GPT-4.1 judge: 9.32 vs 8.95 CPU).

RAG Embeddings

SIG

HYP

arXiv cs.LG·Jun 11

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur kernels: random feature construction combining sketched finite modulation and randomized completely monotone radial factors. Feature dimension = Dm (sketch size m × radial draws D), avoiding O(d²) dependence. Guarantees: unbiasedness, operator-norm bounds controlled by intrinsic dimensions, spectral stability for kernel ridge regression.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.AI·Jun 10

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Latent Memory replaces each memory item (text/image) with a single compressed latent token, reducing generator token consumption by 3-10x. Trained with reconstruction, contrastive, and distillation objectives, the system achieves competitive performance on HotpotQA and multimodal benchmarks while lowering memory pressure.

RAG Embeddings Vision

SIG

HYP

arXiv cs.AI·Jun 10

A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

Unified framework integrating PPO, time-series prediction, in-context learning, game theory, and cross-modal sentiment analysis for financial systems. Results: +23.7% portfolio optimization, -31.2% high-frequency trading error, +18.9% recommendation accuracy, +27.4% Nash convergence, +15.6% sentiment analysis.

Reinforcement learning Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 10

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

New cross-modal knowledge distillation method without paired data. Framework establishes distributional relationship between teacher and student models, identifying two key quantities: feature alignment and label alignment. Significant improvements on multimodal benchmarks.

Papers RAG Embeddings

SIG

HYP

Reddit r/LocalLLaMA·Jun 9

Semantic distance as routing layer: an on-device, serverless alternative to the central-index model

Decentralized prototype using local embeddings (EmbeddingGemma-300M) to replace central indexes. Devices communicate peer-to-peer, rank content by semantic distance (cosine similarity) without server or global ranking. Proposed extension to AI agents discovering each other's needs/offers through semantic proximity.

Embeddings AI Agents Open source

SIG

HYP

GitHub Trending·Jun 9

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> chroma-core /</span> chroma

Chroma is a vector search infrastructure for AI applications. The trending GitHub project provides storage and querying tools for embeddings to support RAG and language model-based systems.

Vector search Embeddings RAG

SIG

HYP

Reddit r/MachineLearning·Jun 8

Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

An agent developer ditched semantic embeddings for tool selection, switching to BM25. With 140 MCP tools in production, cosine similarity on short descriptions (<50 tokens) failed (64% accuracy): key discriminators (specific nouns) diluted in embedding space. BM25 on flat-text projection achieves 81% top-1.

AI Agents MCP RAG

SIG

HYP

Reddit r/LocalLLaMA·Jun 8

Used local Ollama (gemma4:e4b + nomic-embed-text) to bulk-generate AI summaries for 4300 arXiv papers and push them to a remote Cloudflare DB — pipeline walkthrough

Developer built ArxivExplorer, semantic arXiv search engine with AI-generated summaries. Local pipeline uses Ollama: gemma4:e4b (8B) for structured JSON summaries, nomic-embed-text (137M) for 768-dim embeddings. 4300 papers processed, ~95% first-pass success rate, storage via Cloudflare D1/Vectorize. REST API 100× faster than wrangler.

RAG Embeddings Open source

SIG

HYP

Reddit r/MachineLearning·Jun 8

Memanto vs SQLite R_A_G Benchmark Results - Cloud vs Local Memory Systems [P]

Head-to-head benchmark of Memanto (cloud memory system) vs custom SQLite RAG on LoCoMo conversational dataset. Memanto achieves 90% accuracy in 1.878s vs 80% in 2.680s for SQLite. Analysis shows SQLite failures stem from API rate limits (HTTP 429), while Memanto's decoupled architecture buffers against shared quota exhaustion.

RAG Benchmarks Vector search

SIG

HYP

arXiv cs.AI·Jun 8

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

ZEDD (Zero-Shot Embedding Drift Detection) detects prompt injections by measuring semantic shifts in embedding space between benign and suspect inputs. Without model internals access or retraining, the method achieves >93% accuracy on Llama 3, Qwen 2, Mistral with <3% false positive rate.

AI safety Embeddings Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 8

Trading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations

Study on e-commerce recommender systems incorporating product carbon footprint. Researchers estimate missing PCF via semantic search and LLM prompting, then apply post-hoc re-ranking on BPR, NeuMF, and LightGCN. On Amazon Reviews (3 categories), substantial carbon reductions achievable with minimal engagement cost.

RAG Embeddings

SIG

HYP

GitHub Trending·Jun 7

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> RyanCodrai /</span> turbovec

TurboVec is a vector index built on TurboQuant, written in Rust with Python bindings. Optimized for high-performance vector search.

Vector search Embeddings Open source

SIG

HYP

GitHub Trending·Jun 7

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> RyanCodrai /</span> turbovec

TurboVec is a vector index built on TurboQuant, written in Rust with Python bindings. Optimized for high-performance vector search.

Vector search Embeddings Open source

SIG

HYP

arXiv cs.CL·Jun 5

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

Multilingual coreference resolution pipeline using cycle-consistent machine translation (English → target language → English) to generate training data. Translation quality validated via cosine similarity in BERT latent space. Significant performance gains on 4 low-resource languages.

Benchmarks Embeddings Papers

SIG

HYP

arXiv cs.CL·Jun 5

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

New pre-training approach combining MLM (Masked Language Modeling) and JEPA (Joint Embedding Predictive Architecture) for text encoders. Hybrid model trained on English Wikipedia with identical compute budget. Results: more uniform embeddings (-0.16 vs -0.05), richer spectral geometry, better semantic-to-lexical balance on GLUE benchmarks.

Papers Fine-tuning Embeddings

SIG

HYP

Hacker News (AI)·Jun 4

Inside FAISS: Billion-Scale Similarity Search

Technical deep-dive into FAISS, Meta's library for billion-scale similarity search. Covers internal architecture, indexing algorithms, and optimizations for massive query workloads.

Vector search Embeddings Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 4

I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

Open-source practical guide to LLM engineering patterns: RAG, hybrid retrieval, rerankers, evaluation. Covers pre-filtering, in-memory scoring vs vector databases, batching, cleanup. Python examples included. Author emphasizes engineering harness quality matches model quality for production solutions.

RAG Vector search Embeddings

SIG

HYP

arXiv cs.CL·Jun 4

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

New FFR approach retrieves coherent multi-utterance, multi-image fragments from long-form multimodal dialogues. Two models: F2RVLM (generation + RL with multi-objective rewards) for single-dialogue, FFRS (two-stage indexing + retrieval) for corpus-scale. MLDR dataset introduced, superior performance on benchmarks.

RAG Vision Embeddings

SIG

HYP

arXiv cs.LG·Jun 4

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

KODA is a kernel-based framework for comparing and aligning vision-language model representations (CLIP, SigLIP). The method identifies sample subsets weakly clustered under one representation but strongly clustered under another through constrained optimization and low-rank approximations. Code released.

Vision Embeddings Benchmarks

SIG

HYP

arXiv cs.LG·Jun 4

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

Training-free lexical-dense fusion study for long-term conversational memory retrieval. Score-level fusion of late-interaction dense + BM25 improves Hit@1 by +8.8 to +17.2 points across six encoders (Hit@1 0.752 with e5-large-v2). Web search cross-encoder reranker degrades results (-6.9 pp). Analysis shows division of labor: dense excels on multi-hop/temporal questions, BM25 on adversarial ones.

RAG Embeddings Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 3

Mellum & Granite Embedding models are ready on llama.cpp

Mellum and Granite embedding models are now available on llama.cpp. Two pull requests add support for these models in the framework.

Embeddings Open source Tools

SIG

HYP

arXiv cs.CL·Jun 3

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

SEA-Embedding is an open and reproducible text-embedding pipeline for Southeast Asian languages trained exclusively on public data. The study examines three core factors: data composition, training objective, and base encoder initialization. Achieves state-of-the-art results on SEA-BED.

Embeddings Open source Papers

SIG

HYP

arXiv cs.LG·Jun 3

Learning Coherent Representations: A Topological Approach to Interpretability

Novel topological approach to interpretability in deep neural networks. Authors introduce 'coherence', a geometric property where each neuron responds to contiguous regions of state space. They propose Coh, a differentiable objective function based on Fréchet variance, validated on MNIST and BERT embeddings.

Papers Embeddings

SIG

HYP

arXiv cs.LG·Jun 3

CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

CL-DMDF introduces a dynamic multimodal data fusion model using contrastive learning to handle missing or uncertain modalities. It features a dual-dimension attention mechanism (features and modalities) and entity-centroid contrastive learning module for enhanced discrimination. Validated across three datasets.

Embeddings Papers

SIG

HYP

arXiv cs.LG·Jun 3

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

StenCE, a cross-modal contrastive learning framework, detects severe coronary artery stenosis from non-invasive ECGs. Evaluated across stenosis severity thresholds, the model outperforms prior work and enables early risk stratification in asymptomatic patients.

Vision Embeddings Benchmarks

SIG

HYP

Embeddings — AI news · Signal IA