Topic

#Embeddings

Embeddings are numerical vector representations of text, images, or audio that capture their semantic meaning. For example, OpenAI's text-embedding-3-small model converts sentences into vectors used for search or similarity tasks.

40Articles
6Sources
69Avg. signal
Reddit r/LocalLLaMA·

Building a free, offline LLM “tutor” grounded in one university textbook — RAG, LoRA, or both? Sanity check wanted

Developer seeks to build a free offline AI tutor grounded in a university textbook. Proposed architecture: RAG as core component (chunking, embedding, retrieval with page/section citations) + optional LoRA for pedagogical style. Questions on model selection (Qwen, Gemma), handling complex structures (figures, equations), and packaging for non-technical users.

RAGFine-tuningOpen source
SIG
35
HYP
00
arXiv cs.AI·

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

Neuro-symbolic framework for ontology-grounded knowledge graph construction combining open-domain extraction, embedding-based canonicalization, and targeted LLM-based correction of ontology violations. Defers corrections to post-extraction stage to reduce token usage, improve KG consistency, and preserve QA quality for multi-hop reasoning and symbolic operations.

RAGReasoningEmbeddings
SIG
72
HYP
00
Reddit r/LocalLLaMA·

I built an enforcement layer for AI coding agents using a local knowledge graph and hybrid RAG

Writ is an enforcement layer for AI coding agents using a local Neo4j knowledge graph and hybrid RAG. A 5-stage retrieval pipeline (BM25, HNSW vector similarity, graph traversal, reciprocal rank fusion) surfaces only relevant rules per task. 30 bash hook scripts enforce execution: no code without approved plan, mandatory tests, static analysis required.

AI AgentsCode generationRAG
SIG
72
HYP
00
arXiv cs.CL·

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX is a two-stage cross-lingual biomedical entity linking system requiring no annotated training data. It enriches SapBERT with Wikidata-derived multilingual aliases and uses an LLM for context-aware disambiguation. On five benchmarks, it achieves +19.2 Recall@1 on XL-BEL, with major gains for low-resource languages (Turkish +21.6, Korean +22.1, Thai +30.8).

BenchmarksPapersRAG
SIG
78
HYP
00
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> meilisearch /</span> meilisearch

Meilisearch is a lightning-fast search engine API providing AI-powered hybrid search for websites and applications.

Vector searchEmbeddingsTools
SIG
45
HYP
00
arXiv cs.CL·

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

Study on cross-lingual retrieval asymmetry in 5 multilingual models (Gemini, Mistral, OpenAI, Qwen). Analysis of 6,518 idiomatic expressions in English, Bengali, Hindi, Arabic. Finding: hubness (vector concentration) is the dominant causal driver (49.5% dominance share), far exceeding anisotropy. CSLS correction closes 63.5% of reciprocity gap.

EmbeddingsBenchmarksMulti-agent
SIG
82
HYP
00
Reddit r/MachineLearning·

Added a Chrome Dino-style game to my research tool's pipeline wait screen driven by real SSE events [P]

ScholarScout v1.5.3 adds a Chrome Dino-style game to the pipeline wait screen (2-3 min). A pixel owl runs through a parallax forest; each spawned paper dot maps to a real SSE backend event (600ms intervals). Colors indicate source (arXiv white, PubMed green, Crossref purple). New features: k-means clustering on embeddings, per-cluster synthesis, paper freshness management with least-used prioritization.

ToolsRAGEmbeddings
SIG
65
HYP
00
Reddit r/MachineLearning·

[P] I built a system that lets you ask questions about any GitHub repo and get answers grounded in the actual source code [P]

GitRAG lets users ask questions about any public GitHub repo and get answers grounded in source code with exact file paths and line numbers. System combines AST-aware parsing, dense embeddings, BM25 index, RRF fusion, and Cohere reranking before generation via llama-3.3-70b on Groq. Supports 15+ languages.

RAGEmbeddingsCode generation
SIG
72
HYP
00
arXiv cs.LG·

Uncovering the Latent Potential of Deep Intermediate Representations

Study showing task-relevant information is distributed non-monotonically across layers in foundational models. Introduces LOES (Layer-wise Optimal Embedding Selection), a spectral method identifying task-discriminative subspaces, and GeoReg, geometric regularization enforcing simplicial structure on class manifolds. Consistent improvements across architectures and modalities.

Fine-tuningEmbeddingsPapers
SIG
72
HYP
00
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> qdrant /</span> qdrant

Qdrant is a high-performance vector database designed for large-scale AI applications. Available as open-source and cloud service.

Vector searchEmbeddingsInfrastructure
SIG
45
HYP
00
arXiv cs.CL·

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Comparative study of four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, LLM-Based) for RAG on Khmer agricultural documents. Recursive chunking with 300 characters achieves best performance: L2 distance 0.4295, Answer Relevance 0.8663, Khmer IoU 0.6441. Statistically significant improvement over Sentence-Based (p=0.0121).

RAGEmbeddingsBenchmarks
SIG
72
HYP
00