Topic

#Vision

Computer vision is the field of AI that enables machines to analyze and interpret images or videos. GPT-4o, for instance, can describe the content of a photo, read printed text, or identify objects within a scene.

40Articles

10Sources

70Avg. signal

Latent Space·Jun 18

[AINews] Midjourney Medical: scan your organs like you step on a scale

Midjourney announces its second product: a medical application enabling organ scanning via smartphone without specialized medical equipment. The AI model analyzes captured images to provide preliminary diagnostics.

Image generation Vision Business

SIG

HYP

arXiv cs.CL·Jun 18

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework for multimodal emotion-cause pair extraction, improves pair-confidence robustness. Using margin constraints and contextual corruption, it increases Pair F1 by 2.58–2.83 points on ECF/MECAD/MEC4 without changing inference.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.AI·Jun 18

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a benchmark to evaluate strategic reasoning in Vision-Language Models (VLMs) using real-time strategy games. Built on Beyond All Reason, it offers multi-scenario evaluations, diagnostic mini-games targeting specific competencies, and a self-evolving generation framework. Current state-of-the-art VLMs fail at multi-agent coordination and complex task scaling.

Vision Reasoning Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.LG·Jun 18

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Evaluation protocol for single-image-to-3D mesh quality using VLM judges (vision-language models). Authors demonstrate that cheap proxies (CLIP similarity, geometry validity stats) fail to correlate with perceived quality. Their VLM-judge protocol with position-bias correction achieves Cohen's kappa = 0.66 between two independent judge families.

Vision Evals Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT is a modular agentic-RAG framework reducing VLM hallucinations through a five-stage closed-loop pipeline (Extractor, Retriever, Solver, Citation Injector, Verifier). Ungrounded claims trigger targeted re-retrieval. 23 component-wise metrics and CaVeScore measure citation faithfulness and cross-modal grounding. Results: 87.1% accuracy on ScienceQA, 55.2% on MMMU.

Vision RAG AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

llama.cpp - how to free up even more space on your GPU

llama.cpp optimizes GPU memory management. Key parameters: --no-mmproj-offload frees 1GB for vision models, --cache-type-k/v reduces KV cache by 50-75%, --spec-draft-n-max=2 optimizes speculative decoding. Flash attention enabled by default. Tested on Qwen 3.6-27B with 150k context on RTX 3090.

Llama Open source Infrastructure

SIG

HYP

The Decoder·Jun 17

Amazon, Nvidia, and AMD bet $310 million on AI startup building 3D world models

Amazon, Nvidia, and AMD invest $310 million in Odyssey ML, a 3D world model startup valued at $1.45 billion. IQT fund and Google's Jeff Dean join the round. World models are emerging as the next major AI bet after language models.

Funding Reasoning Vision

SIG

HYP

Hugging Face Blog·Jun 17

MolmoMotion: Language-guided 3D motion forecasting

Hugging Face introduces MolmoMotion, a language-guided 3D motion forecasting model. The system combines vision and language to predict future trajectories from videos, enabling applications in robotics and animation.

Vision Robotics

SIG

HYP

GitHub Trending·Jun 17

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> bytedance /</span> UI-TARS-desktop

ByteDance releases UI-TARS-desktop, an open-source multimodal AI agent stack. The project connects cutting-edge AI models and agent infrastructure to automate UI-based tasks.

AI Agents Multi-agent Open source

SIG

HYP

GitHub Trending·Jun 17

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> bytedance /</span> UI-TARS-desktop

ByteDance releases UI-TARS-desktop, an open-source multimodal AI agent stack connecting cutting-edge AI models and agent infrastructure. Platform for building agents capable of interacting with user interfaces.

AI Agents Multi-agent Open source

SIG

HYP

Reddit r/MachineLearning·Jun 17

Mel AI just shared a demo of video-native AI characters that can talk, react, and respond to camera context in real time [N]

Mel AI demonstrates video-native AI characters that talk, lip-sync, show facial reactions, and respond in real time to camera context. The system detects user environment and adapts responses accordingly. This approach moves beyond text-based Character AI (founded by former Google/LaMDA developers).

AI Agents Vision Voice

SIG

HYP

arXiv cs.CL·Jun 17

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Two recent studies reach contradictory conclusions about LVLMs' ability to coordinate efficient referring expressions. This research controls for task differences and directly compares prompting styles. Models coordinate efficiently with explicit prompting but fail to infer communicative efficiency needs from implicit prompts.

Prompt engineering Vision Evals

SIG

HYP

arXiv cs.LG·Jun 17

ProCUA-SFT Technical Report

ProCUA-SFT is a dataset of 3.1M step-level SFT samples generated automatically from 93K synthetic trajectories across 2,484 application combinations. Fine-tuning UI-TARS 7B on ProCUA-SFT achieves 45.0% on OSWorld, a +18.7 percentage-point improvement over the base model and +35% above AgentNet. The pipeline uses Kimi-K2.5 as task generator, precondition judge, and trajectory executor.

AI Agents Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Comparative study of LLM abilities to predict next speaker, turn changes, and addressee in multi-party conversations. On the AMI corpus, LLMs outperform supervised models and humans in next speaker prediction without audio-visual access. MM-LLMs exceed text-based LLMs but remain below human performance for addressee and turn-change prediction.

Benchmarks Evals Vision

SIG

HYP

arXiv cs.CL·Jun 17

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP

arXiv cs.LG·Jun 17

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE is an expert-level mixed-precision quantization framework for MoE multimodal LLMs. It decomposes expert selection frequency by modality (vision/text) and filters redundant vision tokens to correct estimation biases. Results: <2.9% performance loss at W3A16.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 17

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

Quantized integer-only transformer implementation for jet tagging on AMD Versal AI Engine (AIE). Reusable software framework automatically converts Python model descriptions to Vitis graph code for low-latency, resource-constrained deployment. Open-source release.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.AI·Jun 17

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen is a financial multimodal reasoning agent that accumulates experience from prior trajectories in persistent memory. The system improves a frozen 8B vision-language model across four financial benchmarks using selective experience activation and a deterministic tool environment for numerical computation and verification.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.AI·Jun 17

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.

Code generation Benchmarks Vision

SIG

HYP

arXiv cs.AI·Jun 17

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench is a benchmark evaluating long-horizon webpage generation by vision-language models. It contains 490 real-world pages for structural evaluation and 507 goal-oriented interaction tasks over 129 pages. Experiments show structural fidelity degrades with webpage length, and visually plausible generations often fail to support multi-step executable interactions.

Vision Benchmarks AI Agents

SIG

HYP

arXiv cs.CL·Jun 17

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Study of LLM adaptation for 3D CT report generation in medical imaging. RAD3D-Prefix, a lightweight diagnostic-prior framework, integrates image embeddings and multi-label classification logits. Across LLMs from 96.1M to 1.6B parameters, freezing the model and training only projection layers outperforms full fine-tuning, reducing clinical hallucination and overfitting.

Fine-tuning Vision

SIG

HYP

arXiv cs.CL·Jun 17

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG is a multi-agent system driven by Variational Free Energy to reduce hallucinations in Multimodal Retrieval-Augmented Generation. It uses Monte Carlo Tree Search, logit perturbations, and specialized agents to route high-risk queries and perform post-hoc factual verification. Authors introduce ModeVent, a challenging subset of MultiVent dataset, to evaluate M-RAG robustness.

RAG Multi-agent Vision

SIG

HYP

arXiv cs.LG·Jun 17

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

Researchers identify a critical issue in knowledge editing for MLLMs: updates work with multimodal inputs (text+image) but fail with unimodal inputs alone. They propose DECODE, a method that localizes and decouples modality-specific neurons to propagate edits consistently across all input types.

Fine-tuning Vision Evals

SIG

HYP

arXiv cs.LG·Jun 17

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

Systematic evaluation of foundation model representations (5 FMs) on computational pathology tasks using whole-slide images and transcriptomic profiles (IH-BC, IH-NSCLC cohorts). Multimodal fusion improves performance when no single modality dominates. Conformal prediction shows true diagnosis remains recoverable in prediction sets for majority of failed predictions.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 17

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

Foundation model-orchestrated workflow for pedestrian protection design. Integrates ML surrogate (R²=0.87), multi-objective evolutionary search, geometry generator, and LLM interface. Reduces evaluation time from hours to seconds; generates 35 safety-compliant alternatives in automotive bumper case study.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard improves web navigation for AI agents via Dynamic Dual-Policy Optimization (DDPO) to handle reward conflicts and Confidence-Guided Adaptive Navigation Reflection (CANR) to calibrate per-step errors. The framework achieves state-of-the-art performance on standard web navigation benchmarks.

AI Agents Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 17

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine introduces a framework for fine-grained visual dependency modeling in mathematical reasoning. A new dataset augments visual annotations with visual dependency ratings. Two-stage progressive training balances answer correctness and visual grounding rewards according to each sample's intrinsic visual necessity, reducing reward bias.

Reasoning Vision Benchmarks

SIG

HYP

Le Big Data·Jun 17

Les lunettes AR de Snap sont là… mais qui osera vraiment les porter ?

Snap launches its consumer AR glasses. The article questions actual product adoption amid competition and social acceptance challenges for users.

Vision

SIG

HYP

arXiv cs.AI·Jun 16

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

Researchers identify a vulnerability in multimodal LLM cascades: an adversarial attack (Forced Deferral Attack) manipulates weak-model confidence to force routing to the strong model, increasing compute costs without targeting answer correctness.

AI safety Vision Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

Position paper on integrating AI into organizational decision-making. Authors propose a framework to allocate agency between humans and AI systems based on task attributes and knowledge availability. Two manufacturing case studies: visual quality inspection and factory location decisions.

AI Agents Business Vision

SIG

HYP

arXiv cs.CL·Jun 16

ReportQA: QA-Based Radiology Report Evaluation

ReportQA introduces a QA-based evaluation metric for automated radiology report generation. The framework uses LLMs to extract structured information, generate QA pairs from templates, and evaluate alignment with radiologist judgments. Authors release knowledge trees, structured reports, and code for QA construction and evaluation.

Papers Vision Evals

SIG

HYP

arXiv cs.CL·Jun 16

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

AgentViSS benchmark evaluates visual social intelligence of multimodal agents in social simulations. 240 scenarios, 585 roles, 2,340 instances test whether MLLMs use visual cues (expressions, posture, gaze) to guide interactions. Seven models evaluated show gap: expression and conflict handling near saturation, interaction regulation and visually grounded outcomes remain substantially harder.

Benchmarks Vision AI Agents

SIG

HYP

arXiv cs.CL·Jun 16

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Reinforcement learning post-training method (GRPO) to improve hateful and propagandistic meme detection in thinking-based MLLMs. +2.1% improvement on Hateful Memes (79.9%→82.0%) and +7.6 macro-F1 points on ArMeme (0.536→0.612) with chain-of-thought explanations. Code and data publicly released.

Reinforcement learning Reasoning Vision

SIG

HYP

arXiv cs.LG·Jun 16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS optimizes flow-matching and diffusion policies at inference time via Q-steering. The method projects noisy intermediate actions to clean action estimates before evaluating the critic, avoiding numerical instability. Results: 90% success rate across 50 offline-to-online tasks, and outperforms existing approaches on 6 manipulation tasks with frozen VLA models.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

Variational Autoencoder (C-VAE) trained on 1.5 million X-ray scattering images to learn low-dimensional representations. Model reveals organized clusters and generates controlled synthetic images. Deployed without retraining across two synchrotron facilities, outperforms DINOv3 in interpretability. Integrated into Latent Space Explorer (MLExchange).

Vision Benchmarks Tools

SIG

HYP

arXiv cs.AI·Jun 16

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Multimodal fusion framework for time-to-event prediction (PE mortality, CVD outcomes) aligning CT and longitudinal EHR representations using foundation models. Four strategies tested (late fusion, contrastive alignment, cross-attention, co-attention) on 3,099–2,951 patients. Contrastive fusion improves concordance index by 1.5–5.4% vs unimodal baselines.

Benchmarks Embeddings Vision

SIG

HYP

arXiv cs.AI·Jun 16

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Visual-Seeker is a multimodal deep search agent that enhances visual reasoning in MLLMs for complex scenarios. The approach uses an active visual reasoning data pipeline and 5K synthetic multimodal trajectories for training. The agent achieves SOTA performance across five multimodal search benchmarks, surpassing some proprietary models.

AI Agents Vision Multi-agent

SIG

HYP

arXiv cs.AI·Jun 16

QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks

Fair token allocation system for decentralized agentic networks. Approach combines multi-modal representations, differentially private prototypes, and reward scheme robust to data heterogeneity. Simulations show improved fairness and QoS, with enhanced resistance to image reconstruction attacks.

AI Agents Multi-agent Vision

SIG

HYP