Topic

#Vision

Computer vision is the field of AI that enables machines to analyze and interpret images or videos. GPT-4o, for instance, can describe the content of a photo, read printed text, or identify objects within a scene.

40Articles
12Sources
67Avg. signal
arXiv cs.AI·

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

TIGER is an inference-time framework to mitigate hallucinations in multimodal generation. It independently extracts an observation graph from input and a claim graph from output, then assigns risk scores to claims based on support and conflict. The model repairs high-risk claims while keeping the backbone frozen. Convergence analysis shows geometric risk reduction to an explicit asymptotic bound.

ReasoningVisionPapers
SIG
78
HYP
00
arXiv cs.AI·

Closed-Loop Neural Activation Control in Vision-Language-Action Models

CTRL-STEER introduces a closed-loop control framework for Vision-Language-Action (VLA) models. Instead of fixed steering coefficients, it adaptively adjusts intervention strength over time using PID or reinforcement learning controllers. Experiments on OpenVLA with LIBERO task suites demonstrate improved concept regulation stability and better steering-task success trade-offs without retraining the base model.

VisionAI AgentsReinforcement learning
SIG
72
HYP
00
arXiv cs.LG·

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

Novel approach for Major Depressive Disorder detection from EEG without data augmentation. SGC (Score-Guided Classification) uses an unsupervised generative network to model pathological anomalies as prior, fused with deep feature representations. Cross-Channel Spatial Adaptation module handles multi-center channel heterogeneity. Validated on Mumtaz2016 and MODMA datasets.

PapersEvalsVision
SIG
72
HYP
00
arXiv cs.AI·

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

CoSee, an auditing framework, analyzes failure modes of modular visual reasoning systems using shared working memory. On 4B–8B models, two dominant failure modes emerge: Noise Reinforcement (reusing ungrounded notes) and Policy Collapse (under-specified answers). The study shows naive shared workspaces amplify hallucinations without explicit verification.

VisionAI AgentsMulti-agent
SIG
72
HYP
00
arXiv cs.AI·

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

SCALE is a self-improving framework for web agents using MLLMs. It employs three adversarial roles (Selector, Predictor, Judger) to autonomously explore agent limitations and expand cognitive boundaries. SCALE-Hop optimizes global planning via graph exploration. A SCALE-20k dataset from 19 real websites with 20k structured demonstrations validates the approach across multiple MLLMs.

AI AgentsVisionReinforcement learning
SIG
72
HYP
00
arXiv cs.AI·

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

TRINE is an FPGA accelerator and compiler for end-to-end multimodal inference (ViT, CNN, GNN, transformers) without reconfiguration. It unifies layers as matrix operations, switches between systolic and SIMD architectures at runtime, and applies in-stream token pruning. On Alveo U50 and ZCU104, it achieves 22.57x latency reduction vs RTX 4090 while consuming 20-21 W.

VisionCode generationInfrastructure
SIG
78
HYP
00
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> PaddlePaddle /</span> PaddleOCR

PaddleOCR is a lightweight, multilingual OCR toolkit (100+ languages) designed to convert PDF and image documents into structured data for LLM consumption.

Open sourceVisionTools
SIG
65
HYP
00
arXiv cs.CL·

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Study of persona effects on explanations generated by multimodal LLM agents in urban perception. Analysis of 59,808 annotations from 1,200 persona-conditioned agents: captions show strong convergence, justifications display systematic variation tied to socioeconomic and political attributes, perception tags show no significant persona-related differences.

VisionAI AgentsPrompt engineering
SIG
72
HYP
00