Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
Core Question
Can biologically inspired sampling strategies enable VLMs to achieve higher reasoning efficiency and accuracy than conventional uniform sampling under limited pixel budgets ?
Motivation
Modern Vision-Language Models typically process images using uniform spatial sampling, allocating equal resolution to every region regardless of its relevance. However, human vision operates differently: it concentrates high resolution in a small foveal region while maintaining coarse peripheral awareness, dynamically shifting attention to informative parts of a scene. Inspired by this principle, LLMind introduces a bio-inspired adaptive sampling strategy that redistributes spatial resolution across the image, magnifying semantically important regions while compressing less informative areas to enable efficient reasoning under strict pixel budgets.
Overview
Framework Summary
LLMind is a training-free adaptive visual representation framework designed to improve reasoning efficiency in Vision-Language Models (VLMs) under strict pixel budgets. The framework consists of two key components that iteratively refine the sampling parameters based on semantic feedback from the frozen VLM during inference.
Component 01
Bio-inspired Adaptive Sampling Strategy (BASS)
BASS dynamically redistributes spatial resolution across the input image. Using a Mobius transformation, LLMind magnifies task-relevant regions while compressing peripheral content.
This directly reflects the cortical magnification principle in human vision, where important visual stimuli occupy a larger representational space.
Component 02
Closed-loop Semantic Feedback (CSF)
CSF adds an inference-time semantic feedback loop. Instead of relying only on perceptual similarity, LLMind evaluates predicted answers and uses that signal to adjust sampling parameters.
Because the VLM is treated as a black box, gradients are estimated with Simultaneous Perturbation Stochastic Approximation (SPSA).
LLMind Pipeline
Given an image and a question, BASS first generates an adaptive sampling representation under a predefined pixel budget, which is then reconstructed to the original resolution and fed to the frozen VLM for reasoning. The predicted answers are subsequently used by CSF to iteratively update the sampling parameters, progressively refining the visual representation so that more resolution is allocated to question-relevant regions.
Quantitative Results
VQA Accuracy Under Pixel Constraints
Performance comparison across VQAv2, A-OKVQA, and Seed Bench at 1%, 3%, and 5% pixel budgets. Each card reports one VLM, with LLMind (Ours) highlighted against alternative sampling strategies.
Uniform Sampling
Static Foveated
Sunflower Inspired
Radial Sampling
LLMind (Ours)
Qualitative Results
Optimization Dynamics
The video below illustrates how LLMind iteratively refines its sampling strategy during inference, progressively reallocating visual resolution toward question-relevant regions as optimization proceeds.
Category-wise Performance
Comparison of LLMind against Uniform Sampled across A-OKVQA question categories using Qwen2.5-VL under three pixel budgets. The plots below shows how performance changes category-by-category as the visual budget increases.
Prediction Outcomes
The panel below compares the distribution of correct, wrong, and partially correct predictions between Uniform Sampling and LLMind across different pixel budgets using Qwen3-VL on LVIS dataset.
Contributions
01
Inspired by neuroscience-grounded principles, we identify a fundamental limitation in current VLM visual representations and conduct the first comprehensive analysis of visual representation strategies in VLMs.
02
We propose BASS, a bio-inspired sampling strategy that dynamically reallocates the pixel budget toward perceptually and semantically salient regions, mimicking human visual perception.
03
We introduce CSF, a training-free test-time optimization mechanism that aligns visual perception with task-driven reasoning and is compatible with both white-box and black-box VLMs.
04
We validate the proposed framework on standard VQA benchmarks, demonstrating consistent improvements in reasoning accuracy under constrained pixel budgets for both scene-level and region-guided settings.
Citation
@misc{debnath2026llmindbioinspiredtrainingfreeadaptive,
title={LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models},
author={Soumyaratna Debnath and Bui Duc Manh and Zinan Liu and Lin Wang},
year={2026},
eprint={2603.14882},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.14882},
}