Seeing More with Less:

Human-Like Representations in Vision Models

Andrey Gizdov1,2   Shimon Ullman1,3   Daniel Harari1,3

CVPR 2025 • Spotlight
Paper (PDF) Video Presentation Code (coming soon)

Foveated sampling boosts accuracy of leading LMMs (BLIP-2, LLaVA, ViLT, MDETR, InstructBLIP) by up to 2.7 % under identical pixel budgets.

Only 3 % pixels → 80 % performance. Scaling curves reveal strong diminishing returns of resolution.

Variable sampling induces human-like global self-attention & resolution-selective neurons in CNNs and transformers.

Enables vision on 5–10 × lower bandwidth for UAVs, IoT cameras & wearables while retaining 80–95 % accuracy.

Abstract – in plain English

The human eye keeps only a thumb-nail-sized region in razor-sharp focus while resolution falls towards the periphery. This mechanism is known as foveation and is a clever way to reduce the amount of visual information our brain needs to process while still capturing fine details + a wide field-of-view. Modern vision–language models, however, treat every pixel equally, incurring extra compute and memory for information that often matters less.

We explore a simple thought experiment: keep the total pixel count unchanged, but concentrate pixels around a chosen fixation point and reduce them in the periphery, mirroring the retina’s layout.

Foveated Sampling
Foveated Sampling
VS
Uniform Downscaling
Uniform Downscaling

Given that the two (motorbike) representations above both comprise the same number of pixels, just distributed differently, we simply ask which visual representation is better for vision models. The result is measurable and task-agnostic: with the same 3 % pixel budget VLMs gain +2.7 % accuracy on visual question answering and +2.2 % on object detection. In short, variable sampling delivers consistent gains even without re-training models.

Why foveation?

The image bellow is too large to (1) be transmitted over most network channels in real time; (2) too large to store in memory; (3) too large to process via VLMs.

Full image (14,400px)
Downscaled image (1,024px)
While downscaling the image solves the problem, it looses critical details. Existing methods such as dynamic tokenization or token merging can be used as workarounds, but they still require the full image to begin with. This is a problem when the bottleneck in a computer vision pipeline is the network bandwidth, which is the case in long-range communication devices such as UAVs, IoT cameras and wearables. Then the question becomes: which pixels to send to the reciever? We show that, even without training, one is better off sending pixels in a variable manner, rather than downscaling the image.

How we build the sampling maps

All three schemes live on the same log-polar coordinate transform \[ (r,\theta)=\Bigl(\log\!\bigl(\sqrt{(x-x_f)^2+(y-y_f)^2}\bigr),\; \arctan\!\frac{y-y_f}{x-x_f}\Bigr), \] where \((x_f,y_f)\) is the fixation point. For Svar we allocate a fixed quota of samples to concentric annuli whose area decreases linearly with radius—mimicking the widening receptive fields of retinal ganglion cells. For Suni each annulus contains the same number of samples, yielding an even pixel density.

We mask the original image with the sampling maps, \(I(x,y)\), and then reconstruct a full-resolution frame with bilinear interpolation. \[ \hat I = \mathcal I\bigl(I(x,y)\cdot S(x,y)\bigr), \] ensuring that model architectures remain unchanged. At 3 % density a 720 p frame shrinks from 1.3 MB to ~45 kB on the wire.

Foveated Sampling
Variable size receptive fields
Uniform Downscaling
Uniform size receptive fields
Foveated Sampling
Variable sampling
VS
Uniform Downscaling
Uniform sampling

Main experiments with 3% pixel budget

We evaluate three vision tasks across seven representative models:

Visual-QA results (3 % pixel budget)

Foveated inputs outperform uniform down-scaling across all benchmarks. STable 2–6 (supplementary) give full breakdowns by question category; a snapshot for ViLT and BLIP-2 on VQAv2 is shown below.

Does the fixation point matter?

We shift the fovea 100 px toward each corner. Accuracy varies by < 0.5 % (STable 1). Conclusion: performance stems from variable density—not from a lucky fixation bias.

STable 1. Performance accuracy comparison between variable central fixation and corner fixations across different models and datasets
Model #Total Params Dataset Variable Variable Std. Uniform
MDETR-ResNet101-RoBERTa 169M GQA 46.79% ±0.01% 44.13%
BLIP-2-FlanT5XL 3.4B GQA 42.27% ±0.21% 40.72%
BLIP-2-FlanT5XL 3.4B VQAv2 57.89% ±0.46% 56.19%
InstructBLIP-FlanT5XL 4B VQAv2 66.37% ±0.56% 66.48%
ViLT-B/32 87.4M VQAv2 64.90% ±0.82% 63.01%
LLaVa-v1.5 13B VQAv2 65.91% ±0.75% 65.14%

Object detection & segmentation

On GQA we observe a consistent +2 – 3 pp gain in mAP50:95. To isolate resolution effects we also evaluate only those objects whose masks cover an equal number of samples in both schemes (sample-equalised COCO subset). The table below shows that variable sampling still edges out uniform, especially for small objects.

Object Detection Results: Performance comparison across different sampling strategies
Model Sampling AR ARS ARM ARL
DETR-R101 Baseline 54.8 15.3 51.9 72.0
DETR-R101 Uniform 37.1 1.2 28.0 51.6
DETR-R101 Variable 38.5 2.1 31.0 54.7
Mask RCNN-R101 Baseline 52.5 25.6 49.7 64.2
Mask RCNN-R101 Uniform 34.7 1.6 27.7 48.2
Mask RCNN-R101 Variable 36.9 3.4 31.7 48.6

STable 7 (supplementary) provides additional detailed breakdowns by object size and category.

Human-like representations

Transformers. Average attention-distance grows by 30 % in foveated inputs, paralleling the long-range lateral connections of V1. CNNs. A permutation test on 5 k tensors (see §5 in the supplement) rejects H0 (p < 10-3): neurons specialise for high- vs low-resolution inputs.

Attention maps & neuron specialisation

Edge-compute impact

A LoRa-equipped drone is capped at ≈ 70 kbit s-1. Our 3 % pipeline streams a 720 p scene at ≈ 45 kbit s-1 while retaining ~80 % of full-resolution accuracy—enabling minimum bandwidth AI control.

Limitations & future work