Abstract – in plain English

The human eye keeps only a thumb-nail-sized region in razor-sharp focus while resolution falls towards the periphery. This mechanism is known as foveation and is a clever way to reduce the amount of visual information our brain needs to process while still capturing fine details + a wide field-of-view. Modern vision–language models, however, treat every pixel equally, incurring extra compute and memory for information that often matters less.

We explore a simple thought experiment: keep the total pixel count unchanged, but concentrate pixels around a chosen fixation point and reduce them in the periphery, mirroring the retina’s layout.

Foveated Sampling

VS

Uniform Downscaling

Given that the two (motorbike) representations above both comprise the same number of pixels, just distributed differently, we simply ask which visual representation is better for vision models. The result is measurable and task-agnostic: with the same 3 % pixel budget VLMs gain +2.7 % accuracy on visual question answering and +2.2 % on object detection. In short, variable sampling delivers consistent gains even without re-training models.

Why foveation?

The image bellow is too large to (1) be transmitted over most network channels in real time; (2) too large to store in memory; (3) too large to process via VLMs.

Full image (14,400px)

Downscaled image (1,024px)

While downscaling the image solves the problem, it looses critical details. Existing methods such as dynamic tokenization or token merging can be used as workarounds, but they still require the full image to begin with. This is a problem when the bottleneck in a computer vision pipeline is the network bandwidth, which is the case in long-range communication devices such as UAVs, IoT cameras and wearables. Then the question becomes: which pixels to send to the reciever? We show that, even without training, one is better off sending pixels in a variable manner, rather than downscaling the image.

How we build the sampling maps

All three schemes live on the same log-polar coordinate transform \[ (r,\theta)=\Bigl(\log\!\bigl(\sqrt{(x-x_f)^2+(y-y_f)^2}\bigr),\; \arctan\!\frac{y-y_f}{x-x_f}\Bigr), \] where \((x_f,y_f)\) is the fixation point. For S_var we allocate a fixed quota of samples to concentric annuli whose area decreases linearly with radius—mimicking the widening receptive fields of retinal ganglion cells. For S_uni each annulus contains the same number of samples, yielding an even pixel density.

We mask the original image with the sampling maps, \(I(x,y)\), and then reconstruct a full-resolution frame with bilinear interpolation. \[ \hat I = \mathcal I\bigl(I(x,y)\cdot S(x,y)\bigr), \] ensuring that model architectures remain unchanged. At 3 % density a 720 p frame shrinks from 1.3 MB to ~45 kB on the wire.

Variable size receptive fields

Uniform size receptive fields

Variable sampling

VS

Uniform sampling

Main experiments with 3% pixel budget

We evaluate three vision tasks across seven representative models:

Visual Question Answering – ViLT-B/32, BLIP-2, InstructBLIP, LLaVA-v1.5
Referring-expression Object Detection – MDETR-ResNet101 • DETR-R101 • Mask R-CNN-R101
Free-text Grounding – (qualitative) BLIP-2 on SEED-Bench

Visual-QA results (3 % pixel budget)

Foveated inputs outperform uniform down-scaling across all benchmarks. STable 2–6 (supplementary) give full breakdowns by question category; a snapshot for ViLT and BLIP-2 on VQAv2 is shown below.

Does the fixation point matter?

We shift the fovea 100 px toward each corner. Accuracy varies by < 0.5 % (STable 1). Conclusion: performance stems from variable density—not from a lucky fixation bias.

STable 1. Performance accuracy comparison between variable central fixation and corner fixations across different models and datasets
Model	#Total Params	Dataset	Variable	Variable Std.	Uniform
MDETR-ResNet101-RoBERTa	169M	GQA	46.79%	±0.01%	44.13%
BLIP-2-FlanT5XL	3.4B	GQA	42.27%	±0.21%	40.72%
BLIP-2-FlanT5XL	3.4B	VQAv2	57.89%	±0.46%	56.19%
InstructBLIP-FlanT5XL	4B	VQAv2	66.37%	±0.56%	66.48%
ViLT-B/32	87.4M	VQAv2	64.90%	±0.82%	63.01%
LLaVa-v1.5	13B	VQAv2	65.91%	±0.75%	65.14%

Object detection & segmentation

On GQA we observe a consistent +2 – 3 pp gain in mAP_50:95. To isolate resolution effects we also evaluate only those objects whose masks cover an equal number of samples in both schemes (sample-equalised COCO subset). The table below shows that variable sampling still edges out uniform, especially for small objects.

Object Detection Results: Performance comparison across different sampling strategies
Model	Sampling	AR	ARS	ARM	ARL
DETR-R101	Baseline	54.8	15.3	51.9	72.0
DETR-R101	Uniform	37.1	1.2	28.0	51.6
DETR-R101	Variable	38.5	2.1	31.0	54.7
Mask RCNN-R101	Baseline	52.5	25.6	49.7	64.2
Mask RCNN-R101	Uniform	34.7	1.6	27.7	48.2
Mask RCNN-R101	Variable	36.9	3.4	31.7	48.6

STable 7 (supplementary) provides additional detailed breakdowns by object size and category.

Human-like representations

Transformers. Average attention-distance grows by 30 % in foveated inputs, paralleling the long-range lateral connections of V1. CNNs. A permutation test on 5 k tensors (see §5 in the supplement) rejects H₀ (p < 10^-3): neurons specialise for high- vs low-resolution inputs.

Edge-compute impact

A LoRa-equipped drone is capped at ≈ 70 kbit s^-1. Our 3 % pipeline streams a 720 p scene at ≈ 45 kbit s^-1 while retaining ~80 % of full-resolution accuracy—enabling minimum bandwidth AI control.

Limitations & future work

Architectures are frozen; end-to-end training on sparse tensors may unlock further efficiency.
Single fixation per image; real eyes saccade 3–4 Hz. Integrating learned fixation policies is a next step.
Current VLMs inherit full-res pre-training; retraining from scratch on foveated data is less compute-heavy.

Seeing More with Less:

Human-Like Representations in Vision Models