Please check out our fast forward presentation on July 19th and our paper presentation on July 23rd @2:50pm in LA!
Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. This results in degraded performance across multiple perception tasks, including object tracking, text recognition, and robotic manipulation.
In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop.
Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.
Rather than learning the sensor attention policy end-to-end from raw pixels, we decompose the problem into three lightweight, interpretable components: (i) a saliency detector, (ii) a motion model, and (iii) a scanpath selection policy. This modular design enables real-time inference on edge hardware and avoids the instability and latency of monolithic policies.
Please check our paper for further method details.
We demonstrate both smooth-pursuit scanpaths for object tracking and saccadic scanpaths for scene text recognition. The predictive attention policy consistently directs high-resolution sensing to task-relevant regions that remain indistinguishable in the global context stream. This enables recovery of fine spatial details such as object boundaries and textures under real-world lighting conditions and sensor noise.
Our policy-based foveated imaging consistently outperforms task-agnostic baselines and, in some cases, matches full-resolution performance while using less than one-eighth of the pixel bandwidth.
| Task | Metric | Full-resolution | GT Oracle | Spatial downsampling |
Temporal downsampling |
Foveated (Ours) |
|---|---|---|---|---|---|---|
| Object Tracking | IoU ↑ | 0.281 | 0.405 | 0.122 | 0.148 | 0.283 |
| Text Recognition | Transcription Rate ↑ | 0.333 | 0.271 | 0.067 | 0.248 | 0.264 |
| Robotic Manipulation | Success Rate ↑ (Complete | Partial) |
0.15 | 0.61 | N/A | 0.10 | 0.51 | 0.07 | 0.30 | 0.12 | 0.57 |
Quantitative results of policy-based foveated perception. We compare downstream task performance of our method against same-pixel-bandwidth downsampling baselines and include full-resolution and oracle performance with ground-truth (GT) ROI selections. Row 1: downstream soccer ball tracking Intersection-over-Union (IoU) of the baseline methods and our approach. Temporal downsampling IoU is computed only on kept frames. Row 2: percentage of distinct text objects correctly detected and transcribed in the road scene text recognition task. Row 3: partial and complete task success rate over 100 trials of the ALOHA insertion task. The GT Oracle is not applicable because no ground-truth foveation labels are available for this task. Our approach outperforms all relevant baselines at comparable bandwidth.
Please see our paper for more detailed analysis and further evaluations.
@inproceedings{xiao2026foveated,
author = {Xiao, Howard and Ackermann, Jan and Deng, Boyang and Wetzstein, Gordon},
title = {Policy-based Foveated Imaging and Perception},
year = {2026},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (SIGGRAPH Conference Papers '26), July 19--23, 2026, Los Angeles, CA, USA},
numpages = {11},
location = {Los Angeles, CA, USA},
series = {SIGGRAPH Conference Papers '26}
}