Policy-based Foveated Imaging and Perception

Please check out our fast forward presentation on July 19th and our paper presentation on July 23rd @2:50pm in LA!

Dual-Stream Ultra-High-Resolution Sensors + Real-Time, Task-Aware Sensor Attention Policy = Foveated Imaging And Perception
Rapid Sensor Resolution Growth
Low-resolution global field-of-view (~1 GP scene)
Low-Resolution Global Context
Full-resolution foveated ROI
Foveated Full-Resolution Region of Interest (ROI)
Image credit: Turgot map of Paris (1739), Norman B. Leventhal Map Center, via Wikimedia Commons.

Ultra-high-resolution sensors capture fine details critical for visual perception, but acquiring all pixels at full resolution is often infeasible under realistic constraints. We introduce a real-time, task-aware foveated imaging framework that learns a sensor attention policy to dynamically allocate pixel bandwidth, extensively validated in simulation and on a 200-megapixel prototype.

Scroll to view more

Abstract

Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. This results in degraded performance across multiple perception tasks, including object tracking, text recognition, and robotic manipulation.

Loading 0%
Full Resolution Reference
Loading 0%
1/16 x Full Bandwidth
Naive Spatial Downsampling
Loading 0%
1/16 x Full Bandwidth
Naive Temporal Downsampling
1/161/81/41/21

In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop.

Dual-Stream Capture with 200 MP Foveated Imaging Prototype

Full field-of-view
Hover
to move
Full Field-of-View (FFoV)
Zoomed-In Low-Resolution FFoV Crop
Foveated Full-Resolution ROI

Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.

See more of our results here.

Method

Rather than learning the sensor attention policy end-to-end from raw pixels, we decompose the problem into three lightweight, interpretable components: (i) a saliency detector, (ii) a motion model, and (iii) a scanpath selection policy. This modular design enables real-time inference on edge hardware and avoids the instability and latency of monolithic policies.

Please check our paper for further method details.

Results

200 MP Foveated Imaging Prototype

We demonstrate both smooth-pursuit scanpaths for object tracking and saccadic scanpaths for scene text recognition. The predictive attention policy consistently directs high-resolution sensing to task-relevant regions that remain indistinguishable in the global context stream. This enables recovery of fine spatial details such as object boundaries and textures under real-world lighting conditions and sensor noise.

Simulation Results for Multiple Perception Tasks

Our policy-based foveated imaging consistently outperforms task-agnostic baselines and, in some cases, matches full-resolution performance while using less than one-eighth of the pixel bandwidth.

Qualitative Results
Quantitative Results
Task Metric Full-resolution GT Oracle Spatial
downsampling
Temporal
downsampling
Foveated
(Ours)
Object Tracking IoU ↑ 0.281 0.405 0.122 0.148 0.283
Text Recognition Transcription Rate ↑ 0.333 0.271 0.067 0.248 0.264
Robotic Manipulation Success Rate ↑
(Complete | Partial)
0.15 | 0.61 N/A 0.10 | 0.51 0.07 | 0.30 0.12 | 0.57

Quantitative results of policy-based foveated perception. We compare downstream task performance of our method against same-pixel-bandwidth downsampling baselines and include full-resolution and oracle performance with ground-truth (GT) ROI selections. Row 1: downstream soccer ball tracking Intersection-over-Union (IoU) of the baseline methods and our approach. Temporal downsampling IoU is computed only on kept frames. Row 2: percentage of distinct text objects correctly detected and transcribed in the road scene text recognition task. Row 3: partial and complete task success rate over 100 trials of the ALOHA insertion task. The GT Oracle is not applicable because no ground-truth foveation labels are available for this task. Our approach outperforms all relevant baselines at comparable bandwidth.

Please see our paper for more detailed analysis and further evaluations.

BibTeX

@inproceedings{xiao2026foveated,
  author    = {Xiao, Howard and Ackermann, Jan and Deng, Boyang and Wetzstein, Gordon},
  title     = {Policy-based Foveated Imaging and Perception},
  year      = {2026},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  booktitle = {Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (SIGGRAPH Conference Papers '26), July 19--23, 2026, Los Angeles, CA, USA},
  numpages  = {11},
  location  = {Los Angeles, CA, USA},
  series    = {SIGGRAPH Conference Papers '26}
}