TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

Héctor Carrión1,2*, Yutong Bai1,3*, Víctor A. Hernández Castro1*,
Kishan Panaganti4, Ayush Zenith1, Matthew Trang1, Tony Zhang1, Pietro Perona4, Jitendra Malik3
1Tera AI    2UC Santa Cruz    3UC Berkeley    4California Institute of Technology
*Equal Contribution  

Abstract

World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360º panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time.
We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing.

Method Overview

STRIDE Data Structure and TARDIS Modeling Process
TARDIS inputs Observation O₀ which conditions State S₀, both in space (coordinates) and time (month, year). Following this, O₀ and S₀ condition action A₀ spatially as a move distance in meters with heading in degrees and temporally via month, year offsets. Finally, fO: (O₀,S₀,A₀)→O₁, and the auto-regressive cycle repeats.
TARDIS operates through an interactive, real-time auto-regressive loop that processes observations, state coordinates, and executes navigation actions in both spatial and temporal dimensions. The system treats traditionally separate challenges as a single, integrated sequential prediction problem.

🗺️ Spatial Localization

Given an egocentric observation On, the spatial-state function fss determines precise latitude and longitude coordinates at meter-level accuracy.

⏰ Temporal Localization

Using observation and spatial coordinates, the temporal-state function fts determines the temporal position (month, year) to understand environmental changes.

🚗 Spatial Action

The spatial action function fsa determines navigational move distance in meters and heading in degrees for commanded spatial movement.

📅 Temporal Action

Temporal action function fta enables explicit action in the temporal dimension through month and year change commands.

Key Contributions

Experimental Results

🎨 Image Generation

41% FID

improvement over Chameleon7B in controllable photorealistic image synthesis

🌍 Georeferencing

60%

predictions within 10m error (vs SVG's <10%)

🚘 Self-Control

77.4%

road adherence at 4m lane width with autonomous action generation

⏳ Temporal Consistency

R2 = 0.94

linear SSIM decay over 5-year intervals

Dataset Statistics
STRIDE Dataset Composition:
82B tokens arranged into 6M visual "sentences"
131k panoramic images expanded to 3.6M projected images
27× data augmentation efficiency without traditional augmentation
16-year temporal span (2008-2024) with geographical coverage of San Mateo County
Example Visual Sentence:

Architecture Details

TARDIS adopts a 1B-parameter transformer architecture following the LLaMA design with specialized tokenization for multi-modal spatiotemporal data. The model uses a 16K context length and processes images via VQGAN tokenization (8192 vocabulary, 1024 tokens per 512×512 image).
Multi-Modal Tokenization

🖼️ Image Tokens

VQGAN tokenizer with 8192 vocabulary size, 1024 tokens per 512×512 image

📍 Spatial Tokens

Latitude/longitude with 1e-5 precision, dynamically allocated token bins

📅 Temporal Tokens

Month (1-12) and year (2000-2030) discrete representations

🎯 Action Tokens

Distance (0-50m, 0.1m precision) and heading (0-359.9°, 0.1° precision)

Applications

TARDIS demonstrates versatility across multiple agentic tasks, showcasing the potential for sophisticated generalist agents capable of understanding and manipulating spatial and temporal aspects of their environments with enhanced embodied reasoning capabilities.

🎨 Controllable Generation

Fine-grained control over photorealistic image synthesis with explicit spatial and temporal commands

📍 Instruction Following

Precise navigation based on distance and heading instructions in real-world environments

🤖 Autonomous Navigation

Self-control capabilities with valid action generation adhering to road networks

🌍 Georeferencing

State-of-the-art location prediction from single images without aerial reference data

Video

Demo Video Coming Soon
Interactive demonstrations of TARDIS performing spatiotemporal navigation, controllable generation, and autonomous exploration tasks.

BibTeX

@article{carrion2025_tardis_stride,
    title={{TARDIS STRIDE}: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy},
    author={Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik},
    journal={arXiv preprint},
    year={2025},
}