TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

Héctor Carrión^1,2*, Yutong Bai^1,3*, Víctor A. Hernández Castro^1*,
Kishan Panaganti⁴, Ayush Zenith¹, Matthew Trang¹, Tony Zhang¹, Pietro Perona⁴, Jitendra Malik³

¹Tera AI ²UC Santa Cruz ³UC Berkeley ⁴California Institute of Technology
^*Equal Contribution

🤗 Dataset 📄 Paper 💻 Code

Abstract

World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360º panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time.

                We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing.
            

Method Overview

STRIDE Data Structure and TARDIS Modeling Process
TARDIS inputs Observation O₀ which conditions State S₀, both in space (coordinates) and time (month, year). Following this, O₀ and S₀ condition action A₀ spatially as a move distance in meters with heading in degrees and temporally via month, year offsets. Finally, f_O: (O₀,S₀,A₀)→O₁, and the auto-regressive cycle repeats.

TARDIS operates through an interactive, real-time auto-regressive loop that processes observations, state coordinates, and executes navigation actions in both spatial and temporal dimensions. The system treats traditionally separate challenges as a single, integrated sequential prediction problem.

🗺️ Spatial Localization

Given an egocentric observation O_n, the spatial-state function f_ss determines precise latitude and longitude coordinates at meter-level accuracy.

⏰ Temporal Localization

Using observation and spatial coordinates, the temporal-state function f_ts determines the temporal position (month, year) to understand environmental changes.

🚗 Spatial Action

The spatial action function f_sa determines navigational move distance in meters and heading in degrees for commanded spatial movement.

📅 Temporal Action

Temporal action function f_ta enables explicit action in the temporal dimension through month and year change commands.

Key Contributions

STRIDE Dataset: A novel dataset creation method generating 3.6M sequences from 131k panoramas, achieving 27× augmentation efficiency with SSIM >0.81 temporal consistency.
Controllable Spatiotemporal Generation: TARDIS can be explicitly instructed how to move, leading to fine-grain image generation control with 41% FID improvement over Chameleon7B.
Advanced Georeferencing: State-of-the-art meter-level precision with 60% predictions <10m error vs SVG's <10% at the same threshold.
Valid Self-Control: Auto-regressive formulation allows autonomous action generation with 77.4% road adherence at 4m lane width.
Temporal Sensitivity: Explicit time modeling enables adaptation to temporal changes with linear SSIM decay R²=0.94 over 5-year intervals.
Interactive Adaptation: Flexible sequential structure enables interactive updates and refinements of predictions based on new observations.

Experimental Results

🎨 Image Generation

41% FID

improvement over Chameleon7B in controllable photorealistic image synthesis

🌍 Georeferencing

60%

predictions within 10m error (vs SVG's <10%)

🚘 Self-Control

77.4%

road adherence at 4m lane width with autonomous action generation

⏳ Temporal Consistency

R² = 0.94

linear SSIM decay over 5-year intervals

Dataset Statistics

                STRIDE Dataset Composition:

                • 82B tokens arranged into 6M visual "sentences"

                • 131k panoramic images expanded to 3.6M projected images

                • 27× data augmentation efficiency without traditional augmentation

                • 16-year temporal span (2008-2024) with geographical coverage of San Mateo County

Example Visual Sentence:

Architecture Details

TARDIS adopts a 1B-parameter transformer architecture following the LLaMA design with specialized tokenization for multi-modal spatiotemporal data. The model uses a 16K context length and processes images via VQGAN tokenization (8192 vocabulary, 1024 tokens per 512×512 image).

Multi-Modal Tokenization

🖼️ Image Tokens

VQGAN tokenizer with 8192 vocabulary size, 1024 tokens per 512×512 image

📍 Spatial Tokens

Latitude/longitude with 1e^-5 precision, dynamically allocated token bins

📅 Temporal Tokens

Month (1-12) and year (2000-2030) discrete representations

🎯 Action Tokens

Distance (0-50m, 0.1m precision) and heading (0-359.9°, 0.1° precision)

Applications

TARDIS demonstrates versatility across multiple agentic tasks, showcasing the potential for sophisticated generalist agents capable of understanding and manipulating spatial and temporal aspects of their environments with enhanced embodied reasoning capabilities.

🎨 Controllable Generation

Fine-grained control over photorealistic image synthesis with explicit spatial and temporal commands

📍 Instruction Following

Precise navigation based on distance and heading instructions in real-world environments

🤖 Autonomous Navigation

Self-control capabilities with valid action generation adhering to road networks

🌍 Georeferencing

State-of-the-art location prediction from single images without aerial reference data

Video

Demo Video Coming Soon
Interactive demonstrations of TARDIS performing spatiotemporal navigation, controllable generation, and autonomous exploration tasks.

BibTeX

@article{carrion2025_tardis_stride,
    title={{TARDIS STRIDE}: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy},
    author={Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik},
    journal={arXiv preprint},
    year={2025},
}