TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy
Héctor Carrión1,2*,
Yutong Bai1,3*,
Víctor A. Hernández Castro1*,
Kishan Panaganti4,
Ayush Zenith1,
Matthew Trang1,
Tony Zhang1,
Pietro Perona4,
Jitendra Malik3
1Tera AI
2UC Santa Cruz
3UC Berkeley
4California Institute of Technology
*Equal Contribution
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360º panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time.
We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing.
Method Overview
STRIDE Data Structure and TARDIS Modeling Process
TARDIS inputs Observation O₀ which conditions State S₀, both in space (coordinates) and time (month, year). Following this, O₀ and S₀ condition action A₀ spatially as a move distance in meters with heading in degrees and temporally via month, year offsets. Finally, fO: (O₀,S₀,A₀)→O₁, and the auto-regressive cycle repeats.
TARDIS operates through an interactive, real-time auto-regressive loop that processes observations, state coordinates, and executes navigation actions in both spatial and temporal dimensions. The system treats traditionally separate challenges as a single, integrated sequential prediction problem.
🗺️ Spatial Localization
Given an egocentric observation On, the spatial-state function fss determines precise latitude and longitude coordinates at meter-level accuracy.
⏰ Temporal Localization
Using observation and spatial coordinates, the temporal-state function fts determines the temporal position (month, year) to understand environmental changes.
🚗 Spatial Action
The spatial action function fsa determines navigational move distance in meters and heading in degrees for commanded spatial movement.
📅 Temporal Action
Temporal action function fta enables explicit action in the temporal dimension through month and year change commands.
Key Contributions
STRIDE Dataset: A novel dataset creation method generating 3.6M sequences from 131k panoramas, achieving 27× augmentation efficiency with SSIM >0.81 temporal consistency.
Controllable Spatiotemporal Generation: TARDIS can be explicitly instructed how to move, leading to fine-grain image generation control with 41% FID improvement over Chameleon7B.
Advanced Georeferencing: State-of-the-art meter-level precision with 60% predictions <10m error vs SVG's <10% at the same threshold.
Valid Self-Control: Auto-regressive formulation allows autonomous action generation with 77.4% road adherence at 4m lane width.
Temporal Sensitivity: Explicit time modeling enables adaptation to temporal changes with linear SSIM decay R²=0.94 over 5-year intervals.
Interactive Adaptation: Flexible sequential structure enables interactive updates and refinements of predictions based on new observations.
Experimental Results
🎨 Image Generation
41% FID
improvement over Chameleon7B in controllable photorealistic image synthesis
🌍 Georeferencing
60%
predictions within 10m error (vs SVG's <10%)
🚘 Self-Control
77.4%
road adherence at 4m lane width with autonomous action generation
⏳ Temporal Consistency
R2 = 0.94
linear SSIM decay over 5-year intervals
Dataset Statistics
STRIDE Dataset Composition:
• 82B tokens arranged into 6M visual "sentences"
• 131k panoramic images expanded to 3.6M projected images
• 27× data augmentation efficiency without traditional augmentation
• 16-year temporal span (2008-2024) with geographical coverage of San Mateo County
Example Visual Sentence:
Architecture Details
TARDIS adopts a 1B-parameter transformer architecture following the LLaMA design with specialized tokenization for multi-modal spatiotemporal data. The model uses a 16K context length and processes images via VQGAN tokenization (8192 vocabulary, 1024 tokens per 512×512 image).
Multi-Modal Tokenization
🖼️ Image Tokens
VQGAN tokenizer with 8192 vocabulary size, 1024 tokens per 512×512 image
📍 Spatial Tokens
Latitude/longitude with 1e-5 precision, dynamically allocated token bins
📅 Temporal Tokens
Month (1-12) and year (2000-2030) discrete representations
🎯 Action Tokens
Distance (0-50m, 0.1m precision) and heading (0-359.9°, 0.1° precision)
Applications
TARDIS demonstrates versatility across multiple agentic tasks, showcasing the potential for sophisticated generalist agents capable of understanding and manipulating spatial and temporal aspects of their environments with enhanced embodied reasoning capabilities.
🎨 Controllable Generation
Fine-grained control over photorealistic image synthesis with explicit spatial and temporal commands
📍 Instruction Following
Precise navigation based on distance and heading instructions in real-world environments
🤖 Autonomous Navigation
Self-control capabilities with valid action generation adhering to road networks
🌍 Georeferencing
State-of-the-art location prediction from single images without aerial reference data
Video
Demo Video Coming Soon
Interactive demonstrations of TARDIS performing spatiotemporal navigation, controllable generation, and autonomous exploration tasks.
BibTeX
@article{carrion2025_tardis_stride,
title={{TARDIS STRIDE}: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy},
author={Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik},
journal={arXiv preprint},
year={2025},
}