STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

1Institute of Visual Computing, Graz University of Technology
2Christian Doppler Laboratory for Embedded Machine Learnin
3Institute for Machine Learning, Johannes Kepler University Linz
4Amazon
NeurIPS 2025

*Indicates Equal Contribution

This work is independent of the author’s employment at Amazon

TL;DR

STSBench a new framework for benchmarking Vision-Language Models (VLMs) in autonomous driving. STSBench uses multi-view videos and LiDAR to test spatio-temporal reasoning and complex traffic interactions. Evaluations using 971 human-verified questions revealed that current models struggle significantly with fundamental traffic dynamics, highlighting an urgent need for better model architectures.

Scenario Prompt

Ego waiting pedestrian to cross

Scenario Categories

STSBench scenario categories STSBench covers common ego-vehicle (blue) actions, e.g., ego lane change and interactions with agents (orange), e.g., ego overtaking agent, important for vehicle control. In addition, to test for complex spatio-temporal understanding, we evaluate agent actions, e.g., agent left turn, and interactions between agents, e.g., agent waiting for pedestrian to cross.

Mined nuScenes Scenarios

Scenario Statistics

Number of Scenarios

Scenario statistics

Number of mined scenarios in total (gray) and the remaining samples (green) after sub-sampling and verification. Scenarios with more than 50 samples (dashed red line) have been sub-sampled considering spatial distribution, occlusion, and distance to the ego-vehicle.



Number of Scenarios per Category

Scenario distribution

Evaluations

Average Performance

Average scores

Detailed Performance

LLM evaluation
VLM evaluation
Driving experts evaluation

BibTeX

@inproceedings{neurips2025stsbench,
    title={{STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving}},
    author={Fruhwirth-Reisinger, Christian and Mali{\'c}, Du{\v{s}}an and Lin, Wei and Schinagl, David and Schulter, Samuel and Possegger, Horst},
    booktitle={In Proceedings of the Conference on Neural Information Processing Systems},
    year={2025}
}