Logo SpaceVista
All-Scale Visual Spatial Reasoning from mm to km

1Multimedia Lab, Chinese University of Hong Kong, 2Astribot,
3Beijing University of Posts and Telecommunications,
4Hong Kong University of Science and Technology
Left Logo

News

Right Logo
[2025.10.10] Icon Our preview SFT code base is released for preview at .
[2025.10.10] Icon Our preview 100K subset of SpaceVista-1M is now available at .
[2025.10.10] Icon Our initial paper is now accessible at .

Abstract

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios.

What is All-Scale Spatial Reasoning?

data_graph

Spatial reasoning is the ability to perceive, interpret, and act across spatial scales, from millimeter-sized components to distant aerial scenes. All-scale spatial reasoning is fundamental to next-generation intelligent systems and supports diverse applications: mm sensing for advanced manufacturing, cm and m perception for embodied agents, 10m operation for autonomous driving, and 100m for drone based sensing.
Despite progress, existing work shows clear limitations in both model design and dataset coverage. Current scene perception research mostly targets indoor scenes, narrow object classes, and limited spatial ranges, and lacks training paradigms engineered for end to end, cross scale reasoning. SpaceVista addresses this gap by presenting the first systematic optimization across both data and model dimensions to enable robust, full scene spatial reasoning.

Dataset: SpaceVista-1M

pipeline
The limited data and performance constraints in existing models necessitate the creation of a dataset with all-scale spatial context. We propose SpaceVista-1M, a diverse, real-world, all-scale reasoning dataset, as the first to the best of our knowledge. SpaceVista-1M primarily comprises diverse spatial reasoning question–answer pairs, with rich semantic (category, rationale), 2D (mask, box, point), and 3D (depth, camera parameters, point cloud) annotations, obtained either natively or through processing. The construction pipeline in the above figure follows the step-by-step procedure of preparing, transforming, and generating to obtain an all-scale dataset by integrating specialized models.
data_graph

Basic informations

We collect a large number of spatial reasoning videos from both open-source datasets and our own ollected data. Specifically, we select scenes including tabletop, indoor, outdoor, and drone-view scenes, and design 19 types of spatial reasoning tasks covering all-scale from millimeters to kilometers. The dataset contains diverse spatial reasoning question–answer pairs, enriched with semantic, 2D, and 3D annotations.

Characteristics

  • 5 spatial scales scenes
  • 19 spatial reasoning task type
  • All-scale: from mm to km
  • 38,000 videos across diverse scenes
  • Over 50 subscene categories
  • 1 million QA pairs
  • Video Data with 3D Modeling
  • Comprehensive Annotations & Metadata

Evaluation

Although we perform limited manual filtering on open-source data, its suitability for accurately evaluating real-world perception remains uncertain. To address this, we collect higher-fidelity data comprising two types:

  • 1) measured, recorded, and manually annotated data for tiny and tabletop objects
  • 2) existing videos enhanced through retrieval and verification of public information for indoor and outdoor scenes.

Why not simply train with all-scale data?

data_graph

Mixing different types of knowledge without distinction hinders, rather than facilitates, the model's reasoning, as shown in the Figure above — a problem known as knowledge conflict. In all-scale reasoning, this conflict appears when similar visual patterns are interpreted differently at different scales.

Model: SpaceVista-7B

pipeline
SpaceVista-7B ingests a question with videos and self-supervised dense features, encodes them, projects features to a shared space, and fuses them in an LLM through learnable interaction. A LoRA-like scale expert with a scale-aware router adapts the model to different spatial scales. Training uses reinforcement learning with stepwise rewards to align reasoning and final answers.

Experiment

experiment-2

Results overview. SpaceVista-7B achieves comparative improvements across all benchmarks, highlighting its advantages in spatial reasoning tasks. Although models including LLAVA-Onevision-7B also demonstrate competitive performance, SpaceVista-7B consistently shows greater robustness and adaptability across various tasks, thereby solidifying its position as a leading model in the field of spatial reasoning.

Evaluation suite. The comparison across models is conducted on multiple spatial reasoning benchmarks. We conduct a comprehensive evaluation of LLAVA-Onevision-7B, LLAVA-Next-Video-7B, InternVL3.5-8B, Qwen2.5-VL-7B, SpaceR-7B, SpatialMLLM-4-B, VILASR-7B, and our SpaceVista-7B on VSI-Bench, STI-Bench, MMSI-Bench, SPAR-Bench, and SpaceVista-Bench, highlighting the robustness and competitiveness of our model.

Leaderboard on SpaceVista-Bench

Logo Click the table header to sort in ascending or descending order.
Models highlighted in bold red indicate the top three overall performers on SpaceVista-Bench.
# Model Source Date Overall Tiny Tabletop Tabletop Indoor Outdoor
3 GPT-5 🥉 Link 2025-08 33.732.220.339.043.0
8 GPT-4o Link 2024-05 26.921.713.334.338.3
2 Gemini-2.5-Pro 🥈 Link 2025-06 33.833.038.734.529.0
11 Gemini-2.5-Flash Link 2025-06 24.420.730.019.926.9
6 Claude-Sonnet-4 Link 2025-05 29.727.319.338.134.1
10 Claude-Opus-4.1 Link 2025-08 26.421.729.524.330.0
5 Internvl3.5-38B Link 2025-08 30.729.325.241.227.0
10 Internvl3.5-14B Link 2025-08 26.427.722.331.324.3
4 Internvl3-78B Link 2025-04 33.538.323.342.230.3
9 Internvl3-38B Link 2025-04 26.518.714.334.838.0
13 GLM-4.5V Link 2025-08 23.323.017.827.325.2
14 GLM-4.1V-Thinking Link 2025-07 23.130.719.329.013.3
10 Qwen2.5VL-72B Link 2025-01 26.427.720.329.628.0
7 Qwen2.5VL-32B Link 2025-01 28.425.319.338.130.7
16 LLAVA-Onevision-72B Link 2024-08 16.025.012.015.311.7
17 LLAVA-Onevision-7B Link 2024-08 12.617.58.013.311.6
15 SpaceR Link 2025-04 21.212.917.334.919.8
12 Spatial-MLLM Link 2025-05 24.217.320.336.123.1
1 SpaceVista-7B 🥇 Link 2025-09 36.733.437.142.234.1

Disclaimer

A preview subset of our dataset, SpaceVista-1M, has been released; GitHub homepage has also been made available. We will continue to update and launch the full version in subsequent phases.

BibTeX

@article{sun2025spacevista,
  title={SpaceVista: All-Scale Visual Spatial Reasoning from mm to km}, 
  author={Sun, Peiwen and Lang, Shiqiang and Wu, Dongming and Ding, Yi and Feng, Kaituo and Liu, Huadai and Ye, Zhen and Liu, Rui and Liu, Yun-Hui and Wang, Jianan and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2510.09606},
  year={2025}
}