SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

SpaceVista
All-Scale Visual Spatial Reasoning from mm to km

¹Multimedia Lab, Chinese University of Hong Kong, ²Astribot,
³Beijing University of Posts and Telecommunications,
⁴Hong Kong University of Science and Technology

Abstract

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios.

What is All-Scale Spatial Reasoning?

Spatial reasoning is the ability to perceive, interpret, and act across spatial scales, from millimeter-sized components to distant aerial scenes. All-scale spatial reasoning is fundamental to next-generation intelligent systems and supports diverse applications: mm sensing for advanced manufacturing, cm and m perception for embodied agents, 10m operation for autonomous driving, and 100m for drone based sensing.
Despite progress, existing work shows clear limitations in both model design and dataset coverage. Current scene perception research mostly targets indoor scenes, narrow object classes, and limited spatial ranges, and lacks training paradigms engineered for end to end, cross scale reasoning. SpaceVista addresses this gap by presenting the first systematic optimization across both data and model dimensions to enable robust, full scene spatial reasoning.

Dataset: SpaceVista-1M

The limited data and performance constraints in existing models necessitate the creation of a dataset with all-scale spatial context. We propose SpaceVista-1M, a diverse, real-world, all-scale reasoning dataset, as the first to the best of our knowledge. SpaceVista-1M primarily comprises diverse spatial reasoning question–answer pairs, with rich semantic (category, rationale), 2D (mask, box, point), and 3D (depth, camera parameters, point cloud) annotations, obtained either natively or through processing. The construction pipeline in the above figure follows the step-by-step procedure of preparing, transforming, and generating to obtain an all-scale dataset by integrating specialized models.

Basic informations

We collect a large number of spatial reasoning videos from both open-source datasets and our own ollected data. Specifically, we select scenes including tabletop, indoor, outdoor, and drone-view scenes, and design 19 types of spatial reasoning tasks covering all-scale from millimeters to kilometers. The dataset contains diverse spatial reasoning question–answer pairs, enriched with semantic, 2D, and 3D annotations.

Characteristics

5 spatial scales scenes
19 spatial reasoning task type
All-scale: from mm to km
38,000 videos across diverse scenes
Over 50 subscene categories
1 million QA pairs
Video Data with 3D Modeling
Comprehensive Annotations & Metadata

Evaluation

Although we perform limited manual filtering on open-source data, its suitability for accurately evaluating real-world perception remains uncertain. To address this, we collect higher-fidelity data comprising two types:

1) measured, recorded, and manually annotated data for tiny and tabletop objects
2) existing videos enhanced through retrieval and verification of public information for indoor and outdoor scenes.

Why not simply train with all-scale data?

Mixing different types of knowledge without distinction hinders, rather than facilitates, the model's reasoning, as shown in the Figure above — a problem known as knowledge conflict. In all-scale reasoning, this conflict appears when similar visual patterns are interpreted differently at different scales.

Model: SpaceVista-7B

SpaceVista-7B ingests a question with videos and self-supervised dense features, encodes them, projects features to a shared space, and fuses them in an LLM through learnable interaction. A LoRA-like scale expert with a scale-aware router adapts the model to different spatial scales. Training uses reinforcement learning with stepwise rewards to align reasoning and final answers.

Experiment

Results overview. SpaceVista-7B achieves comparative improvements across all benchmarks, highlighting its advantages in spatial reasoning tasks. Although models including LLAVA-Onevision-7B also demonstrate competitive performance, SpaceVista-7B consistently shows greater robustness and adaptability across various tasks, thereby solidifying its position as a leading model in the field of spatial reasoning.

Evaluation suite. The comparison across models is conducted on multiple spatial reasoning benchmarks. We conduct a comprehensive evaluation of LLAVA-Onevision-7B, LLAVA-Next-Video-7B, InternVL3.5-8B, Qwen2.5-VL-7B, SpaceR-7B, SpatialMLLM-4-B, VILASR-7B, and our SpaceVista-7B on VSI-Bench, STI-Bench, MMSI-Bench, SPAR-Bench, and SpaceVista-Bench, highlighting the robustness and competitiveness of our model.

#	Model	Source	Date	Overall	Tiny Tabletop	Tabletop	Indoor	Outdoor
3	GPT-5 🥉	Link	2025-08	33.7	32.2	20.3	39.0	43.0
8	GPT-4o	Link	2024-05	26.9	21.7	13.3	34.3	38.3
2	Gemini-2.5-Pro 🥈	Link	2025-06	33.8	33.0	38.7	34.5	29.0
11	Gemini-2.5-Flash	Link	2025-06	24.4	20.7	30.0	19.9	26.9
6	Claude-Sonnet-4	Link	2025-05	29.7	27.3	19.3	38.1	34.1
10	Claude-Opus-4.1	Link	2025-08	26.4	21.7	29.5	24.3	30.0
5	Internvl3.5-38B	Link	2025-08	30.7	29.3	25.2	41.2	27.0
10	Internvl3.5-14B	Link	2025-08	26.4	27.7	22.3	31.3	24.3
4	Internvl3-78B	Link	2025-04	33.5	38.3	23.3	42.2	30.3
9	Internvl3-38B	Link	2025-04	26.5	18.7	14.3	34.8	38.0
13	GLM-4.5V	Link	2025-08	23.3	23.0	17.8	27.3	25.2
14	GLM-4.1V-Thinking	Link	2025-07	23.1	30.7	19.3	29.0	13.3
10	Qwen2.5VL-72B	Link	2025-01	26.4	27.7	20.3	29.6	28.0
7	Qwen2.5VL-32B	Link	2025-01	28.4	25.3	19.3	38.1	30.7
16	LLAVA-Onevision-72B	Link	2024-08	16.0	25.0	12.0	15.3	11.7
17	LLAVA-Onevision-7B	Link	2024-08	12.6	17.5	8.0	13.3	11.6
15	SpaceR	Link	2025-04	21.2	12.9	17.3	34.9	19.8
12	Spatial-MLLM	Link	2025-05	24.2	17.3	20.3	36.1	23.1
1	SpaceVista-7B 🥇	Link	2025-09	36.7	33.4	37.1	42.2	34.1

Model

Source

Date

Overall

Tiny Tabletop

Tabletop

Indoor

Outdoor

SpaceVista All-Scale Visual Spatial Reasoning from mm to km

News