SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Hunar Batra^1*, Haoqin Tu², Hardy Chen², Yuanze Lin¹, Cihang Xie², Ronald Clark¹

¹Perceptual Intelligence and Extended Reality Lab (PIXL), University of Oxford
University of California, Santa Cruz logo

²VLAA Lab, University of Santa Cruz
^*Corresponding Author

NeurIPS 2025 Workshops on Space in Vision, Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making (EWM), Aligning Reinforcement Learning Experimentalists and Theorists (ARLET), and Scaling Environments for Agents (SEA)

arXiv Arxiv Paper Code 🤗 Models and Dataset

Overview

We introduce SpatialThinker, a 3D-aware reasoning MLLM trained with dense spatial rewards via RL on 7K synthetic VQA dataset we generate, STVQA-7K. SpatialThinker achieves 2x the gains of vanilla RL and surpasses GPT-4o on several tasks.

🌟 SpatialThinker integrates scene graph-based grounding with online RL for spatial reasoning, achieving strong performance with only 7K training samples versus millions required by existing methods.
🌟 We introduce STVQA-7K, a high-quality spatial VQA dataset grounded in scene graphs, along with a scalable data generation pipeline up to 108k samples.
🌟 We design a dense, lexicographically gated multi-objective reward that guides regionally focused spatial reasoning, achieving superior in- and out-of-distribution generalization across spatial, generic VQA, and real-world benchmarks, and outperforming conventional RL and SFT baselines, open-sourced generalist and spatial MLLMs, and proprietary models.

SpatialThinker in action. The model first identifies and localizes the region of interest, then constructs a 3D relational scene graph, and finally performs scene-grounded reasoning. This enables SpatialThinker to think in 3D space, beyond 2D projections of images—mirroring how humans build and reason over a mental 3D model of what they see.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SPATIALTHINKER, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SPATIALTHINKER consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SPATIALTHINKER-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

Motivation

MLLMs still struggle with 3D spatial understanding — they see images but don't understand the spatial structure behind them. We aim to teach models to see in 3D, reason over object relations, and move beyond flat 2D perception — akin to how humans form mental 3D models of a scene.

🔒 Challenges

(1) Lack of 3D Spatial Knowledge: Existing MLLMs lack supervision connecting 2D pixels to 3D relational structure.
(2) Data Inefficiency & Limited Coverage: Prior Spatial VLMs rely on massive training sets yet still generalize poorly, and capture a narrow subset of spatial relations. Our generated STVQA-7K dataset covers 84 distinct 2D and 3D relations, spanning relation, distance, depth, orientation, size, reach, and instance-location reasoning.
(3) Sparse RL Signals: Naive reinforcement learning provides weak scalar rewards, failing to shape structured spatial reasoning.
(4) Disjoint Scene Graph Usage: Scene graphs are often treated as external pre-processing tools rather than being integrated into the model's reasoning loop.

🔑 Solutions

(1) End-to-End Scene-Grounded Reasoning: SpatialThinker integrates scene graph grounding directly into multimodal reasoning, forming 3D relational mental graphs within its thought process.
(2) Data-Efficient Synthetic Supervision: Our STVQA-7K dataset (scalable to 108K) provides dense, scene-graph-grounded spatial supervision across 84 diverse 2D + 3D relations.
(3) Dense Multi-Objective RL Reward: A lexicographically gated dense reward progressively guides reasoning from format validity → count fidelity → accuracy → spatial grounding.
(4) Compact Yet High-Performance Training: Trained on only 7K samples, SpatialThinker doubles RL gains and outperforms GPT-4o on spatial reasoning benchmarks.

Method

Dense, lexicographically gated rewards teach SpatialThinker to ground every reasoning step before answering.

Reward design (lexicographic). Format → {count, accuracy} → spatial. Spatial credit is given only if the final answer is correct. We use weights w_format=0.1, w_count=0.2, w_accuracy=0.5, w_spatial=0.2.
Format reward. Enforces the <observe>–<scene>–<think>–<answer> template and validates JSON in <scene> (parseable; each object has ID+bbox; relations are valid subject–predicate–object triplets).
Count reward. Encourages the right number of objects/relations w.r.t. ground truth and penalizes over/under-generation to avoid reward hacking (λ_obj=0.7, λ_rel=0.3).
Spatial reward. On correct answers only: match predicted ↔ ground-truth objects via Hungarian algorithm with cost λ_spatial(1−IoU)+λ_semantic(1−sim) (λ_spatial=1.0, λ_semantic=2.0), then average CIoU over matches.
GRPO training. Sample N rollouts per input, score with the dense reward, compute group-normalized advantages A=(r−μ)/(σ+ε), and optimize with a PPO-style clipped loss (clips ε_l=0.2, ε_h=0.3) and token-level KL regularization (β=1e−2).

Method overview. Dense lexicographic rewards (format, count, accuracy, spatial) with GRPO enforce grounded 3D reasoning.

STVQA-7K: Dataset Construction

Source. Built from Visual Genome scene graphs, yielding 7,587 multiple‑choice VQA pairs spanning 2D and 3D spatial understanding.
Coverage. Nine reasoning types: relations, size, orientation, distance, depth, reach, location, count, existence; VG150 predicates augmented with +34 spatial relations (e.g., near/far, bigger/taller, facing away, inside/beneath).
Generation and filtering. Questions generated from scene graphs using Claude Sonnet 4, rated for difficulty/quality, and filtered via pass@2 consistency checks with GPT‑4o—downselecting from 56,224 to ~7.5K high‑quality items.
Localized supervision. Per‑question subgraphs via lemmatized keyword matching; absolute‑pixel bounding boxes retained to support CIoU‑based spatial rewards.
Scalability. Pipeline scales to ~108K samples (upper bound of Visual Genome) for future post‑training or RL fine‑tuning.

STVQA-7K pipeline (top) and QA-type distribution with examples (below).

Qualitative Examples: Scene‑Grounded Spatial Reasoning

Results

Performance on Spatial VQA

SpatialThinker‑7B delivers near‑proprietary accuracy and SOTA among open models on spatial VQA with only ~7.5K training samples.

Model	3DSRBench	CV			BLINK
Model	3DSRBench	2D	3D	Avg	Rel	Depth	Avg
Proprietary Models
GPT‑4o	44.3	75.8	83.0	79.4	82.5	78.2	80.4
Claude 3.5 Sonnet	48.2	60.2	71.5	65.9	58.7	67.7	63.2
Open‑Source General MLLMs
Qwen2.5‑VL‑3B	44.0	59.9	60.2	60.0	66.4	54.0	60.2
Qwen2.5‑VL‑7B	48.4	69.1	68.0	68.6	84.0	52.4	68.2
VLAA‑Thinker‑Qwen2.5‑VL‑7B	52.2	60.8	60.3	60.6	81.2	71.0	76.1
LLaVA‑NeXT‑8B	48.4	62.2	65.3	63.8	–	–	–
Cambrian‑1‑8B	42.2	72.3	72.0	72.2	69.9	73.4	71.7
Open‑Source Spatial MLLMs
RoboPoint‑13B	–	–	61.2	–	60.8	61.3	61.1
SpatialBot‑3B	41.1	–	69.1	–	67.8	67.7	67.8
SpaceLLaVA‑13B	42.0	–	68.5	–	72.7	62.9	67.8
SATORI‑R1	48.0	54.6	69.4	62.0	77.0	58.9	68.0
Spatial‑RGPT‑7B w/ depth	48.4	–	60.7	–	65.7	82.3	74.0
SpaceThinker	51.1	65.1	65.9	65.5	73.4	59.9	66.7
SpaceOm	52.2	72.1	69.3	70.7	81.1	65.3	73.2
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT	50.8	53.9	68.4	61.1	65.0	66.9	66.0
Qwen2.5‑VL‑3B + Vanilla GRPO	50.1	70.6	66.6	68.6	73.4	55.6	64.5
SpatialThinker‑3B (Ours)	52.9	71.0	76.3	73.6	81.8	66.9	74.4
Qwen2.5‑VL‑7B + SFT	53.6	56.1	71.3	63.7	75.5	64.5	70.0
Qwen2.5‑VL‑7B + Vanilla GRPO	54.7	68.9	76.5	72.7	80.4	75.0	77.7
SpatialThinker‑7B (Ours)	56.4	77.7	78.7	78.2	86.0	72.6	79.3

Bold = Top‑1, Underline = Top‑2.

Model	MMVP	SpatialReasonerEval	SpatialBench
Proprietary Models
GPT‑4o	70.7	85.8	67.0
Claude 3.5 Sonnet	71.3	84.1	63.2
Open‑Source General & Spatial MLLMs
Qwen2.5‑VL‑3B	67.0	68.0	49.9
Qwen2.5‑VL‑7B	72.3	70.6	62.5
VLAA‑Thinker‑7B	75.3	61.2	66.2
SpaceThinker	63.0	69.6	57.9
SpaceOm	66.3	68.9	58.6
SpatialReasoner	64.0	76.4	59.2
SATORI‑R1	67.7	70.5	60.3
Visionary‑R1	70.3	72.9	59.8
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT	62.7	67.5	56.3
Qwen2.5‑VL‑3B + Vanilla GRPO	68.3	69.3	56.9
SpatialThinker‑3B (Ours)	69.0	76.5	61.5
Qwen2.5‑VL‑7B + SFT	68.3	70.8	63.5
Qwen2.5‑VL‑7B + Vanilla GRPO	74.3	79.6	64.2
SpatialThinker‑7B (Ours)	78.0	82.7	66.4

Bold = Top‑1, Underline = Top‑2.

Performance on Real‑World and General VQA

Dense spatial rewards transfer to real‑world VQA; SpatialThinker‑7B leads MM‑Star, VStarBench, and RoboSpatial‑Home, while remaining competitive on hallucination and real‑world tests.

Model	MM‑Star	VStar	RealWorldQA	MME‑RW‑Lite	RoboSpatial‑Home	HallusionBench
Proprietary and Open‑Source MLLMs
GPT‑4o	64.7	66.0	75.4	51.6	68.4	55.0
Claude 3.5 Sonnet	65.1	51.8	60.1	45.2	57.0	55.5
Qwen2.5‑VL‑3B	55.9	74.9	58.2	41.9	58.7	46.3
Qwen2.5‑VL‑7B	63.9	75.9	68.4	44.1	70.6	52.9
VLAA‑Thinker‑7B	63.8	58.1	66.4	44.6	68.9	68.9
SpaceThinker	54.5	56.5	61.6	–	52.6	65.4
SpaceOm	57.7	56.5	53.3	–	68.9	62.9
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT	53.9	73.3	64.8	43.0	69.8	58.9
Qwen2.5‑VL‑3B + Vanilla GRPO	56.7	74.3	64.4	46.7	64.0	59.0
SpatialThinker‑3B (Ours)	57.6	78.0	66.3	46.5	70.6	62.5
Qwen2.5‑VL‑7B + SFT	63.2	78.0	65.4	47.4	72.4	66.2
Qwen2.5‑VL‑7B + Vanilla GRPO	63.4	73.9	66.6	46.3	76.2	60.7
SpatialThinker‑7B (Ours)	65.9	81.7	69.2	48.3	76.3	66.4

Bold = Top‑1, Underline = Top‑2.

BibTex


        @misc{batra2025spatialthinkerreinforcing3dreasoning,
          title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, 
          author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
          year={2025},
          eprint={2511.07403},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2511.07403}, 
        }