SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

University of Oxford ogo 1Perceptual Intelligence and Extended Reality Lab (PIXL), University of Oxford
University of California, Santa Cruz logo 2VLAA Lab, University of Santa Cruz

*Corresponding Author


NeurIPS 2025 Workshops on Space in Vision, Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making (EWM), Aligning Reinforcement Learning Experimentalists and Theorists (ARLET), and Scaling Environments for Agents (SEA)

Overview

We introduce SpatialThinker, a 3D-aware reasoning MLLM trained with dense spatial rewards via RL on 7K synthetic VQA dataset we generate, STVQA-7K. SpatialThinker achieves 2x the gains of vanilla RL and surpasses GPT-4o on several tasks.



SpatialThinker Results Overview

SpatialThinker in action. The model first identifies and localizes the region of interest, then constructs a 3D relational scene graph, and finally performs scene-grounded reasoning. This enables SpatialThinker to think in 3D space, beyond 2D projections of images—mirroring how humans build and reason over a mental 3D model of what they see.


Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SPATIALTHINKER, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SPATIALTHINKER consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SPATIALTHINKER-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.



Motivation

MLLMs still struggle with 3D spatial understanding — they see images but don't understand the spatial structure behind them. We aim to teach models to see in 3D, reason over object relations, and move beyond flat 2D perception — akin to how humans form mental 3D models of a scene.

🔒 Challenges

(1) Lack of 3D Spatial Knowledge: Existing MLLMs lack supervision connecting 2D pixels to 3D relational structure.
(2) Data Inefficiency & Limited Coverage: Prior Spatial VLMs rely on massive training sets yet still generalize poorly, and capture a narrow subset of spatial relations. Our generated STVQA-7K dataset covers 84 distinct 2D and 3D relations, spanning relation, distance, depth, orientation, size, reach, and instance-location reasoning.
(3) Sparse RL Signals: Naive reinforcement learning provides weak scalar rewards, failing to shape structured spatial reasoning.
(4) Disjoint Scene Graph Usage: Scene graphs are often treated as external pre-processing tools rather than being integrated into the model's reasoning loop.

🔑 Solutions

(1) End-to-End Scene-Grounded Reasoning: SpatialThinker integrates scene graph grounding directly into multimodal reasoning, forming 3D relational mental graphs within its thought process.
(2) Data-Efficient Synthetic Supervision: Our STVQA-7K dataset (scalable to 108K) provides dense, scene-graph-grounded spatial supervision across 84 diverse 2D + 3D relations.
(3) Dense Multi-Objective RL Reward: A lexicographically gated dense reward progressively guides reasoning from format validity → count fidelity → accuracy → spatial grounding.
(4) Compact Yet High-Performance Training: Trained on only 7K samples, SpatialThinker doubles RL gains and outperforms GPT-4o on spatial reasoning benchmarks.

Method

Dense, lexicographically gated rewards teach SpatialThinker to ground every reasoning step before answering.
  • Reward design (lexicographic). Format → {count, accuracy} → spatial. Spatial credit is given only if the final answer is correct. We use weights wformat=0.1, wcount=0.2, waccuracy=0.5, wspatial=0.2.
  • Format reward. Enforces the <observe>–<scene>–<think>–<answer> template and validates JSON in <scene> (parseable; each object has ID+bbox; relations are valid subject–predicate–object triplets).
  • Count reward. Encourages the right number of objects/relations w.r.t. ground truth and penalizes over/under-generation to avoid reward hacking (λobj=0.7, λrel=0.3).
  • Spatial reward. On correct answers only: match predicted ↔ ground-truth objects via Hungarian algorithm with cost λspatial(1−IoU)+λsemantic(1−sim) (λspatial=1.0, λsemantic=2.0), then average CIoU over matches.
  • GRPO training. Sample N rollouts per input, score with the dense reward, compute group-normalized advantages A=(r−μ)/(σ+ε), and optimize with a PPO-style clipped loss (clips εl=0.2, εh=0.3) and token-level KL regularization (β=1e−2).
MY ALT TEXT

Method overview. Dense lexicographic rewards (format, count, accuracy, spatial) with GRPO enforce grounded 3D reasoning.

STVQA-7K: Dataset Construction
  • Source. Built from Visual Genome scene graphs, yielding 7,587 multiple‑choice VQA pairs spanning 2D and 3D spatial understanding.
  • Coverage. Nine reasoning types: relations, size, orientation, distance, depth, reach, location, count, existence; VG150 predicates augmented with +34 spatial relations (e.g., near/far, bigger/taller, facing away, inside/beneath).
  • Generation and filtering. Questions generated from scene graphs using Claude Sonnet 4, rated for difficulty/quality, and filtered via pass@2 consistency checks with GPT‑4o—downselecting from 56,224 to ~7.5K high‑quality items.
  • Localized supervision. Per‑question subgraphs via lemmatized keyword matching; absolute‑pixel bounding boxes retained to support CIoU‑based spatial rewards.
  • Scalability. Pipeline scales to ~108K samples (upper bound of Visual Genome) for future post‑training or RL fine‑tuning.
MY ALT TEXT
MY ALT TEXT

STVQA-7K pipeline (top) and QA-type distribution with examples (below).

Qualitative Examples: Scene‑Grounded Spatial Reasoning

Results

Performance on Spatial VQA

SpatialThinker‑7B delivers near‑proprietary accuracy and SOTA among open models on spatial VQA with only ~7.5K training samples.

Model 3DSRBench CV BLINK
2D 3D Avg Rel Depth Avg
Proprietary Models
GPT‑4o44.375.883.079.482.578.280.4
Claude 3.5 Sonnet48.260.271.565.958.767.763.2
Open‑Source General MLLMs
Qwen2.5‑VL‑3B44.059.960.260.066.454.060.2
Qwen2.5‑VL‑7B48.469.168.068.684.052.468.2
VLAA‑Thinker‑Qwen2.5‑VL‑7B52.260.860.360.681.271.076.1
LLaVA‑NeXT‑8B48.462.265.363.8
Cambrian‑1‑8B42.272.372.072.269.973.471.7
Open‑Source Spatial MLLMs
RoboPoint‑13B61.260.861.361.1
SpatialBot‑3B41.169.167.867.767.8
SpaceLLaVA‑13B42.068.572.762.967.8
SATORI‑R148.054.669.462.077.058.968.0
Spatial‑RGPT‑7B w/ depth48.460.765.782.374.0
SpaceThinker51.165.165.965.573.459.966.7
SpaceOm52.272.169.370.781.165.373.2
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT50.853.968.461.165.066.966.0
Qwen2.5‑VL‑3B + Vanilla GRPO50.170.666.668.673.455.664.5
SpatialThinker‑3B (Ours)52.971.076.373.681.866.974.4
Qwen2.5‑VL‑7B + SFT53.656.171.363.775.564.570.0
Qwen2.5‑VL‑7B + Vanilla GRPO54.768.976.572.780.475.077.7
SpatialThinker‑7B (Ours)56.477.778.778.286.072.679.3

Bold = Top‑1, Underline = Top‑2.

Model MMVP SpatialReasonerEval SpatialBench
Proprietary Models
GPT‑4o70.785.867.0
Claude 3.5 Sonnet71.384.163.2
Open‑Source General & Spatial MLLMs
Qwen2.5‑VL‑3B67.068.049.9
Qwen2.5‑VL‑7B72.370.662.5
VLAA‑Thinker‑7B75.361.266.2
SpaceThinker63.069.657.9
SpaceOm66.368.958.6
SpatialReasoner64.076.459.2
SATORI‑R167.770.560.3
Visionary‑R170.372.959.8
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT62.767.556.3
Qwen2.5‑VL‑3B + Vanilla GRPO68.369.356.9
SpatialThinker‑3B (Ours)69.076.561.5
Qwen2.5‑VL‑7B + SFT68.370.863.5
Qwen2.5‑VL‑7B + Vanilla GRPO74.379.664.2
SpatialThinker‑7B (Ours)78.082.766.4

Bold = Top‑1, Underline = Top‑2.

Performance on Real‑World and General VQA

Dense spatial rewards transfer to real‑world VQA; SpatialThinker‑7B leads MM‑Star, VStarBench, and RoboSpatial‑Home, while remaining competitive on hallucination and real‑world tests.

Model MM‑Star VStar RealWorldQA MME‑RW‑Lite RoboSpatial‑Home HallusionBench
Proprietary and Open‑Source MLLMs
GPT‑4o64.766.075.451.668.455.0
Claude 3.5 Sonnet65.151.860.145.257.055.5
Qwen2.5‑VL‑3B55.974.958.241.958.746.3
Qwen2.5‑VL‑7B63.975.968.444.170.652.9
VLAA‑Thinker‑7B63.858.166.444.668.968.9
SpaceThinker54.556.561.652.665.4
SpaceOm57.756.553.368.962.9
Method Comparison (Trained on STVQA‑7K)
Qwen2.5‑VL‑3B + SFT53.973.364.843.069.858.9
Qwen2.5‑VL‑3B + Vanilla GRPO56.774.364.446.764.059.0
SpatialThinker‑3B (Ours)57.678.066.346.570.662.5
Qwen2.5‑VL‑7B + SFT63.278.065.447.472.466.2
Qwen2.5‑VL‑7B + Vanilla GRPO63.473.966.646.376.260.7
SpatialThinker‑7B (Ours)65.981.769.248.376.366.4

Bold = Top‑1, Underline = Top‑2.

BibTex


        @misc{batra2025spatialthinkerreinforcing3dreasoning,
          title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, 
          author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark},
          year={2025},
          eprint={2511.07403},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2511.07403}, 
        }