(1) Lack of 3D Spatial Knowledge: Existing MLLMs lack supervision connecting 2D pixels to 3D relational structure.
(2) Data Inefficiency & Limited Coverage: Prior Spatial VLMs rely on massive training sets yet still generalize poorly, and capture a narrow subset of spatial relations. Our generated STVQA-7K dataset covers 84 distinct 2D and 3D relations, spanning relation, distance, depth, orientation, size, reach, and instance-location reasoning.
(3) Sparse RL Signals: Naive reinforcement learning provides weak scalar rewards, failing to shape structured spatial reasoning.
(4) Disjoint Scene Graph Usage: Scene graphs are often treated as external pre-processing tools rather than being integrated into the model's reasoning loop.