See, Symbolize, Act

Grounding VLMs with Spatial Representations for Better Gameplay

Abstract

Vision-Language Models (VLMs) excel at describing visual scenes, yet struggle to translate perception into precise, grounded actions. We investigate whether providing VLMs with both the visual frame and the symbolic representation of the scene can improve their performance in interactive environments. We evaluate three state-of-the-art VLMs across Atari games, VizDoom, and AI2-THOR, comparing frame-only, frame with self-extracted symbols, frame with ground-truth symbols, and symbol-only pipelines. Our results indicate that all models benefit when the symbolic information is accurate. However, when VLMs extract symbols themselves, performance becomes dependent on model capability and scene complexity. Our findings reveal that symbolic grounding is beneficial in VLMs only when symbol extraction is reliable, and highlight perception quality as a central bottleneck for future VLM-based agents.

Key Findings

  • Ground-truth symbols help every model in every game. When object positions come from the game RAM (via OCAtari), all three VLMs improve across Pong, Breakout, Space Invaders, VizDoom, and AI2-THOR.
  • Self-extracted symbols help Claude but hurt GPT-4o and Gemini. The difference traces directly to object detection quality. Claude's F1 is 0.715; GPT-4o's is 0.124. Inaccurate coordinates degrade decision-making.

Main Results

Ground-truth symbols improve every model, but self-extracted symbols only help Claude. GPT-4o and Gemini degrade in complex scenes because their low detection accuracy (F1 of 0.124 and 0.189, respectively) introduces more noise than signal.

Claude-4-Sonnet performance across games, normalized relative to F+S-GT score

Claude-4-Sonnet performance across games, normalized relative to F+S-GT score (see paper for absolute values).

Detection Quality

The difference between models traces directly to how accurately they extract object positions from the frame.

ModelF1 ScoreIoU
Claude-4-Sonnet0.7150.533
Gemini-2.5-Pro0.1890.202
GPT-4o0.1240.128

Conclusion

Symbolic grounding improves VLM gameplay, but only when the model can extract symbols accurately. When detection quality is low, self-extracted coordinates introduce noise that hurts performance more than having no symbols at all.

Read the full paper

BibTeX

@misc{baghel2026seesymbolizeactgrounding,
      title={See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay},
      author={Ashish Baghel and Paras Chopra},
      year={2026},
      eprint={2603.11601},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.11601},
}

Contact

ashish.baghel@lossfunk.com  ·  paras@lossfunk.com