RoboGolf:
Mastering Real-World Minigolf
with a Reflective Multi-Modality Vision-Language Model

Hantao Zhou 1*,  Tianying Ji 2*,  Lukas Sommerhalder 1,   Michael Goerner 1,   Norman Hendrich 1,   Jianwei Zhang 1,  Fuchun Sun 2,  Huazhe Xu 2
1 University of Hamburg, 2 Tsinghua University * Equal contribution

Abstract

Minigolf is an exemplary real-world game for examining embodied intelligence, as it not only challenges spatial and kinodynamic reasoning but also requires reflective and modification capacities to address impractical course designs. We introduce RoboGolf, a framework that combines dual-camera perception and closed-loop control, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. Extensive experiments on challenging given courts and impractical courts demonstrate the effectiveness of our approach.

RoboGolf Handles Diverse Courts

Simple Courts


Medium Courts

Variations of Endpoint Positions

Require precise hitting force to go through the ramp yet not exceed the volcano.

Variations of Endpoint Types

Different endpoints and obstacles require our model to recognize and plan accurately to avoid overshooting or falling short.

Courts with Multiple Feasible Solutions

Identify various feasible solutions, and accurately plan the shot to navigate through obstacles and reach the target efficiently.

Complex Kinodynamic

Hit with precise force and direction to avoid getting stuck, prevent wall hits, force loss.

Hard Courts



Bilateral Golf Balls Impact Challenge

Hit the red ball from the starting point to bump the white ball, initially positioned at the center of the court, into the yellow round endpoint.