Vision-Language-Autonomy

IROS'25 Workshop on AI Meets Autonomy: CMU VLA Challegne (4th place)
1Urban Robotics Lab, Korea Advanced Institute of Science and Technology*These authors contributed equally; †Corresponding authors

Abstract

We developed a Vision-Language-Autonomy framework capable of taking in natural language queries or commands about a scene and generate the appropriate navigation-based response through reasoning about semantic and spatial relationships. Our framework works in a initially unknown environment and navigate to appropriate viewpoints to discover and validate spatial relations and attributes. Our system is equipped with a 3D LiDAR and a 360 camera, along with autonomy system capable of estimating sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints.

Our system is composed of four main components:

  • Perception Module: Robust object detection and tracking using 3D LiDAR and 360 camera.
  • 3D Scene Graph Module: Building a hierarchical 3D scene graph using the detected objects.
  • Object-Goal Navigation Module: Navigating to the goal object using the sensor input and natural language query.
  • Visual Grounding Module: Grounding the language query to the given 3D environment and resoning to generate the appropriate navigation-based response.

Integrated Vision-Language-Autonomy framework can reason about three types of questions:

  • Numerical questions: "How many chairs are there between the table and the sofa?"
  • Object reference questions: "Find the potted plant closest to the table."
  • Instruction-following questions: "Please go to the kitchen avoiding the path between the fridge and the oven."

Method

Perception Module

We use YOLO-World for object detection and tracking. 3D LiDAR points are projected into 2D images and 2D images are masked with SAM mask. With class labels verified using YOLO-World and SAM masks, we project point clouds back to 3D space with class labels and CLIP features.

3D Scene Graph Module

We build a hierarchical 3D scene graph using the detected objects. The layers consists of detection, object, keyframe, place, room, and scene. Object layer contains the detected objects that are registered with their 3D bounding boxes. Detection layer contains the detected objects that are not registered with their 3D bounding boxes. Keyframe layer contains the keyframes that are registered using following rules:

  • If the pose difference between the current frame and the previous frame is greater than a threshold, the current frame is registered as a keyframe.
  • If the new object is detected, the current frame is registered as a keyframe.
  • If the new detection is found, the current frame is registered as a keyframe.

Object-Goal Navigation Module

We developed a exploration strategy that guides the agent to explore the environment in a way that is efficient and effective in finding the goal object. We use three types of exploration strategies:

  • Geometric frontier-based exploration: Navigates to the geometrically closest frontier.
  • Semantic frontier-based exploration: Navigates to the semantically closest frontier using CLIP embedded semantic value map.
  • Contour-based exploration: Navigates along the contour of the environment for thorough area coverage.
Geometric frontier-based exploration
Semantic frontier-based exploration
Countour-based exploration

Visual Grounding Module

Find task reasoning flow
Numerical task reasoning flow
Multiview observation
Confidence aggregation

We utilize VLM to reason about the language query and the 3D environment. For each related object detection, the agent performs multiview observation using object-hull. After observing multiple views for each group, visual grounding module performs reasoning. We use multiple VLM models in parallel and aggregate the results for better reasoning. Confidence aggregation based on entropy is used to determine the most confident answer for the question. The module outputs the answer with the answer candidate which confidence is higher than a threshold.

Demonstration

Our framework successfully reason about the language query of the unknown 3D environment.

How many sofas are below a window? Ans: 2 (with confidence 0.93)