Vision-Language-Autonomy
IROS'25 Workshop on AI Meets Autonomy: CMU VLA Challegne (4th place)
We developed a Vision-Language-Autonomy framework capable of taking in natural language queries or commands about a scene and generate the appropriate navigation-based response through reasoning about semantic and spatial relationships. Our framework works in a initially unknown environment and navigate to appropriate viewpoints to discover and validate spatial relations and attributes. Our system is equipped with a 3D LiDAR and a 360 camera, along with autonomy system capable of estimating sensor pose, analyze the terrain, avoid collisions, and navigate to waypoints.
Our system is composed of four main components:
Integrated Vision-Language-Autonomy framework can reason about three types of questions:


We use YOLO-World for object detection and tracking. 3D LiDAR points are projected into 2D images and 2D images are masked with SAM mask. With class labels verified using YOLO-World and SAM masks, we project point clouds back to 3D space with class labels and CLIP features.

We build a hierarchical 3D scene graph using the detected objects. The layers consists of detection, object, keyframe, place, room, and scene. Object layer contains the detected objects that are registered with their 3D bounding boxes. Detection layer contains the detected objects that are not registered with their 3D bounding boxes. Keyframe layer contains the keyframes that are registered using following rules:

We developed a exploration strategy that guides the agent to explore the environment in a way that is efficient and effective in finding the goal object. We use three types of exploration strategies:





We utilize VLM to reason about the language query and the 3D environment. For each related object detection, the agent performs multiview observation using object-hull. After observing multiple views for each group, visual grounding module performs reasoning. We use multiple VLM models in parallel and aggregate the results for better reasoning. Confidence aggregation based on entropy is used to determine the most confident answer for the question. The module outputs the answer with the answer candidate which confidence is higher than a threshold.
powered by Academic Project Page Template