CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

ICRA'26 (Under Review)

Taeyun Kim¹Alvin Jinsung Choi¹Dasol Hong¹Hyun Myung†¹

¹Urban Robotics Lab, Korea Advanced Institute of Science and Technology†Corresponding authors

paper (Coming Soon)arxiv (Coming Soon)code (Coming Soon)

Abstract

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration.

We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target’s association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target’s ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

Method

Unified Semantic Value Map: CLUE constructs a unified semantic value map by balancing contextual room and contextual object cues according to the target's characteristic with entropy-based weighting, enabling more reliable navigation.
Contextual Rooms for Global Spatial Understanding: We use LLM to obtain contextual room types for the target object, and use it to obtain contextual room value with VLM and confidence aggregationfor semantic value map.
Contextual Objects for Local Spatial Understanding: We use LLM to obtain contextual objects for the target object, and use it to obtain contextual object value with Gaussian modeling for semantic value map.
Real-Time Capability: CLUE ensures real-time capability by leveraging offline LLM queries and effectively obtaining commonsense knowledge for navigation.
Extensive Experiments: We conduct extensive experiments on the HM3D dataset and real-world deployments, demonstrating the effectiveness of our approach.

Experiments

Unified Semantic Value Map

(a) An example of a low-entropy object (toilet), where contextual rooms provide distinctive guidance while contextual objects do not. (b) An example of a high-entropy object (TV), where the unified map is more strongly influenced by local contextual objects due to the lack of distinctive spatial evidence from contextual rooms.

Qualitative Results

Resource Analysis

Real-World Deployment

CLUE is deployed on a customized UGV platform based on a Clearpath Jackal, equipped with an Intel NUC, a velodyne VLP-16 LiDAR, and a Ricoh Theta Z1 camera.

Video Demo

Target object: piano

Citation

@unpublished{kim2025clue,
  title = {CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation},
  author = {Kim, Taeyun and Choi, Alvin Jinsung and Hong, Dasol and Myung, Hyun},
  note = {Under review at the IEEE International Conference on Robotics and Automation (ICRA), 2026},
  year = {2025}
}

Relevant Projects

IROS'25 Workshop on AI Meets Autonomy: CMU VLA Challegne (4th place)

Vision-Language-Autonomy

Our Vision-Language-Autonomy framework is capable of taking in natural language queries or commands about a scene and generate the appropriate navigation-based response through reasoning about semantic and spatial relationships

AAAI'26 (Under Review)

NeuDonatello

NeuDonatello is a novel neural scene reconstruction method that explicitly models SDF uncertainty and adaptively regularizes the reconstruction process.