Language Guided Localization and Navigation
Abstract
Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this talk we explore two multi-modal embodied perception tasks which require localization or navigation of an agent in an unknown 3D space with limited information about the environment. First we present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset, we propose the task of Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. The second task we examine is the Vision Language Navigation (VLN) task, in which an agent navigates via natural language instructions. For both tasks, we address the objective of improving model accuracy and demonstrate that this can be done using passive data, which can introduce more semantically rich and diverse information during training, in comparison to additional interaction data. We additionally introduce a novel analysis pipeline for both tasks to diagnose and reveal limitations and failure modes of these types of common multi-modal models.