Learning action representations for self-supervised visual exploration
Learning to efficiently navigate an environment using only an on-board camera is a difficult task for an agent when the final goal is far from the initial state and extrinsic rewards are sparse. To address this problem, we present a self-supervised prediction network to train the agent with intrinsic rewards that relate to achieving the desired final goal. The network learns to predict its future camera view (the future state) from a current state-action pair through an Action Representation Module that decodes input actions as higher dimensional representations. To increase the representational power of the network during exploration we fuse the responses from the Action Representation Module in the transition network, which predicts the future state. Moreover, to enhance the discrimination capability between predictions from different input actions we introduce joint regression and triplet ranking loss functions. We show that, despite the sparse extrinsic rewards, by learning action representations we achieve a faster training convergence than state-of-the-art methods with only a small increase in the number of the model parameters.