dc.description.abstract | This thesis aims at reproducing the video of an indoor scene as seen from another, targeted, view using modalities such as depth and skeleton as guidance. However, synthesizing the video containing a moving person is challenging due to the camera placement in the scene that causes scale difference and self-occlusion. The other key challenge is maintaining temporal consistency across the synthesized frames. Current state-of-the-art methods focus on synthesizing each frame separately, which can cause the loss of the motion information contained in the input view. Therefore, we need to model the temporal consistency for a smooth transitioning between the synthesized frames. We consider a neural network-based approach and use the body skeleton as a driving cue, visible texture transfer for self-occlusion, and recurrent neural network to maintain temporal consistency in the feature space. We propose a 2D-based synthesis network that specifically disentangles the encoding of the input image and the target pose which allows learning better features that lead to better image synthesis. We also propose a training strategy based on a pixel-wise loss function that improves high-frequency details to enhance the visual quality of the synthesized images. Moreover, we propose a novel masking scheme to account for the scale difference and the spatial shift and deformation between the input and output skeleton. We propose a new formulation of the 2D-based synthesis network to address the temporal consistency constraint on the synthesized multi-view frames. In particular, we extend recurrent neural networks to learn a spatiotemporal feature space that preserves the texture and approximates the targeted view. In addition, we propose a hybrid approach combining a direct texture transfer of the visible pixel from the input to the targeted view and a 3D-based synthesis network for refinement. Experimental results on standard image and multi-view video benchmarks improve existing alternatives in terms of visual quality and the smoothness of the synthesized frames. | en_US |