Real-world Human Re-identification: Attributes and Beyond.
MetadataShow full item record
Surveillance systems capable of performing a diverse range of tasks that support human intelligence and analytical efforts are becoming widespread and crucial due to increasing threats upon national infrastructure and evolving business and governmental analytical requirements. Surveillance data can be critical for crime-prevention, forensic analysis, and counter-terrorism activities in both civilian and governmental agencies alike. However, visual surveillance data must currently be parsed by trained human operators and therefore any utility is offset by the inherent training and staffing costs as a result. The automated analysis of surveillance video is therefore of great scientific interest. One of the open problems within this area is that of reliably matching humans between disjoint surveillance camera views, termed re-identification. Automated re-identification facilitates human operational efficiency in the grouping of disparate and fragmented people observations through space and time into individual personal identities, a pre-requisite for higher-level surveillance tasks. However, due to the complex nature of realworld scenes and the highly variable nature of human appearance, reliably re-identifying people is non-trivial. Most re-identification approaches developed so far rely on low-level visual feature matching approaches that aim to match human detections against a known gallery of potential matches. However, for many applications an initial detection of a human may be unavailable or a low-level feature representation may not be sufficiently invariant to photometric or geometric variability inherent between camera views. This thesis begins by proposing a “mid-level” human-semantic representation that exploits expert human knowledge of surveillance task execution to the task of re-identifying people in order to compute an attribute-based description of a human. It further shows how this attribute-based description is synergistic with low-level data-derived features to enhance re-identification accuracy and subsequently gain further performance benefits by employing a discriminatively learned distance metric. Finally, a novel “zero-shot” scenario is proposed in which a visual probe is unavailable but re-identification is still possible via a manually provided semantic attribute description. The approach is extensively evaluated using several public benchmark datasets. One challenge in constructing an attribute-based and human-semantic representation is the requirement for extensive annotation. Mitigating this annotation cost in order to present a realistic and scalable re-identification system, is motivation for the second technical area of this thesis, where transfer-learning and data-mining are investigatedin two different approaches. Discriminative methods trade annotation cost for enhanced performance. Because discriminative person re-identification models operate between two camera views, annotation cost therefore scales quadratically on the number of cameras in the entire network. For practical re-identification, this 4 is an unreasonable expectation and prohibitively expensive. By leveraging flexible multi-source transfer of re-identification models, part of this cost may be alleviated. Specifically, it is possible to leverage prior re-identification models learned for a set of source-view pairs (domains), and flexibly combine those to obtain good re-identification performance for a given target-view pair with greatly reduced annotation requirements. The volume of exhaustive annotation effort required for attribute-driven re-identification scales linearly on the number of cameras and attributes. Real-world operation of an attributeenabled, distributed camera network would also require prohibitive quantities of annotation effort by human experts. This effort is completely avoided by taking a data-driven approach to attribute computation, by learning an effective associated representation by crawling large volumes of Internet data. By training on a larger and more diverse array of examples, this representation is more view-invariant and generalisable than attributes trained on conventional scales. These automatically discovered attributes are shown to provide a valuable representation that significantly improves re-identification performance. Moreover, a method to map them onto existing expert-annotated-ontologies is contributed. In the final contribution of this thesis, the underlying assumptions about visual surveillance equipment and re-identification are challenged and the thesis motivates a novel research area using dynamic, mobile platforms. Such platforms violate the common assumption shared by most previous research, namely that surveillance devices are always stationary, relative to the observed scene. The most important new challenge discovered in this exciting area is that the unconstrained video is too challenging for traditional approaches to applying discriminative methods that rely on the explicit modelling of appearance translations when modelling view-pairs, or even a single view. A new dataset was collected by a remote-operated vehicle using control software developed to simulate a fully-autonomous re-identification unmanned aerial vehicle programmed to fly in proximity with humans until images of sufficient quality for re-identification are obtained. Variations of the standard re-identification model are investigated in an enhanced re-identification paradigm, and new challenges with this distinct form of re-identification are elucidated. Finally, conventional wisdom regarding re-identification in light of these observations is re-examined.
AuthorsLayne, Ryan David Conway
- Theses