Person re-identification is a critical component to target identification and tracking in perception systems, particularly when long-term target tracking is required and kinematic track may not be reliable. Identifying an individual agnostic of their outward visual appearance is a particularly challenging problem that plagues many existing re-identification models that exist today. One growing area of research for performing appearance agnostic identification is the use of time-sequence images to identify an individual’s gait. Several methods of performing gait identification exist today. Existing methods require image preprocessing such as human pose estimation which either requires human keypoint labels or a pre-trained model that may not be optimized for the type of data variance that would be observed for a new scene (for example an aerial perspective). We propose an architecture that performs gait classification for person re-identification without the need for additional labels during training using the concept of pose-transfer. Our framework learns human pose estimation landmarks simultaneously with a gait encoder that may be used as a time-sequence fingerprint of a person for use in long-term tracking systems.
Performing many simultaneous tasks on a resource-limited device is challenging due to the limited amount of available computational resources. Efficient and universal model architectures are the key to solving this problem. Existing sub-fields of machine learning, such as Multi-Task Learning (MTL), have proven that learning multiple tasks with a single neural network architecture is possible and even has the potential to improve sample efficiency, memory efficiency, and can be less prone to overfitting. In Visual Question Answering (VQA), a model ingests multi-modal input to produce text-based responses in the context of an image. Our proposed architecture merges the MTL and VQA concepts to form TaskNet. TaskNet solves the visual MTL problem using an input task to provide context to the network and guide its attention mechanism towards providing a relevant response. Our approach saves memory without sacrificing performance relative to naively training independent models. TaskNet efficiently provides multiple fine-grained classifications on a single input image and seamlessly incorporates context-specific metadata to further boost performance in a world of high variance.
Multi-object tracking (MOT) is a crucial component of situational awareness in military defense applications. With the growing use of unmanned aerial systems (UASs), MOT methods for aerial surveillance is in high demand. Application of MOT in UAS presents specific challenges such as moving sensor, changing zoom levels, dynamic background, illumination changes, obscurations and small objects. In this work, we present a robust object tracking architecture aimed to accommodate for the noise in real-time situations. Our work is based on the tracking-by-detection paradigm where an independent object detector is first applied to isolate all potential detections and an object tracking model is applied afterwards to link unique objects between frames. Object trajectories are constructed using multiple hypothesis tracking (MHT) framework that produces the best hypothesis based on the kinematic and visual scorings. We propose a kinematic prediction model, called Deep Extended Kalman Filter (DeepEKF), in which a sequence-to-sequence architecture is used to predict entity trajectories in latent space. DeepEKF utilizes a learned image embedding along with an attention mechanism trained to weight the importance of areas in an image to predict future states. For the visual scoring, we experiment with different similarity measures to calculate distance based on entity appearances, including a convolutional neural network (CNN) encoder, pre-trained using Siamese networks. In initial evaluation experiments, we show that our method, combining scoring structure of the kinematic and visual models within a MHT framework, has improved performance especially in edge cases where entity motion is unpredictable, or the data presents frames with significant gaps.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.