Leveraging the power of deep neural networks, single-person pose estimation has made substantial progress throughout the last years. More recently, multi-person pose estimation has also become of growing importance, mainly driven by the high demand for reliable video surveillance systems in public security. To keep up with these demands, certain efforts have been made to improve the performance of such systems, which is yet limited by the insufficient amount of available training data. This work addresses this lack of labeled data: by diminishing the often faced problem of domain shift between synthetic images from computer game graphics engines and real world data, annotated training data shall be provided at zero labeling-cost. To this end, generative adversarial networks are applied as domain adaption framework, adapting the data of a novel synthetic pose estimation dataset to several real world target domains. State-of-the-art domain adaption methods are extended to meet the important requirement of exact content preservation between synthetic and adapted images. Experiments, that are subsequently conducted, indicate the improved suitability of the adapted data as human pose estimators trained on this data outperform those which are trained on purely synthetic images.
Person re-identification is the task of matching visual appearances of the same person in image or video data while distinguishing appearances of different persons. With falling hardware costs cameras mounted on unmanned aerial vehicles (UAVs) have become increasingly useful for security and surveillance tasks in recent years. Re-identification approaches have to adapt to the new challenges posed by this type of data, such as unusual and changing viewpoints or camera motion. Furthermore, the characteristics of the data will change between the scenarios the UAV is used in. This requires robust models that can handle a wide range of characteristics.
In this work, we train convolutional neural networks for person re-identification. However, datasets of sufficient size for training all consist of data from fixed camera networks. We show that the resulting models, while performing strongly on camera network data, struggle to handle the different characteristics of aerial imagery, likely because of an overfitting to data bias inherent in the training data. To address this issue we combine the deep features with hand-crafted covariance features which introduce a higher degree of invariance into our combined representation. The fusion of both types of features is achieved by including the covariance information into the training process of the deep model.
We evaluate the combined representation on a dataset consisting of twelve people moving through a scene recorded by four fixed cameras and one mobile aerial camera. We discuss strengths and weaknesses of the features and show that our combined approach outperforms baselines as well as previous work.
Capturing vital signs, specifically heart rate and oxygen saturation, is essential in care situations. Clinical pulse oximetry solutions work contact-based by clips or otherwise fixed sensor units which have sometimes undesired impact on the patient. A typical example would be pre-term infants in neonatal care which require permanent monitoring and have a very fragile skin. This requires a regular change of the sensor unit location by the staff to avoid skin damage. To improve patient comfort and to reduce care effort, a feasibility study with a camera-based passive optical method for contactless pulse oximetry from a distance is performed. In contrast to most existing research on contactless pulse oximetry, a task-optimized multi-spectral sensor unit instead of a standard RGB-camera is proposed. This first allows to avoid the widely used green spectral range for distant heart rate measurement, which is unsuitable for pulse oximetry due to nearly equal spectral extinction coefficients of saturated oxy-hemoglobin and non-saturated hemoglobin. Second, it also better addresses the challenge of the worse signal-to-noise ratio than in the contact-based or active measurement, e.g., caused by background illumination. Signal noise from background illumination is addressed in several ways. The key part is an automated reference measurement of background illumination by automated patient localization in the acquired images by extraction of skin and background regions with a CNN-based detector. Due to the custom spectral ranges, the detector is trained and optimized for this specific setup. Altogether, allowing a contactless measurement, the studied concept promises to improve the care of patients where skin contact has negative effects.
Person re-identification is the task of correctly matching visual appearances of the same person in image or video data while distinguishing appearances of different persons. The traditional setup for re-identification is a network of fixed cameras. However, in recent years mobile aerial cameras mounted on unmanned aerial vehicles (UAV) have become increasingly useful for security and surveillance tasks. Aerial data has many characteristics different from typical camera network data. Thus, re-identification approaches designed for a camera network scenario can be expected to suffer a drop in accuracy when applied to aerial data.
In this work, we investigate the suitability of features, which were shown to give robust results for re- identification in camera networks, for the task of re-identifying persons between a camera network and a mobile aerial camera. Specifically, we apply hand-crafted region covariance features and features extracted by convolu- tional neural networks which were learned on separate data. We evaluate their suitability for this new and as yet unexplored scenario. We investigate common fusion methods to combine the hand-crafted and learned features and propose our own deep fusion approach which is already applied during training of the deep network.
We evaluate features and fusion methods on our own dataset. The dataset consists of fourteen people moving through a scene recorded by four fixed ground-based cameras and one mobile camera mounted on a small UAV. We discuss strengths and weaknesses of the features in the new scenario and show that our fusion approach successfully leverages the strengths of each feature and outperforms all single features significantly.
Monitoring of the heart rhythm is the cornerstone of the diagnosis of cardiac arrhythmias. It is done by means of electrocardiography which relies on electrodes attached to the skin of the patient. We present a new system approach based on the so-called vibrocardiogram that allows an automatic non-contact registration of the heart rhythm. Because of the contactless principle, the technique offers potential application advantages in medical fields like emergency medicine (burn patient) or premature baby care where adhesive electrodes are not easily applicable. A laser-based, mobile, contactless vibrometer for on-site diagnostics that works with the principle of laser Doppler vibrometry allows the acquisition of vital functions in form of a vibrocardiogram. Preliminary clinical studies at the Klinikum Karlsruhe have shown that the region around the carotid artery and the chest region are appropriate therefore. However, the challenge is to find a suitable measurement point in these parts of the body that differs from person to person due to e. g. physiological properties of the skin. Therefore, we propose a new Microsoft Kinect-based approach. When a suitable measurement area on the appropriate parts of the body are detected by processing the Kinect data, the vibrometer is automatically aligned on an initial location within this area. Then, vibrocardiograms on different locations within this area are successively acquired until a sufficient measuring quality is achieved. This optimal location is found by exploiting the autocorrelation function.
In order to control riots in crowds, it is helpful to get ringleaders under control and pull them out of the crowd if one has become an offender. A great support to achieve these tasks is the capability of observing the crowd and ringleaders automatically by using cameras. It also allows a better conservation of evidence in riot control. A ringleader who has become an offender should be tracked across and recognized by several cameras, regardless of whether overlapping camera’s fields of view exist or not. We propose a context-based approach for handover of persons between different camera fields of view. This approach can be applied for overlapping as well as for non-overlapping fields of view, so that a fast and accurate identification of individual persons in camera networks is feasible. Within the scope of this paper, the approach is applied to a handover of persons between single images without having any temporal information. It is particularly developed for semiautomatic video editing and a handover of persons between cameras in order to improve conservation of evidence. The approach has been developed on a dataset collected during a Crowd and Riot Control (CRC) training of the German armed forces. It consists of three different levels of escalation. First, the crowd started with a peaceful demonstration. Later, there were violent protests, and third, the riot escalated and offenders bumped into the chain of guards. One result of the work is a reliable context-based method for person re-identification between single images of different camera fields of view in crowd and riot scenarios. Furthermore, a qualitative assessment shows that the use of contextual information can support this task additionally. It can decrease the needed time for handover and the number of confusions which supports the conservation of evidence in crowd and riot scenarios.
Military Operations in Urban Terrain (MOUT) require the capability to perceive and to analyze the situation around a
patrol in order to recognize potential threats. A permanent monitoring of the surrounding area is essential in order to
appropriately react to the given situation, where one relevant task is the detection of objects that can pose a threat.
Especially the robust detection of persons is important, as in MOUT scenarios threats usually arise from persons. This
task can be supported by image processing systems. However, depending on the scenario, person detection in MOUT can
be challenging, e.g. persons are often occluded in complex outdoor scenes and the person detection also suffers from low
image resolution. Furthermore, there are several requirements on person detection systems for MOUT such as the
detection of non-moving persons, as they can be a part of an ambush. Existing detectors therefore have to operate on
single images with low thresholds for detection in order to not miss any person. This, in turn, leads to a comparatively
high number of false positive detections which renders an automatic vision-based threat detection system ineffective. In
this paper, a hybrid detection approach is presented. A combination of a discriminative and a generative model is
examined. The objective is to increase the accuracy of existing detectors by integrating a separate hypotheses
confirmation and rejection step which is built by a discriminative and generative model. This enables the overall
detection system to make use of both the discriminative power and the capability to detect partly hidden objects with the
models. The approach is evaluated on benchmark data sets generated from real-world image sequences captured during
MOUT exercises. The extension shows a significant improvement of the false positive detection rate.
Tracking of persons and analysis of their trajectories are important tasks of surveillance systems as they support the monitoring personnel. However, this trend is accompanied by an increasing demand on smarter camera networks carrying out surveillance tasks autonomously. Thus, there is a higher system complexity so that requirements on the video analysis algorithms are increasing as well. In this paper, we present a system concept and application for anonymously gathering, processing and analysis of trajectories in distributed smart camera networks. It allows a multitude of analysis techniques such as inspecting individual properties of the observed movement in real-time. Additionally, the anonymous movement data allows long-term storage and big data analyses for statistical purposes. The system described in this paper has been implemented as prototype system and deployed for proof of concept under real conditions at the entrance hall of the Leibniz University Hannover. It shows an overall stable performance, particularly with respect to significant illumination changes over hours, as well as regarding the reduction of false positives by post processing and trajectory merging performed on top of a panorama based person detection module.
Counting people is a common topic in the area of visual surveillance and crowd analysis. While many image-based solutions are designed to count only a few persons at the same time, like pedestrians entering a shop or watching an advertisement, there is hardly any solution for counting large crowds of several hundred persons or more. We addressed this problem previously by designing a semi-automatic system being able to count crowds consisting of hundreds or thousands of people based on aerial images of demonstrations or similar events. This system requires major user interaction to segment the image. Our principle aim is to reduce this manual interaction. To achieve this, we propose a new and automatic system. Besides counting the people in large crowds, the system yields the positions of people allowing a plausibility check by a human operator. In order to automatize the people counting system, we use crowd density estimation. The determination of crowd density is based on several features like edge intensity or spatial frequency. They indicate the density and discriminate between a crowd and other image regions like buildings, bushes or trees. We compare the performance of our automatic system to the previous semi-automatic system and to manual counting in images. By counting a test set of aerial images showing large crowds containing up to 12,000 people, the performance gain of our new system will be measured. By improving our previous system, we will increase the benefit of an image-based solution for counting people in large crowds.
KEYWORDS: Image segmentation, Video surveillance, Video, Systems modeling, Data modeling, Time metrology, Image processing, Cameras, Image analysis, Imaging systems
Counting people in crowds is a common problem in visual surveillance. Many solutions are just designed to count
less than one hundred people. Only few systems have been tested on large crowds of several hundred people and
no known counting system has been tested on crowds of several thousand people. Furthermore, none of these
large scale systems delivers people's positions, they just estimate the number. But having the position of people
would be a large benefit, since this would enable a human observer to carry out a plausibility check. In addition,
most approaches require video data as input or a scene model. In order to generally solve the problem, these
assumptions must not be made. We propose a system that can count people on single aerial images including
mosaic images generated from video data. No assumptions about crowd density will be made, i. e. the system
has to work from low to very high density. The main challenge is the large variety of possible input data. Typical
scenarios would be public events such as demonstrations or open air concerts. Our system uses a model-based
detection of individual humans. This includes the determination of their positions and the total number. In
order to cope with the given challenges we divide our system into three steps: foreground segmentation, person
size determination and person detection. We evaluate our proposed system on a variety of aerial images showing
large crowds with up to several thousand people
Military Operations in Urban Terrain (MOUT) require the capability to perceive and to analyse the situation around a
patrol in order to recognize potential threats. As in MOUT scenarios threats usually arise from humans one important
task is the robust detection of humans.
Detection of humans in MOUT by image processing systems can be very challenging, e.g., due to complex outdoor
scenes where humans have a weak contrast against the background or are partially occluded. Porikli et al. introduced
covariance descriptors and showed their usefulness for human detection in complex scenes. However, these descriptors
do not lie on a vector space and so well-known machine learning techniques need to be adapted to train covariance
descriptor classifiers. We present a novel approach based on manifold learning that simplifies the classification of
covariance descriptors.
In this paper, we apply this approach for detecting humans. We describe our human detection method and evaluate the
detector on benchmark data sets generated from real-world image sequences captured during MOUT exercises.
In order to control riots in crowds, it is helpful to get the ringleader under control. A great support to achieve this task is
the capability to automatically track individual persons in a video sequence taken from a crowd. In this paper we address
the robustness of such a tracking function.
We start from the results of a previous evaluation of tracking methods, where a so-called Covariance-Tracker was found
to be most appropriate. This tracker uses covariance matrices as object descriptors, as proposed by Porikli et al. The set
of all covariance matrices describes a Riemannian manifold that is used to compare and update the covariance
descriptors during tracking.
We propose Covariance-Tracker adaptations to improve its performance. Furthermore, we summarize the performance
evaluation results of the original method and compare these with the results of the adapted one. The result is a robust
method for tracking people in crowds which can improve situational awareness.
KEYWORDS: Error control coding, Detection and tracking algorithms, Sensors, Data processing, Data acquisition, Control systems, Situational awareness sensors, Information security, Cameras, Unmanned aerial vehicles
If for a given application, candidate tracking methods for humans need to be selected and optimized, then relevant sensor
and truth data as well as appropriate assessment criteria are required. In the work reported in this contribution we used
data recently collected in a riot control scenario. We then processed the sensor data using a set of tracking methods from
literature. Tracking results and truth data allowed us to deduce metrics that reflect the usefulness of a tracking method for
the selected scenario. The software implementation of the assessment criteria, together with sensor and truth data, forms
a benchmark for tracking algorithms in a riot control scenario. It can be used by developers to optimize their tracking
systems and to demonstrate their usefulness for application in a riot control scenario. The performance and robustness of
optimized tracking methods can considerably improve situational awareness in a riot control scenario.
Quick and precise response is essential for riot squads when coping with escalating violence in crowds. Often it is just a single person, known as the leader of the gang, who instigates other people and thus is responsible of excesses. Putting this single person out of action in most cases leads to a de-escalating situation. Fostering de-escalations is one of the main tasks of crowd and riot control. To do so, extensive situation awareness is mandatory for the squads and can be promoted by technical means such as video surveillance using sensor networks.
To develop software tools for situation awareness appropriate input data with well-known quality is needed. Furthermore, the developer must be able to measure algorithm performance and ongoing improvements. Last but not least, after algorithm development has finished and marketing aspects emerge, meeting of specifications must be proved.
This paper describes a multisensor benchmark which exactly serves this purpose. We first define the underlying algorithm task. Then we explain details about data acquisition and sensor setup and finally we give some insight into quality measures of multisensor data. Currently, the multisensor benchmark described in this paper is applied to the development of basic algorithms for situational awareness, e.g. tracking of individuals in a crowd.
Military Operations in Urban Terrain (MOUT) require the capability to perceive and to analyse the situation around a
patrol in order to recognize potential threats. Human operators can only observe a limited field of regard. Sensors can
enhance the field of regard up to 360°, but then the amount of data cannot be fully exploited by a human operator any
more. For this reason an intelligent assistance system is required that monitors the circumference of a moving platform
and warns the driver of a threatening situation. One first processing step of such a system is the recognition of humans.
There are numerous approaches to the detection of humans, mainly from stationary cameras. Moving cameras play a
role in the field of pedestrian protection from a moving road vehicle. There are two principal differences to this latter
application domain. Firstly, the threat in a MOUT scenario potentially arises from humans in the scene. Secondly, not
only the trajectories of individual humans are relevant, but also the motion and the behavior of groups of humans. As a
first step towards an assistance system that automatically warns drivers in a MOUT scenario, we implemented an
approach to the detection of humans in video images and applied them to a relevant set of image sequences taken in a
MOUT scenario. In the paper we assess the obtained results and outline further research activities.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.