Spelling suggestions: "subject:"emporal csegment betworks"" "subject:"emporal csegment conetworks""
1 |
An Empirical Active Learning Study for Temporal Segment NetworksMao, Jilei January 2022 (has links)
Video classification is the task of producing a label that is relevant to the video given its frames. Active learning aims to achieve greater accuracy with fewer labeled training instances through a designed query strategy that can select representative instances from the unlabeled training instances and send them to be labeled by an oracle. It is successfully used in many modern machine learning problems. To figure out how different active learning strategies work on the video classification task, we test several active learning strategies including margin sampling, standard deviation sampling, and center sampling on Temporal Segment Networks (TSN, a classic neural network designed for video classification). We profile these three active learning strategies on systematic control experiments and get the respective models, then we compare these models’ confusion matrix, data distribution, and training log with the baseline models after the first round of query. We observe that the comparison results among models are different under different evaluation criteria. Among all the evaluation criteria we use, the average performance of center sampling is better than that of random sampling, while margin sampling and standard deviation sampling get much worse performance than random sampling and center sampling. The training log and data distribution indicate that margin sampling and standard deviation are prone to select outliers inside the data which are hard to learn but apparently not helpful to improve the model performance. Center sampling will easily outperform random sampling by F1-score. Therefore, the evaluation criteria should be formulated according to the actual application requirements. / Videoklassificering är uppgiften att producera en etikett som är relevant för videon uifrån videons bildsekvens. Aktivt lärande syftar till att uppnå större noggrannhet med färre märkta träningsexempel genom en designad frågestrategi för att välja representativa instanser som ska märkas av ett orakel från de omärkta träningsexemplen, och används framgångsrikt i många moderna maskininlärningsproblem. För att ta reda på hur olika aktiva inlärningsstrategier fungerar på videoklassificeringsuppgifter testar vi flera aktiva strategier inklusive marginalsampling, standardavvikelsessampling samt sampling baserat på Temporal Segment Networks (TSN, som är ett klassiskt neuralt nätverk designat för videoklassificeringsuppgift). Vi testar dessa tre aktiva inlärningsstrategier på systematiska kontrollexperiment, sedan jämför vi dessa modellers förvirringsmatris, datamängdsdistribution, träningslogg med baslinjemodellens efter den första frågeomgången. Vi observerar att endast metoden ”urval av centra” överträffar slumpmässigt urval. Metoden med slumpmässiga provtagningar samt metoden med är benägna att välja extremvärden som är svåra att lära sig men tydligen inte till hjälp för att förbättra modellens prestanda.
|
2 |
Spatio-Temporal Networks for Human Activity Recognition based on Optical Flow in Omnidirectional Image ScenesSeidel, Roman 29 February 2024 (has links)
The ability of human beings to perceive the environment around them with their visual system is called motion perception. This means that the attention of our visual system is primarily focused on those objects that are moving. The property of human motion perception is used in this dissertation to infer human activity from data using artificial neural networks. One of the main aims of this thesis is to discover which modalities, namely RGB images, optical flow and human keypoints, are best suited for HAR in omnidirectional data. Since these modalities are not yet available for omnidirectional cameras, they are synthetically generated and captured with an omnidirectional camera. During data generation, a distinction is made between synthetically generated omnidirectional data and a real omnidirectional dataset that was recorded in a Living Lab at Chemnitz University of Technology and subsequently annotated by hand. The synthetically generated dataset, called OmniFlow, consists of RGB images, optical flow in forward and backward directions, segmentation masks, bounding boxes for the class people, as well as human keypoints. The real-world dataset, OmniLab, contains RGB images from two top-view scenes as well as manually annotated human keypoints and estimated forward optical flow.
In this thesis, the generation of the synthetic and real-world datasets is explained. The OmniFlow dataset is generated using the 3D rendering engine Blender, in which a fully configurable 3D indoor environment is created with artificially textured rooms, human activities, objects and different lighting scenarios. A randomly placed virtual camera following the omnidirectional camera model renders the RGB images, all other modalities and 15 predefined activities. The result of modelling the 3D indoor environment is the OmniFlow dataset. Due to the lack of omnidirectional optical flow data, the OmniFlow dataset is validated using Test-Time Augmentation (TTA). Compared to the baseline, which contains Recurrent All-Pairs Field Transforms (RAFT) trained on the FlyingChairs and FlyingThings3D datasets, it was found that only about 1000 images need to be used for fine-tuning to obtain a very low End-point Error (EE). Furthermore, it was shown that the influence of TTA on the test dataset of OmniFlow affects EE by about a factor of three. As a basis for generating artificial keypoints on OmniFlow with action labels, the Carnegie Mellon University motion capture database is used with a large number of sports and household activities as skeletal data defined in the BVH format. From the BVH-skeletal data, the skeletal points of the people performing the activities can be directly derived or extrapolated by projecting these points from the 3D world into an omnidirectional 2D image. The real-world dataset, OmniLab, was recorded in two rooms of the Living Lab with five different people mimicking the 15 actions of OmniFlow. Human keypoint annotations were added manually in two iterations to reduce the error rate of incorrect annotations.
The activity-level evaluation was investigated using a TSN and a PoseC3D network. The TSN consists of two CNNs, a spatial component trained on RGB images and a temporal component trained on the dense optical flow fields of OmniFlow. The PoseC3D network, an approach to skeleton-based activity recognition, uses a heatmap stack of keypoints in combination with 3D convolution, making the network more effective at learning spatio-temporal features than methods based on 2D convolution. In the first step, the networks were trained and validated on the synthetically generated dataset OmniFlow. In the second step, the training was performed on OmniFlow and the validation on the real-world dataset OmniLab. For both networks, TSN and PoseC3D, three hyperparameters were varied and the top-1, top-5 and mean accuracy given. First, the learning rate of the stochastic gradient descent (Stochastic Gradient Descent (SGD)) was varied. Secondly, the clip length, which indicates the number of consecutive frames for learning the network, was varied, and thirdly, the spatial resolution of the input data was varied. For the spatial resolution variation, five different image sizes were generated from the original dataset by cropping from the original dataset of OmniFlow and OmniLab. It was found that keypoint-based HAR with PoseC3D performed best compared to human activity classification based on optical flow and RGB images. This means that the top-1 accuracy was 0.3636, the top-5 accuracy was 0.7273 and the mean accuracy was 0.3750, showing that the most appropriate output resolution is 128px × 128px and the clip length is at least 24 consecutive frames. The best results could be achieved with a learning rate of PoseC3D of 10-3.
In addition, confusion matrices indicating the class-wise accuracy of the 15 activity classes have been given for the modalities RGB images, optical flow and human keypoints. The confusion matrix for the modality RGB images shows the best classification result of the TSN for the action walk with an accuracy of 1.00, but almost all other actions are also classified as walking in real-world data. The classification of human actions based on optical flow works best on the action sit in chair and stand up with an accuracy of 1.00 and walk with 0.50. Furthermore, it is noticeable that almost all actions are classified as sit in chair and stand up, which indicates that the intra-class variance is low, so that the TSN is not able to distinguish between the selected action classes. Validated on real-world data for the modality keypoint the actions rugpull (1.00) and cleaning windows (0.75) performs best. Therefore, the PoseC3D network on a time-series of human keypoints is less sensitive to variations in the image angle between the synthetic and real-world data than for the modalities RGB images and optical flow.
The pipeline for the generation of synthetic data with regard to a more uniform distribution of the motion magnitudes needs to be investigated in future work.
Random placement of the person and other objects is not sufficient for a complete coverage of all movement magnitudes. An additional improvement of the synthetic data could be the rotation of the person around their own axis, so that the person moves in a different direction while performing the activity and thus the movement magnitudes contain more variance. Furthermore, the domain transition between synthetic and real-world data should be considered further in terms of viewpoint invariance and augmentation methods. It may be necessary to generate a new synthetic dataset with only top-view data and re-train the TSN and PoseC3D. As an augmentation method, for example, the Fourier Domain Adaption (FDA) could reduce the domain gap between the synthetically generated and the real-world dataset.:1 Introduction
2 Theoretical Background
3 Related Work
4 Omnidirectional Synthetic Human Optical Flow
5 Human Keypoints for Pose in Omnidirectional Images
6 Human Activity Recognition in Indoor Scenarios
7 Conclusion and Future Work
A Chapter 4: Flow Dataset Statistics
B Chapter 5: 3D Rotation Matrices
C Chapter 6: Network Training Parameters
|
Page generated in 0.0879 seconds