Global ETD Search

1	Understanding Human Activities at Large Scale Caba Heilbron, Fabian David 03 1900 (has links) With the growth of online media, surveillance and mobile cameras, the amount and size of video databases are increasing at an incredible pace. For example, YouTube reported that over 400 hours of video are uploaded every minute to their servers. Arguably, people are the most important and interesting subjects of such videos. The computer vision community has embraced this observation to validate the crucial role that human action recognition plays in building smarter surveillance systems, semantically aware video indexes and more natural human-computer interfaces. However, despite the explosion of video data, the ability to automatically recognize and understand human activities is still somewhat limited. In this work, I address four different challenges at scaling up action understanding. First, I tackle existing dataset limitations by using a flexible framework that allows continuous acquisition, crowdsourced annotation, and segmentation of online videos, thus, culminating in a large-scale, rich, and easy-to-use activity dataset, known as ActivityNet. Second, I develop an action proposal model that takes a video and directly generates temporal segments that are likely to contain human actions. The model has two appealing properties: (a) it retrieves temporal locations of activities with high recall, and (b) it produces these proposals quickly. Thirdly, I introduce a model, which exploits action-object and action-scene relationships to improve the localization quality of a fast generic action proposal method and to prune out irrelevant activities in a cascade fashion quickly. These two features lead to an efficient and accurate cascade pipeline for temporal activity localization. Lastly, I introduce a novel active learning framework for temporal localization that aims to mitigate the data dependency issue of contemporary action detectors. By creating a large-scale video benchmark, designing efficient action scanning methods, enriching approaches with high-level semantics for activity localization, and an effective strategy to build action detectors with limited data, this thesis is making a step closer towards general video understanding. Computer Vision Machine Learning Video Understanding Activity Localization ActivityNet
2	Efficient Utilization of Video Embeddings from Video-Language Models Lindgren, Felix January 2023 (has links) In the digital age where video content is abundant, this thesis investigates the efficient adaptation of an existing video-language model (VLM) to new data. The research leverages CLIP, a robust language-vision model, for various video-related tasks including video retrieval. The study explores using pre-trained VLMs to extract video embeddings without the need for extensive retraining. The effectiveness of a smaller model using aggregation is compared with larger models and the application of logistic regression for few-shot learning on video embeddings is examined. The aggregation was done using both non-learning through mean-pooling and also by utilizing a transformer. The video-retrieval models were evaluated on the ActivityNet Captions dataset which contains long videos with dense descriptions while the linear probes were evaluated on ActivityNet200 a video classification dataset. The study's findings suggest that most models improved when additional frames were employed through aggregation, leading to improved performance. A model trained with fewer frames was able to surpass those trained with two or four times more frames by instead using aggregation. The incorporation of patch dropout and the freezing of embeddings proved advantageous by enhancing performance and conserving training resources. Furthermore, using a linear probe showed that the extracted features were of high quality requiring only 2-4 samples per class to match the zero-shot performance. VLM CLIP transformers machine learning video retrieval activitynet efficient training aggregation

Search results

Understanding Human Activities at Large Scale

Efficient Utilization of Video Embeddings from Video-Language Models