1 |
Efficient Temporal Action Localization in VideosAlwassel, Humam 17 April 2018 (has links)
State-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the actions being searched. To address this need, we propose the new problem of action spotting in videos, which we define as finding a specific action in a video while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in spotting and finding action instances in a video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the action spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but it also accurately finds human activities with 30.8% mAP (0.5 tIoU), outperforming state-of-the-art methods
|
2 |
Understanding Human Activities at Large ScaleCaba Heilbron, Fabian David 03 1900 (has links)
With the growth of online media, surveillance and mobile cameras, the amount and size of video databases are increasing at an incredible pace. For example, YouTube reported that over 400 hours of video are uploaded every minute to their servers. Arguably, people are the most important and interesting subjects of such videos. The computer vision community has embraced this observation to validate the crucial role that human action recognition plays in building smarter surveillance systems, semantically aware video indexes and more natural human-computer interfaces. However, despite the explosion of video data, the ability to automatically recognize and understand human activities is still somewhat limited.
In this work, I address four different challenges at scaling up action understanding. First, I tackle existing dataset limitations by using a flexible framework that allows continuous acquisition, crowdsourced annotation, and segmentation of online videos, thus, culminating in a large-scale, rich, and easy-to-use activity dataset, known as ActivityNet. Second, I develop an action proposal model that takes a video and directly generates temporal segments that are likely to contain human actions. The model has two appealing properties: (a) it retrieves temporal locations of activities with high recall, and (b) it produces these proposals quickly. Thirdly, I introduce a model, which exploits action-object and action-scene relationships to improve the localization quality of a fast generic action proposal method and to prune out irrelevant activities in a cascade fashion quickly. These two features lead to an efficient and accurate cascade pipeline for temporal activity localization. Lastly, I introduce a novel active learning framework for temporal localization that aims to mitigate the data dependency issue of contemporary action detectors. By creating a large-scale video benchmark, designing efficient action scanning methods, enriching approaches with high-level semantics for activity localization, and an effective strategy to build action detectors with limited data, this thesis is making a step closer towards general video understanding.
|
3 |
Attribute learning for image/video understandingFu, Yanwei January 2015 (has links)
For the past decade computer vision research has achieved increasing success in visual recognition including object detection and video classification. Nevertheless, these achievements still cannot meet the urgent needs of image and video understanding. The recently rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. In particular, these types of media data usually contain very complex social activities of a group of people (e.g. YouTube video of a wedding reception) and are captured by consumer devices with poor visual quality. Thus it is extremely challenging to automatically understand such a high number of complex image and video categories, especially when these categories have never been seen before. One way to understand categories with no or few examples is by transfer learning which transfers knowledge across related domains, tasks, or distributions. In particular, recently lifelong learning has become popular which aims at transferring information to tasks without any observed data. In computer vision, transfer learning often takes the form of attribute learning. The key underpinning idea of attribute learning is to exploit transfer learning via an intermediatelevel semantic representations – attributes. The semantic attributes are most commonly used as a semantically meaningful bridge between low feature data and higher level class concepts, since they can be used both descriptively (e.g., ’has legs’) and discriminatively (e.g., ’cats have it but dogs do not’). Previous works propose many different attribute learning models for image and video understanding. However, there are several intrinsic limitations and problems that exist in previous attribute learning work. Such limitations discussed in this thesis include limitations of user-defined attributes, projection domain-shift problems, prototype sparsity problems, inability to combine multiple semantic representations and noisy annotations of relative attributes. To tackle these limitations, this thesis explores attribute learning on image and video understanding from the following three aspects. Firstly to break the limitations of user-defined attributes, a framework for learning latent attributes is present for automatic classification and annotation of unstructured group social activity in videos, which enables the tasks of attribute learning for understanding complex multimedia data with sparse and incomplete labels. We investigate the learning of latent attributes for content-based understanding, which aims to model and predict classes and tags relevant to objects, sounds and events – anything likely to be used by humans to describe or search for media. Secondly, we propose the framework of transductive multi-view embedding hypergraph label propagation and solve three inherent limitations of most previous attribute learning work, i.e., the projection domain shift problems, the prototype sparsity problems and the inability to combine multiple semantic representations. We explore the manifold structure of the data distributions of different views projected onto the same embedding space via label propagation on a graph. Thirdly a novel framework for robust learning is presented to effectively learn relative attributes from the extremely noisy and sparse annotations. Relative attributes are increasingly learned from pairwise comparisons collected via crowdsourcing tools which are more economic and scalable than the conventional laboratory based data annotation. However, a major challenge for taking a crowdsourcing strategy is the detection and pruning of outliers. We thus propose a principled way to identify annotation outliers by formulating the relative attribute prediction task as a unified robust learning to rank problem, tackling both the outlier detection and relative attribute prediction tasks jointly. In summary, this thesis studies and solves the key challenges and limitations of attribute learning in image/video understanding. We show the benefits of solving these challenges and limitations in our approach which thus achieves better performance than previous methods.
|
4 |
Exploring Deep Learning for Video UnderstandingJanuary 2020 (has links)
abstract: Video analysis and understanding have obtained more and more attention in recent years. The research community also has devoted considerable effort and made progress in many related visual tasks, like video action/event recognition, thumbnail frame or video index retrieval, and zero-shot learning. The way to find good representative features of videos is an important objective for these visual tasks.
Thanks to the success of deep neural networks in recent vision tasks, it is natural to take the deep learning methods into consideration for better extraction of a global representation of the images and videos. In general, Convolutional Neural Network (CNN) is utilized for obtaining the spatial information, and Recurrent Neural Network (RNN) is leveraged for capturing the temporal information.
This dissertation provides a perspective of the challenging problems in different kinds of videos which may require different solutions. Therefore, several novel deep learning-based approaches of obtaining representative features are outlined for different visual tasks like zero-shot learning, video retrieval, and video event recognition in this dissertation. To better understand and obtained the video spatial and temporal information, Convolutional Neural Network and Recurrent Neural Network are jointly utilized in most approaches. And different experiments are conducted to present the importance and effectiveness of good representative features for obtaining a better knowledge of video clips in the computer vision field. This dissertation also concludes a discussion with possible future works of obtaining better representative features of more challenging video clips. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2020
|
5 |
Efficient and Robust Video Understanding for Human-robot Interaction and DetectionLi, Ying 09 October 2018 (has links)
No description available.
|
6 |
Facial Motion Augmented Identity Verification with Deep Neural NetworksSun, Zheng 06 October 2023 (has links) (PDF)
Identity verification is ubiquitous in our daily life. By verifying the user's identity, the authorization process grants the privilege to access resources or facilities or perform certain tasks. The traditional and most prevalent authentication method is the personal identification number (PIN) or password. While these knowledge-based credentials could be lost or stolen, human biometric-based verification technologies have become popular alternatives in recent years. Nowadays, more people are used to unlocking their smartphones using their fingerprint or face instead of the conventional passcode. However, these biometric approaches have their weaknesses. For example, fingerprints could be easily fabricated, and a photo or image could spoof the face recognition system. In addition, these existing biometric-based identity verification methods could continue if the user is unaware, sleeping, or even unconscious. Therefore, an additional level of security is needed. In this dissertation, we demonstrate a novel identity verification approach, which makes the biometric authentication process more secure. Our approach requires only one regular camera to acquire a short video for computing the face and facial motion representations. It takes advantage of the advancements in computer vision and deep learning techniques. Our new deep neural network model, or facial motion encoder, can generate a representation vector for the facial motion in the video. Then the decision algorithm compares the vector to the enrolled facial motion vector to determine their similarity for identity verification. We first proved its feasibility through a keypoint-based method. After that, we built a curated dataset and proposed a novel representation learning framework for facial motions. The experimental results show that this facial motion verification approach reaches an average precision of 98.8\%, which is more than adequate for customary use. We also tested this algorithm on complex facial motions and proposed a new self-supervised pretraining approach to boost the encoder's performance. At last, we evaluated two other potential upstream tasks that could help improve the efficiency of facial motion encoding. Through these efforts, we have built a solid benchmark for facial motion representation learning, and the elaborate techniques can inspire other face analysis and video understanding research.
|
7 |
Apprentissage autosupervisé de modèles prédictifs de segmentation à partir de vidéos / Self-supervised learning of predictive segmentation models from videoLuc, Pauline 25 June 2019 (has links)
Les modèles prédictifs ont le potentiel de permettre le transfert des succès récents en apprentissage par renforcement à de nombreuses tâches du monde réel, en diminuant le nombre d’interactions nécessaires avec l’environnement.La tâche de prédiction vidéo a attiré un intérêt croissant de la part de la communauté ces dernières années, en tant que cas particulier d’apprentissage prédictif dont les applications en robotique et dans les systèmes de navigations sont vastes.Tandis que les trames RGB sont faciles à obtenir et contiennent beaucoup d’information, elles sont extrêmement difficile à prédire, et ne peuvent être interprétées directement par des applications en aval.C’est pourquoi nous introduisons ici une tâche nouvelle, consistant à prédire la segmentation sémantique ou d’instance de trames futures.Les espaces de descripteurs que nous considérons sont mieux adaptés à la prédiction récursive, et nous permettent de développer des modèles de segmentation prédictifs performants jusqu’à une demi-seconde dans le futur.Les prédictions sont interprétables par des applications en aval et demeurent riches en information, détaillées spatialement et faciles à obtenir, en s’appuyant sur des méthodes état de l’art de segmentation.Dans cette thèse, nous nous attachons d’abord à proposer pour la tâche de segmentation sémantique, une approche discriminative se basant sur un entrainement par réseaux antagonistes.Ensuite, nous introduisons la tâche nouvelle de prédiction de segmentation sémantique future, pour laquelle nous développons un modèle convolutionnel autoregressif.Enfin, nous étendons notre méthode à la tâche plus difficile de prédiction de segmentation d’instance future, permettant de distinguer entre différents objets.Du fait du nombre de classes variant selon les images, nous proposons un modèle prédictif dans l’espace des descripteurs d’image convolutionnels haut niveau du réseau de segmentation d’instance Mask R-CNN.Cela nous permet de produire des segmentations visuellement plaisantes en haute résolution, pour des scènes complexes comportant un grand nombre d’objets, et avec une performance satisfaisante jusqu’à une demi seconde dans le futur. / Predictive models of the environment hold promise for allowing the transfer of recent reinforcement learning successes to many real-world contexts, by decreasing the number of interactions needed with the real world.Video prediction has been studied in recent years as a particular case of such predictive models, with broad applications in robotics and navigation systems.While RGB frames are easy to acquire and hold a lot of information, they are extremely challenging to predict, and cannot be directly interpreted by downstream applications.Here we introduce the novel tasks of predicting semantic and instance segmentation of future frames.The abstract feature spaces we consider are better suited for recursive prediction and allow us to develop models which convincingly predict segmentations up to half a second into the future.Predictions are more easily interpretable by downstream algorithms and remain rich, spatially detailed and easy to obtain, relying on state-of-the-art segmentation methods.We first focus on the task of semantic segmentation, for which we propose a discriminative approach based on adversarial training.Then, we introduce the novel task of predicting future semantic segmentation, and develop an autoregressive convolutional neural network to address it.Finally, we extend our method to the more challenging problem of predicting future instance segmentation, which additionally segments out individual objects.To deal with a varying number of output labels per image, we develop a predictive model in the space of high-level convolutional image features of the Mask R-CNN instance segmentation model.We are able to produce visually pleasing segmentations at a high resolution for complex scenes involving a large number of instances, and with convincing accuracy up to half a second ahead.
|
8 |
AI-Enhanced Methods in Autonomous Systems: Large Language Models, DL Techniques, and Optimization Algorithmsde Zarzà i Cubero, Irene 23 January 2024 (has links)
Tesis por compendio / [ES] La proliferación de sistemas autónomos y su creciente integración en la vida humana cotidiana han abierto nuevas fronteras de investigación y desarrollo. Dentro de este ámbito, la presente tesis se adentra en las aplicaciones multifacéticas de los LLMs (Large Language Models), técnicas de DL (Deep Learning) y algoritmos de optimización en el ámbito de estos sistemas autónomos. A partir de los principios de los métodos potenciados por la Inteligencia Artificial (IA), los estudios englobados en este trabajo convergen en la exploración y mejora de distintos sistemas autónomos que van desde sistemas de platooning de camiones en sistemas de comunicaciones Beyond 5G (B5G), Sistemas Multi-Agente (SMA), Vehículos Aéreos No Tripulados (UAV), estimación del área de incendios forestales, hasta la detección temprana de enfermedades como el glaucoma.
Un enfoque de investigación clave, perseguido en este trabajo, gira en torno a la implementación innovadora de controladores PID adaptativos en el platooning de vehículos, facilitada a través de la integración de los LLMs. Estos controladores PID, cuando se infunden con capacidades de IA, ofrecen nuevas posibilidades en términos de eficiencia, fiabilidad y seguridad de los sistemas de platooning. Desarrollamos un modelo de DL que emula un controlador PID adaptativo, mostrando así su potencial en las redes y radios habilitadas para IA. Simultáneamente, nuestra exploración se extiende a los sistemas multi-agente, proponiendo una Teoría Coevolutiva Extendida (TCE) que amalgama elementos de la dinámica coevolutiva, el aprendizaje adaptativo y las recomendaciones de estrategias basadas en LLMs. Esto permite una comprensión más matizada y dinámica de las interacciones estratégicas entre agentes heterogéneos en los SMA.
Además, nos adentramos en el ámbito de los vehículos aéreos no tripulados (UAVs), proponiendo un sistema para la comprensión de vídeos que crea una log de la historia basada en la descripción semántica de eventos y objetos presentes en una escena capturada por un UAV. El uso de los LLMs aquí permite razonamientos complejos como la predicción de eventos con mínima intervención humana. Además, se aplica una metodología alternativa de DL para la estimación del área afectada durante los incendios forestales. Este enfoque aprovecha una nueva arquitectura llamada TabNet, integrada con Transformers, proporcionando así una estimación precisa y eficiente del área.
En el campo de la salud, nuestra investigación esboza una metodología exitosa de detección temprana del glaucoma. Utilizando un enfoque de entrenamiento de tres etapas con EfficientNet en imágenes de retina, logramos una alta precisión en la detección de los primeros signos de esta enfermedad.
A través de estas diversas aplicaciones, el foco central sigue siendo la exploración de metodologías avanzadas de IA dentro de los sistemas autónomos. Los estudios dentro de esta tesis buscan demostrar el poder y el potencial de las técnicas potenciadas por la IA para abordar problemas complejos dentro de estos sistemas. Estas investigaciones en profundidad, análisis experimentales y soluciones desarrolladas arrojan luz sobre el potencial transformador de las metodologías de IA en la mejora de la eficiencia, fiabilidad y seguridad de los sistemas autónomos, contribuyendo en última instancia a la futura investigación y desarrollo en este amplio campo. / [CA] La proliferació de sistemes autònoms i la seua creixent integració en la vida humana quotidiana han obert noves fronteres de recerca i desenvolupament. Dins d'aquest àmbit, la present tesi s'endinsa en les aplicacions multifacètiques dels LLMs (Large Language Models), tècniques de DL (Deep Learning) i algoritmes d'optimització en l'àmbit d'aquests sistemes autònoms. A partir dels principis dels mètodes potenciats per la Intel·ligència Artificial (IA), els estudis englobats en aquest treball convergeixen en l'exploració i millora de diferents sistemes autònoms que van des de sistemes de platooning de camions en sistemes de comunicacions Beyond 5G (B5G), Sistemes Multi-Agent (SMA), Vehicles Aeris No Tripulats (UAV), estimació de l'àrea d'incendis forestals, fins a la detecció precoç de malalties com el glaucoma.
Un enfocament de recerca clau, perseguit en aquest treball, gira entorn de la implementació innovadora de controladors PID adaptatius en el platooning de vehicles, facilitada a través de la integració dels LLMs. Aquests controladors PID, quan s'infonen amb capacitats d'IA, ofereixen noves possibilitats en termes d'eficiència, fiabilitat i seguretat dels sistemes de platooning. Desenvolupem un model de DL que emula un controlador PID adaptatiu, mostrant així el seu potencial en les xarxes i ràdios habilitades per a IA. Simultàniament, la nostra exploració s'estén als sistemes multi-agent, proposant una Teoria Coevolutiva Estesa (TCE) que amalgama elements de la dinàmica coevolutiva, l'aprenentatge adaptatiu i les recomanacions d'estratègies basades en LLMs. Això permet una comprensió més matissada i dinàmica de les interaccions estratègiques entre agents heterogenis en els SMA.
A més, ens endinsem en l'àmbit dels Vehicles Aeris No Tripulats (UAVs), proposant un sistema per a la comprensió de vídeos que crea un registre de la història basat en la descripció semàntica d'esdeveniments i objectes presents en una escena capturada per un UAV. L'ús dels LLMs aquí permet raonaments complexos com la predicció d'esdeveniments amb mínima intervenció humana. A més, s'aplica una metodologia alternativa de DL per a l'estimació de l'àrea afectada durant els incendis forestals. Aquest enfocament aprofita una nova arquitectura anomenada TabNet, integrada amb Transformers, proporcionant així una estimació precisa i eficient de l'àrea.
En el camp de la salut, la nostra recerca esbossa una metodologia exitosa de detecció precoç del glaucoma. Utilitzant un enfocament d'entrenament de tres etapes amb EfficientNet en imatges de retina, aconseguim una alta precisió en la detecció dels primers signes d'aquesta malaltia.
A través d'aquestes diverses aplicacions, el focus central continua sent l'exploració de metodologies avançades d'IA dins dels sistemes autònoms. Els estudis dins d'aquesta tesi busquen demostrar el poder i el potencial de les tècniques potenciades per la IA per a abordar problemes complexos dins d'aquests sistemes. Aquestes investigacions en profunditat, anàlisis experimentals i solucions desenvolupades llançen llum sobre el potencial transformador de les metodologies d'IA en la millora de l'eficiència, fiabilitat i seguretat dels sistemes autònoms, contribuint en última instància a la futura recerca i desenvolupament en aquest ampli camp. / [EN] The proliferation of autonomous systems, and their increasing integration with day-to-day human life, have opened new frontiers of research and development. Within this scope, the current thesis dives into the multifaceted applications of Large Language Models (LLMs), Deep Learning (DL) techniques, and Optimization Algorithms within the realm of these autonomous systems. Drawing from the principles of AI-enhanced methods, the studies encapsulated within this work converge on the exploration and enhancement of different autonomous systems ranging from B5G Truck Platooning Systems, Multi-Agent Systems (MASs), Unmanned Aerial Vehicles, Forest Fire Area Estimation, to the early detection of diseases like Glaucoma.
A key research focus, pursued in this work, revolves around the innovative deployment of adaptive PID controllers in vehicle platooning, facilitated through the integration of LLMs. These PID controllers, when infused with AI capabilities, offer new possibilities in terms of efficiency, reliability, and security of platooning systems. We developed a DL model that emulates an adaptive PID controller, thereby showcasing its potential in AI-enabled radio and networks. Simultaneously, our exploration extends to multi-agent systems, proposing an Extended Coevolutionary (EC) Theory that amalgamates elements of coevolutionary dynamics, adaptive learning, and LLM-based strategy recommendations. This allows for a more nuanced and dynamic understanding of the strategic interactions among heterogeneous agents in MASs.
Moreover, we delve into the realm of Unmanned Aerial Vehicles (UAVs), proposing a system for video understanding that employs a language-based world-state history of events and objects present in a scene captured by a UAV. The use of LLMs here enables open-ended reasoning such as event forecasting with minimal human intervention. Furthermore, an alternative DL methodology is applied for the estimation of the affected area during forest fires. This approach leverages a novel architecture called TabNet, integrated with Transformers, thus providing accurate and efficient area estimation.
In the field of healthcare, our research outlines a successful early detection methodology for glaucoma. Using a three-stage training approach with EfficientNet on retinal images, we achieved high accuracy in detecting early signs of this disease.
Across these diverse applications, the core focus remains: the exploration of advanced AI methodologies within autonomous systems. The studies within this thesis seek to demonstrate the power and potential of AI-enhanced techniques in tackling complex problems within these systems. These in-depth investigations, experimental analyses, and developed solutions shed light on the transformative potential of AI methodologies in improving the efficiency, reliability, and security of autonomous systems, ultimately contributing to future research and development in this expansive field. / De Zarzà I Cubero, I. (2023). AI-Enhanced Methods in Autonomous Systems: Large Language Models, DL Techniques, and Optimization Algorithms [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/202201 / Compendio
|
9 |
Crime Detection From Pre-crime Video AnalysisSedat Kilic (18363729) 03 June 2024 (has links)
<p dir="ltr">his research investigates the detection of pre-crime events, specifically targeting behaviors indicative of shoplifting, through the advanced analysis of CCTV video data. The study introduces an innovative approach that leverages augmented human pose and emotion information within individual frames, combined with the extraction of activity information across subsequent frames, to enhance the identification of potential shoplifting actions before they occur. Utilizing a diverse set of models including 3D Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs), and a specially developed transformer architecture, the research systematically explores the impact of integrating additional contextual information into video analysis.</p><p dir="ltr">By augmenting frame-level video data with detailed pose and emotion insights, and focusing on the temporal dynamics between frames, our methodology aims to capture the nuanced behavioral patterns that precede shoplifting events. The comprehensive experimental evaluation of our models across different configurations reveals a significant improvement in the accuracy of pre-crime detection. The findings underscore the crucial role of combining visual features with augmented data and the importance of analyzing activity patterns over time for a deeper understanding of pre-shoplifting behaviors.</p><p dir="ltr">The study’s contributions are multifaceted, including a detailed examination of pre-crime frames, strategic augmentation of video data with added contextual information, the creation of a novel transformer architecture customized for pre-crime analysis, and an extensive evaluation of various computational models to improve predictive accuracy.</p>
|
Page generated in 0.1019 seconds