Global ETD Search

41	Acoustic Space Mapping : A Machine Learning Approach to Sound Source Separation and Localization / Projection d'espaces acoustiques : Une approche par apprentissage automatisé de la séparation et de la localisation de sources sonores Deleforge, Antoine 26 November 2013 (has links) Dans cette thèse, nous abordons le problème longtemps étudié de la séparation et localisation binaurale (deux microphones) de sources sonores par l'apprentissage supervisé. Dans ce but, nous développons un nouveau paradigme dénommé projection d'espaces acoustiques, à la croisé des chemins entre la perception binaurale, de l'écoute robotisée, du traitement du signal audio, et de l'apprentissage automatisé. L'approche proposée consiste à apprendre un lien entre les indices auditifs perçus par le système et la position de la source sonore dans une autre modalité du système, comme l'espace visuelle ou l'espace moteur. Nous proposons de nouveaux protocoles expérimentaux permettant d'acquérir automatiquement de grands ensembles d'entraînement qui associent des telles données. Les jeux de données obtenus sont ensuite utilisés pour révéler certaines propriétés intrinsèques des espaces acoustiques, et conduisent au développement d'une famille générale de modèles probabilistes permettant la projection localement linéaire d'un espace de haute dimension vers un espace de basse dimension. Nous montrons que ces modèles unifient plusieurs méthodes de régression et de réduction de dimension existantes, tout en incluant un grand nombre de nouveaux modèles qui généralisent les précédents. Les popriétés et l'inférence de ces modèles sont détaillées en profondeur, et le net avantage des méthodes proposées par rapport à des techniques de l'état de l'art est établit sur différentes applications de projection d'espace, au delà du champs de l'analyse de scènes auditives. Nous montrons ensuite comment les méthodes proposées peuvent être étendues probabilistiquement pour s'attaquer au fameux problème de la soirée cocktail, c'est à dire localiser une ou plusieurs sources émettant simultanément dans un environnement réel, et reséparer les signaux mélangés. Nous montrons que les techniques qui en découlent accomplissent cette tâche avec une précision inégalée. Ceci démontre le rôle important de l'apprentissage et met en avant le paradigme de la projection d'espaces acoustiques comme un outil prometteur pour aborder de façon robuste les problèmes les plus difficiles de l'audition binaurale computationnelle. / In this thesis, we address the long-studied problem of binaural (two microphones) sound source separation and localization through supervised leaning. To achieve this, we develop a new paradigm referred as acoustic space mapping, at the crossroads of binaural perception, robot hearing, audio signal processing and machine learning. The proposed approach consists in learning a link between auditory cues perceived by the system and the emitting sound source position in another modality of the system, such as the visual space or the motor space. We propose new experimental protocols to automatically gather large training sets that associates such data. Obtained datasets are then used to reveal some fundamental intrinsic properties of acoustic spaces and lead to the development of a general family of probabilistic models for locally-linear high- to low-dimensional space mapping. We show that these models unify several existing regression and dimensionality reduction techniques, while encompassing a large number of new models that generalize previous ones. The properties and inference of these models are thoroughly detailed, and the prominent advantage of proposed methods with respect to state-of-the-art techniques is established on different space mapping applications, beyond the scope of auditory scene analysis. We then show how the proposed methods can be probabilistically extended to tackle the long-known cocktail party problem, i.e., accurately localizing one or several sound sources emitting at the same time in a real-word environment, and separate the mixed signals. We show that resulting techniques perform these tasks with an unequaled accuracy. This demonstrates the important role of learning and puts forwards the acoustic space mapping paradigm as a promising tool for robustly addressing the most challenging problems in computational binaural audition. Sensorimoteur Robotique Analyse de scène auditive Perception Apprentissage automatisé Modèles bayesiens Sensorimotor Robotics Auditory Scene Analysis Perception Machine Learning Bayesian Models 620
42	Toward sequential segregation of speech sounds based on spatial cues / Vers la ségrégation séquentielle de signaux de parole sur la base d'indices de position David, Marion 13 November 2014 (has links) Dans un contexte sonore constitué de plusieurs sources sonores, l’analyse de scène auditive a pour objectif de dresser une représentation précise et utile des sons perçus. Résoudre ce type de scènes consiste à regrouper les sons provenant d’une même source et de les séparer des autres sons. Ce travail de thèse a eu pour but d’approfondir nos connaissances du traitement de ces scènes auditives complexes par le système auditif. En particulier, il s’agissait d’étudier l’influence potentielle des indices spatiaux sur la ségrégation. Une attention particulière a été portée tout au long de cette thèse pour intégrer des éléments réalistes dans toutes les études menées. Dans un environnement réel, la salle et la tête entraînent des distorsions des signaux de parole en fonction des positions de la source et du récepteur. Ce phénomène est appelé coloration. Comme première approximation de la parole, des bruits avec un spectre de parole ont été utilisés pour évaluer l’effet de la coloration. Les résultats ont montré que les fines différences spectrales monaurales induites par la coloration due à la tête et à la salle peuvent engendrer de la ségrégation. De plus, cette ségrégation peut être renforcée en ajoutant les indices binauraux associés à une position donnée (ILD, ITD). En particulier, une deuxième étude a suggéré que les variations monaurales d’intensité au cours du temps à chaque oreille étaient plus utiles pour la ségrégation que les différences interaurales de niveau. Les résultats ont également montré que le percept de latéralisation, associé à un ITD donné, favorise la ségrégation lorsque ce percept est suffisamment saillant. Par ailleurs, l’ITD per se peut induire de la ségrégation. La capacité naturelle à résoudre perceptivement une scène auditive est pertinente pour l’intelligibilité de la parole. L’objectif était de répliquer ces premières expériences, donc évaluer l’influence des indices spatiaux sur la ségrégation de signaux de parole à la place de bruits gelés. Une caractéristique de la parole est la grande variabilité de ses paramètres acoustiques qui permettent de transmettre de l’information. Ainsi, la première étape a été d’étudier dans quelle mesure la ségrégation basée sur une différence de fréquence peut être influencée par l’introduction de variabilité spectrale au sein des stimuli. L’étape suivante a été d’évaluer la différence de fréquence fondamentale requise pour séparer des flux de parole. En effet, il a été supposé que des indices de position pourraient être utiles pour renforcer la ségrégation basée sur un indice plus robuste comme une différence de F0 du fait de leur stabilité au cours du temps dans des situations réelles. Les résultats de ces expériences préliminaires ont montré que l’introduction d’une large variabilité spectrale au sein de flux de sons purs pouvait entraîner un percept compliqué, probablement constitué des multiples flux sonores. De plus, les résultats ont indiqué qu’une différence de F0 comprise entre 3 et 5 demi-tons permettait de séparer des signaux de parole. Les résultats de ces expériences pourront être utilisés pour concevoir la prochaine expérience visant à étudier dans quelle mesure un percept ambigu peut évoluer vers de la ségrégation par l’introduction d’indices de position. / In a context of competing sound sources, the auditory scene analysis aims to draw an accurate and useful representation of the perceived sounds. Solving such a scene consists of grouping sound events which come from the same source and segregating them from the other sounds. This PhD work intended to further our understanding of how the human auditory system processes these complex acoustic environments, with a particular emphasis on the potential influence of spatial cues on perceptual stream segregation. All the studies conducted during this PhD endeavoured to rely on realistic configurations.In a real environment, the diffraction and reflection properties of the room and the head lead to distortions of the sounds depending on the source and receiver positions. This phenomenon is named colouration. Speechshaped noises, as a first approximation of speech sounds, were used to evaluate the effect of this colouration on stream segregation. The results showed that the slight monaural spectral differences induced by head and room colouration can induce segregation. Moreover, this segregation was enhanced by adding the binaural cues associated with a given position (ITD, ILD). Especially, a second study suggested that the monaural intensity variations across time at each ear were more relevant for stream segregation than the interaural level differences. The results also indicated that the percept of lateralization associated with a given ITD helped the segregation when the lateralization was salient enough. Besides, the ITD per se could also favour segregation.The natural ability to perceptually solve an auditory scene is relevant for speech intelligibility. The main idea was to replicate the first experiments with speech items instead of frozen noises. A characteristic of running speech is a high degree of acoustical variability used to convey information. Thus, as a first step, we investigated the robustness of stream segregation based on a frequency difference to variability on the same acoustical cue (i.e., frequency). The second step was to evaluate the fundamental frequency difference that enables to separate speech items. Indeed, according to the limited effects measured in the two first experiments, it was assumed that spatial cues might be relevant for stream segregation only in interaction with another “stronger” cue such as a F0 difference.The results of these preliminary experiments showed first that the introduction of a large spectral variability introduced within pure tone streams can lead to a complicated percept, presumably consisting of multiple streams. Second, the results suggested that a fundamental frequency difference comprised between 3 and 5 semitones enables to separate speech item. These experiments provided results that will be used to design the next experiment investigating how an ambiguous percept could be biased toward segregation by introducing spatial cues. Analyse de scènes auditives Ségrégation séquentielle Différences spatiales Indices de position Signaux de parole Auditory scene analysis Sequential segregation Spectral differences Spatial cues Speech sounds
43	Analyse de scène sonore multi-capteurs : un front-end temps-réel pour la manipulation de scène / Multi-sensor sound scene analysis : a real-time front-end for scene manipulation Baque, Mathieu 09 June 2017 (has links) La thèse s’inscrit dans un contexte d’essor de l’audio spatialisé (5.1, Dolby Atmos...). Parmi les formats audio 3D existants, l’ambisonie permet une représentation spatiale homogène du champ sonore et se prête naturellement à des manipulations : rotations, distorsion du champ sonore. L’objectif de cette thèse est de fournir un outil d’analyse et de manipulation de contenus audio (essentiellement vocaux) au format ambisonique. Un fonctionnement temps-réel et en conditions acoustiques réelles sont les principales contraintes à respecter. L’algorithme mis au point est basé sur une analyse en composantes indépendantes (ACI) appliquée trame à trame qui permet de décomposer le champ acoustique en un ensemble de contributions, correspondant à des sources (champ direct) ou à de la réverbération. Une étape de classification bayésienne, appliquée aux composantes extraites, permet alors l’identification et le dénombrement des sources sonores contenues dans le mélange. Les sources identifiées sont localisées grâce à la matrice de mélange obtenue par ACI, pour fournir une cartographie de la scène sonore. Une étude exhaustive des performances est menée sur des contenus réels en fonction de plusieurs paramètres : nombre de sources, environnement acoustique, longueur des trames, ou ordre ambisonique utilisé. Des résultats fiables en terme de localisation et de comptage de sources ont été obtenus pour des trames de quelques centaines de ms. L’algorithme, exploité comme prétraitement dans un prototype d’assistant vocal domestique, permet d’améliorer significativement les performances de reconnaissance, notamment en prise de son lointaine et en présence de sources interférentes. / The context of this thesis is the development of spatialized audio (5.1 contents, Dolby Atmos...) and particularly of 3D audio. Among the existing 3D audio formats, Ambisonics and Higher Order Ambisonics (HOA) allow a homogeneous spatial representation of a sound field and allows basics manipulations, like rotations or distorsions. The aim of the thesis is to provides efficient tools for ambisonics and HOA sound scene analyse and manipulations. A real-time implementation and robustness to reverberation are the main constraints to deal with. The implemented algorithm is based on a frame-by-frame Independent Component Analysis (ICA), wich decomposes the sound field into a set of acoustic contributions. Then a bayesian classification step is applied to the extracted components to identify the real sources and the residual reverberation. Direction of arrival of the sources are extracted from the mixing matrix estimated by ICA, according to the ambisonic formalism, and a real-time cartography of the sound scene is obtained. Performances have been evaluated in different acoustic environnements to assess the influence of several parameters such as the ambisonic order, the frame length or the number of sources. Accurate results in terms of source localization and source counting have been obtained for frame lengths of a few hundred milliseconds. The algorithm is exploited as a pre-processing step for a speech recognition prototype and allows a significant increasing of the recognition results, in far field conditions and in the presence of noise and interferent sources. Acoustique Audio 3D Séparation de sources Analyse de scène Ambisonie HOA Analyse en composantes indépendantes Déréverbération Acoustics Source separation Scene analysis Ambisonics Independent component analysis Dereverberation 620.21
44	Harmonic Sound Source Separation in Monaural Music Signals Goel, Priyank January 2013 (has links) (PDF) Sound Source Separation refers to separating sound signals according to their sources from a given observed sound. It is efficient to code and very easy to analyze and manipulate sounds from individual sources separately than in a mixture. This thesis deals with the problem of source separation in monaural recordings of harmonic musical instruments. A good amount of literature is surveyed and presented since sound source separation has been tried by many researchers over many decades through various approaches. A prediction driven approach is first presented which is inspired by old-plus-new heuristic used by humans for Auditory Scene Analysis. In this approach, the signals from different sources are predicted using a general model and then these predictions are reconciled with observed sound to get the separated signal. This approach failed for real world sound recordings in which the spectrum of the source signals change very dynamically. Considering the dynamic nature of the spectrums, an approach which uses covariance matrix of amplitudes of harmonics is proposed. The overlapping and non-overlapping harmonics of the notes are first identified with the knowledge of pitch of the notes. The notes are matched on the basis of their covariance profiles. The second order properties of overlapping harmonics of a note are estimated with the use of co-variance matrix of a matching note. The full harmonic is then reconstructed using these second order characteristics. The technique has performed well over sound samples taken from RWC musical Instrument database. Sound Source Separation Harmonic Musical Instruments Harmonic Sound Source Seperation Monaural Music Signals Sinusoidal Modeling Monaural Sound Source Seperation Auditory Scene Analysis Monaural Musical Recordings Communication Engineering
45	Analyse de scène temps réel pour l'interaction 3D / Real-time scene analysis for 3D interaction Kaiser, Adrien 01 July 2019 (has links) Cette thèse porte sur l'analyse visuelle de scènes intérieures capturées par des caméras de profondeur dans le but de convertir leurs données en information de haut niveau sur la scène. Elle explore l'application d'outils d'analyse géométrique 3D à des données visuelles de profondeur en termes d'amélioration de qualité, de recalage et de consolidation. En particulier, elle vise à montrer comment l'abstraction de formes permet de générer des représentations légères pour une analyse rapide avec des besoins matériels faibles. Cette propriété est liée à notre objectif de concevoir des algorithmes adaptés à un fonctionnement embarqué en temps réel dans le cadre d'appareils portables, téléphones ou robots mobiles. Le contexte de cette thèse est l'exécution d'un procédé d’interaction 3D temps réel sur un appareil mobile. Cette exécution soulève plusieurs problématiques, dont le placement de zones d'interaction 3D par rapport à des objets environnants réels, le suivi de ces zones dans l'espace lorsque le capteur est déplacé ainsi qu'une utilisation claire et compréhensible du système par des utilisateurs non experts. Nous apportons des contributions vers la résolution de ces problèmes pour montrer comment l'abstraction géométrique de la scène permet une localisation rapide et robuste du capteur et une représentation efficace des données fournies ainsi que l'amélioration de leur qualité et leur consolidation. Bien que les formes géométriques simples ne contiennent pas autant d'information que les nuages de points denses ou les ensembles volumiques pour représenter les scènes observées, nous montrons qu’elles constituent une approximation acceptable et que leur légèreté leur donne un bon équilibre entre précision et performance. / This PhD thesis focuses on the problem of visual scene analysis captured by commodity depth sensors to convert their data into high level understanding of the scene. It explores the use of 3D geometry analysis tools on visual depth data in terms of enhancement, registration and consolidation. In particular, we aim to show how shape abstraction can generate lightweight representations of the data for fast analysis with low hardware requirements. This last property is important as one of our goals is to design algorithms suitable for live embedded operation in e.g., wearable devices, smartphones or mobile robots. The context of this thesis is the live operation of 3D interaction on a mobile device, which raises numerous issues including placing 3D interaction zones with relation to real surrounding objects, tracking the interaction zones in space when the sensor moves and providing a meaningful and understandable experience to non-expert users. Towards solving these problems, we make contributions where scene abstraction leads to fast and robust sensor localization as well as efficient frame data representation, enhancement and consolidation. While simple geometric surface shapes are not as faithful as heavy point sets or volumes to represent observed scenes, we show that they are an acceptable approximation and their light weight makes them well balanced between accuracy and performance. Analyse visuelle de scène Capteurs de profondeur Analyse géométrique Abstraction haut niveau Interaction 3D Visual scene analysis Commodity depth sensors Geometric analysis High level abstraction 3D interaction
46	Auditory foreground and background decomposition: New perspectives gained through methodological diversification Thomaßen, Sabine 11 April 2022 (has links) A natural auditory scene contains many sound sources each of which produces complex sounds. These sounds overlap and reach our ears at the same time, but they also change constantly. To still be able to follow the sound source of interest, the auditory system must decide where each individual tone belongs to and integrate this information over time. For well-controlled investigations on the mechanisms behind this challenging task, sound sources need to be simulated in the lab. This is mostly done with sine tones arranged in certain spectrotemporal patterns. The vast majority of studies simply interleave two sub-sequences of sine tones. Participants report how they perceive these sequences or they perform a task whose performance measure allows hints on how the scene was perceived. While many important insights have been gained with this procedure, the questions that can be addressed with it are limited and the commonly used response methods are partly susceptible to distortions or only indirect measures. The present thesis enlarged the complexity of the tone sequences and the diversity of perceptual measures used for investigations on auditory scene analysis. These changes are intended to open up new questions and give new perspectives on our knowledge about auditory scene analysis. In detail, the thesis established three-tone sequences as a tool for specific investigations on the perceptual foreground and background processing in complex auditory scenes. In addition, it modifies an already established approach for indirect measures of auditory perception in a way that enables detailed and univocal investigations on background processing. Finally, a new response method, namely a no-report method for auditory perception that might also serve as a method to validate subjective report measures, was developed. This new methodological approach uses eye movements as a measurement tool for auditory perception. With the aid of all these methodological improvements, the current thesis shows that auditory foreground formation is actually more complex than previously assumed since listeners hold more than one auditory source in the foreground without being forced to do so. In addition, it shows that the auditory system prefers a limited number of specific source configurations probably to avoid combinatorial explosion. Finally, the thesis indicates that the formation of the perceptual background is also quite complex since the auditory system holds perceptual organization alternatives in parallel that were basically assumed to be mutually exclusive. Thus, both the foreground and the background follow different rules than expected based on two-tone sequences. However, one finding seems to be true for both kinds of sequences: the impact of the tone pattern on the subjective perception is marginal, be it in two- or three-tone sequences. Regarding the no-report method for auditory perception, the thesis shows that eye movements and the reported auditory foreground formations were in good agreement and it seems like this approach indeed has the potential to become a first no-report measure for auditory perception.:Abstract 3 Acknowledgments 5 List of Figures 8 List of Tables 9 Collaborations 11 1 General Introduction 13 1.1 The auditory foreground 13 1.1.1 Attention and auditory scene analysis 13 1.1.2 Investigating auditory scene analysis with two-tone sequences 16 1.1.3 Multistability 18 1.2 The auditory background 21 1.2.1 Investigating auditory background processing 22 1.3 Measures of auditory perception 23 1.3.1 Report procedures 23 1.3.2 Performance-based measures 26 1.3.3 Psychophysiological measures 27 1.4 Summary and goals of the thesis 30 2 The auditory foreground 33 2.1 Study 1: Foreground formation in three-tone sequences 33 2.1.1 Abstract 33 2.1.2 Introduction 33 2.1.3 Methods 37 2.1.4 Results 43 2.1.5 Discussion 48 2.2 Study 2: Pattern effects in three-tone sequences 53 2.2.1 Abstract 53 2.2.2 Methods 53 2.2.3 Results 54 2.2.4 Discussion 58 2.3 Study 3: Pattern effects in two-tone sequences 59 2.3.1 Abstract 59 2.3.2 Introduction 59 2.3.3 General Methods 63 2.3.4 Experiment 1 – Methods and Results 65 2.3.5 Experiment 2 – Methods and Results 67 2.3.6 Experiment 3 – Methods and Results 70 2.3.7 Discussion 72 3 The auditory background 74 3.1 Study 4: Background formation in three-tone sequences 74 3.1.1 Abstract 74 3.1.2 Introduction 74 3.1.3 Methods 77 3.1.4 Results 82 3.1.5 Discussion 86 4 Audio-visual coupling for investigations on auditory perception 90 4.1 Study 5: Using Binocular Rivalry to tag auditory perception 90 4.1.1 Abstract 90 4.1.2 Introduction 90 4.1.3 Methods 92 4.1.4 Results 100 4.1.5 Discussion 108 5 General Discussion 113 5.1 Short review of the findings 113 5.2 The auditory foreground 114 5.2.1 Auditory foreground formation and attention theories 114 5.2.2 The role of tone pattern in foreground formation 116 5.2.3 Methodological considerations and continuation 117 5.3 The auditory background 118 5.3.1 Auditory object formation without attention 120 5.3.2 Multistability without attention 121 5.3.3 Methodological considerations and continuation 122 5.4 Auditory scene analysis by audio-visual coupling 124 5.4.1 Methodological considerations and continuation 124 5.5 Artificial listening situations and conclusions on natural hearing 126 6 Conclusions 128 References 130 info:eu-repo/classification/ddc/153.7 ddc:153.7
47	Transparent Object Reconstruction and Registration Confidence Measures for 3D Point Clouds based on Data Inconsistency and Viewpoint Analysis Albrecht, Sven 28 February 2018 (has links) A large number of current mobile robots use 3D sensors as part of their sensor setup. Common 3D sensors, i.e., laser scanners or RGB-D cameras, emit a signal (laser light or infrared light for instance), and its reflection is recorded in order to estimate depth to a surface. The resulting set of measurement points is commonly referred to as 'point clouds'. In the first part of this dissertation an inherent problem of sensors that emit some light signal is addressed, namely that these signals can be reflected and/or refracted by transparent of highly specular surfaces, causing erroneous or missing measurements. A novel heuristic approach is introduced how such objects may nevertheless be identified and their size and shape reconstructed by fusing information from several viewpoints of the scene. In contrast to other existing approaches no prior knowledge about the objects is required nor is the shape of the reconstructed objects restricted to a limited set of geometric primitives. The thesis proceeds to illustrate problems caused by sensor noise and registration errors and introduces mechanisms to address these problems. Finally a quantitative comparison between equivalent directly measured objects, the reconstructions and "ground truth" is provided. The second part of the thesis addresses the problem of automatically determining the quality of the registration for a pair of point clouds. Although a different topic, these two problems are closely related, if modeled in the fashion of this thesis. After illustrating why the output parameters of a popular registration algorithm (ICP) are not suitable to deduce registration quality, several heuristic measures are developed that provide better insight. Experiments performed on different datasets were performed to showcase the applicability of the proposed measures in different scenarios. transparent object reconstruction 3D point clouds evaluation of scan registration 54.74 - Maschinelles Sehen 54.72 - Künstliche Intelligenz I.4.8 - Scene Analysis I.4.5 - Reconstruction ddc:004
48	Deep CASA for Robust Pitch Tracking and Speaker Separation Liu, Yuzhou January 2019 (has links) No description available. Computer Science Engineering Computational auditory scene analysis speech separation talker-independent speaker separation permutation invariant training pitch estimation multi-pitch estimation machine learning deep learning neural networks
49	Deep learning methods for reverberant and noisy speech enhancement Zhao, Yan 15 September 2020 (has links) No description available. Computer Science Engineering Deep neural networks Supervised learning Attention Speech enhancement Speech denoising Speech dereverberation Time-frequency masking Speech intelligibility Speech quality Computational auditory scene analysis
50	Holistic Representations For Activities And Crowd Behaviors Solmaz, Berkan 01 January 2013 (has links) In this dissertation, we address the problem of analyzing the activities of people in a variety of scenarios, this is commonly encountered in vision applications. The overarching goal is to devise new representations for the activities, in settings where individuals or a number of people may take a part in specific activities. Different types of activities can be performed by either an individual at the fine level or by several people constituting a crowd at the coarse level. We take into account the domain specific information for modeling these activities. The summary of the proposed solutions is presented in the following. The holistic description of videos is appealing for visual detection and classification tasks for several reasons including capturing the spatial relations between the scene components, simplicity, and performance [1, 2, 3]. First, we present a holistic (global) frequency spectrum based descriptor for representing the atomic actions performed by individuals such as: bench pressing, diving, hand waving, boxing, playing guitar, mixing, jumping, horse riding, hula hooping etc. We model and learn these individual actions for classifying complex user uploaded videos. Our method bypasses the detection of interest points, the extraction of local video descriptors and the quantization of local descriptors into a code book; it represents each video sequence as a single feature vector. This holistic feature vector is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence it integrates the information about the motion and scene structure. We tested our approach on two of the most challenging datasets, UCF50 [4] and HMDB51 [5], and obtained promising results which demonstrates the robustness and the discriminative power of our holistic video descriptor for classifying videos of various realistic actions. In the above approach, a holistic feature vector of a video clip is acquired by dividing the video into spatio-temporal blocks then concatenating the features of the individual blocks together. However, such a holistic representation blindly incorporates all the video regions regardless of iii their contribution in classification. Next, we present an approach which improves the performance of the holistic descriptors for activity recognition. In our novel method, we improve the holistic descriptors by discovering the discriminative video blocks. We measure the discriminativity of a block by examining its response to a pre-learned support vector machine model. In particular, a block is considered discriminative if it responds positively for positive training samples, and negatively for negative training samples. We pose the problem of finding the optimal blocks as a problem of selecting a sparse set of blocks, which maximizes the total classifier discriminativity. Through a detailed set of experiments on benchmark datasets [6, 7, 8, 9, 5, 10], we show that our method discovers the useful regions in the videos and eliminates the ones which are confusing for classification, which results in significant performance improvement over the state-of-the-art. In contrast to the scenes where an individual performs a primitive action, there may be scenes with several people, where crowd behaviors may take place. For these types of scenes the traditional approaches for recognition will not work due to severe occlusion and computational requirements. The number of videos is limited and the scenes are complicated, hence learning these behaviors is not feasible. For this problem, we present a novel approach, based on the optical flow in a video sequence, for identifying five specific and common crowd behaviors in visual scenes. In the algorithm, the scene is overlaid by a grid of particles, initializing a dynamical system which is derived from the optical flow. Numerical integration of the optical flow provides particle trajectories that represent the motion in the scene. Linearization of the dynamical system allows a simple and practical analysis and classification of the behavior through the Jacobian matrix. Essentially, the eigenvalues of this matrix are used to determine the dynamic stability of points in the flow and each type of stability corresponds to one of the five crowd behaviors. The identified crowd behaviors are (1) bottlenecks: where many pedestrians/vehicles from various points in the scene are entering through one narrow passage, (2) fountainheads: where many pedestrians/vehicles are emerging from a narrow passage only to separate in many directions, (3) lanes: where many pedestrians/vehicles are moving at the same speeds in the same direction, (4) arches or rings: where the iv collective motion is curved or circular, and (5) blocking: where there is a opposing motion and desired movement of groups of pedestrians is somehow prohibited. The implementation requires identifying a region of interest in the scene, and checking the eigenvalues of the Jacobian matrix in that region to determine the type of flow, that corresponds to various well-defined crowd behaviors. The eigenvalues are only considered in these regions of interest, consistent with the linear approximation and the implied behaviors. Since changes in eigenvalues can mean changes in stability, corresponding to changes in behavior, we can repeat the algorithm over clips of long video sequences to locate changes in behavior. This method was tested on over real videos representing crowd and traffic scenes. Computer vision video retrieval action recognition activity recognition video scene analysis dynamical systems crowd behaviors Electrical and Computer Engineering Electrical and Electronics Engineering

Search results