Global ETD Search

51	Supervised Speech Separation Using Deep Neural Networks Wang, Yuxuan 21 May 2015 (has links) No description available. Computer Science Engineering Speech separation time-frequency masking computational auditory scene analysis acoustic features deep neural networks training targets generalization speech intelligibility speech quality
52	On Generalization of Supervised Speech Separation Chen, Jitong 30 August 2017 (has links) No description available. Computer Science Engineering Speech separation speech intelligibility computational auditory scene analysis mask estimation supervised learning deep neural networks acoustic features noise generalization SNR generalization speaker generalization
53	Auditory-based algorithms for sound segregation in multisource and reverberant environments Roman, Nicoleta 24 August 2005 (has links) No description available. binaural speech segregation monaural speech segregation robust automatic speech segregation adaptive filtering room impulse response reverberation
54	Integrating computational auditory scene analysis and automatic speech recognition Srinivasan, Soundararajan 25 September 2006 (has links) No description available. Robust automatic speech recognition Speech segregation Phonemic restoration Top-down analysis Binaural processing Uncertainty decoding Multitalker perception
55	Integrating Monaural and Binaural Cues for Sound Localization and Segregation in Reverberant Environments Woodruff, John F. 20 June 2012 (has links) No description available. Acoustics Artificial Intelligence Computer Science Electrical Engineering computational auditory scene analysis speech segregation sound localization binaural monaural ideal binary masking speech intelligibility
56	An intuitive motion-based input model for mobile devices Richards, Mark Andrew January 2006 (has links) Traditional methods of input on mobile devices are cumbersome and difficult to use. Devices have become smaller, while their operating systems have become more complex, to the extent that they are approaching the level of functionality found on desktop computer operating systems. The buttons and toggle-sticks currently employed by mobile devices are a relatively poor replacement for the keyboard and mouse style user interfaces used on their desktop computer counterparts. For example, when looking at a screen image on a device, we should be able to move the device to the left to indicate we wish the image to be panned in the same direction. This research investigates a new input model based on the natural hand motions and reactions of users. The model developed by this work uses the generic embedded video cameras available on almost all current-generation mobile devices to determine how the device is being moved and maps this movement to an appropriate action. Surveys using mobile devices were undertaken to determine both the appropriateness and efficacy of such a model as well as to collect the foundational data with which to build the model. Direct mappings between motions and inputs were achieved by analysing users' motions and reactions in response to different tasks. Upon the framework being completed, a proof of concept was created upon the Windows Mobile Platform. This proof of concept leverages both DirectShow and Direct3D to track objects in the video stream, maps these objects to a three-dimensional plane, and determines device movements from this data. This input model holds the promise of being a simpler and more intuitive method for users to interact with their mobile devices, and has the added advantage that no hardware additions or modifications are required the existing mobile devices. Input Model HumanComputer Interface Mobile Device User Interface Input Devices Interaction Styles Automated Survey DirectX Mobile DirectShow Windows Mobile Computer Vision Image Processing Edge Detection Object Detection Motion Tracking Scene Analysis Human Movement ARToolkit Augmented Reality
57	Who Spoke What And Where? A Latent Variable Framework For Acoustic Scene Analysis Sundar, Harshavardhan 26 March 2016 (has links) (PDF) Speech is by far the most natural form of communication between human beings. It is intuitive, expressive and contains information at several cognitive levels. We as humans, are perceptive to several of these cognitive levels of information, as we can gather the information pertaining to the identity of the speaker, the speaker's gender, emotion, location, the language, and so on, in addition to the content of what is being spoken. This makes speech based human machine interaction (HMI), both desirable and challenging for the same set of reasons. For HMI to be natural for humans, it is imperative that a machine understands information present in speech, at least at the level of speaker identity, language, location in space, and the summary of what is being spoken. Although one can draw parallels between the human-human interaction and HMI, the two differ in their purpose. We, as humans, interact with a machine, mostly in the context of getting a task done more efficiently, than is possible without the machine. Thus, typically in HMI, controlling the machine in a specific manner is the primary goal. In this context, it can be argued that, HMI, with a limited vocabulary containing specific commands, would suffice for a more efficient use of the machine. In this thesis, we address the problem of ``Who spoke what and where", in the context of a machine understanding the information pertaining to identities of the speakers, their locations in space and the keywords they spoke, thus considering three levels of information - speaker identity (who), location (where) and keywords (what). This can be addressed with the help of multiple sensors like microphones, video camera, proximity sensors, motion detectors, etc., and combining all these modalities. However, we explore the use of only microphones to address this issue. In practical scenarios, often there are times, wherein, multiple people are talking at the same time. Thus, the goal of this thesis is to detect all the speakers, their keywords, and their locations in mixture signals containing speech from simultaneous speakers. Addressing this problem of ``Who spoke what and where" using only microphone signals, forms a part of acoustic scene analysis (ASA) of speech based acoustic events. We divide the problem of ``who spoke what and where" into two sub-problems: ``Who spoke what?" and ``Who spoke where". Each of these problems is cast in a generic latent variable (LV) framework to capture information in speech at different levels. We associate a LV to represent each of these levels and model the relationship between the levels using conditional dependency. The sub-problem of ``who spoke what" is addressed using single channel microphone signal, by modeling the mixture signal in terms of LV mass functions of speaker identity, the conditional mass function of the keyword spoken given the speaker identity, and a speaker-specific-keyword model. The LV mass functions are estimated in a Maximum likelihood (ML) framework using the Expectation Maximization (EM) algorithm using Student's-t Mixture Model (tMM) as speaker-specific-keyword models. Motivated by HMI in a home environment, we have created our own database. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82 % for detecting both the speakers and their respective keywords. The other sub-problem of ``who spoke where?" is addressed in two stages. In the first stage, the enclosure is discretized into sectors. The speakers and the sectors in which they are located are detected in an approach similar to the one employed for ``who spoke what" using signals collected from a Uniform Circular Array (UCA). However, in place of speaker-specific-keyword models, we use tMM based speaker models trained on clean speech, along with a simple Delay and Sum Beamformer (DSB). In the second stage, the speakers are localized within the active sectors using a novel region constrained localization technique based on time difference of arrival (TDOA). Since the problem being addressed is a multi-label classification task, we use the average Hamming score (accuracy) as the performance metric. Although the proposed approach yields an accuracy of 100 % in an anechoic setting for detecting both the speakers and their corresponding sectors in two-speaker mixture signals, the performance degrades to an accuracy of 67 % in a reverberant setting, with a $60$ dB reverberation time (RT60) of 300 ms. To improve the performance under reverberation, prior knowledge of the location of multiple sources is derived using a novel technique derived from geometrical insights into TDOA estimation. With this prior knowledge, the accuracy of the proposed approach improves to 91 %. It is worthwhile to note that, the accuracies are computed for mixture signals containing more than 90 % overlap of competing speakers. The proposed LV framework offers a convenient methodology to represent information at broad levels. In this thesis, we have shown its use with three different levels. This can be extended to several such levels to be applicable for a generic analysis of the acoustic scene consisting of broad levels of events. It will turn out that not all levels are dependent on each other and hence the LV dependencies can be minimized by independence assumption, which will lead to solving several smaller sub-problems, as we have shown above. The LV framework is also attractive to incorporate prior knowledge about the acoustic setting, which is combined with the evidence from the data to derive the information about the presence of an acoustic event. The performance of the framework, is dependent on the choice of stochastic models, which model the likelihood function of the data given the presence of acoustic events. However, it provides an access to compare and contrast the use of different stochastic models for representing the likelihood function. Signal Processing Acoustic Scene Analysis (ASA) Latent Variables Expectation Maximization EM Algorithm Mixture Models Acoustic Source Localization Hyperboloids Student's-t Mixture Models Gaussian Mixture Models Robust Mixture Modeling Time Difference of Arrival (TDOA) Communication Engineering
58	Detekce elipsy v obraze / Ellipse Detection Hříbek, Petr January 2008 (has links) The thesis introduces methods used for an ellipse detection. Each method is theoretically described in current subsection. The description includes methods like Hough transform, Random Hough transform, RANSAC, Genetic Algorithm and improvements with optimalization. Further there are described modifications of current procedures in the thesis to reach better results. Next to the last chapter represents testing parameters of speed, quality and accuracy of implemented algorithms. There is a conclusion of testing and a result discussion at the end.
59	Computational auditory scene analysis and robust automatic speech recognition Narayanan, Arun 14 November 2014 (has links) No description available. Engineering Computer Science Automatic speech recognition noise robustness computational auditory scene analysis binary masking ratio masking mask estimation deep neural networks acoustic modeling speech separation speech enhancement noisy ASR CHiME-2 Aurora-4
60	Ψηφιακή επεξεργασία και αυτόματη κατηγοριοποίηση περιβαλλοντικών ήχων Νταλαμπίρας, Σταύρος 20 September 2010 (has links) Στο κεφάλαιο 1 παρουσιάζεται μία γενική επισκόπηση της αυτόματης αναγνώρισης γενικευμένων ακουστικών γεγονότων. Επιπλέον συζητάμε τις εφαρμογές της τεχνολογίας αναγνώρισης ακουστικού σήματος και δίνουμε μία σύντομη περιγραφή του state of the art. Τέλος, αναφέρουμε τη συνεισφορά της διατριβής. Στο κεφάλαιο 2 εισάγουμε τον αναγνώστη στο χώρο της επεξεργασίας ακουστικών σημάτων που δε περιλαμβάνουν ομιλία. Παρουσιάζονται οι σύγχρονες προσεγγίσεις όσον αφορά στις μεθοδολογίες εξαγωγής χαρακτηριστικών και αναγνώρισης προτύπων. Στο κεφάλαιο 3 προτείνεται ένα καινοτόμο σύστημα αναγνώρισης ήχων ειδικά σχεδιασμένο για το χώρο των ηχητικών γεγονότων αστικού περιβάλλοντος και αναλύεται ο σχεδιασμός της αντίστοιχης βάσης δεδομένων. Δημιουργήθηκε μία ιεραρχική πιθανοτική δομή μαζί με δύο ομάδες ακουστικών παραμέτρων που οδηγούν σε υψηλή ακρίβεια αναγνώρισης. Στο κεφάλαιο 4 ερευνάται η χρήση της τεχνικής πολλαπλών αναλύσεων όπως εφαρμόζεται στο πρόβλημα της διάκρισης ομιλίας/μουσικής. Στη συνέχεια η τεχνική αυτή χρησιμοποιήθηκε για τη δημιουργία ενός συστήματος το οποίο συνδυάζει χαρακτηριστικά από διαφορετικά πεδία με στόχο την αποδοτική ανάλυση online ραδιοφωνικών σημάτων. Στο κεφάλαιο 5 προτείνεται ένα σύστημα το οποίο εντοπίζει μη-τυπικές καταστάσεις σε περιβάλλον σταθμού μετρό με στόχο να βοηθήσει το εξουσιοδοτημένο προσωπικό στην συνεχή επίβλεψη του χώρου. Στο κεφάλαιο 6 προτείνεται ένα προσαρμοζόμενο σύστημα για ακουστική παρακολούθηση εν δυνάμει καταστροφικών καταστάσεων ικανό να λειτουργεί κάτω από διαφορετικά περιβάλλοντα. Δείχνουμε ότι το σύστημα επιτυγχάνει υψηλή απόδοση και μπορεί να προσαρμόζεται αυτόνομα σε ετερογενείς ακουστικές συνθήκες. Στο κεφάλαιο 7 ερευνάται η χρήση της μεθόδου ανίχνευσης καινοτομίας για ακουστική επόπτευση κλειστών και ανοιχτών χώρων. Ηχογραφήθηκε μία βάση δεδομένων πραγματικού κόσμου και προτείνονται τρεις πιθανοτικές τεχνικές. Στο κεφάλαιο 8 παρουσιάζεται μία καινοτόμα μεθοδολογία για αναγνώριση γενικευμένου ακουστικού σήματος που οδηγεί σε υψηλή ακρίβεια αναγνώρισης. Εκμεταλλευόμαστε τα πλεονεκτήματα της χρονικής συγχώνευσης χαρακτηριστικών σε συνδυασμό με μία παραγωγική τεχνική κατηγοριοποίησης. / The dissertation is outlined as followed: In chapter 1 we present a general overview of the task of automatic recognition of sound events. Additionally we discuss the applications of the generalized audio signal recognition technology and we give a brief description of the state of the art. Finally we mention the contribution of the thesis. In chapter 2 we introduce the reader to the area of non speech audio processing. We provide the current trend in the feature extraction methodologies as well as the pattern recognition techniques. In chapter 3 we analyze a novel sound recognition system especially designed for addressing the domain of urban environmental sound events. A hierarchical probabilistic structure was constructed along with a combined set of sound parameters which lead to high accuracy. chapter 4 is divided in the following two parts: a) we explore the usage of multiresolution analysis as regards the speech/music discrimination problem and b) the previously acquired knowledge was used to build a system which combined features of different domains towards efficient analysis of online radio signals. In chapter 5 we exhaustively experiment on a new application of the sound recognition technology, space monitoring based on the acoustic modality. We propose a system which detects atypical situations under a metro station environment towards assisting the authorized personnel in the space monitoring task. In chapter 6 we propose an adaptive framework for acoustic surveillance of potentially hazardous situations under environments of different acoustic properties. We show that the system achieves high performance and has the ability to adapt to heterogeneous environments in an unsupervised way. In chapter 7 we investigate the usage of the novelty detection method to the task of acoustic monitoring of indoor and outdoor spaces. A database with real-world data was recorded and three probabilistic techniques are proposed. In chapter 8 we present a novel methodology for generalized sound recognition that leads to high recognition accuracy. The merits of temporal feature integration as well as multi domain descriptors are exploited in combination with a state of the art generative classification technique. Κατηγοριοποίηση ήχων Θεωρία πιθανοτήτων 621.382 8 Sound classification Probability theory Sound event detection Hidden Markov models Generalized recognition of audio signals Computational auditory scene analysis Novelty detection

Search results