Spelling suggestions: "subject:"markov codels"" "subject:"markov 2models""
241 |
Descoberta e caracterização de vírus emergentes e reergentes em áreas peri-florestais. / Discovering and characterizing emerging and re-emerging viruses in communities encroaching tropical hotspots.Paola, Nicholas Di 21 March 2018 (has links)
A fragmentação e a invasão de florestas tropicais e a crescente concentração de assentamentos humanos aumentaram exponencialmente as chances de exposição a vírus emergentes e emergentes. Dado o grande potencial de espalhamento de patógenos em população humanas, a identificação e caracterização de agentes patogênicos circulantes podem melhorar a atenção primária e as capacidades de diagnóstico para um agente emergente futuro. As abordagens moleculares e metagenômicas que utilizam as tecnologias de sequenciação da próxima geração levaram a descoberta e caracterização de muitos vírus emergentes na última década. Além disso, as abordagens in silico também podem ajudar a identificar vírus emergentes usando apenas dados de sequenciamento publicamente disponíveis. Além disso, estimar a ascendência filogenética e até mesmo analisar as mudanças no uso de codons são ferramentas adicionais que podem melhorar a nossa compreensão de vírus emergentes ou reemergentes. Este projeto visou aplicar essas ferramentas em ambos os vírus que poderiam estar circulando no Brasil: Parvovírus B19 e vírus da Febre Amarela. Também exploramos as aplicações de modelos ocultos de Markov e índice de adaptação de codons usando dados publicamente disponíveis. Esperamos que este trabalho forneça uma prova de conceito para futuros projetos metagenômicos e demonstre a utilidade das várias técnicas moleculares e bioinformáticas no estudo de vírus emergentes. / Fragmentation and encroachment of tropical rainforests and the growing concentration of human settlements have exponentially increased chances of exposure to re-emerging and emerging viruses. Given the large potential for pathogens to spillover and spread in a population, identifying and characterizing circulating human pathogens could improve the readiness and diagnostic capabilities for a future emergence. Molecular and metagenomic approaches using next-generation sequencing technologies have led to the discovery and characterization of many emerging viruses over the last decade. In complement, in silico approaches can also help identify emerging viruses using only publicly available sequencing data. Moreover, estimating the phylogenetic ancestry and even analyzing changes in codon usage are additional tools that can improve our understanding of an emerging or re-emerging virus. This project aimed to apply these tools to two viruses that could be circulating in Brazil: Parvovirus B19 and Yellow Fever virus. We also explored the applications of Hidden Markov models and codon adaptation index using publicly available data. We expect this work to provide a proof-of-concept for future metagenomic projects, and demonstrate the utility for several molecular and bioinformatics techniques in the study of emerging viruses.
|
242 |
Workflow and Activity Modeling for Monitoring Surgical Procedures / Modélisation des activités chirurgicales et de leur déroulement pour la reconnaissance des étapes opératoiresPadoy, Nicolas 14 April 2010 (has links)
Le bloc opératoire est au coeur des soins délivrés dans l'hôpital. Suite à de nombreux développements techniques et médicaux, il devient équipé de salles opératoires hautement technologiques. Bien que ces changements soient bénéfiques pour le traitement des patients, ils accroissent la complexité du déroulement des opérations. Ils impliquent également la présence de nombreux systèmes électroniques fournissant de l'information sur les processus chirurgicaux. Ce travail s'intéresse au développement de méthodes statistiques permettant de modéliser le déroulement des processus chirurgicaux et d'en reconnaitre les étapes, en utilisant des signaux présents dans le bloc opératoire. Nous introduisons et formalisons le problème consistant à reconnaitre les phases réalisées au sein d'un processus chirurgical, en utilisant une représentation des chirurgies par une suite temporelle et multi-dimensionnelle de signaux synchronisés. Nous proposons ensuite des méthodes pour la modélisation, la segmentation hors-ligne et la reconnaissance en-ligne des phases chirurgicales. La méthode principale, une variante de modèle de Markov caché étendue par des variables de probabilités de phases, est démontrée sur deux applications médicales. La première concerne les interventions endoscopiques, la cholécystectomie étant prise en exemple. Les phases endoscopiques sont reconnues en utilisant des signaux indiquant l'utilisation des instruments et enregistrés lors de chirurgies réelles. La deuxième application concerne la reconnaissance des activités génériques d'une salle opératoire. Dans ce cas, la reconnaissance utilise de l'information 4D provenant d'un système de reconstruction multi-vues / The department of surgery is the core unit of the patient care system within a hospital. Due to continuous technical and medical developments, such departments are equipped with increasingly high-tech surgery rooms. This provides higher benefits for patient treatment, but also increases the complexity of the procedures' workflow. This also induces the presence of multiple electronic systems providing rich and various information about the surgical processes. The focus of this work is the development of statistical methods that permit the modeling and monitoring of surgical processes, based on signals available in the surgery room. We introduce and formalize the problem of recognizing phases within a workflow, using a representation of interventions in terms of multidimensional time-series formed by synchronized signals acquired over time. We then propose methods for the modeling, offline segmentation and on-line recognition of surgical phases. The main method, a variant of hidden Markov models augmented by phase probability variables, is demonstrated on two medical applications. The first one is the monitoring of endoscopic interventions, using cholecystectomy as illustrative surgery. Phases are recognized using signals indicating tool usage and recorded from real procedures. The second application is the monitoring of a generic surgery room workflow. In this case, phase recognition is performed by using 4D information from surgeries performed in a mock-up operating room in presence of a multi-view reconstruction system
|
243 |
Modeling and Recognizing Network Scanning Activities with Finite Mixture Models and Hidden Markov Models / Modélisation et reconnaissance des activités de balayage du réseau à l'aide de modèles à mélange fini et de modèles de Markov cachésDe Santis, Giulia 20 December 2018 (has links)
Le travail accompli dans cette thèse a consisté à construire des modèles stochastiques de deux scanners de l'Internet qui sont ZMap et Shodan. Les paquets provenant de chacun des deux scanners ont été collectés par le Laboratoire de Haute Sécurité (LHS) hébergé à Inria Nancy Grand Est, et ont été utilisés pour construire par apprentissage des chaînes de Markov cachées (HMMs). La première partie du travail consistait à modéliser l'intensité des deux scanners considérés. Nous avons cherché à savoir si l'intensité de ZMap varie en fonction du service ciblé et si les intensités des deux scanners sont comparables. Les résultats ont montré que la réponse à la première question est positive (c'est-à-dire que l'intensité de ZMap varie en fonction des ports ciblés), alors que la réponse à la deuxième question est négative. En d'autres termes, nous avons obtenu un modèle pour chaque ensemble de logs. La partie suivante du travail consistait à étudier deux autres caractéristiques des mêmes scanners : leurs mouvements spatiotemporels. Nous avons créé des ensembles d'échantillons de logs avec chacune d'elle contient une seule exécution de ZMap et Shodan. Ensuite, nous avons calculé les différences d'adresses IP ciblées consécutivement par le même scanner (c.-à-d. dans chaque échantillon), et les timestamps correspondants. Les premiers ont été utilisés pour modéliser les mouvements spatiaux, tandis que les seconds pour les mouvements temporels. Une fois que les modèles de chaînes de Markov cachées sont construites, ils ont été appliqués pour identifier les scanners d'autres ensembles de logs. Dans les deux cas, nos modèles ne sont pas capables de détecter le service ciblé, mais ils détectent correctement le scanner qui génère de nouveaux logs, avec une précision de 95% en utilisant les mouvements spatiaux et de 98% pour les mouvements temporels / The work accomplished in this PhD consisted in building stochastic models of ZMap and Shodan, respectively, two Internet-wide scanners. More in detail, packets originated by each of the two considered scanners have been collected by the High Security Lab hosted in Inria, and have been used to learn Hidden Markov Models (HMMs). The rst part of the work consisted in modeling intensity of the two considered scanners. We investigated if the intensity of ZMap varies with respect to the targeted service, and if the intensities of the two scanners are comparable. Results showed that the answer to the first question is positive (i.e., intensity of ZMap varied with respect to the targeted ports), whereas the answer to the second question is negative. In other words, we obtained a model for each set of logs. The following part of the work consisted in investigating other two features of the same scanners: their spatial and temporal movements, respectively. More in detail, we created datasets containing logs of one single execution of ZMap and Shodan, respectively. Then, we computed di erences of IP addresses consecutively targeted by the same scanner (i.e., in each sample), and of the corresponding timestamps. The former have been used to model spatial movements, whereas the latter temporal ones. Once the Hidden Markov Models are available, they have been applied to detect scanners from other sets of logs. In both cases, our models are not able to detect the targeted service, but they correctly detect the scanner that originates new logs, with an accuracy of 95% when exploiting spatial movements, and of 98% when using temporal movements
|
244 |
Méthodes particulaires et vraisemblances pour l'inférence de modèles d'évolution avec dépendance au contexte / Sequential Monte Carlo methods and likelihoods for inference of context-dependent evolutionary modelsHuet, Alexis 27 June 2014 (has links)
Cette thèse est consacrée à l'inférence de modèles stochastiques d'évolution de l'ADN avec dépendance au contexte, l'étude portant spécifiquement sur la classe de modèles stochastiques RN95+YpR. Cette classe de modèles repose sur un renforcement des taux d'occurrence de certaines substitutions en fonction du contexte local, ce qui introduit des phénomènes de dépendance dans l'évolution des différents sites de la séquence d'ADN. Du fait de cette dépendance, le calcul direct de la vraisemblance des séquences observées met en jeu des matrices de dimensions importantes, et est en général impraticable. Au moyen d'encodages spécifiques à la classe RN95+YpR, nous mettons en évidence de nouvelles structures de dépendance spatiales pour ces modèles, qui sont associées à l'évolution des séquences d'ADN sur toute leur histoire évolutive. Ceci rend notamment possible l'utilisation de méthodes numériques particulaires, développées dans le cadre des modèles de Markov cachés, afin d'obtenir des approximations consistantes de la vraisemblance recherchée. Un autre type d'approximation de la vraisemblance, basé sur des vraisemblances composites, est également introduit. Ces méthodes d'approximation de la vraisemblance sont implémentées au moyen d'un code en C++. Elles sont mises en œuvre sur des données simulées afin d'étudier empiriquement certaines de leurs propriétés, et sur des données génomiques, notamment à des fins de comparaison de modèles d'évolution / This thesis is devoted to the inference of context-dependent evolutionary models of DNA sequences, and is specifically focused on the RN95+YPR class of stochastic models. This class of models is based on the reinforcement of some substitution rates depending on the local context, which introduces dependence phenomena between sites in the evolution of the DNA sequence. Because of these dependencies, the direct computation of the likelihood of the observed sequences involves high-dimensional matrices, and is usually infeasible. Through encodings specific to the RN95+YpR class, we highlight new spatial dependence structures for these models, which are related to the evolution of DNA sequences throughout their evolutionary history. This enables the use of particle filter algorithms, developed in the context of hidden Markov models, in order to obtain consistent approximations of the likelihood. Another type of approximation of the likelihood, based on composite likelihoods, is also introduced. These approximation methods for the likelihood are implemented in a C++ program. They are applied on simulated data to empirically investigate some of their properties, and on genomic data, especially for comparison of evolutionary models
|
245 |
A behavioral ecology of fishermen : hidden stories from trajectory data in the Northern Humboldt Current System / Une écologie du comportement des pêcheurs : histoires cachées à partir des données de trajectoires dans le système de Courant de HumboldtJoo Arakawa, Rocío 19 December 2013 (has links)
Ce travail propose une contribution originale à la compréhension du comportement spatial des pêcheurs, basée sur les paradigmes de l'écologie comportementale et de l'écologie du mouvement. En s'appuyant sur des données du 'Vessel Monitoring System', nous étudions le comportement des pêcheurs d'anchois du Pérou à des échelles différentes: (1) les modes comportementaux au sein des voyages de pêche (i.e. recherche, pêche et trajet), (2) les patrons comportementaux parmi les voyages de pêche, (3) les patrons comportementaux par saison de pêche conditionnés par des scénarios écosystémiques et (4) les patrons spatiaux des positions de modes comportementaux, que nous utilisons pour la création de cartes de probabilité de présence d'anchois. Pour la première échelle, nous comparons plusieurs modèles Markoviens (modèles de Markov et semi-Markov cachés) et discriminatifs (forêts aléatoires, machines à vecteurs de support et réseaux de neurones artificiels) pour inférer les modes comportementaux associés aux trajectoires VMS. L'utilisation d'un ensemble de données pour lesquelles les modes comportementaux sont connus (grâce aux données collectées par des observateurs embarqués), nous permet d'entraîner les modèles dans un cadre supervisé et de les valider. Les modèles de semi-Markov cachés sont les plus performants, et sont retenus pour inférer les modes comportementaux sur l'ensemble de données VMS. Pour la deuxième échelle, nous caractérisons chaque voyage de pêche par plusieurs descripteurs, y compris le temps passé dans chaque mode comportemental. En utilisant une analyse de classification hiérarchique, les patrons des voyages de pêche sont classés en groupes associés à des zones de gestion, aux segments de la flottille et aux personnalités des capitaines. Pour la troisième échelle, nous analysons comment les conditions écologiques donnent forme au comportement des pêcheurs à l'échelle d'une saison de pêche. Via des analyses de co-inertie, nous trouvons des associations significatives entre les dynamiques spatiales des pêcheurs, des anchois et de l'environnement, et nous caractérisons la réponse comportementale des pêcheurs selon des scénarios environnementaux contrastés. Pour la quatrième échelle, nous étudions si le comportement spatial des pêcheurs reflète dans une certaine mesure la répartition spatiale de l'anchois. Nous construisons un indicateur de la présence d'anchois à l'aide des modes comportementaux géo-référencés inférés à partir des données VMS. Ce travail propose enfin une vision plus large du comportement de pêcheurs: les pêcheurs ne sont pas seulement des agents économiques, ils sont aussi des fourrageurs, conditionnés par la variabilité dans l'écosystème. Pour conclure, nous discutons de la façon dont ces résultats peuvent avoir de l'importance pour la gestion de la pêche, des analyses de comportement collectif et des modèles end-to-end. / This work proposes an original contribution to the understanding of fishermen spatial behavior, based on the behavioral ecology and movement ecology paradigms. Through the analysis of Vessel Monitoring System (VMS) data, we characterized the spatial behavior of Peruvian anchovy fishermen at different scales: (1) the behavioral modes within fishing trips (i.e., searching, fishing and cruising); (2) the behavioral patterns among fishing trips; (3) the behavioral patterns by fishing season conditioned by ecosystem scenarios; and (4) the computation of maps of anchovy presence proxy from the spatial patterns of behavioral mode positions. At the first scale considered, we compared several Markovian (hidden Markov and semi-Markov models) and discriminative models (random forests, support vector machines and artificial neural networks) for inferring the behavioral modes associated with VMS tracks. The models were trained under a supervised setting and validated using tracks for which behavioral modes were known (from on-board observers records). Hidden semi-Markov models performed better, and were retained for inferring the behavioral modes on the entire VMS dataset. At the second scale considered, each fishing trip was characterized by several features, including the time spent within each behavioral mode. Using a clustering analysis, fishing trip patterns were classified into groups associated to management zones, fleet segments and skippers' personalities. At the third scale considered, we analyzed how ecological conditions shaped fishermen behavior. By means of co-inertia analyses, we found significant associations between fishermen, anchovy and environmental spatial dynamics, and fishermen behavioral responses were characterized according to contrasted environmental scenarios. At the fourth scale considered, we investigated whether the spatial behavior of fishermen reflected to some extent the spatial distribution of anchovy. Finally, this work provides a wider view of fishermen behavior: fishermen are not only economic agents, but they are also foragers, constrained by ecosystem variability. To conclude, we discuss how these findings may be of importance for fisheries management, collective behavior analyses and end-to-end models.
|
246 |
Spatio-temporal dynamics in land use and habit fragmentation in Sandveld, South AfricaJames Takawira Magidi January 2010 (has links)
<p>This research assessed landuse changes and trends in vegetation cover in the Sandveld, using remote sensing images. Landsat TM satellite images of 1990, 2004 and 2007 were classified using the maximum likelihood classifier into seven landuse classes, namely water, agriculture, fire patches, natural vegetation, wetlands, disturbed veld, and open sands. Change detection using remote sensing algorithms and landscape metrics was performed on these multi-temporal landuse maps using the Land Change Modeller and Patch Analyst respectively. Markov stochastic modelling techniques were used to predict future scenarios in landuse change based on the classified images and their transitional probabilities. MODIS NDVI multi-temporal datasets with a 16day temporal resolution were used to assess seasonal and annual trends in vegetation cover using time series analysis (PCA and time profiling).Results indicated that natural vegetation decreased from 46% to 31% of the total landscape between 1990 and 2007 and these biodiversity losses were attributed to an increasing agriculture footprint. Predicted future scenario based on transitional probabilities revealed a continual loss in natural habitat and increase in the agricultural footprint. Time series analysis results (principal components and temporal profiles) suggested that the landscape has a high degree of overall dynamic change with pronounced inter and intra-annual changes and there was an overall increase in greenness associated with increase in agricultural activity. The study concluded that without future conservation interventions natural habitats would continue to disappear, a condition that will impact heavily on biodiversity and significant waterdependent ecosystems such as wetlands. This has significant implications for the long-term provision of water from ground water reserves and for the overall sustainability of current agricultural practices.</p>
|
247 |
Cascaded All-Optical Shared-Memory Architecture Packet Switches Using Channel Grouping Under Bursty TrafficShell, Michael David 01 December 2004 (has links)
This work develops an exact logical operation model to predict the performance of the all-optical shared-memory architecture (OSMA) class of packet switches and provides a means to obtain a reasonable approximation of OSMA switch performance within certain types of networks, including the Banyan family.
All-optical packet switches have the potential to far exceed the bandwidth capability of their current electronic counterparts. However, all-optical switching technology is currently not mature. Consequently, all-optical switch fabrics and buffers are more constrained in size and can cost several orders of magnitude more than those of electronic switches. The use of shared-memory buffers and/or links with multiple parallel channels (channel grouping) have been suggested as ways to maximize switch performance with buffers of limited size. However, analysis of shared-memory switches is far more difficult than for other commonly used buffering strategies. Obtaining packet loss performance by simulation is often not a viable alternative to modeling if low loss rates or large networks are encountered. Published models of electronic shared-memory packet switches (ESMP) have primarily involved approximate models to allow analysis of switches with a large number of ports and/or buffer cells. Because most ESMP models become inaccurate for small switches, and OSMA switches, unlike ESMP switches, do not buffer packets unless contention occurs, existing ESMP models cannot be applied to OSMA switches. Previous models of OSMA switches were confined to isolated (non-networked), symmetric OSMA switches using channel grouping under random traffic. This work is far more general in that it also encompasses OSMA switches that (1) are subjected to bursty traffic and/or with input links that have arbitrary occupancy probability distributions, (2) are interconnected to form a network and (3) are asymmetric.
|
248 |
Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And GeneralizationsLaxman, Srivatsan 03 1900 (has links)
Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets.
In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes.
Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of |C| candidate serial episodes of size N has space complexity O(N|C|) and time complexity O(ΔTN|C|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of |C| episodes needs O(n|C|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream).
This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(|C|) and O(n|C|) respectively (where n denotes the total number of events in the given event sequence and |C| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods.
There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, |C|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences.
In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework.
Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining.
In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated.
The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented.
|
249 |
Joint Evaluation Of Multiple Speech Patterns For Speech Recognition And TrainingNair, Nishanth Ulhas 01 1900 (has links)
Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic Speech Recognition (ASR) systems work well when the test and training conditions match. In real world environments there is often a mismatch between testing and training conditions. Various factors like additive noise, acoustic echo, and speaker accent, affect the speech recognition performance. Since ASR is a statistical pattern recognition problem, if the test patterns are unlike anything used to train the models, errors are bound to occur, due to feature vector mismatch. Various approaches to robustness have been proposed in the ASR literature contributing to mainly two topics: (i) reducing the variability in the feature vectors or (ii) modify the statistical model parameters to suit the noisy condition. While some of those techniques are quite effective, we would like to examine robustness from a different perspective. Considering the analogy of human communication over telephones, it is quite common to ask the person speaking to us, to repeat certain portions of their speech, because we don't understand it. This happens more often in the presence of background noise where the intelligibility of speech is affected significantly. Although exact nature of how humans decode multiple repetitions of speech is not known, it is quite possible that we use the combined knowledge of the multiple utterances and decode the unclear part of speech. Majority of ASR algorithms do not address this issue, except in very specific issues such as pronunciation modeling. We recognize that under very high noise conditions or bursty error channels, such as in packet communication where packets get dropped, it would be beneficial to take the approach of repeated utterances for robust ASR. In this thesis, we have formulated a set of algorithms for both joint evaluation/decoding for recognizing noisy test utterances as well as utilize the same formulation for selective training of Hidden Markov Models (HMMs), again for robust performance.
We first address joint recognition of multiple speech patterns given that they belong to the same class. We formulated this problem considering the patterns as isolated words. If there are K test patterns (K ≥ 2) of a word by a speaker, we show that it is possible to improve the speech recognition accuracy over independent single pattern evaluation of test speech, for the case of both clean and noisy speech. We also find the state sequence which best represents the K patterns. This formulation can be extended to connected word recognition or continuous speech recognition also.
Next, we consider the benefits of joint multi-pattern likelihood for HMM training. In the usual HMM training, all the training data is utilized to arrive at a best possible parametric model. But, it is possible that the training data is not all genuine and therefore may have labeling errors, noise corruptions, or plain outlier exemplars. Such outliers will result in poorer models and affect speech recognition performance. So it is important to selectively train them so that the outliers get a lesser weightage. Giving lesser weight to an entire outlier pattern has been addressed before in speech recognition literature. However, it is possible that only some portions of a training pattern are corrupted. So it is important that only the corrupted portions of speech are given a lesser weight during HMM training and not the entire pattern. Since in HMM training, multiple patterns of speech from each class are used, we show that it is possible to use joint evaluation methods to selectively train HMMs such that only the corrupted portions of speech are given a lesser weight and not the entire speech pattern.
Thus, we have addressed all the three main tasks of a HMM, to jointly utilize the availability of multiple patterns belonging to the same class. We experimented the new algorithms for Isolated Word Recognition in the case of both clean speech and noisy speech. Significant improvement in speech recognition performance is obtained, especially for speech affected by transient/burst noise.
|
250 |
Αυτόματος τεμαχισμός ψηφιακών σημάτων ομιλίας και εφαρμογή στη σύνθεση ομιλίας, αναγνώριση ομιλίας και αναγνώριση γλώσσας / Automatic segmentation of digital speech signals and application to speech synthesis, speech recognition and language recognitionΜπόρας, Ιωσήφ 19 October 2009 (has links)
Η παρούσα διατριβή εισάγει μεθόδους για τον αυτόματο τεμαχισμό σημάτων ομιλίας. Συγκεκριμένα παρουσιάζονται τέσσερις νέες μέθοδοι για τον αυτόματο τεμαχισμό σημάτων ομιλίας, τόσο για γλωσσολογικά περιορισμένα όσο και μη προβλήματα. Η πρώτη μέθοδος κάνει χρήση των σημείων του σήματος που αντιστοιχούν στα ανοίγματα των φωνητικών χορδών κατά την διάρκεια της ομιλίας για να εξάγει όρια ψευδό-φωνημάτων με χρήση του αλγορίθμου δυναμικής παραμόρφωσης χρόνου. Η δεύτερη τεχνική εισάγει μια καινοτόμα υβριδική μέθοδο εκπαίδευσης κρυμμένων μοντέλων Μαρκώφ, η οποία τα καθιστά πιο αποτελεσματικά στον τεμαχισμό της ομιλίας. Η τρίτη μέθοδος χρησιμοποιεί αλγορίθμους μαθηματικής παλινδρόμησης για τον συνδυασμό ανεξαρτήτων μηχανών τεμαχισμού ομιλίας. Η τέταρτη μέθοδος εισάγει μια επέκταση του αλγορίθμου Βιτέρμπι με χρήση πολλαπλών παραμετρικών τεχνικών για τον τεμαχισμό της ομιλίας. Τέλος, οι προτεινόμενες μέθοδοι τεμαχισμού χρησιμοποιούνται για την βελτίωση συστημάτων στο πρόβλημα της σύνθεσης ομιλίας, αναγνώρισης ομιλίας και αναγνώρισης γλώσσας. / The present dissertation introduces methods for the automatic segmentation of speech signals. In detail, four new segmentation methods are presented both in for the cases of linguistically constrained or not segmentation. The first method uses pitchmark points to extract pseudo-phonetic boundaries using dynamic time warping algorithm. The second technique introduces a new hybrid method for the training of hidden Markov models, which makes them more effective in the speech segmentation task. The third method uses regression algorithms for the fusion of independent segmentation engines. The fourth method is an extension of the Viterbi algorithm using multiple speech parameterization techniques for segmentation. Finally, the proposed methods are used to improve systems in the task of speech synthesis, speech recognition and language recognition.
|
Page generated in 0.0625 seconds