Global ETD Search

21	Découverte de règles d'association multi-relationnelles à partir de bases de connaissances ontologiques pour l'enrichissement d'ontologies / Discovering multi-relational association rules from ontological knowledge bases to enrich ontologies Tran, Duc Minh 23 July 2018 (has links) Dans le contexte du Web sémantique, les ontologies OWL représentent des connaissances explicites sur un domaine sur la base d'une conceptualisation des domaines d'intérêt, tandis que la connaissance correspondante sur les individus est donnée par les données RDF qui s'y réfèrent. Dans cette thèse, sur la base d'idées dérivées de l'ILP, nous visons à découvrir des motifs de connaissance cachés sous la forme de règles d'association multi-relationnelles en exploitant l'évidence provenant des assertions contenues dans les bases de connaissances ontologiques. Plus précisément, les règles découvertes sont codées en SWRL pour être facilement intégrées dans l'ontologie, enrichissant ainsi son pouvoir expressif et augmentant les connaissances sur les individus (assertions) qui en peuvent être dérivées. Deux algorithmes appliqués aux bases de connaissances ontologiques peuplées sont proposés pour trouver des règles à forte puissance inductive : (i) un algorithme de génération et test par niveaux et (ii) un algorithme évolutif. Nous avons effectué des expériences sur des ontologies accessibles au public, validant les performances de notre approche et les comparant avec les principaux systèmes de l'état de l'art. En outre, nous effectuons une comparaison des métriques asymétriques les plus répandues, proposées à l'origine pour la notation de règles d'association, comme éléments constitutifs d'une fonction de fitness pour l'algorithme évolutif afin de sélectionner les métriques qui conviennent à la sémantique des données. Afin d'améliorer les performances du système, nous avons proposé de construire un algorithme pour calculer les métriques au lieu d'interroger viaSPARQL-DL. / In the Semantic Web context, OWL ontologies represent explicit domain knowledge based on the conceptualization of domains of interest while the corresponding assertional knowledge is given by RDF data referring to them. In this thesis, based on ideas derived from ILP, we aim at discovering hidden knowledge patterns in the form of multi-relational association rules by exploiting the evidence coming from the assertional data of ontological knowledge bases. Specifically, discovered rules are coded in SWRL to be easily integrated within the ontology, thus enriching its expressive power and augmenting the assertional knowledge that can be derived. Two algorithms applied to populated ontological knowledge bases are proposed for finding rules with a high inductive power: (i) level-wise generated-and-test algorithm and (ii) evolutionary algorithm. We performed experiments on publicly available ontologies, validating the performances of our approach and comparing them with the main state-of-the-art systems. In addition, we carry out a comparison of popular asymmetric metrics, originally proposed for scoring association rules, as building blocks for a fitness function for evolutionary algorithm to select metrics that are suitable with data semantics. In order to improve the system performance, we proposed to build an algorithm to compute metrics instead of querying via SPARQL-DL. Web sémantique Ontologie OWL SWRL RDF Exploration de données Algorithmes évolutionnaires Logique de description Découverte de modèle Semantic web Ontology OWL SWRL RDF Data mining Evolutionary algorithms Description logics Pattern discovery
22	Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And Generalizations Laxman, Srivatsan 03 1900 (has links) Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets. In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes. Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of \|C\| candidate serial episodes of size N has space complexity O(N\|C\|) and time complexity O(ΔTN\|C\|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of \|C\| episodes needs O(n\|C\|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream). This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(\|C\|) and O(n\|C\|) respectively (where n denotes the total number of events in the given event sequence and \|C\| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods. There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, \|C\|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences. In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework. Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining. In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated. The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented. Data Mining Databases - Data Mining Hidden Markov Models Temporal Data Mining Episodes (Temporal Patterns) Event Sequences Frequent Episodes Episode Discovery Fast Algorithms Pattern Discovery Computer Science
23	Discovery and Analysis of Aligned Pattern Clusters from Protein Family Sequences Lee, En-Shiun Annie 28 April 2015 (has links) Protein sequences are essential for encoding molecular structures and functions. Consequently, biologists invest substantial resources and time discovering functional patterns in proteins. Using high-throughput technologies, biologists are generating an increasing amount of data. Thus, the major challenge in biosequencing today is the ability to conduct data analysis in an effi cient and productive manner. Conserved amino acids in proteins reveal important functional domains within protein families. Conversely, less conserved amino acid variations within these protein sequence patterns reveal areas of evolutionary and functional divergence. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search is used. However, at present, combinatorial methods of pattern search generate a large set of solutions, and probabilistic methods require richer representations. They require biological ground truth of the input sequences, such as gene name or taxonomic species, as class labels based on traditional classi fication practice to train a model for predicting unknown sequences. However, these algorithms are inherently biased by mislabelling and may not be able to reveal class characteristics in a detailed and succinct manner. A novel pattern representation called an Aligned Pattern Cluster (AP Cluster) as developed in this dissertation is compact yet rich. It captures conservations and variations of amino acids and covers more sequences with lower entropy and greatly reduces the number of patterns. AP Clusters contain statistically signi cant patterns with variations; their importance has been confi rmed by the following biological evidences: 1) Most of the discovered AP Clusters correspond to binding segments while their aligned columns correspond to binding sites as verifi ed by pFam, PROSITE, and the three-dimensional structure. 2) By compacting strong correlated functional information together, AP Clusters are able to reveal class characteristics for taxonomical classes, gene classes and other functional classes, or incorrect class labelling. 3) Co-occurrence of AP Clusters on the same homologous protein sequences are spatially close in the protein's three-dimensional structure. These results demonstrate the power and usefulness of AP Clusters. They bring in similar statistically signifi cance patterns with variation together and align them to reveal protein regional functionality, class characteristics, binding and interacting sites for the study of protein-protein and protein-drug interactions, for diff erentiation of cancer tumour types, targeted gene therapy as well as for drug target discovery. pattern discovery knowledge discovery clustering protein functionality sequence pattern co-occurrence spectral clustering hierarchical clustering classification unsupervised learning cluster validity measure Jaccard Index Mutual Information Information gain pattern alignment
24	Etude bioinformatique de l'évolution de la régulation transcriptionnelle chez les bactéries / Bioinformatic study of the evolution of the transcriptional regulation in bacteria Janky, Rekin's 17 December 2007 (has links) L'objet de cette thèse de bioinformatique est de mieux comprendre l’ensemble des systèmes de régulation génique chez les bactéries. La disponibilité de centaines de génomes complets chez les bactéries ouvre la voie aux approches de génomique comparative et donc à l’étude de l’évolution des réseaux transcriptionnels bactériens. Dans un premier temps, nous avons implémenté et validé plusieurs méthodes de prédiction d’opérons sur base des génomes bactériens séquencés. Suite à cette étude, nous avons décidé d’utiliser un algorithme qui se base simplement sur un seuil sur la distance intergénique, à savoir la distance en paires de bases entre deux gènes adjacents. Notre évaluation sur base d’opérons annotés chez Escherichia coli et Bacillus subtilis nous permet de définir un seuil optimal de 55pb pour lequel nous obtenons respectivement 78 et 79% de précision. Deuxièmement, l’identification des motifs de régulation transcriptionnelle, tels les sites de liaison des facteurs de transcription, donne des indications de l’organisation de la régulation. Nous avons développé une méthode de recherche d’empreintes phylogénétiques qui consiste à découvrir des paires de mots espacés (dyades) statistiquement sur-représentées en amont de gènes orthologues bactériens. Notre méthode est particulièrement adaptée à la recherche de motifs chez les bactéries puisqu’elle profite d’une part des centaines de génomes bactériens séquencés et d’autre part les facteurs de transcription bactériens présentent des domaines Hélice-Tour-Hélice qui reconnaissent spécifiquement des dyades. Une évaluation systématique sur 368 gènes de E.coli a permis d’évaluer les performances de notre méthode et de tester l’influence de plus de 40 combinaisons de paramètres concernant le niveau taxonomique, l’inférence d’opérons, le filtrage des dyades spécifiques de E.coli, le choix des modèles de fond pour le calcul du score de significativité, et enfin un seuil sur ce score. L’analyse détaillée pour un cas d’étude, l’autorégulation du facteur de transcription LexA, a montré que notre approche permet d’étudier l’évolution des sites d’auto-régulation dans plusieurs branches taxonomiques des bactéries. Nous avons ensuite appliqué la détection d’empreintes phylogénétiques à chaque gène de E.coli, et utilisé les motifs détectés comme significatifs afin de prédire les gènes co-régulés. Au centre de cette dernière stratégie, est définie une matrice de scores de significativité pour chaque mot détecté par gène chez l’organisme de référence. Plusieurs métriques ont été définies pour la comparaison de paires de profils de scores de sorte que des paires de gènes ayant des motifs détectés significativement en commun peuvent être regroupées. Ainsi, l’ensemble des nos méthodes nous permet de reconstruire des réseaux de co-régulation uniquement à partir de séquences génomiques, et nous ouvre la voie à l’étude de l’organisation et de l’évolution de la régulation transcriptionnelle pour des génomes dont on ne connaît rien.<p><p>The purpose of my thesis is to study the evolution of regulation within bacterial genomes by using a cross-genomic comparative approach. Nowadays, numerous genomes have been sequenced facilitating in silico analysis in order to detect groups of functionally related genes and to predict the mechanism of their relative regulation. In this project, we combined prediction of operons and regulons in order to reconstruct the transcriptional regulatory network for a bacterial genome. We have implemented three methods in order to predict operons from a bacterial genome and evaluated them on hundreds of annotated operons of Escherichia coli and Bacillus subtilis. It turns out that a simple distance-based threshold method gives good results with about 80% of accuracy. The principle of this method is to classify pairs of adjacent genes as “within operon” or “transcription unit border”, respectively, by using a threshold on their intergenic distance: two adjacent genes are predicted to be within an operon if their intergenic distance is smaller than 55bp. In the second part of my thesis, I evaluated the performances of a phylogenetic footprinting approach based on the detection of over-represented spaced motifs. This method is particularly suitable for (but not restricted to) Bacteria, since such motifs are typically bound by factors containing a Helix-Turn-Helix domain. We evaluated footprint discovery in 368 E.coli K12 genes with annotated sites, under 40 different combinations of parameters (taxonomical level, background model, organism-specific filtering, operon inference, significance threshold). Motifs are assessed both at the level of correctness and significance. The footprint discovery method proposed here shows excellent results with E. coli and can readily be extended to predict cis-acting regulatory signals and propose testable hypotheses in bacterial genomes for which nothing is known about regulation. Moreover, the predictive power of the strategy, and its capability to track the evolutionary divergence of cis-regulatory motifs was illustrated with the example of LexA auto-regulation, for which our predictions are remarkably consistent with the binding sites characterized in different taxonomical groups. A next challenge was to identify groups of co-regulated genes (regulons), by regrouping genes with similar motifs, in order to address the challenging domain of the evolution of transcriptional regulatory networks. We tested different metrics to detect putative pairs of co-regulated genes. The comparison between predicted and annotated co-regulation networks shows a high positive predictive value, since a good fraction of the predicted associations correspond to annotated co-regulations, and a low sensitivity, which may be due to the consequence of highly connected transcription factors (global regulator). A regulon-per-regulon analysis indeed shows that the sensitivity is very weak for these transcription factors, but can be quite good for specific transcription factors. The originality of this global strategy is to be able to infer a potential network from the sole analysis of genome sequences, and without any prior knowledge about the regulation in the considered organism. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Sciences exactes et naturelles Biologie Structural bioinformatics Bacterial genetics Microbial metabolism -- Regulation Microbial genomics Bio-informatique structurale Génétique bactérienne Génomique microbienne dyad-analysis Evaluation coregulation network lexA operon prediction pattern-discovery Bacteria Phylogenetic footprinting RSAT
25	Dolování periodických vzorů / Periodic Patterns Mining Stríž, Rostislav January 2012 (has links) Data collecting and analysis are commonly used techniques in many sectors of today's business and science. Process called Knowledge Discovery in Databases presents itself as a great tool to find new and interesting information that can be used in a future developement. This thesis deals with basic principles of data mining and temporal data mining as well as with specifics of concrete implementation of chosen algorithms for mining periodic patterns in time series. These algorithms have been developed in a form of managed plug-ins for Microsoft Analysis Services -- service that provides data mining features for Microsoft SQL Server. Finally, we discuss obtained results of performed experiments focused on time complexity of implemented algorithms.
26	Získávání frekventovaných vzorů z proudu dat / Frequent Pattern Discovery in a Data Stream Dvořák, Michal January 2012 (has links) Frequent-pattern mining from databases has been widely studied and frequently observed. Unfortunately, these algorithms are not suitable for data stream processing. In frequent-pattern mining from data streams, it is important to manage sets of items and also their history. There are several reasons for this; it is not just the history of frequent items, but also the history of potentially frequent sets that can become frequent later. This requires more memory and computational power. This thesis describes two algorithms: Lossy Counting and FP-stream. An effective implementation of these algorithms in C# is an integral part of this thesis. In addition, the two algorithms have been compared.
27	Advanced natural language processing and temporal mining for clinical discovery Mehrabi, Saeed 17 August 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / There has been vast and growing amount of healthcare data especially with the rapid adoption of electronic health records (EHRs) as a result of the HITECH act of 2009. It is estimated that around 80% of the clinical information resides in the unstructured narrative of an EHR. Recently, natural language processing (NLP) techniques have offered opportunities to extract information from unstructured clinical texts needed for various clinical applications. A popular method for enabling secondary uses of EHRs is information or concept extraction, a subtask of NLP that seeks to locate and classify elements within text based on the context. Extraction of clinical concepts without considering the context has many complications, including inaccurate diagnosis of patients and contamination of study cohorts. Identifying the negation status and whether a clinical concept belongs to patients or his family members are two of the challenges faced in context detection. A negation algorithm called Dependency Parser Negation (DEEPEN) has been developed in this research study by taking into account the dependency relationship between negation words and concepts within a sentence using the Stanford Dependency Parser. The study results demonstrate that DEEPEN, can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs. Additionally, an NLP system consisting of section segmentation and relation discovery was developed to identify patients' family history. To assess the generalizability of the negation and family history algorithm, data from a different clinical institution was used in both algorithm evaluations. Deep learning Family history Natural language processing Negation Pancreatic cancer Temporal pattern discovery Medical records -- Data processing Forms management Electronic records -- Access control Computational linguistics Data mining
28	Probabilistic Sequence Models with Speech and Language Applications Henter, Gustav Eje January 2013 (has links) Series data, sequences of measured values, are ubiquitous. Whenever observations are made along a path in space or time, a data sequence results. To comprehend nature and shape it to our will, or to make informed decisions based on what we know, we need methods to make sense of such data. Of particular interest are probabilistic descriptions, which enable us to represent uncertainty and random variation inherent to the world around us. This thesis presents and expands upon some tools for creating probabilistic models of sequences, with an eye towards applications involving speech and language. Modelling speech and language is not only of use for creating listening, reading, talking, and writing machines---for instance allowing human-friendly interfaces to future computational intelligences and smart devices of today---but probabilistic models may also ultimately tell us something about ourselves and the world we occupy. The central theme of the thesis is the creation of new or improved models more appropriate for our intended applications, by weakening limiting and questionable assumptions made by standard modelling techniques. One contribution of this thesis examines causal-state splitting reconstruction (CSSR), an algorithm for learning discrete-valued sequence models whose states are minimal sufficient statistics for prediction. Unlike many traditional techniques, CSSR does not require the number of process states to be specified a priori, but builds a pattern vocabulary from data alone, making it applicable for language acquisition and the identification of stochastic grammars. A paper in the thesis shows that CSSR handles noise and errors expected in natural data poorly, but that the learner can be extended in a simple manner to yield more robust and stable results also in the presence of corruptions. Even when the complexities of language are put aside, challenges remain. The seemingly simple task of accurately describing human speech signals, so that natural synthetic speech can be generated, has proved difficult, as humans are highly attuned to what speech should sound like. Two papers in the thesis therefore study nonparametric techniques suitable for improved acoustic modelling of speech for synthesis applications. Each of the two papers targets a known-incorrect assumption of established methods, based on the hypothesis that nonparametric techniques can better represent and recreate essential characteristics of natural speech. In the first paper of the pair, Gaussian process dynamical models (GPDMs), nonlinear, continuous state-space dynamical models based on Gaussian processes, are shown to better replicate voiced speech, without traditional dynamical features or assumptions that cepstral parameters follow linear autoregressive processes. Additional dimensions of the state-space are able to represent other salient signal aspects such as prosodic variation. The second paper, meanwhile, introduces KDE-HMMs, asymptotically-consistent Markov models for continuous-valued data based on kernel density estimation, that additionally have been extended with a fixed-cardinality discrete hidden state. This construction is shown to provide improved probabilistic descriptions of nonlinear time series, compared to reference models from different paradigms. The hidden state can be used to control process output, making KDE-HMMs compelling as a probabilistic alternative to hybrid speech-synthesis approaches. A final paper of the thesis discusses how models can be improved even when one is restricted to a fundamentally imperfect model class. Minimum entropy rate simplification (MERS), an information-theoretic scheme for postprocessing models for generative applications involving both speech and text, is introduced. MERS reduces the entropy rate of a model while remaining as close as possible to the starting model. This is shown to produce simplified models that concentrate on the most common and characteristic behaviours, and provides a continuum of simplifications between the original model and zero-entropy, completely predictable output. As the tails of fitted distributions may be inflated by noise or empirical variability that a model has failed to capture, MERS's ability to concentrate on high-probability output is also demonstrated to be useful for denoising models trained on disturbed data. / <p>QC 20131128</p> / ACORNS: Acquisition of Communication and Recognition Skills / LISTA – The Listening Talker Time series acoustic modelling speech synthesis stochastic processes causal-state splitting reconstruction robust causal states pattern discovery Markov models HMMs nonparametric models Gaussian processes Gaussian process dynamical models nonlinear Kalman filters information theory minimum entropy rate simplification kernel density estimation time-series bootstrap

Search results