Global ETD Search

181	Efficient prediction of relational structure and its application to natural language processing Riedel, Sebastian January 2009 (has links) Many tasks in Natural Language Processing (NLP) require us to predict a relational structure over entities. For example, in Semantic Role Labelling we try to predict the ’semantic role’ relation between a predicate verb and its argument constituents. Often NLP tasks not only involve related entities but also relations that are stochastically correlated. For instance, in Semantic Role Labelling the roles of different constituents are correlated: we cannot assign the agent role to one constituent if we have already assigned this role to another. Statistical Relational Learning (also known as First Order Probabilistic Logic) allows us to capture the aforementioned nature of NLP tasks because it is based on the notions of entities, relations and stochastic correlations between relationships. It is therefore often straightforward to formulate an NLP task using a First Order probabilistic language such as Markov Logic. However, the generality of this approach comes at a price: the process of finding the relational structure with highest probability, also known as maximum a posteriori (MAP) inference, is often inefficient, if not intractable. In this work we seek to improve the efficiency of MAP inference for Statistical Relational Learning. We propose a meta-algorithm, namely Cutting Plane Inference (CPI), that iteratively solves small subproblems of the original problem using any existing MAP technique and inspects parts of the problem that are not yet included in the current subproblem but could potentially lead to an improved solution. Our hypothesis is that this algorithm can dramatically improve the efficiency of existing methods while remaining at least as accurate. We frame the algorithm in Markov Logic, a language that combines First Order Logic and Markov Networks. Our hypothesis is evaluated using two tasks: Semantic Role Labelling and Entity Resolution. It is shown that the proposed algorithm improves the efficiency of two existing methods by two orders of magnitude and leads an approximate method to more probable solutions. We also give show that CPI, at convergence, is guaranteed to be at least as accurate as the method used within its inner loop. Another core contribution of this work is a theoretic and empirical analysis of the boundary conditions of Cutting Plane Inference. We describe cases when Cutting Plane Inference will definitely be difficult (because it instantiates large networks or needs many iterations) and when it will be easy (because it instantiates small networks and needs only few iterations). 005.1
182	Strojové učení redukční analýzy / Machine learning of analysis by reduction Hoffmann, Petr January 2013 (has links) We study the inference of models of the analysis by reduction that forms an important tool for parsing natural language sentences. We prove that the inference of such models from positive and negative samples is NP-hard when requiring a small model. On the other hand, if only positive samples are considered, the problem is effectively solvable. We propose a new model of the analysis by reduction (the so-called single k-reversible restarting automaton) and propose a method for inferring it from positive samples of analyses by reduction. The power of the model lies between growing context-sensitive languages and context-sensitive languages. Benchmarks using targets based on grammars have several drawbacks. Therefore we propose a benchmark working with targets based on random automata, that can be used to evaluate inference algorithms. This benchmark is then used to evaluate our inference method. 1
183	Inférence de la structure tri-dimensionnelle du génome / Inferring the 3D architecture of the genome Varoquaux, Nelle 03 December 2015 (has links) La structure de l'ADN, des chromosomes et l'organisation du génome sont des sujets fascinants du monde de la biologie. La plupart de la recherche s'est concentrée sur la structure unidimensionnelle du génome, étudiant comment les gènes et les chromosomes sont organisés, et le lien entre l'organisation unidimensionnelle et la régulation des gènes, l'épissage, la méthylation… Cependant, le génome est avant tout organisé dans un espace euclidien tridimensionnel, et cette structure 3D, bien que moins étudiée, joue elle aussi un rôle important dans la fonction génomique de la cellule. La capture de la conformation des chromosomes (3C) et les méthodes qui en sont dérivées, associées au séquençage à haut débit (NGS) mesurent désormais en une seule expérience des interactions physiques entre paire de loci sur tout le génome, permettant ainsi aux chercheurs de découvrir les secrets de l'organisation des génomes. Ces nouvelles technologies ouvrent la voie à des études systématiques et globales sur le repliement de l'ADN dans le noyau. Cependant, ces nouvelles méthodes 3C, comme toute nouvelle technologie, sont accompagnées de nombreux défis computationnels et théoriques. Le premier chapitre est dédié au développement d'une méthode robuste et précise pour inférer un modèle tridimensionnel à partir de données Hi-C. Notre méthode modélise les fréquences d'interaction comme une distribution de Poisson dont l'intensité est une fonction de la distance euclidienne entre paires de loci : nous formulons ainsi l'inférence de la structure 3D comme un problème de maximum de vraisemblance. Nous montrons que notre méthode infère des modèles plus robustes et plus stables selon les données et les résolutions de celles-ci. Le deuxième chapitre est consacré à l'étude de l'architecture du P. falciparum, un petit parasite responsable de la forme la plus virulente et mortelle de la malaria. Ce projet, dont l'objectif était avant tout de répondre à une question biologique, cherchait à comprendre comment l'architecture 3D du génome du P. falciparum est liée à l'expression et la régulation des gènes à différent moments du cycle cellulaire du parasite. En collaboration avec les équipes de K. Le Roch et de W. Noble, spécialisées respectivement dans l'étude du P. falciparum, et dans le développement de méthode computationnelle pour étudier, entre autre, la structure 3D du génome, nous avons construit des modèles de l'organisation du génome à trois moments du cycle cellulaire du parasite. Ceux-ci révèlent que le génome est replié dans le noyau dans une structure complexe, où de nombreux éléments génomiques colocalisent : centromères, télomères… Cette architecture indique une forte association entre l'organisation spatiale du génome et l'expression des gènes. Le dernier chapitre répond à une question très différente, mais aussi liée à l'étude des données 3C. Celles-ci, initialement développées pour étudier la structure tridimensionnelle du génome, ont été récemment utilisées pour des applications très diverses : l'assemblage de génomes de novo, la déconvolution d'échantillons métagénomiques et l'annotation de génomes. Nous décrivons dans ce chapitre une nouvelle méthode, Centurion, qui infère conjointement la position de tous les centromères d'un organisme, en utilisant la propriété qu'ont les centromères à colocaliser dans le noyau. Cette méthode est donc une alternative aux méthodes de détection de centromères classiques, qui, malgré des années de recherche et un enjeu économique certain, n'ont pu identifier la position des centromères dans un certain nombre d'espèces de levure. / The structure of DNA, chromosomes and genome organization is a topic that has fascinated the field of biology for many years. Most research focused on the one-dimensional structure of the genome, studying the linear organizations of genes and genomes and their link with gene expression and regulation, splicing, DNA methylation… Yet, spatial and temporal three-dimensional genome architecture is also thought to play an important role in many genomic functions. Chromosome conformation capture (3C) based methods, coupled with next generation sequencing (NGS), allow the measurement, in a single experiment, of genome wide physical interactions between pairs of loci, thus enabling to unravel the secrets behind 3D organization of genomes. These new technologies have paved the way towards a systematic and genome wide analysis of how DNA folds into the nucleus and opened new avenues to understanding many biological processes, such as gene regulation, DNA replication and repair, somatic copy number alterations and epigenetic changes. Yet, 3C technologies, as any new biotechnology, now poses important computational and theoretical challenges for which mathematically well grounded methods need to be developped. The first chapter is dedicated to developping a robust and accurate method to infer a 3D model of the genome from Hi-C data. Previous methods often formulated the inference as an optimization problem akin to multidimensional scaling (MDS) based on an ad hoc conversion of contact counts into euclidean wish distances. Chromosomes are modeled with a beads-on-a-string model, and the methods attempt to place the beads in a 3D euclidean space to fullfill a number of, often non convex, constraints and such that the pairwise distances between beads are as close as possible to the corresponding wish distances. These approaches rely on dubious hypotheses to convert contact counts into wish distances, challenging the accuracy of the final 3D model. Another limitation is the MDS formulation which is only intuitively motivated, and not grounded on a clear statistical model. To alleviate these problems, our method models contact counts as a Poisson distribution where the intensity is a decreasing function of the spatial distance between elements interacting. We then formulate the 3D structure inference as a maximum likelihood problem. We demonstrate that our method infers robust and stable models across resolutions and datasets. The second chapter focuses on the genome architecture of the P. falciparum, a small parasite responsible for the deadliest and most virulent form of human malaria. This project was biologically driven and aimed at understanding whether and how the 3D structure of the genome related to gene expression and regulation at different time points in the complex life cycle of the parasite. In collaboration with the Le Roch lab and the Noble lab, we built 3D models of the genome at three time points which resulted in a complex genome architecture indicative of a strong association between the spatial genome and gene expression. The last chapter tackles a very different question, also based on 3C-based data. Initially developped to probe the 3D architecture of the chromosomes, Hi-C and related techniques have recently been re-purposed for diverse applications: de novo genome assembly, deconvolution of metagenomic samples and genome annotations. We describe in this chapter a novel method, Centurion, that jointly infers the locations of all centromeres in a single yeast genome from Hi-C data, using the centromeres' tendency to strongly colocalize in the nucleus. Indeed, centromeres are essential for proper chromosome segregation, yet, despite extensive research, centromere locations are unknown for many yeast species. We demonstrate the robustness of our approach on datasets with low and high coverage on well annotated organisms. We then predict centromere coordinates for 6 yeast species that currently lack those annotations. Structure tridimensionelle Inference Hi-C Génome 3D structure Inference Hi-C Genome 576.5
184	Approximate inference in graphical models Hennig, Philipp January 2011 (has links) Probability theory provides a mathematically rigorous yet conceptually flexible calculus of uncertainty, allowing the construction of complex hierarchical models for real-world inference tasks. Unfortunately, exact inference in probabilistic models is often computationally expensive or even intractable. A close inspection in such situations often reveals that computational bottlenecks are confined to certain aspects of the model, which can be circumvented by approximations without having to sacrifice the model's interesting aspects. The conceptual framework of graphical models provides an elegant means of representing probabilistic models and deriving both exact and approximate inference algorithms in terms of local computations. This makes graphical models an ideal aid in the development of generalizable approximations. This thesis contains a brief introduction to approximate inference in graphical models (Chapter 2), followed by three extensive case studies in which approximate inference algorithms are developed for challenging applied inference problems. Chapter 3 derives the first probabilistic game tree search algorithm. Chapter 4 provides a novel expressive model for inference in psychometric questionnaires. Chapter 5 develops a model for the topics of large corpora of text documents, conditional on document metadata, with a focus on computational speed. In each case, graphical models help in two important ways: They first provide important structural insight into the problem; and then suggest practical approximations to the exact probabilistic solution. 519.2
185	Studium vývoje lymfocytů pomocí hmotnostní cytometrie / Studying lymphocyte development using mass cytometry Novák, David January 2020 (has links) Studying lymphocyte development using mass cytometry Abstract Development of mature lymphocytes, a white blood cell subtype, is crucial for the correct function of the human immune system. Currently, developmental pathways of lymphocytes can be studied using high-throughput single-cell measurements. In particular, mass cytometry enables the study of immunologically relevant pheno- typic and functional markers on a vast scale. In this work I present my individual contribution to tviblindi, a powerful software tool for analysis of cytometric data aimed at uncovering developmental trajectories. tviblindi is a package written in R, Python and C++. It provides a means to integrate prior knowledge with data analyses grounded in graph theory and algebraic topology. tviblindi is accessible to biological researchers without background in computer science or mathematics. It is an addition to the expanding field of trajectory inference in single-cell data. Furthermore, I review current knowledge of T-cell development and conduct a tviblindi analysis thereof using human thymus and peripheral blood datasets and evaluate the results. 1
186	Studium vývoje lymfocytů pomocí hmotnostní cytometrie / Studying lymphocyte development using mass cytometry Novák, David January 2020 (has links) Studying lymphocyte development using mass cytometry Abstract Development of mature lymphocytes, a white blood cell subtype, is crucial for the correct function of the human immune system. Currently, developmental pathways of lymphocytes can be studied using high-throughput single-cell measurements. In particular, mass cytometry enables the study of immunologically relevant pheno- typic and functional markers on a vast scale. In this work I present my individual contribution to tviblindi, a powerful software tool for analysis of cytometric data aimed at uncovering developmental trajectories. tviblindi is a package written in R, Python and C++. It provides a means to integrate prior knowledge with data analyses grounded in graph theory and algebraic topology. tviblindi is accessible to biological researchers without background in computer science or mathematics. It is an addition to the expanding field of trajectory inference in single-cell data. Furthermore, I review current knowledge of T-cell development and conduct a tviblindi analysis thereof using human thymus and peripheral blood datasets and evaluate the results. 1
187	Distributed Bootstrap for Massive Data Yang Yu (12466911) 27 April 2022 (has links) <p>Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework.</p> <p> </p> <p>First, we propose two generic distributed bootstrap methods, \texttt{k-grad} and \texttt{n+k-1-grad}, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an $\ell_\infty$-norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency and showing that $\tau_{\min}$ only increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory.</p> <p> </p> <p>Then, we extend \texttt{k-grad} and \texttt{n+k-1-grad} to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.</p> Statistics Distributed Learning High-dimensional Inference Multiplier Bootstrap De-biased LASSO Simultaneous inference
188	Generating Predictive Inferences when Multiple Alternatives are Available Cranford, Edward Andrew 09 December 2016 (has links) The generation of predictive inferences may be difficult when a story leads to multiple possible consequences. Prior research has shown that readers only generate predictive inferences automatically, under normal reading conditions, when the story is based on familiar events for which the reader has readily available knowledge about what may happen next, there is enough constraining information in the text so that the inference is highly predictable, and there are few or no alternative inferences available (McKoon & Ratcliff, 1992). However, some evidence shows predictive inferences were generated when the likelihood of the targeted inference was reduced and the story implied an alternative consequence could occur (Klin, Murray, Levine, & Guzmán, 1999). It is possible, though, that the alternative was not a likely enough consequence to affect processing of the targeted inference. Prior research did not examine whether the alternative inference was drawn or whether multiple inferences could be entertained simultaneously. The experiments in this dissertation were designed to further assess the nature of interference when multiple consequences are possible by increasing the likelihood of the alternative so that both inferences were more equally likely to occur. The first two experiments used a word-naming task and showed that neither inference was activated when probed at 500 ms after the story (Experiment 1A) or when probed at 1000 ms (Experiment 1B), suggesting the alternative inference interferes with activation of the targeted inference. Experiments 2 and 3 used a contradictory reading paradigm to assess whether the inferences were activated but only at a minimal level so that they were not detected in a word-naming task. Reading time was slower when a sentence contradicted both inferences but not when it contradicted only one inference, even after reading a lengthy filler text. Reading time was also slower in Experiment 3 when the filler text was removed. These results imply both inferences were generated at a minimal level of activation that does not strengthen over time. The results are discussed in the light of comprehension theories that could account for the representation of minimally encoded inferences (Kintsch, 1998; Myers & O'Brien, 1998). minimal encoding resonance prediction construction-integration narrative comprehension predictive inference inference
189	Copula Modelling of High-Dimensional Longitudinal Binary Response Data / Copula-modellering av högdimensionell longitudinell binärresponsdata Henningsson, Nils January 2022 (has links) This thesis treats the modelling of a high-dimensional data set of longitudinal binary responses. The data consists of default indicators from different nations around the world as well as some explanatory variables such as exposure to underlying assets. The data used for the modelling is an aggregated term which combines several of the default indicators in the data set into one. The modelling sets out from a portfolio perspective and seeks to find underlying correlations between the nations in the data set as well as see the extreme values produced by a portfolio with assets in the nations in the data set. The modelling takes a copula approach which uses Gaussian copulas to first formulate several different models mathematically and then optimize the parameters in the models to best fit the data set. Models A and B are optimized using standard stochastic gradient ascent on the likelihood function while model C uses variational inference and stochastic gradient ascent on the evidence lower bound for optimization. Using the different Gaussian copulas obtained from the optimization process a portfolio simulation is then done to examine the extreme values. The results show low correlations in models A and B while model C with it's additional regional correlations show slightly higher correlations in three of the subgroups. The portfolio simulations show similar tail behaviour in all three models, however model C produces more extreme risk measure outcomes in the form of higher VaR and ES. / Denna uppsats behandlar modellering av en datauppsättning bestående av högdimensionell longitudinell binärrespons. Datan består av konkursindikatorer för ett flertal suveräna stater runtom världen samt förklarande variabler så som exponering mot underliggande tillgångar. Datan som används i modelleringen är en aggregerad term som slår samman flera av konkursindikatorerna till en term. Modellerandet tar ett portföljperspektiv och försöker att finna underliggande korrelationer mellan nationerna i datamängden så väl som extremförluster som kan komma från en portfölj med tillgångar i de olika länderna som innefattas av datamängden. Utgångspunkten för modellerandet är ett copula-perspektiv som använder Gaussiska copulas där man först försöker matematiskt formulera flertalet modeller för att sedan optimera parametrarna i dessa modeller för att bäst passa datamängden till hands. För modell A och modell B optimeras log-likelihoodfunktionen med hjälp av stochastic gradient ascent medan i modell C används variational inference och sedan optimeras evidence lower bound med hjälp av stochastic gradient ascent. Med hjälp av de anpassade copula-modellerna simuleras sedan olika portföljer för att se vilka extremvärden som kan antas. Resultaten visar små korrelationer i modell A och B medan i modell C, med dess ytterligare regionala korrelationer, visas något större korrelation i tre av undergrupperna. Portföljsimuleringarna visar liknande svansbeteende i alla tre modeller, men modell C ger upphov till större riskmåttvärden i portföljerna i form av högre VaR och ES. Copula latent model variational inference Copula latent modell variational inference Other Mathematics Annan matematik
190	SpecTackle: Inferring Partial Specifications Through Constraint-Based Dynamic Analysis Wedig, Sean A. 30 August 2012 (has links) No description available. Computer Science specification inference dynamic inference tests to specs constraint-based dynamic analysis formal methods

Search results