Global ETD Search

101	Classification of phylogenetic data via Bayesian mixture modelling Loza Reyes, Elisa January 2010 (has links) Conventional probabilistic models for phylogenetic inference assume that an evolutionary tree,andasinglesetofbranchlengthsandstochasticprocessofDNA evolutionare sufficient to characterise the generating process across an entire DNA alignment. Unfortunately such a simplistic, homogeneous formulation may be a poor description of reality when the data arise from heterogeneous processes. A well-known example is when sites evolve at heterogeneous rates. This thesis is a contribution to the modelling and understanding of heterogeneityin phylogenetic data. Weproposea methodfor the classificationof DNA sites based on Bayesian mixture modelling. Our method not only accounts for heterogeneous data but also identifies the underlying classes and enables their interpretation. We also introduce novel MCMC methodology with the same, or greater, estimation performance than existing algorithms but with lower computational cost. We find that our mixture model can successfully detect evolutionary heterogeneity and demonstrate its direct relevance by applying it to real DNA data. One of these applications is the analysis of sixteen strains of one of the bacterial species that cause Lyme disease. Results from that analysis have helped understanding the evolutionary paths of these bacterial strains and, therefore, the dynamics of the spread of Lyme disease. Our method is discussed in the context of DNA but it may be extendedto othertypesof molecular data. Moreover,the classification scheme thatwe propose is evidence of the breadth of application of mixture modelling and a step forwards in the search for more realistic models of theprocesses that underlie phylogenetic data. 519
102	Contributions to Collective Dynamical Clustering-Modeling of Discrete Time Series Wang, Chiying 27 April 2016 (has links) The analysis of sequential data is important in business, science, and engineering, for tasks such as signal processing, user behavior mining, and commercial transactions analysis. In this dissertation, we build upon the Collective Dynamical Modeling and Clustering (CDMC) framework for discrete time series modeling, by making contributions to clustering initialization, dynamical modeling, and scaling. We first propose a modified Dynamic Time Warping (DTW) approach for clustering initialization within CDMC. The proposed approach provides DTW metrics that penalize deviations of the warping path from the path of constant slope. This reduces over-warping, while retaining the efficiency advantages of global constraint approaches, and without relying on domain dependent constraints. Second, we investigate the use of semi-Markov chains as dynamical models of temporal sequences in which state changes occur infrequently. Semi-Markov chains allow explicitly specifying the distribution of state visit durations. This makes them superior to traditional Markov chains, which implicitly assume an exponential state duration distribution. Third, we consider convergence properties of the CDMC framework. We establish convergence by viewing CDMC from an Expectation Maximization (EM) perspective. We investigate the effect on the time to convergence of our efficient DTW-based initialization technique and selected dynamical models. We also explore the convergence implications of various stopping criteria. Fourth, we consider scaling up CDMC to process big data, using Storm, an open source distributed real-time computation system that supports batch and distributed data processing. We performed experimental evaluation on human sleep data and on user web navigation data. Our results demonstrate the superiority of the strategies introduced in this dissertation over state-of-the-art techniques in terms of modeling quality and efficiency. discrete time series deviated dynamic time warping semi-Markov chain distributed data processing system
103	Bayesian Logistic Regression Model with Integrated Multivariate Normal Approximation for Big Data Fu, Shuting 28 April 2016 (has links) The analysis of big data is of great interest today, and this comes with challenges of improving precision and efficiency in estimation and prediction. We study binary data with covariates from numerous small areas, where direct estimation is not reliable, and there is a need to borrow strength from the ensemble. This is generally done using Bayesian logistic regression, but because there are numerous small areas, the exact computation for the logistic regression model becomes challenging. Therefore, we develop an integrated multivariate normal approximation (IMNA) method for binary data with covariates within the Bayesian paradigm, and this procedure is assisted by the empirical logistic transform. Our main goal is to provide the theory of IMNA and to show that it is many times faster than the exact logistic regression method with almost the same accuracy. We apply the IMNA method to the health status binary data (excellent health or otherwise) from the Nepal Living Standards Survey with more than 60,000 households (small areas). We estimate the proportion of Nepalese in excellent health condition for each household. For these data IMNA gives estimates of the household proportions as precise as those from the logistic regression model and it is more than fifty times faster (20 seconds versus 1,066 seconds), and clearly this gain is transferable to bigger data problems. Parallel computing. Multivariate Normal distribution Metropolis Hastings sampler Empirical logistic transform Markov chain Monte Carlo
104	Stochastic heat equations with Markovian switching Fan, Qianzhu January 2017 (has links) This thesis consists of three parts. In the first part, we recall some background theory that will be used throughout the thesis. In the second part, we studied the existence and uniqueness of solutions of the stochastic heat equations with Markovian switching. In the third part, we investigate the properties of solutions, such as Feller property, strong Feller property and stability. 510
105	Modelling operational risk using skew t-copulas and Bayesian inference Garzon Rozo, Betty Johanna January 2016 (has links) Operational risk losses are heavy tailed and are likely to be asymmetric and extremely dependent among business lines/event types. The analysis of dependence via copula models has been focussed on the bivariate case mainly. In the vast majority of instances symmetric elliptical copulas are employed to model dependence for severities. This thesis proposes a new methodology to assess, in a multivariate way, the asymmetry and extreme dependence between severities, and to calculate the capital for operational risk. This methodology simultaneously uses (i) several parametric distributions and an alternative mixture distribution (the Lognormal for the body of losses and the generalised Pareto Distribution for the tail) using a technique from extreme value theory, (ii) the multivariate skew t-copula applied for the first time across severities and (iii) Bayesian theory. The former to model severities, I test simultaneously several parametric distributions and the mixture distribution for each business line. This procedure enables me to achieve multiple combinations of the severity distribution and to find which fits most closely. The second to effectively model asymmetry and extreme dependence in high dimensions. The third to estimate the copula model, given the high multivariate component (i.e. eight business lines and seven event types) and the incorporation of mixture distributions it is highly difficult to implement maximum likelihood. Therefore, I use a Bayesian inference framework and Markov chain Monte Carlo simulation to evaluate the posterior distribution to estimate and make inferences of the parameters of the skew t-copula model. The research analyses an updated operational loss data set, SAS® Operational Risk Global Data (SAS OpRisk Global Data), to model operational risk at international financial institutions. I then evaluate the impact of this multivariate, asymmetric and extreme dependence on estimating the total regulatory capital, among other established multivariate copulas. My empirical findings are consistent with other studies reporting thin and medium-tailed loss distributions. My approach substantially outperforms symmetric elliptical copulas, demonstrating that modelling dependence via the skew t-copula provides a more efficient allocation of capital charges of up to 56% smaller than that indicated by the standard Basel model.
106	Contribuições para o controle on-line de processos por atributos. / Contributions to monitoring process for attributes. Trindade, Anderson Laécio Galindo 02 April 2008 (has links) O procedimento de controle on-line de processos por atributos, proposto por Taguchi et al. (1989), consiste em amostrar um item a cada m produzidos e decidir, a cada inspeção, se houve ou não aumento na fração de itens não-conformes produzidos. Caso o item inspecionado seja não-conforme, pára-se o processo para ajuste supondo-se que tenha ocorrido uma mudança para a condição fora de controle. Como i) o sistema de inspeção pode estar sujeito a erros de classificação e o item inspecionado ser submetido à classificações repetidas; ii) a fração de itens não-conformes no estado fora de controle pode variar em função do número de itens produzidos (x) segundo uma função y (x); iii) e a decisão de parar o processo pode ser tomada com base no resultado das últimas h inspeções, desenvolve-se um modelo que engloba estes pontos. Utilizando-se as propriedades de uma cadeia de Markov ergódica, obtém-se a expressão do custo médio do sistema de controle, que é minimizada por parâmetros que vão além do intervalo de inspeção m: o número de classificações repetidas r; o número mínimo de classificações conformes para declarar um item como conforme s, o comprimento do histórico de inspeções considerado h e o critério de parada para ajuste u. Os resultados obtidos mostram que: o uso de classificações repetidas pode ser uma alternativa econômica quando apenas um item é considerado na decisão sobre o ajuste do processo; uma cadeia da Markov finita pode ser utilizada para representar o sistema de controle na presença de uma função y (x) não-constante; tomar a decisão de ajuste com base na observação de uma seqüência de itens inspecionados é a alternativa de maior impacto sobre o custo de controle do processo. / The quality control procedure for attributes, proposed by Taguchi et al. (1989), consists in inspecting a single item at every m produced items and, based on the result of each inspection, deciding weather the non-conforming fraction has increased or not. If an inspected item is declared non-conforming, the process is stopped and adjusted, assuming that it has changed to out-of-control condition. Once: i) the inspection system is subject to misclassification and it is possible to carry out repetitive classifications in the inspected item; ii) the non-conforming fraction, when the process is out-of-control, can be described by y(x); iii) the decision about stopping the process can be based on last h inspections, a model which considers those points is developed. Using properties of ergodic Markov chains, the average cost expression is calculated and can be minimized by parameters beyond m: number of repetitive classifications (r); minimum number of classifications as conforming to declare an item as conforming (s); number of inspections taken into account (h) and stopping criteria (u). The results obtained show that: repetitive classifications of the inspected item can be a viable option if only one item is used to decide about the process condition; a finite Markov chain can be used to represent the control procedure in presence of a function y(x); deciding about the process condition based on last h inspections has a significant impact on the average cost. Control procedure for attributes Controle de processos Markov chain Quality deterioration Repetitive classifications
107	Comparação de algoritmos usados na construção de mapas genéticos / Comparison of algorithms used in the construction of genetic linkage maps Mollinari, Marcelo 23 January 2008 (has links) Mapas genéticos são arranjos lineares que indicam a ordem e distância entre locos nos cromossomos de uma determinada espécie. Recentemente, a grande disponibilidade de marcadores moleculares tem tornado estes mapas cada vez mais saturados, sendo necessários métodos eficientes para sua construção. Uma das etapas que merece mais atenção na construção de mapas de ligação é a ordenação dos marcadores genéticos dentro de cada grupo de ligação. Tal ordenação é considerada um caso especial do clássico problema do caixeiro viajante (TSP), que consiste em escolher a melhor ordem entre todas as possíveis. Entretanto, a estratégia de busca exaustiva torna-se inviável quando o número de marcadores é grande. Nesses casos, para que esses mapas possam ser construídos uma alternativa viável é a utilização de algoritmos que forneçam soluções aproximadas. O objetivo desse trabalho foi avaliar a eficiência dos algoritmos Try (TRY), Seriation (SER), Rapid Chain Delineation (RCD), Recombination Counting and Ordering (RECORD) e Unidirectional Growth (UG), além dos critérios PARF (produto mínimo das frações de recombinação adjacentes), SARF (soma mínima das frações de recombinação adjacentes), SALOD (soma máxima dos LOD scores adjacentes) e LMHC (verossimilhança via cadeias de Markov ocultas), usados juntamente com o algoritmo de verificação de erros RIPPLE, para a construção de mapas genéticos. Para tanto, foi simulado um mapa de ligação de uma espécie vegetal hipotética, diplóide e monóica, contendo 21 marcadores com distância fixa entre eles de 3 centimorgans. Usando o método Monte Carlo, foram obtidas aleatoriamente 550 populações F2 com 100 e 400 indivíduos, além de diferentes combinações de marcadores dominantes e codominantes. Foi ainda simulada perda de 10% e 20% dos dados. Os resultados mostraram que os algoritmos TRY e SER tiveram bons resultados em todas as situações simuladas, mesmo com presença de elevado número de dados perdidos e marcadores dominantes ligados em repulsão, podendo ser então recomendado em situações práticas. Os algoritmos RECORD e UG apresentaram bons resultados na ausência de marcadores dominantes ligados em repulsão, podendo então ser recomendados em situações com poucos marcadores dominantes. Dentre todos os algoritmos, o RCD foi o que se mostrou menos eficiente. O critério LHMC, aplicado com o algoritmo RIPPLE, foi o que apresentou melhores resultados quando se deseja fazer verificações de erros na ordenação. / Genetic linkage maps are linear arrangements showing the order and distance between loci in chromosomes of a particular species. Recently, the availability of molecular markers has made such maps more saturated and efficient methods are needed for their construction. One of the steps that deserves more attention in the construction of genetic linkage maps is the ordering of genetic markers within each linkage group. This ordering is considered a special case of the classic traveling salesman problem (TSP), which consists in choosing the best order among all possible ones. However, the strategy of exhaustive search becomes unfeasible when the number of markers is large. One possible alternative to construct such maps is to use algorithms that provide approximate solutions. Thus, the aim of this work was to evaluate the efficiency of algorithms Try (TRY), Seriation (SER), Rapid Chain Delineation (RCD), Recombination Counting and Ordering (RECORD) and Unidirectional Growth (UG), as well as the criteria PARF (product of adjacent recombination fractions), SARF (sum of adjacent recombination fractions), SALOD (sum of adjacent lod scores) and LMHC (likelihood via hidden Markov chains), used with the RIPPLE algorithm for error verification, in the construction of genetic linkage maps. For doing so, a linkage map of a hypothetical diploid and monoecious plant species was simulated, containing 21 markers with fixed distance of 3 centimorgans between them. Using Monte Carlo methods, 550 F2 populations were randomly simulated with 100 and 400 individuals, together with different combinations of dominant and codominant markers. 10 % and 20 % of missing data was also included. Results showed that the algorithms TRY and SER gave good results in all situations, even with presence of a large number of missing data and dominant markers linked in repulsion phase. Thus, these can be recommended for analyzing real data. The algorithms RECORD and UG gave good results in the absence of dominant markers linked in repulsion phase and can be used in this case. Among all algorithms, RCD was the least efficient. The criterion LHMC, applied with the RIPPLE algorithm, showed the best results when the goal is to check ordering errors. Algoritmos Cadeias de Markov Hidden Markov Chain Mapeamento genético Marcador Molecular. Molecular Marker. Monte Carlo Multipoint estimates
108	Grafos aleatórios exponenciais / Exponential Random Graphs Santos, Tássio Naia dos 09 December 2013 (has links) Estudamos o comportamento da familia aresta-triangulo de grafos aleatorios exponenciais (ERG) usando metodos de Monte Carlo baseados em Cadeias de Markov. Comparamos contagens de subgrafos e correlacoes entre arestas de ergs as de Grafos Aleatorios Binomiais (BRG, tambem chamados de Erdos-Renyi). E um resultado teorico conhecido que para algumas parametrizacoes os limites das contagens de subgrafos de ERGs convergem para os de BRGs, assintoticamente no numero de vertices [BBS11, CD11]. Observamos esse fenomeno em grafos com poucos (20) vertices em nossas simulacoes. / We study the behavior of the edge-triangle family of exponential random graphs (ERG) using the Markov Chain Monte Carlo method. We compare ERG subgraph counts and edge correlations to those of the classic Binomial Random Graph (BRG, also called Erdos-Renyi model). It is a known theoretical result that for some parameterizations the limit ERG subgraph counts converge to those of BRGs, as the number of vertices grows [BBS11, CD11]. We observe this phenomenon on graphs with few (20) vertices in our simulations. Cadeia de Markov Combinatória Combinatorics Grafos Aleatórios Markov Chain Monte Carlo Monte Carlo Random Graphs
109	Bayesian Inference Frameworks for Fluorescence Microscopy Data Analysis January 2019 (has links) abstract: In this work, I present a Bayesian inference computational framework for the analysis of widefield microscopy data that addresses three challenges: (1) counting and localizing stationary fluorescent molecules; (2) inferring a spatially-dependent effective fluorescence profile that describes the spatially-varying rate at which fluorescent molecules emit subsequently-detected photons (due to different illumination intensities or different local environments); and (3) inferring the camera gain. My general theoretical framework utilizes the Bayesian nonparametric Gaussian and beta-Bernoulli processes with a Markov chain Monte Carlo sampling scheme, which I further specify and implement for Total Internal Reflection Fluorescence (TIRF) microscopy data, benchmarking the method on synthetic data. These three frameworks are self-contained, and can be used concurrently so that the fluorescence profile and emitter locations are both considered unknown and, under some conditions, learned simultaneously. The framework I present is flexible and may be adapted to accommodate the inference of other parameters, such as emission photophysical kinetics and the trajectories of moving molecules. My TIRF-specific implementation may find use in the study of structures on cell membranes, or in studying local sample properties that affect fluorescent molecule photon emission rates. / Dissertation/Thesis / Masters Thesis Applied Mathematics 2019 Mathematics Statistics Biophysics bayesian beta-bernoulli process gaussian process markov chain monte carlo microscopy superresolution
110	Enhancing Multi-model Inference with Natural Selection Ching-Wei Cheng (7582487) 30 October 2019 (has links) <div>Multi-model inference covers a wide range of modern statistical applications such as variable selection, model confidence set, model averaging and variable importance.</div><div>The performance of multi-model inference depends on the availability of candidate models, whose quality has been rarely studied in literature. In this dissertation, we study genetic algorithm (GA) in order to obtain high-quality candidate models. Inspired by the process of natural selection, GA performs genetic operations such as selection, crossover and mutation iteratively to update a collection of potential solutions (models) until convergence. The convergence properties are studied based on the Markov chain theory and used to design an adaptive termination criterion that vastly reduces the computational cost. In addition, a new schema theory is established to characterize how the current model set is improved through evolutionary process. Extensive numerical experiments are carried out to verify our theory and demonstrate the empirical power of GA, and new findings are obtained for two real data examples. </div> Statistics Convergence analysis Evolvability Genetic algorithm Markov chain analysis Multi-model inference Schema theory

Search results