Global ETD Search

21	Statistical Methods for Characterizing Genomic Heterogeneity in Mixed Samples Zhang, Fan 12 December 2016 (has links) "Recently, sequencing technologies have generated massive and heterogeneous data sets. However, interpretation of these data sets is a major barrier to understand genomic heterogeneity in complex diseases. In this dissertation, we develop a Bayesian statistical method for single nucleotide level analysis and a global optimization method for gene expression level analysis to characterize genomic heterogeneity in mixed samples. The detection of rare single nucleotide variants (SNVs) is important for understanding genetic heterogeneity using next-generation sequencing (NGS) data. Various computational algorithms have been proposed to detect variants at the single nucleotide level in mixed samples. Yet, the noise inherent in the biological processes involved in NGS technology necessitates the development of statistically accurate methods to identify true rare variants. At the single nucleotide level, we propose a Bayesian probabilistic model and a variational expectation maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of relatively low coverage (27x and 298x) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants. Characterization of heterogeneity in gene expression data is a critical challenge for personalized treatment and drug resistance due to intra-tumor heterogeneity. Mixed membership factorization has become popular for analyzing data sets that have within-sample heterogeneity. In recent years, several algorithms have been developed for mixed membership matrix factorization, but they only guarantee estimates from a local optimum. At the gene expression level, we derive a global optimization (GOP) algorithm that provides a guaranteed epsilon-global optimum for a sparse mixed membership matrix factorization problem for molecular subtype classification. We test the algorithm on simulated data and find the algorithm always bounds the global optimum across random initializations and explores multiple modes efficiently. The GOP algorithm is well-suited for parallel computations in the key optimization steps. " Rare variant detection Next-generation sequencing Bayesian statistics Variational inference Global optimization Matrix factorization
22	A wikification prediction model based on the combination of latent, dyadic and monadic features / Um modelo de previsão para Wikification baseado na combinação de atributos latentes, diádicos e monádicos Raoni Simões Ferreira 25 April 2016 (has links) Most of the reference information, nowadays, is found in repositories of documents semantically linked, created in a collaborative fashion and freely available in the web. Among the many problems faced by content providers in these repositories, one of the most important is Wikification, that is, the placement of links in the articles. These links have to support user navigation and should provide a deeper semantic interpretation of the content. Wikification is a hard task since the continuous growth of such repositories makes it increasingly demanding for editors. As consequence, they have their focus shifted from content creation, which should be their main objective. This has motivated the design of automatic Wikification tools which, traditionally, address two distinct problems: (a) how to identify which words (or phrases) in an article should be selected as anchors and (b) how to determine to which article the link, associated with the anchor, should point. Most of the methods in literature that address these problems are based on machine learning approaches which attempt to capture, through statistical features, characteristics of the concepts and its associations. Although these strategies handle the repository as a graph of concepts, normally they take limited advantage of the topological structure of this graph, as they describe it by means of human-engineered link statistical features. Despite the effectiveness of these machine learning methods, better models should take full advantage of the information topology if they describe it by means of data-oriented approaches such as matrix factorization. This indeed has been successfully done in other domains, such as movie recommendation. In this work, we fill this gap, proposing a wikification prediction model that combines the strengths of traditional predictors based on statistical features with a latent component which models the concept graph topology by means of matrix factorization. By comparing our model with a state-of-the-art wikification method, using a sample of Wikipedia articles, we obtained a gain up to 13% in F1 metric. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. The study still includes the analysis of the impact of ambiguous concepts, which allows us to conclude the model is resilient to ambiguity, even though does not include any explicitly disambiguation phase. We finally study the impact of selecting training samples from specific content quality classes, an information that is available in some respositories, such as Wikipedia. We empirically shown that the quality of the training samples impact on precision and overlinking, when comparing training performed using random quality samples versus high quality samples. / Atualmente, informações de referência são disponibilizadas através de repositórios de documentos semanticamente ligados, criados de forma colaborativa e com acesso livre na Web. Entre os muitos problemas enfrentados pelos provedores de conteúdo desses repositórios, destaca-se a Wikification, isto é, a inclusão de links nos artigos desses repositórios. Esses links possibilitam a navegação pelos artigos e permitem ao usuário um aprofundamento semântico do conteúdo. A Wikification é uma tarefa complexa, uma vez que o crescimento contínuo de tais repositórios resulta em um esforço cada vez maior dos editores. Como consequência, eles têm seu foco desviado da criação de conteúdo, que deveria ser o seu principal objetivo. Isso tem motivado o desenvolvimento de ferramentas de Wikification automática que, tradicionalmente, abordam dois problemas distintos: (a) como identificar que palavras (ou frases) em um artigo deveriam ser selecionados como texto de âncora e (b) como determinar para que artigos o link, associado ao texto de âncora, deveria apontar. A maioria dos métodos na literatura que abordam esses problemas usam aprendizado de máquina. Eles tentam capturar, através de atributos estatísticos, características dos conceitos e seus links. Embora essas estratégias tratam o repositório como um grafo de conceitos, normalmente elas pouco exploram a estrutura topológica do grafo, uma vez que se limitam a descrevê-lo por meio de atributos estatísticos dos links, projetados por especialistas humanos. Embora tais métodos sejam eficazes, novos modelos poderiam tirar mais proveito da topologia se a descrevessem por meio de abordagens orientados a dados, tais como a fatoração matricial. De fato, essa abordagem tem sido aplicada com sucesso em outros domínios como recomendação de filmes. Neste trabalho, propomos um modelo de previsão para Wikification que combina a força dos previsores tradicionais baseados em atributos estatísticos, projetados por seres humanos, com um componente de previsão latente, que modela a topologia do grafo de conceitos usando fatoração matricial. Ao comparar nosso modelo com o estado-da-arte em Wikification, usando uma amostra de artigos Wikipédia, observamos um ganho de até 13% em F1. Além disso, fornecemos uma análise detalhada do desempenho do modelo enfatizando a importância do componente de previsão latente e dos atributos derivados dos links entre os conceitos. Também analisamos o impacto de conceitos ambíguos, o que permite concluir que nosso modelo se porta bem mesmo diante de ambiguidade, apesar de não tratar explicitamente este problema. Ainda realizamos um estudo sobre o impacto da seleção das amostras de treino conforme a qualidade dos seus conteúdos, uma informação disponível em alguns repositórios, tais como a Wikipédia. Nós observamos que o treino com documentos de alta qualidade melhora a precisão do método, minimizando o uso de links desnecessários. Aprendizado de máquina Fatoração matricial Previsão de links Wikificação Wikipédia Link prediction Machine learning Matrix factorization Wikification Wikipedia
23	Assessment of source-receptor relationships of aerosols: an integrated forward and backward modeling approach Kulkarni, Sarika 01 December 2009 (has links) This dissertation presents a scientific framework that facilitates enhanced understanding of aerosol source - receptor (S/R) relationships and their impact on the local, regional and global air quality by employing a complementary suite of modeling methods. The receptor - oriented Positive Matrix Factorization (PMF) technique is combined with Potential Source Contribution Function (PSCF), a trajectory ensemble model, to characterize sources influencing the aerosols measured at Gosan, Korea during spring 2001. It is found that the episodic dust events originating from desert regions in East Asia (EA) that mix with pollution along the transit path, have a significant and pervasive impact on the air quality of Gosan. The intercontinental and hemispheric transport of aerosols is analyzed by a series of emission perturbation simulations with the Sulfur Transport and dEposition Model (STEM), a regional scale Chemical Transport Model (CTM), evaluated with observations from the 2008 NASA ARCTAS field campaign. This modeling study shows that pollution transport from regions outside North America (NA) contributed ∼ 30 and 20% to NA sulfate and BC surface concentration. This study also identifies aerosols transported from Europe, NA and EA regions as significant contributors to springtime Arctic sulfate and BC. Trajectory ensemble models are combined with source region tagged tracer model output to identify the source regions and possible instances of quasi-lagrangian sampled air masses during the 2006 NASA INTEX-B field campaign. The impact of specific emission sectors from Asia during the INTEX-B period is studied with the STEM model, identifying residential sector as potential target for emission reduction to combat global warming. The output from the STEM model constrained with satellite derived aerosol optical depth and ground based measurements of single scattering albedo via an optimal interpolation assimilation scheme is combined with the PMF technique to characterize the seasonality and regional distribution of aerosols in Asia. This innovative analysis framework that combines the output from source - oriented chemical transport models with receptor models is shown to reduce the uncertainty in aerosol distributions, which in turn leads to better estimates of source - receptor relationships and impact assessments of aerosol radiative forcing and health effects due to air pollution. Aerosols Chemical Transport Modeling INTEX-B Optimal Interpolation Positive Matrix Factorization Source Receptor Relationships Chemical Engineering
24	Generalized Maximum Entropy, Convexity and Machine Learning Sears, Timothy Dean, tim.sears@biogreenoil.com January 2008 (has links) This thesis identiﬁes and extends techniques that can be linked to the principle of maximum entropy (maxent) and applied to parameter estimation in machine learning and statistics. Entropy functions based on deformed logarithms are used to construct Bregman divergences, and together these represent a generalization of relative entropy. The framework is analyzed using convex analysis to charac- terize generalized forms of exponential family distributions. Various connections to the existing machine learning literature are discussed and the techniques are applied to the problem of non-negative matrix factorization (NMF). Maximum entropy Bregman divergence exponential family deformed logarithm escort distribution non-negative matrix factorization.
25	Application Of Two Receptor Models For The Investigation Of Sites Contaminated With Polychlorinated Biphenyls: Positive Matrix Factorization And Chemical Mass Balance Demircioglu, Filiz 01 June 2010 (has links) (PDF) This study examines the application of two receptor models, namely Positive Matrix Factorization (PMF) and Chemical Mass Balance (CMB), on the investigation of sites contaminated with PCBs. Both models are typically used for apportionment of pollution sources in atmospheric pollution studies, however have gained popularity in the last decade on the investigation of PCBs in soil/sediments. The aim of the study is four-fold / (i) to identify the status of PCB pollution in Lake Eymir area via sampling and analysis of PCBs in collected soil/sediment samples, (ii) to modify the CMB model software in terms of efficiency and user-friendliness (iii) to apply the CMB model to Lake Eymir area PCB data for apportionment of the sources as well as to gather preliminary information regarding degradation of PCBs by considering the history of pollution in the area (iv) to explore the use of PMF for both source apportionment and investigation of fate of PCBs in the environment via use of Monte-Carlo simulated artificial data sets. Total PCB concentrations (Aroclor based) were found to be in the range of below detection limit to 76.3 ng/g dw with a median of. 1.7 ng/g dw for samples collected from the channel between Lake Mogan and Lake Eymir. Application of the CMB model yield contribution of highly chlorinated PCB mixtures (Aroclor 1254 and Aroclor 1260 / typically used in transformers) as sources. The modified CMB model software provided user more efficient and user friendly working environment. Two uncertainty equations, developed and existing in literature, were found to be effective for better resolution of sources by the PMF model. TD Environmental Pollution 172-193.5
26	Probabilistic Matrix Factorization Based Collaborative Filtering With Implicit Trust Derived From Review Ratings Information Ercan, Eda 01 September 2010 (has links) (PDF) Recommender systems aim to suggest relevant items that are likely to be of interest to the users using a variety of information resources such as user pro
27	Examination of Initialization Techniques for Nonnegative Matrix Factorization Frederic, John 21 November 2008 (has links) While much research has been done regarding different Nonnegative Matrix Factorization (NMF) algorithms, less time has been spent looking at initialization techniques. In this thesis, four different initializations are considered. After a brief discussion of NMF, the four initializations are described and each one is independently examined, followed by a comparison of the techniques. Next, each initialization's performance is investigated with respect to the changes in the size of the data set. Finally, a method by which smaller data sets may be used to determine how to treat larger data sets is examined. Nonnegative matrix factorization Initialization Spherical K- means Compression ratio Percent error Random Acol Random C Mathematics
28	Data Privacy Preservation in Collaborative Filtering Based Recommender Systems Wang, Xiwei 01 January 2015 (has links) This dissertation studies data privacy preservation in collaborative filtering based recommender systems and proposes several collaborative filtering models that aim at preserving user privacy from different perspectives. The empirical study on multiple classical recommendation algorithms presents the basic idea of the models and explores their performance on real world datasets. The algorithms that are investigated in this study include a popularity based model, an item similarity based model, a singular value decomposition based model, and a bipartite graph model. Top-N recommendations are evaluated to examine the prediction accuracy. It is apparent that with more customers' preference data, recommender systems can better profile customers' shopping patterns which in turn produces product recommendations with higher accuracy. The precautions should be taken to address the privacy issues that arise during data sharing between two vendors. Study shows that matrix factorization techniques are ideal choices for data privacy preservation by their nature. In this dissertation, singular value decomposition (SVD) and nonnegative matrix factorization (NMF) are adopted as the fundamental techniques for collaborative filtering to make privacy-preserving recommendations. The proposed SVD based model utilizes missing value imputation, randomization technique, and the truncated SVD to perturb the raw rating data. The NMF based models, namely iAux-NMF and iCluster-NMF, take into account the auxiliary information of users and items to help missing value imputation and privacy preservation. Additionally, these models support efficient incremental data update as well. A good number of online vendors allow people to leave their feedback on products. It is considered as users' public preferences. However, due to the connections between users' public and private preferences, if a recommender system fails to distinguish real customers from attackers, the private preferences of real customers can be exposed. This dissertation addresses an attack model in which an attacker holds real customers' partial ratings and tries to obtain their private preferences by cheating recommender systems. To resolve this problem, trustworthiness information is incorporated into NMF based collaborative filtering techniques to detect the attackers and make reasonably different recommendations to the normal users and the attackers. By doing so, users' private preferences can be effectively protected. collaborative filtering data update matrix factorization privacy trustworthiness Computer Security Databases and Information Systems E-Commerce
29	Nonnegative matrix factorization for clustering Kuang, Da 27 August 2014 (has links) This dissertation shows that nonnegative matrix factorization (NMF) can be extended to a general and efficient clustering method. Clustering is one of the fundamental tasks in machine learning. It is useful for unsupervised knowledge discovery in a variety of applications such as text mining and genomic analysis. NMF is a dimension reduction method that approximates a nonnegative matrix by the product of two lower rank nonnegative matrices, and has shown great promise as a clustering method when a data set is represented as a nonnegative data matrix. However, challenges in the widespread use of NMF as a clustering method lie in its correctness and efficiency: First, we need to know why and when NMF could detect the true clusters and guarantee to deliver good clustering quality; second, existing algorithms for computing NMF are expensive and often take longer time than other clustering methods. We show that the original NMF can be improved from both aspects in the context of clustering. Our new NMF-based clustering methods can achieve better clustering quality and run orders of magnitude faster than the original NMF and other clustering methods. Like other clustering methods, NMF places an implicit assumption on the cluster structure. Thus, the success of NMF as a clustering method depends on whether the representation of data in a vector space satisfies that assumption. Our approach to extending the original NMF to a general clustering method is to switch from the vector space representation of data points to a graph representation. The new formulation, called Symmetric NMF, takes a pairwise similarity matrix as an input and can be viewed as a graph clustering method. We evaluate this method on document clustering and image segmentation problems and find that it achieves better clustering accuracy. In addition, for the original NMF, it is difficult but important to choose the right number of clusters. We show that the widely-used consensus NMF in genomic analysis for choosing the number of clusters have critical flaws and can produce misleading results. We propose a variation of the prediction strength measure arising from statistical inference to evaluate the stability of clusters and select the right number of clusters. Our measure shows promising performances in artificial simulation experiments. Large-scale applications bring substantial efficiency challenges to existing algorithms for computing NMF. An important example is topic modeling where users want to uncover the major themes in a large text collection. Our strategy of accelerating NMF-based clustering is to design algorithms that better suit the computer architecture as well as exploit the computing power of parallel platforms such as the graphic processing units (GPUs). A key observation is that applying rank-2 NMF that partitions a data set into two clusters in a recursive manner is much faster than applying the original NMF to obtain a flat clustering. We take advantage of a special property of rank-2 NMF and design an algorithm that runs faster than existing algorithms due to continuous memory access. Combined with a criterion to stop the recursion, our hierarchical clustering algorithm runs significantly faster and achieves even better clustering quality than existing methods. Another bottleneck of NMF algorithms, which is also a common bottleneck in many other machine learning applications, is to multiply a large sparse data matrix with a tall-and-skinny dense matrix. We use the GPUs to accelerate this routine for sparse matrices with an irregular sparsity structure. Overall, our algorithm shows significant improvement over popular topic modeling methods such as latent Dirichlet allocation, and runs more than 100 times faster on data sets with millions of documents. Nonnegative matrix factorization Cluster analysis Hierarchical clustering Cancer subtype discovery GPU computing Sparse matrix multiplication
30	Extending low-rank matrix factorizations for emerging applications Zhou, Ke 13 January 2014 (has links) Low-rank matrix factorizations have become increasingly popular to project high dimensional data into latent spaces with small dimensions in order to obtain better understandings of the data and thus more accurate predictions. In particular, they have been widely applied to important applications such as collaborative filtering and social network analysis. In this thesis, I investigate the applications and extensions of the ideas of the low-rank matrix factorization to solve several practically important problems arise from collaborative filtering and social network analysis. A key challenge in recommendation system research is how to effectively profile new users, a problem generally known as \emph{cold-start} recommendation. In the first part of this work, we extend the low-rank matrix factorization by allowing the latent factors to have more complex structures --- decision trees to solve the problem of cold-start recommendations. In particular, we present \emph{functional matrix factorization} (fMF), a novel cold-start recommendation method that solves the problem of adaptive interview construction based on low-rank matrix factorizations. The second part of this work considers the efficiency problem of making recommendations in the context of large user and item spaces. Specifically, we address the problem through learning binary codes for collaborative filtering, which can be viewed as restricting the latent factors in low-rank matrix factorizations to be binary vectors that represent the binary codes for both users and items. In the third part of this work, we investigate the applications of low-rank matrix factorizations in the context of social network analysis. Specifically, we propose a convex optimization approach to discover the hidden network of social influence with low-rank and sparse structure by modeling the recurrent events at different individuals as multi-dimensional Hawkes processes, emphasizing the mutual-excitation nature of the dynamics of event occurrences. The proposed framework combines the estimation of mutually exciting process and the low-rank matrix factorization in a principled manner. In the fourth part of this work, we estimate the triggering kernels for the Hawkes process. In particular, we focus on estimating the triggering kernels from an infinite dimensional functional space with the Euler Lagrange equation, which can be viewed as applying the idea of low-rank factorizations in the functional space. Matrix factorization Collaborative filtering Social network Dimensional analysis Computer programs Cluster analysis Data processing Social networks

Search results