Global ETD Search

11	A Study of Machine Learning Approaches for Biomedical Signal Processing Shen, Minjie 10 June 2021 (has links) The introduction of high-throughput molecular profiling technologies provides the capability of studying diverse biological systems at molecular level. However, due to various limitations of measurement instruments, data preprocessing is often required in biomedical research. Improper preprocessing will have negative impact on the downstream analytics tasks. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization. Missing data is a major issue in quantitative proteomics data analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, comparative assessment on the accuracy of existing methods remains inconclusive, mainly because the true missing mechanisms are complex and the existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of current and future development. We first report an assessment of eight representative methods collectively targeting three typical missing mechanisms. The selected methods are compared on both realistic simulation and real proteomics datasets, and the performance is evaluated using three quantitative measures. We then discuss fused regularization matrix factorization, a popular low-rank matrix factorization framework with similarity and/or biological regularization, which is extendable to integrating multi-omics data such as gene expressions or clinical variables. We further explore the potential application of convex analysis of mixtures, a biologically inspired latent variable modeling strategy, to missing value imputation. The preliminary results on proteomics data are provided together with an outlook into future development directions. While a few winners emerged from our comparative assessment, data-driven evaluation of imputation methods is imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Imputation accuracy may vary with signal intensity. Fused regularization matrix factorization provides a possibility of incorporating external information. Convex analysis of mixtures presents a biologically plausible new approach. Data normalization is essential to ensure accurate inference and comparability of gene expressions across samples or conditions. Ideally, gene expressions should be rescaled based on consistently expressed reference genes. However, for normalizing biologically diverse samples, the most commonly used reference genes have exhibited striking expression variability, and distribution-based approaches can be problematic when differentially expressed genes are significantly asymmetric. We introduce a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. The between-sample normalization is based on iteratively identified consistently expressed genes, where differentially expressed genes are sequentially eliminated according to scale-invariant Cosine scores. We evaluate the performance of Cosbin and four other representative normalization methods (Total count, TMM/edgeR, DESeq2, DEGES/TCC) on both idealistic and realistic simulation data sets. Cosbin consistently outperforms the other methods across various performance criteria. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel phenotypic groups. / Master of Science / Data preprocessing is often required due to various limitations of measurement instruments in biomedical research. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization. Missing data is a major issue in quantitative proteomics data analysis. Imputation is the process of substituting for missing values. We propose a more realistic assessment workflow which can preserve the original data distribution, and then assess eight representative general-purpose imputation strategies. We explore two biologically inspired imputation approaches: fused regularization matrix factorization (FRMF) and convex analysis of mixtures (CAM) imputation. FRMF integrates external information such as clinical variables and multi-omics data into imputation, while CAM imputation incorporates biological assumptions. We show that the integration of biological information improves the imputation performance. Data normalization is required to ensure correct comparison. For gene expression data, between sample normalization is needed. We propose a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. We show that Cosbin significantly outperform other methods in both ideal simulation and realistic simulation. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel cell types. bioinformatics imputation matrix factorization convex analysis normalization machine learning
12	[en] BINARY MATRIX FACTORIZATION POST-PROCESSING AND APPLICATIONS / [pt] PÓS-PROCESSAMENTO DE FATORAÇÃO BINÁRIA DE MATRIZES E APLICAÇÕES GEORGES MIRANDA SPYRIDES 06 February 2024 (has links) [pt] Novos métodos de fatoração de matrizes introduzem restrições às matrizes decompostas, permitindo tipos únicos de análise. Uma modificação significativa é a fatoração de matrizes binárias para matrizes binárias. Esta técnica pode revelar subconjuntos comuns e mistura de subconjuntos, tornando-a útil em uma variedade de aplicações, como análise de cesta de mercado, modelagem de tópicos e sistemas de recomendação. Apesar das vantagens, as abordagens atuais enfrentam um trade-off entre precisão, escalabilidade e explicabilidade. Enquanto os métodos baseados em gradiente descendente são escaláveis, eles geram altos erros de reconstrução quando limitados para matrizes binárias. Por outro lado, os métodos heurísticos não são escaláveis. Para superar isso, essa tese propõe um procedimento de pós-processamento para discretizar matrizes obtidas por gradiente descendente. Esta nova abordagem recupera o erro de reconstrução após a limitação e processa com sucesso matrizes maiores dentro de um prazo razoável. Testamos esta técnica a muitas aplicações, incluindo um novo pipeline para descobrir e visualizar padrões em processos petroquímicos em batelada. / [en] Novel methods for matrix factorization introduce constraints to the decomposed matrices, allowing for unique kinds of analysis. One significant modification is the binary matrix factorization for binary matrices. This technique can reveal common subsets and mixing of subsets, making it useful in a variety of applications, such as market basket analysis, topic modeling, and recommendation systems. Despite the advantages, current approaches face a trade-off between accuracy, scalability, and explainability. While gradient descent-based methods are scalable, they yield high reconstruction errors when thresholded for binary matrices. Conversely, heuristic methods are not scalable. To overcome this, this thesis propose a post-processing procedure for discretizing matrices obtained by gradient descent. This novel approach recovers the reconstruction error post-thresholding and successfully processes larger matrices within a reasonable timeframe. We apply this technique to many applications including a novel pipeline for discovering and visualizing patterns in petrochemical batch processes. [pt] MINERACAO DE PROCESSOS [pt] FATORACAO DE MATRIZES NAO NEGATIVAS [pt] FATORACAO DE MATRIZES BINARIAS [en] PROCESS MINING [en] NON-NEGATIVE MATRIX FACTORIZATION [en] BINARY MATRIX FACTORIZATION
13	Candidate - job recommendation system : Building a prototype of a machine learning – based recommendation system for an online recruitment company Hafizovic, Nedzad January 2019 (has links) Recommendation systems are gaining more popularity because of the complexity of problems that they provide a solution to. There are many applications of recommendation systems everywhere around us. Implementation of these systems differs and there are two approaches that are most distinguished. First approach is a system without Machine Learning, while the other one includes Machine Learning. The second approach, used in this project, is based on Machine Learning collaborative filtering techniques. These techniques include numerous algorithms and data processing methods. This document describes a process that focuses on building a job recommendation system for a recruitment industry, starting from data acquisition to the final result. Data used in the project is collected from the Pitchler AB company, which provides an online recruitment platform. Result of this project is a machine learning based recommendation system used as an engine for the Pitchler AB IT recruitment platform. machine learning recommendation systems collaborative filtering mode-based matrix factorization data analysis python supervised learning recruitment platform singular value decomposition non-negative matrix factorization Computer Systems Datorsystem
14	CONTEXT AWARE PRIVACY PRESERVING CLUSTERING AND CLASSIFICATION Thapa, Nirmal 01 January 2013 (has links) Data are valuable assets to any organizations or individuals. Data are sources of useful information which is a big part of decision making. All sectors have potential to benefit from having information. Commerce, health, and research are some of the fields that have benefited from data. On the other hand, the availability of the data makes it easy for anyone to exploit the data, which in many cases are private confidential data. It is necessary to preserve the confidentiality of the data. We study two categories of privacy: Data Value Hiding and Data Pattern Hiding. Privacy is a huge concern but equally important is the concern of data utility. Data should avoid privacy breach yet be usable. Although these two objectives are contradictory and achieving both at the same time is challenging, having knowledge of the purpose and the manner in which it will be utilized helps. In this research, we focus on some particular situations for clustering and classification problems and strive to balance the utility and privacy of the data. In the first part of this dissertation, we propose Nonnegative Matrix Factorization (NMF) based techniques that accommodate constraints defined explicitly into the update rules. These constraints determine how the factorization takes place leading to the favorable results. These methods are designed to make alterations on the matrices such that user-specified cluster properties are introduced. These methods can be used to preserve data value as well as data pattern. As NMF and K-means are proven to be equivalent, NMF is an ideal choice for pattern hiding for clustering problems. In addition to the NMF based methods, we propose methods that take into account the data structures and the attribute properties for the classification problems. We separate the work into two different parts: linear classifiers and nonlinear classifiers. We propose two different solutions based on the classifiers. We study the effect of distortion on the utility of data. We propose three distortion measurement metrics which demonstrate better characteristics than the traditional metrics. The effectiveness of the measures is examined on different benchmark datasets. The result shows that the methods have the desirable properties such as invariance to translation, rotation, and scaling. Privacy Preserving Data Mining Nonnegative Matrix Factorization Correlation Neighborhood Distortion Metrics Clustering Classification Computer Sciences Databases and Information Systems Other Computer Sciences
15	Provable Methods for Non-negative Matrix Factorization Pani, Jagdeep January 2016 (has links) (PDF) Nonnegative matrix factorization (NMF) is an important data-analysis problem which concerns factoring a given d n matrix A with nonnegative entries into matrices B and C where B and C are d k and k n with nonnegative entries. It has numerous applications including Object recognition, Topic Modelling, Hyper-spectral imaging, Music transcription etc. In general, NMF is intractable and several heuristics exists to solve the problem of NMF. Recently there has been interest in investigating conditions under which NMF can be tractably recovered. We note that existing attempts make unrealistic assumptions and often the associated algorithms tend to be not scalable. In this thesis, we make three major contributions: First, we formulate a model of NMF with assumptions which are natural and is a substantial weakening of separability. Unlike requiring a bound on the error in each column of (A BC) as was done in much of previous work, our assumptions are about aggregate errors, namely spectral norm of (A BC) i.e. jjA BCjj2 should be low. This is a much weaker error assumption and the associated B; C would be much more resilient than existing models. Second, we describe a robust polynomial time SVD-based algorithm, UTSVD, with realistic provable error guarantees and can handle higher levels of noise than previous algorithms. Indeed, experimentally we show that existing NMF models, which are based on separability assumptions, degrade much faster than UTSVD, in the presence of noise. Furthermore, when the data has dominant features, UTSVD significantly outperforms existing models. On real life datasets we again see a similar outperformance of UTSVD on clustering tasks. Finally, under a weaker model, we prove a robust version of uniqueness of NMF, where again, the word \robust" refers to realistic error bounds. Non-negative Matrix Factorization (NMF) Data Analysis NMF Algorithms UTSVD NMF Algorithm Machine Learning Computer Science
16	Iterative Matrix Factorization Method for Social Media Data Location Prediction Suaysom, Natchanon 01 January 2018 (has links) Since some of the location of where the users posted their tweets collected by social media company have varied accuracy, and some are missing. We want to use those tweets with highest accuracy to help fill in the data of those tweets with incomplete information. To test our algorithm, we used the sets of social media data from a city, we separated them into training sets, where we know all the information, and the testing sets, where we intentionally pretend to not know the location. One prediction method that was used in (Dukler, Han and Wang, 2016) requires appending one-hot encoding of the location to the bag of words matrix to do Location Oriented Nonnegative Matrix Factorization (LONMF). We improve further on this algorithm by introducing iterative LONMF. We found that when the threshold and number of iterations are chosen correctly, we can predict tweets location with higher accuracy than using LONMF. Matrix Factorization Twitter Location Prediction Machine Learning Applied Mathematics Statistics and Probability
17	Topic Analysis of Hidden Trends in Patented Features Using Nonnegative Matrix Factorization Lin, Yicong 01 January 2016 (has links) Intellectual property has gained more attention in recent decades because innovations have become one of the most important resources. This paper implements a probabilistic topic model using nonnegative matrix factorization (NMF) to discover some of the key elements in computer patent, as the industry grew from 1990 to 2009. This paper proposes a new “shrinking model” based on NMF and also performs a close examination of some variations of the base model. Note that rather than studying the strategy to pick the optimized number of topics (“rank”), this paper is particularly interested in which factorization (including different kinds of initiation) methods are able to construct “topics” with the best quality given the predetermined rank. Performing NMF to the description text of patent features, we observe key topics emerge such as “platform” and “display” with strong presence across all years but we also see other short-lived significant topics such as “power” and “heat” which signify the saturation of the industry. Topic Modeling Nonnegative Matrix Factorization Intellectual Property Patent Other Applied Mathematics
18	Recommender system for recipes Goda, Sai Bharath January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Daniel A. Anderson / Most of the e-commerce websites like Amazon, EBay, hotels, trip advisor etc. use recommender systems to recommend products to their users. Some of them use the knowledge of history/ of all users to recommend what kind of products the current user may like (Collaborative filtering) and some use the knowledge of the products which the user is interested in and make recommendations (Content based filtering). An example is Amazon which uses both kinds of techniques.. These recommendation systems can be represented in the form of a graph where the nodes are users and products and edges are between users and products. The aim of this project is to build a recommender system for recipes by using the data from allrecipes.com. Allrecipes.com is a popular website used all throughout the world to post recipes, review them and rate them. To understand the data set one needs to know how the recipes are posted and rated in allrecipes.com, whose details are given in the paper. The network of allrecipes.com consists of users, recipes and ingredients. The aim of this research project is to extensively study about two algorithms adsorption and matrix factorization, which are evaluated on homogeneous networks and try them on the heterogeneous networks and analyze their results. This project also studies another algorithm that is used to propagate influence from one network to another network. To learn from one network and propagate the same information to another network we compute flow (influence of one network on another) as described in [7]. The paper introduces a variant of adsorption that takes the flow values into account and tries to make recommendations in the user-recipe and the user-ingredient networks. The results of this variant are analyzed in depth in this paper. Recommender systems Adsorption Matrix factorization Recipes Ingredients Computer Science (0984) Information Technology (0489)
19	A wikification prediction model based on the combination of latent, dyadic and monadic features / Um modelo de previsão para Wikification baseado na combinação de atributos latentes, diádicos e monádicos Ferreira, Raoni Simões 25 April 2016 (has links) Most of the reference information, nowadays, is found in repositories of documents semantically linked, created in a collaborative fashion and freely available in the web. Among the many problems faced by content providers in these repositories, one of the most important is Wikification, that is, the placement of links in the articles. These links have to support user navigation and should provide a deeper semantic interpretation of the content. Wikification is a hard task since the continuous growth of such repositories makes it increasingly demanding for editors. As consequence, they have their focus shifted from content creation, which should be their main objective. This has motivated the design of automatic Wikification tools which, traditionally, address two distinct problems: (a) how to identify which words (or phrases) in an article should be selected as anchors and (b) how to determine to which article the link, associated with the anchor, should point. Most of the methods in literature that address these problems are based on machine learning approaches which attempt to capture, through statistical features, characteristics of the concepts and its associations. Although these strategies handle the repository as a graph of concepts, normally they take limited advantage of the topological structure of this graph, as they describe it by means of human-engineered link statistical features. Despite the effectiveness of these machine learning methods, better models should take full advantage of the information topology if they describe it by means of data-oriented approaches such as matrix factorization. This indeed has been successfully done in other domains, such as movie recommendation. In this work, we fill this gap, proposing a wikification prediction model that combines the strengths of traditional predictors based on statistical features with a latent component which models the concept graph topology by means of matrix factorization. By comparing our model with a state-of-the-art wikification method, using a sample of Wikipedia articles, we obtained a gain up to 13% in F1 metric. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. The study still includes the analysis of the impact of ambiguous concepts, which allows us to conclude the model is resilient to ambiguity, even though does not include any explicitly disambiguation phase. We finally study the impact of selecting training samples from specific content quality classes, an information that is available in some respositories, such as Wikipedia. We empirically shown that the quality of the training samples impact on precision and overlinking, when comparing training performed using random quality samples versus high quality samples. / Atualmente, informações de referência são disponibilizadas através de repositórios de documentos semanticamente ligados, criados de forma colaborativa e com acesso livre na Web. Entre os muitos problemas enfrentados pelos provedores de conteúdo desses repositórios, destaca-se a Wikification, isto é, a inclusão de links nos artigos desses repositórios. Esses links possibilitam a navegação pelos artigos e permitem ao usuário um aprofundamento semântico do conteúdo. A Wikification é uma tarefa complexa, uma vez que o crescimento contínuo de tais repositórios resulta em um esforço cada vez maior dos editores. Como consequência, eles têm seu foco desviado da criação de conteúdo, que deveria ser o seu principal objetivo. Isso tem motivado o desenvolvimento de ferramentas de Wikification automática que, tradicionalmente, abordam dois problemas distintos: (a) como identificar que palavras (ou frases) em um artigo deveriam ser selecionados como texto de âncora e (b) como determinar para que artigos o link, associado ao texto de âncora, deveria apontar. A maioria dos métodos na literatura que abordam esses problemas usam aprendizado de máquina. Eles tentam capturar, através de atributos estatísticos, características dos conceitos e seus links. Embora essas estratégias tratam o repositório como um grafo de conceitos, normalmente elas pouco exploram a estrutura topológica do grafo, uma vez que se limitam a descrevê-lo por meio de atributos estatísticos dos links, projetados por especialistas humanos. Embora tais métodos sejam eficazes, novos modelos poderiam tirar mais proveito da topologia se a descrevessem por meio de abordagens orientados a dados, tais como a fatoração matricial. De fato, essa abordagem tem sido aplicada com sucesso em outros domínios como recomendação de filmes. Neste trabalho, propomos um modelo de previsão para Wikification que combina a força dos previsores tradicionais baseados em atributos estatísticos, projetados por seres humanos, com um componente de previsão latente, que modela a topologia do grafo de conceitos usando fatoração matricial. Ao comparar nosso modelo com o estado-da-arte em Wikification, usando uma amostra de artigos Wikipédia, observamos um ganho de até 13% em F1. Além disso, fornecemos uma análise detalhada do desempenho do modelo enfatizando a importância do componente de previsão latente e dos atributos derivados dos links entre os conceitos. Também analisamos o impacto de conceitos ambíguos, o que permite concluir que nosso modelo se porta bem mesmo diante de ambiguidade, apesar de não tratar explicitamente este problema. Ainda realizamos um estudo sobre o impacto da seleção das amostras de treino conforme a qualidade dos seus conteúdos, uma informação disponível em alguns repositórios, tais como a Wikipédia. Nós observamos que o treino com documentos de alta qualidade melhora a precisão do método, minimizando o uso de links desnecessários. Aprendizado de máquina Fatoração matricial Link prediction Machine learning Matrix factorization Previsão de links Wikificação Wikification Wikipedia Wikipédia
20	Factorisation de matrices et analyse de contraste pour la recommandation / Matrix Factorization and Contrast Analysis Techniques for Recommendation Aleksandrova, Marharyta 07 July 2017 (has links) Dans de nombreux domaines, les données peuvent être de grande dimension. Ça pose le problème de la réduction de dimension. Les techniques de réduction de dimension peuvent être classées en fonction de leur but : techniques pour la représentation optimale et techniques pour la classification, ainsi qu'en fonction de leur stratégie : la sélection et l'extraction des caractéristiques. L'ensemble des caractéristiques résultant des méthodes d'extraction est non interprétable. Ainsi, la première problématique scientifique de la thèse est comment extraire des caractéristiques latentes interprétables? La réduction de dimension pour la classification vise à améliorer la puissance de classification du sous-ensemble sélectionné. Nous voyons le développement de la tâche de classification comme la tâche d'identification des facteurs déclencheurs, c'est-à-dire des facteurs qui peuvent influencer le transfert d'éléments de données d'une classe à l'autre. La deuxième problématique scientifique de cette thèse est comment identifier automatiquement ces facteurs déclencheurs? Nous visons à résoudre les deux problématiques scientifiques dans le domaine d'application des systèmes de recommandation. Nous proposons d'interpréter les caractéristiques latentes de systèmes de recommandation basés sur la factorisation de matrices comme des utilisateurs réels. Nous concevons un algorithme d'identification automatique des facteurs déclencheurs basé sur les concepts d'analyse par contraste. Au travers d'expérimentations, nous montrons que les motifs définis peuvent être considérés comme des facteurs déclencheurs / In many application areas, data elements can be high-dimensional. This raises the problem of dimensionality reduction. The dimensionality reduction techniques can be classified based on their aim: dimensionality reduction for optimal data representation and dimensionality reduction for classification, as well as based on the adopted strategy: feature selection and feature extraction. The set of features resulting from feature extraction methods is usually uninterpretable. Thereby, the first scientific problematic of the thesis is how to extract interpretable latent features? The dimensionality reduction for classification aims to enhance the classification power of the selected subset of features. We see the development of the task of classification as the task of trigger factors identification that is identification of those factors that can influence the transfer of data elements from one class to another. The second scientific problematic of this thesis is how to automatically identify these trigger factors? We aim at solving both scientific problematics within the recommender systems application domain. We propose to interpret latent features for the matrix factorization-based recommender systems as real users. We design an algorithm for automatic identification of trigger factors based on the concepts of contrast analysis. Through experimental results, we show that the defined patterns indeed can be considered as trigger factors Fouille de données Factorisation de matrices Système de recommandation Data mining Matrix factorization Recommender systems 006.312

Search results