Global ETD Search

1	Comparison of Classification Effects of Principal Component and Sparse Principal Component Analysis for Cardiology Ultrasound in Left Ventricle Yang, Hsiao-ying 05 July 2012 (has links) Due to the association of heart diseases and the patterns of the diastoles and systoles of heart in left ventricle, we analyze and classify the data gathered form Kaohsiung Veterans General Hospital by using the cardiology ultrasound images. We make use of the differences between the gray-scale values of diastoles and systoles in left ventricle to evaluate the function of heart. Following Chen (2011) and Kao (2011), we modified the way about the reduction and alignment of the image data. We also add some more subjects into the study. We treat images in two manners, saving the parts of concern. Since the ultrasound image after transformation to data form is expressed as a high-dimensional matrix, the principal component analysis is adapted to retain the important factors and reduce the dimensions. In this work, we compare the loadings calculated by the usual principal and sparse principal component analysis, then the factor scores are used to carry out the discriminant analysis and discuss the accuracy of classification. By the statistical methods in this work, the accuracy, sensitivity and specificity of the original classifications are over 80% and the cross validations are over 60%. discriminate analysis gray-scale value principal component analysis sparse principal component analysis factor score
2	GRAPH-BASED ANALYSIS OF NON-RANDOM MISSING DATA PROBLEMS WITH LOW-RANK NATURE: STRUCTURED PREDICTION, MATRIX COMPLETION AND SPARSE PCA Hanbyul Lee (17586345) 09 December 2023 (has links) <p dir="ltr">In most theoretical studies on missing data analysis, data is typically assumed to be missing according to a specific probabilistic model. However, such assumption may not accurately reflect real-world situations, and sometimes missing is not purely random. In this thesis, our focus is on analyzing incomplete data matrices without relying on any probabilistic model assumptions for the missing schemes. To characterize a missing scheme deterministically, we employ a graph whose adjacency matrix is a binary matrix that indicates whether each matrix entry is observed or not. Leveraging its graph properties, we mathematically represent the missing pattern of an incomplete data matrix and conduct a theoretical analysis of how this non-random missing pattern affects the solvability of specific problems related to incomplete data. This dissertation primarily focuses on three types of incomplete data problems characterized by their low-rank nature: structured prediction, matrix completion, and sparse PCA.</p><p dir="ltr">First, we investigate a basic structured prediction problem, which involves recovering binary node labels on a fixed undirected graph, where noisy binary observations corresponding to edges are given. Essentially, this setting parallels a simple binary rank-1 symmetric matrix completion problem, where missing entries are determined by a fixed undirected graph. Our aim is to establish the fundamental limit bounds of this problem, revealing a close association between the limits and graph properties, such as connectivity.</p><p dir="ltr">Second, we move on to the general low-rank matrix completion problem. In this study, we establish provable guarantees for exact and approximate low-rank matrix completion problems that can be applied to any non-random missing pattern, by utilizing the observation graph corresponding to the missing scheme. We theoretically and experimentally show that the standard constrained nuclear norm minimization algorithm can successfully recover the true matrix when the observation graph is well-connected and has similar node degrees. We also verify that matrix completion is achievable with a near-optimal sample complexity rate when the observation graph has uniform node degrees and its adjacency matrix has a large spectral gap.</p><p dir="ltr">Finally, we address the sparse PCA problem, featuring an approximate low-rank attribute. Missing data is common in situations where sparse PCA is useful, such as single-cell RNA sequence data analysis. We propose a semidefinite relaxation of the non-convex $\ell_1$-regularized PCA problem to solve sparse PCA on incomplete data. We demonstrate that the method is particularly effective when the observation pattern has favorable properties. Our theory is substantiated through synthetic and real data analysis, showcasing the superior performance of our algorithm compared to other sparse PCA approaches, especially when the observed data pattern has specific characteristics.</p> Statistical data science Missing data handling structured prediction problems Matrix completion approach Sparse principal component analysis
3	Data Mining Algorithms for Decentralized Fault Detection and Diagnostic in Industrial Systems Grbovic, Mihajlo January 2012 (has links) Timely Fault Detection and Diagnosis in complex manufacturing systems is critical to ensure safe and effective operation of plant equipment. Process fault is defined as a deviation from normal process behavior, defined within the limits of safe production. The quantifiable objectives of Fault Detection include achieving low detection delay time, low false positive rate, and high detection rate. Once a fault has been detected pinpointing the type of fault is needed for purposes of fault mitigation and returning to normal process operation. This is known as Fault Diagnosis. Data-driven Fault Detection and Diagnosis methods emerged as an attractive alternative to traditional mathematical model-based methods, especially for complex systems due to difficulty in describing the underlying process. A distinct feature of data-driven methods is that no a priori information about the process is necessary. Instead, it is assumed that historical data, containing process features measured in regular time intervals (e.g., power plant sensor measurements), are available for development of fault detection/diagnosis model through generalization of data. The goal of my research was to address the shortcomings of the existing data-driven methods and contribute to solving open problems, such as: 1) decentralized fault detection and diagnosis; 2) fault detection in the cold start setting; 3) optimizing the detection delay and dealing with noisy data annotations. 4) developing models that can adapt to concept changes in power plant dynamics. For small-scale sensor networks, it is reasonable to assume that all measurements are available at a central location (sink) where fault predictions are made. This is known as a centralized fault detection approach. For large-scale networks, decentralized approach is often used, where network is decomposed into potentially overlapping blocks and each block provides local decisions that are fused at the sink. The appealing properties of the decentralized approach include fault tolerance, scalability, and reusability. When one or more blocks go offline due to maintenance of their sensors, the predictions can still be made using the remaining blocks. In addition, when the physical facility is reconfigured, either by changing its components or sensors, it can be easier to modify part of the decentralized system impacted by the changes than to overhaul the whole centralized system. The scalability comes from reduced costs of system setup, update, communication, and decision making. Main challenges in decentralized monitoring include process decomposition and decision fusion. We proposed a decentralized model where the sensors are partitioned into small, potentially overlapping, blocks based on the Sparse Principal Component Analysis (PCA) algorithm, which preserves strong correlations among sensors, followed by training local models at each block, and fusion of decisions based on the proposed Maximum Entropy algorithm. Moreover, we introduced a novel framework for adding constraints to the Sparse PCA problem. The constraints limit the set of possible solutions by imposing additional goals to be reached trough optimization along with the existing Sparse PCA goals. The experimental results on benchmark fault detection data show that Sparse PCA can utilize prior knowledge, which is not directly available in data, in order to produce desirable network partitions, with a pre-defined limit on communication cost and/or robustness. / Computer and Information Science Computer Science Data Mining Decentralized Learning Fault Detection Fault Diagnosis Machine Learning Sparse Principal Component Analysis
4	Sparse Principal Component Analysis for High-Dimensional Data: A Comparative Study Bonner, Ashley J. 10 1900 (has links) <p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc) Principal Component Analysis (PCA) High Dimensional Data Simulations Loading Vectors Tuning Parameters Applied Statistics Biostatistics Multivariate Analysis Statistical Methodology Applied Statistics
5	Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry Gertrudes, Jadson Castro 10 September 2013 (has links) Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry Agrupamento de dados Análise de componentes principais Clustering Dimensionality reduction Principal component analysis Redução de dimensionalidade Seleção de variáveis Sparse principal component analysis Structure activity relationship Variable selection
6	Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry Jadson Castro Gertrudes 10 September 2013 (has links) Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry Agrupamento de dados Análise de componentes principais Redução de dimensionalidade Seleção de variáveis Clustering Dimensionality reduction Principal component analysis Sparse principal component analysis Structure activity relationship Variable selection

1

Page generated in 0.0893 seconds