Global ETD Search

151	High dimensional data clustering; A comparative study on gene expressions : Experiment on clustering algorithms on RNA-sequence from tumors with evaluation on internal validation Henriksson, William January 2019 (has links) In cancer research, class discovery is the first process for investigating a new dataset for which hidden groups there are by similar attributes. However datasets from gene expressions, RNA microarray or RNA-sequence, are high-dimensional. Which makes it hard to perform clusteranalysis and to get clusters that are well separated. Well separated clusters are wanted because that tells that objects are most likely not placed in wrong clusters. This report investigate in an experiment whether using K-Means and hierarchical are suitable for clustering gene expressions in RNA-sequence data from various tumors. Dimensionality reduction methods are also applied to see whether that helps create well-separated clusters. The results tell that well separated clusters are only achieved by using PCA as dimensionality reduction and K-Means on correlation. The main contribution of this paper is determining that using K-Means or hierarchical clustering on the full natural dimensionality of RNA-sequence data returns unwanted silhouette average width, under 0,4. Cluster analysis cluster validation RNA-sequence tumors high-dimensional data dimensionality reduction Information Systems
152	Dimensionality Reduction in High-Dimensional Profile Analysis Using Scores Vikbladh, Jonathan January 2022 (has links) Profile analysis is a multivariate statistical method for comparing the mean vectors for different groups. It consists of three tests, they are the tests for parallelism, level and flatness. The results from each test give information about the behaviour of the groups and the variables in the groups. The test statistics used when there are more than two groups are likelihood-ratio tests. However, issues in the form indeterminate test statistics occur in the high-dimensional setting, that is when there are more variables than observations. This thesis investigates a method to approach this problem by reducing the dimensionality of the data using scores, that is linear combinations of the variables. Three different ways of choosing this score are compared: the eigendecomposition and two variations of the non-negative matrix factorization. The methods are compared using simulations for five different type of mean parameter settings. The results show that the eigendecomposition is the best technique for choosing the score, and that using more scores only slightly improves the results. Moreover, the results for the parallelism and the flatness tests are shown to be very good, but the results for the level hypothesis deviate from the expectation. High-dimensional data Hypothesis testing LR test Linear scores Multivariate analysis Profile analysis Spherical distributions Probability Theory and Statistics Sannolikhetsteori och statistik
153	Vysokodimenzionální jednobuněčná cytometrie pro analýzu imunitního systému / High-dimensional single cell cytometry approach for immune system analysis Koladiya, Abhishek January 2021 (has links) Technological advancement allowed for the advent of single-cell technologies capable of measuring a large number of cellular features simultaneously. These technologies have been subsequently used to shed light on the heterogeneity of cellular systems previously considered homogeneous, identifying the exclusive features of individual cells within cellular niches. Today, single-cell technologies represent an essential tool for studying the underlying immunological mechanisms correlating with disease. In this context, cytometry is one of the diverse high-throughput methods capable of examining more than 50 features per cell. However, utilising cytometry at its full potential requires the development of optimized assays. Additionally, the resulting high-dimensional data represent a challenge for existing computational techniques. This thesis attempts to address these challenges. The first part of the thesis is focused on developing a non-linear embedding algorithm for rapid analysis of cytometry datasets called EmbedSOM. The comparison of EmbedSOM with other state-of-the-art algorithms suggested the superiority of EmbedSOM with faster runtime. This is critical for the analysis of large datasets with millions of cells. Furthermore, EmbedSOM has additional functionality such as landmark guided...
154	Methods for causal mediation analysis with applications in HIV and cardiorespiratory fitness Chernofsky, Ariel 16 June 2023 (has links) The cause and effect paradigm underlying medical research has led to an enhanced etiological understanding of many diseases and the development of many lifesaving drugs, but the paradigm does not always include an understanding of the pathways involved. Causal mediation analysis extends the cause and effect relationship to the cause and effect through a mediator, an intermediate variable on the causal pathway. The total effect of an exposure on an outcome is decomposed into two parts: 1) the indirect effect of the exposure on the outcome through the mediator and 2) the direct effect of the exposure on the outcome through all other pathways. In this dissertation, I describe various counterfactual causal mediation frameworks with identifiability assumptions that all lead to the Mediation Formula. The indirect and direct effects can be estimated from observable data using a semi-parametric algorithm derived from the Mediation Formula that I generalize to different types of mediators and outcomes. With an increased interest in causal mediation analysis, thoughtful consideration is necessary in the application of the Mediation Formula to real-world data challenges. Here, I consider three motivating causal mediation questions in the areas of HIV curative research and cardio-respiratory fitness. HIV curative treatments typically target the viral reservoir, cells infected with latent HIV. Quantifying the effect of an HIV curative treatment on viral rebound over a set time horizon mediated by reductions in the viral reservoir can inform future directions for improving curative treatments. In cardiorespiratory fitness research, metabolites, molecules involved with cellular respiration, are believed to mediate the effect of physical activity on cardiorespiratory fitness. I propose three novel adaptations to the semi-parametric estimation algorithm to address three data challenges: 1) Numerical integration and optimization of the observed data likelihood for mediators with an assay lower limit (left-censored mediators); 2) Pseudo-value approach for time-to-event outcomes on a restricted mean survival time scale; 3) Elastic net regression for high-dimensional mediators. My novel approaches provide estimation frameworks that can be applied to a broad spectrum of research questions. I provide simulation studies to assess the properties of the estimators and applications of the methodologies to the motivating data. / 2025-06-16T00:00:00Z Biostatistics Assay lower limit Causal mediation analysis High-dimensional mediator Indirect and direct effects Restricted mean survival time Survival analysis
155	PLASMA-HD: Probing the LAttice Structure and MAkeup of High-dimensional Data Fuhry, David P. January 2015 (has links) No description available. Computer Science Network Analytics Graph Analytics High-Dimensional Data Visualization PLASMA-HD Graph Growth Approximate Itemset Mining Parallel Coordinates
156	Variable Selection in High-Dimensional Data Reichhuber, Sarah, Hallberg, Johan January 2021 (has links) Estimating the variables of importance in inferentialmodelling is of significant interest in many fields of science,engineering, biology, medicine, finance and marketing. However,variable selection in high-dimensional data, where the number ofvariables is relatively large compared to the observed data points,is a major challenge and requires more research in order toenhance reliability and accuracy. In this bachelor thesis project,several known methods of variable selection, namely orthogonalmatching pursuit (OMP), ridge regression, lasso, adaptive lasso,elastic net, adaptive elastic net and multivariate adaptive regressionsplines (MARS) were implemented on a high-dimensional dataset.The aim of this bachelor thesis project was to analyze andcompare these variable selection methods. Furthermore theirperformance on the same data set but extended, with the numberof variables and observations being of similar size, were analyzedand compared as well. This was done by generating models forthe different variable selection methods using built-in packagesin R and coding in MATLAB. The models were then used topredict the observations, and these estimations were compared tothe real observations. The performances of the different variableselection methods were analyzed utilizing different evaluationmethods. It could be concluded that some of the variable selectionmethods provided more accurate models for the implementedhigh-dimensional data set than others. Elastic net, for example,was one of the methods that performed better. Additionally, thecombination of final models could provide further insight in whatvariables that are crucial for the observations in the given dataset, where, for example, variable 112 and 23 appeared to be ofimportance. / Att skatta vilka variabler som är viktigai inferentiell modellering är av stort intresse inom mångaforskningsområden, industrier, biologi, medicin, ekonomi ochmarknadsföring. Variabel-selektion i högdimensionella data, därantalet variabler är relativt stort jämfört med antalet observeradedatapunkter, är emellertid en stor utmaning och krävermer forskning för att öka trovärdigheten och noggrannheteni resultaten. I detta projekt implementerades ett flertal kändavariabel-selektions-metoder, nämligen orthogonal matching pursuit(OMP), ridge regression, lasso, elastic net, adaptive lasso,adaptive elastic net och multivariate adaptive regression splines(MARS), på ett högdimensionellt data-set. Syftet med dettakandidat-examensarbete var att analysera och jämföra resultatenav dessa metoder. Vidare analyserades och jämfördes metodernasresultat på samma data-set, fast utökat, med antalet variableroch observationer ungefär lika stora. Detta gjordes genom attgenerera modeller för de olika variabel-selektions-metodernavia inbygga paket i R och programmering i MATLAB. Dessamodeller användes sedan för att prediktera observationer, ochestimeringarna jämfördes därefter med de verkliga observationerna.Resultaten av de olika variabel-selektions-metodernaanalyserades sedan med hjälp av ett flertal evaluerings-metoder.Det kunde fastställas att vissa av de implementerade variabelselektions-metoderna gav mer relevanta modeller för datanän andra. Exempelvis var elastic net en av metoderna sompresterade bättre. Dessutom drogs slutsatsen att kombineringav resultaten av de slutgiltiga modellerna kunde ge en djupareinsikt i vilka variabler som är viktiga för observationerna, där,till exempel, variabel 112 och 23 tycktes ha betydelse. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm variable selection variable selection methods linear regression high-dimensional data variable importance Elektroteknik och elektronik
157	Data-Driven Supervised Classifiers in High-Dimensional Spaces: Application on Gene Expression Data Efrem, Nabiel H. January 2024 (has links) Several ready-to-use supervised classifiers perform predictively well in large-sample cases, but generally, the same cannot be expected when transitioning to high-dimensional settings. This can be explained by the classical supervised theory that has not been developed within high-dimensional spaces, giving several classifiers a hard combat against the curse of dimensionality. A rise in parsimonious classification procedures, particularly techniques incorporating feature selectors, can be observed. It can be interpreted as a two-step procedure: allowing an arbitrary selector to obtain a feature subset independent of a ready-to-use model and subsequently classify unlabelled instances within the selected subset. Modeling the two-step procedure is often heavy in motivation, and theoretical and algorithmic descriptions are frequently overlooked. In this thesis, we aim to describe the theoretical and algorithmic framework when employing a feature selector as a pre-processing step for Support Vector Machine and assess its validity in high-dimensional settings. The validity of the proposed classifier is evaluated based on predictive performance through a comparative study with a state-of-the-art algorithm designed for advanced learning tasks. The chosen algorithm effectively employs feature relevance during training, making it suitable for high-dimensional settings. The results suggest that the proposed classifier performs predicatively superior to the Support Vector Machine in lower input dimensions; however, a high rate of convergence towards a performance comparable to the Support Vector Machine tends to emerge for input dimensions beyond a certain threshold. Additionally, the thesis could not conclude any strict superior performance between the chosen state-of-the-art algorithm and the proposed classifier. Nonetheless, the state-of-the-art algorithm imposes a more balanced performance across both labels. Supervised Classification High-Dimensional Space Feature Selection Parsimonious Classifier Support Vector Machine Probability Theory and Statistics Sannolikhetsteori och statistik
158	Visual Analysis of High-Dimensional Point Clouds using Topological Abstraction Oesterling, Patrick 17 May 2016 (has links) (PDF) This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\'s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\'s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve. Hochdimensionale Daten Punktwolken Clustering Topologie Visualisierung Dichtefunktion High-Dimensional Data Point Clouds Clustering Topology Visualization Temporal Clustering Scalar Fields Merge Tree Density Function ddc:500
159	Vital sign monitoring and data fusion in haemodialysis Borhani, Yasmina January 2013 (has links) Intra-dialytic hypotension (IDH) is the most common complication in haemodialysis (HD) treatment and has been linked with increased mortality in HD patients. Despite various approaches towards understanding the underlying physiological mechanisms giving rise to IDH, the causes of IDH are poorly understood. Heart Rate Variability (HRV) has previously been suggested as a predictive measure of IDH. In contrast to conventional spectral HRV measures in which the frequency bands are defined by fixed limits, a new spectral measure of HRV is introduced in which the breathing rate is used to identify and measure the physiologically-relevant peaks of the frequency spectrum. The ratio of peaks leading up to the IDH event was assessed as a possible measure for IDH prediction. Changes in the proposed measure correlate well with the magnitude of abrupt changes in blood pressure in patients with autonomic dysfunction, but there is no such correlation in patients without autonomic dysfunction. At present, routine clinical vital sign monitoring beyond simple weight and blood pressure measurements at the start and end of each session has not established itself in clinical practice. To investigate the benefits of continuous vital sign monitoring in HD patients with regard to detecting and predicting IDH, different population-based and patient-specific models of normality were devised and tested on data from an observational study at the Oxford Renal Unit in which vital signs were recorded during HD sessions. Patient-specific models of normality performed better in distinguishing between IDH and non-IDH data, primarily due to the wide range of vital sign data included as part of the training data in the population-based models. Further, a patient-specific data fusion model was constructed using Parzen windows to estimate a probability density function from the training data consisting of vital signs from IDH-free sessions. Although the model was constructed using four vital sign inputs, novelty detection was found to be primarily driven by blood pressure decreases. 616.6
160	Maximum-likelihood kernel density estimation in high-dimensional feature spaces /\| C.M. van der Walt Van der Walt, Christiaan Maarten January 2014 (has links) With the advent of the internet and advances in computing power, the collection of very large high-dimensional datasets has become feasible { understanding and modelling high-dimensional data has thus become a crucial activity, especially in the field of pattern recognition. Since non-parametric density estimators are data-driven and do not require or impose a pre-defined probability density function on data, they are very powerful tools for probabilistic data modelling and analysis. Conventional non-parametric density estimation methods, however, originated from the field of statistics and were not originally intended to perform density estimation in high-dimensional features spaces { as is often encountered in real-world pattern recognition tasks. Therefore we address the fundamental problem of non-parametric density estimation in high-dimensional feature spaces in this study. Recent advances in maximum-likelihood (ML) kernel density estimation have shown that kernel density estimators hold much promise for estimating nonparametric probability density functions in high-dimensional feature spaces. We therefore derive two new iterative kernel bandwidth estimators from the maximum-likelihood (ML) leave one-out objective function and also introduce a new non-iterative kernel bandwidth estimator (based on the theoretical bounds of the ML bandwidths) for the purpose of bandwidth initialisation. We name the iterative kernel bandwidth estimators the minimum leave-one-out entropy (MLE) and global MLE estimators, and name the non-iterative kernel bandwidth estimator the MLE rule-of-thumb estimator. We compare the performance of the MLE rule-of-thumb estimator and conventional kernel density estimators on artificial data with data properties that are varied in a controlled fashion and on a number of representative real-world pattern recognition tasks, to gain a better understanding of the behaviour of these estimators in high-dimensional spaces and to determine whether these estimators are suitable for initialising the bandwidths of iterative ML bandwidth estimators in high dimensions. We find that there are several regularities in the relative performance of conventional kernel density estimators across different tasks and dimensionalities and that the Silverman rule-of-thumb bandwidth estimator performs reliably across most tasks and dimensionalities of the pattern recognition datasets considered, even in high-dimensional feature spaces. Based on this empirical evidence and the intuitive theoretical motivation that the Silverman estimator optimises the asymptotic mean integrated squared error (assuming a Gaussian reference distribution), we select this estimator to initialise the bandwidths of the iterative ML kernel bandwidth estimators compared in our simulation studies. We then perform a comparative simulation study of the newly introduced iterative MLE estimators and other state-of-the-art iterative ML estimators on a number of artificial and real-world high-dimensional pattern recognition tasks. We illustrate with artificial data (guided by theoretical motivations) under what conditions certain estimators should be preferred and we empirically confirm on real-world data that no estimator performs optimally on all tasks and that the optimal estimator depends on the properties of the underlying density function being estimated. We also observe an interesting case of the bias-variance trade-off where ML estimators with fewer parameters than the MLE estimator perform exceptionally well on a wide variety of tasks; however, for the cases where these estimators do not perform well, the MLE estimator generally performs well. The newly introduced MLE kernel bandwidth estimators prove to be a useful contribution to the field of pattern recognition, since they perform optimally on a number of real-world pattern recognition tasks investigated and provide researchers and practitioners with two alternative estimators to employ for the task of kernel density estimation. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014 Pattern recognition Non-parametric density estimation Kernel density estimation Kernel bandwidth estimation Maximum-likelihood High-dimensional data Artificial data Probability density function

Page generated in 0.1107 seconds