Global ETD Search

21	A non-asymptotic study of low-rank estimation of smooth kernels on graphs Rangel Walteros, Pedro Andres 12 January 2015 (has links) This dissertation investigates the problem of estimating a kernel over a large graph based on a sample of noisy observations of linear measurements of the kernel. We are interested in solving this estimation problem in the case when the sample size is much smaller than the ambient dimension of the kernel. As is typical in high-dimensional statistics, we are able to design a suitable estimator based on a small number of samples only when the target kernel belongs to a subset of restricted complexity. In our study, we restrict the complexity by considering scenarios where the target kernel is both low-rank and smooth over a graph. Using standard tools of non-parametric estimation, we derive a minimax lower bound on the least squares error in terms of the rank and the degree of smoothness of the target kernel. To prove the optimality of our lower-bound, we proceed to develop upper bounds on the error for a least-square estimator based on a non-convex penalty. The proof of these upper bounds depends on bounds for estimators over uniformly bounded function classes in terms of Rademacher complexities. We also propose a computationally tractable estimator based on least-squares with convex penalty. We derive an upper bound for the computationally tractable estimator in terms of a coherence function introduced in this work. Finally, we present some scenarios wherein this upper bound achieves a near-optimal rate. The motivations for studying such problems come from various real-world applications like recommender systems and social network analysis. Low-rank matrix completion Kernels on graphs High dimensional probability
22	Improved shrunken centroid method for better variable selection in cancer classification with high throughput molecular data Xukun, Li January 1900 (has links) Master of Science / Department of Statistics / Haiyan Wang / Cancer type classification with high throughput molecular data has received much attention. Many methods have been published in this area. One of them is called PAM (nearest centroid shrunken algorithm), which is simple and efficient. It can give very good prediction accuracy. A problem with PAM is that this method selects too many genes, some of which may have no influence on cancer type. A reason for this phenomenon is that PAM assumes that all genes have identical distribution and give a common threshold parameter for genes selection. This may not hold in reality since expressions from different genes could have very different distributions due to complicated biological process. We propose a new method aimed to improve the ability of PAM to select informative genes. Keeping informative genes while reducing false positive variables can lead to more accurate classification result and help to pinpoint target genes for further studies. To achieve this goal, we introduce variable specific test based on Edgeworth expansion to select informative genes. We apply this test on each gene and select some genes based on the result of the test so that a large number of genes will be excluded. Afterward, soft thresholding with cross-validation can be further applied to decide a common threshold value. Simulation and real application show that our method can reduce the irrelevant information and select the informative genes more precisely. The simulation results give us more insight about where the newly proposed procedure could improve the accuracy, especially when the data set is skewed or unbalanced. The method can be applied to broad molecular data, including, for example, lipidomic data from mass spectrum, copy number data from genomics, eQLT analysis with GWAS data, etc. We expect the proposed method will help life scientists to accelerate discoveries with highthroughput data. Feature selection High dimensional classification Cornish-Fisher expansion Shrunken centroid
23	Machine Learning Methods for High-Dimensional Imbalanced Biomedical Data January 2013 (has links) abstract: Learning from high dimensional biomedical data attracts lots of attention recently. High dimensional biomedical data often suffer from the curse of dimensionality and have imbalanced class distributions. Both of these features of biomedical data, high dimensionality and imbalanced class distributions, are challenging for traditional machine learning methods and may affect the model performance. In this thesis, I focus on developing learning methods for the high-dimensional imbalanced biomedical data. In the first part, a sparse canonical correlation analysis (CCA) method is presented. The penalty terms is used to control the sparsity of the projection matrices of CCA. The sparse CCA method is then applied to find patterns among biomedical data sets and labels, or to find patterns among different data sources. In the second part, I discuss several learning problems for imbalanced biomedical data. Note that traditional learning systems are often biased when the biomedical data are imbalanced. Therefore, traditional evaluations such as accuracy may be inappropriate for such cases. I then discuss several alternative evaluation criteria to evaluate the learning performance. For imbalanced binary classification problems, I use the undersampling based classifiers ensemble (UEM) strategy to obtain accurate models for both classes of samples. A small sphere and large margin (SSLM) approach is also presented to detect rare abnormal samples from a large number of subjects. In addition, I apply multiple feature selection and clustering methods to deal with high-dimensional data and data with highly correlated features. Experiments on high-dimensional imbalanced biomedical data are presented which illustrate the effectiveness and efficiency of my methods. / Dissertation/Thesis / M.S. Computer Science 2013 Computer science Biomedical Data High-Dimensional Imbalanced Machine Learning
24	Penalized Regression Methods in the Study of Serum Biomarkers for Overweight and Obesity Vasquez, Monica M., Vasquez, Monica M. January 2017 (has links) The study of circulating biomarkers and their association with disease outcomes has become progressively complex due to advances in the measurement of these biomarkers through multiplex technologies. Although the availability of numerous serum biomarkers is highly promising, multiplex assays present statistical challenges due to the high dimensionality of these data. In this dissertation, three studies are presented that address these challenges using L1 penalized regression methods. In the first part of the dissertation, an extensive simulation study is performed for the logistic regression model that compares the Least Absolute Shrinkage and Selection Operator (LASSO) method with five LASSO-type methods given scenarios that are present in serum biomarker research, such as high correlation between biomarkers, weak associations with the outcome, and sparse number of true signals. Results show that choice of optimal LASSO-type method is dependent on data structure and should be guided by the research objective. Methods are then applied to the Tucson Epidemiological Study of Airway Obstructive Disease (TESAOD) study for the identification of serum biomarkers of overweight and obesity. Measurement of serum biomarkers using multiplex technologies may be more variable as compared to traditional single biomarker methods. Measurement error may induce bias in parameter estimation and complicate the variable selection process. In the second part of the dissertation, an existing measurement error correction method for penalized linear regression with L1 penalty has been adapted to accommodate validation data on a randomly selected subset of the study sample. A simulation study and analysis of TESAOD data demonstrate that the proposed approach improves variable selection and reduces bias in parameter estimation for validation data as small as 10 percent of the study sample. In the third part of the dissertation, a measurement error correction method that utilizes validation data is proposed for the penalized logistic regression model with the L1 penalty. A simulation study and analysis of TESAOD data are used to evaluate the proposed method. Results show an improvement in variable selection. Biomarkers High-Dimensional LASSO Measurement Error Obesity Overweight
25	Visualizing large-scale and high-dimensional time series data Yeqiang, Lin January 2017 (has links) Time series is one of the main research objects in the field of data mining. Visualization is an important mechanism to present processed time series for further analysis by users. In recent years researchers have designed a number of sophisticated visualization techniques for time series. However, most of these techniques focus on the static format, trying to encode the maximal amount of information through one image or plot. We propose the pixel video technique, a visualization technique displaying data in video format. Using pixel video technique, a hierarchal dimension cluster tree for generating the similarity order of dimensions is first constructed, each frame image is generated according to pixeloriented techniques displaying the data in the form of a video. visualization time series high-dimensional data Computer Systems Datorsystem
26	High-dimensional statistical data integration January 2019 (has links) archives@tulane.edu / Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A representative model for the integrative analysis of multiple data types is to decompose each data matrix into a low-rank common-source matrix generated by latent factors shared across all data types, a low-rank distinctive-source matrix corresponding to each data type, and an additive noise matrix. We propose a novel decomposition method, called the decomposition-based generalized canonical correlation analysis, which appropriately defines those matrices by imposing a desirable orthogonality constraint on distinctive latent factors that aims to sufficiently capture the common latent factors. To further delineate the common and distinctive patterns between two data types, we propose another new decomposition method, called the common and distinctive pattern analysis. This method takes into account the common and distinctive information between the coefficient matrices of the common latent factors. We develop consistent estimation approaches for both proposed decompositions under high-dimensional settings, and demonstrate their finite-sample performance via extensive simulations. We illustrate the superiority of proposed methods over the state of the arts by real-world data examples obtained from The Cancer Genome Atlas and Human Connectome Project. / 1 / Zhe Qu High-dimensional data analysis Data integration Canonical correlation analysis
27	Random Matrix Theory: Selected Applications from Statistical Signal Processing and Machine Learning Elkhalil, Khalil 06 1900 (has links) Random matrix theory is an outstanding mathematical tool that has demonstrated its usefulness in many areas ranging from wireless communication to finance and economics. The main motivation behind its use comes from the fundamental role that random matrices play in modeling unknown and unpredictable physical quantities. In many situations, meaningful metrics expressed as scalar functionals of these random matrices arise naturally. Along this line, the present work consists in leveraging tools from random matrix theory in an attempt to answer fundamental questions related to applications from statistical signal processing and machine learning. In a first part, this thesis addresses the development of analytical tools for the computation of the inverse moments of random Gram matrices with one side correlation. Such a question is mainly driven by applications in signal processing and wireless communications wherein such matrices naturally arise. In particular, we derive closed-form expressions for the inverse moments and show that the obtained results can help approximate several performance metrics of common estimation techniques. Then, we carry out a large dimensional study of discriminant analysis classifiers. Under mild assumptions, we show that the asymptotic classification error approaches a deterministic quantity that depends only on the means and covariances associated with each class as well as the problem dimensions. Such result permits a better understanding of the underlying classifiers, in practical large but finite dimensions, and can be used to optimize the performance. Finally, we revisit kernel ridge regression and study a centered version of it that we call centered kernel ridge regression or CKRR in short. Relying on recent advances on the asymptotic properties of random kernel matrices, we carry out a large dimensional analysis of CKRR under the assumption that both the data dimesion and the training size grow simultaneiusly large at the same rate. We particularly show that both the empirical and prediction risks converge to a limiting risk that relates the performance to the data statistics and the parameters involved. Such a result is important as it permits a better undertanding of kernel ridge regression and allows to efficiently optimize the performance. Random matrix theory discriminant analysis kernel regression High dimensional statistics
28	Simultaneous Inference for High Dimensional and Correlated Data Polin, Afroza 22 August 2019 (has links) No description available. Statistics High dimensional data multiple testing correlated data
29	Hierarchické shlukování s Mahalanobis-average metrikou akcelerované na GPU / GPU-accelerated Mahalanobis-average hierarchical clustering Šmelko, Adam January 2020 (has links) Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. For flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. Applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This thesis describes a specialized, GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by several orders of magnitude, thus allowing it to scale to much larger datasets. The thesis provides an overview of current hierarchical clustering algorithms, and details the construction of the variant used on GPU. The result is benchmarked on publicly available high-dimensional data from mass cytometry.
30	A-Optimal Subsampling For Big Data General Estimating Equations Cheung, Chung Ching 08 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method. Subsampling Big Data A-optimality General Estimating Equations High Dimensional Statistics

Search results