Global ETD Search

141	Statistical methods for transcriptomics: From microarrays to RNA-seq Tarazona Campos, Sonia 30 March 2015 (has links) La transcriptómica estudia el nivel de expresión de los genes en distintas condiciones experimentales para tratar de identificar los genes asociados a un fenotipo dado así como las relaciones de regulación entre distintos genes. Los datos ómicos se caracterizan por contener información de miles de variables en una muestra con pocas observaciones. Las tecnologías de alto rendimiento más comunes para medir el nivel de expresión de miles de genes simultáneamente son los microarrays y, más recientemente, la secuenciación de RNA (RNA-seq). Este trabajo de tesis versará sobre la evaluación, adaptación y desarrollo de modelos estadísticos para el análisis de datos de expresión génica, tanto si ha sido estimada mediante microarrays o bien con RNA-seq. El estudio se abordará con herramientas univariantes y multivariantes, así como con métodos tanto univariantes como multivariantes. / Tarazona Campos, S. (2014). Statistical methods for transcriptomics: From microarrays to RNA-seq [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/48485 / Premios Extraordinarios de tesis doctorales Bioestadistics Bioinformatics, Variable selection Non-parametric statistical methods Differential expression Microarrays RNA-seq Transcriptomics ESTADISTICA E INVESTIGACION OPERATIVA
142	Does Process Data Add Value to the Analysis of International Large-Scale Assessment Data?: Leng, Dihao January 2024 (has links) Thesis advisor: Matthias von Davier / The transition of major international large-scale assessments (ILSAs) from paper- to computer-based assessments has made process data increasingly available. While process data is potentially valuable for analyzing students’ test-taking behaviors, it also raises ethical concerns and involves considerable costs. This prompts the question: “Does process data add value to the analysis of ILSA data?” In response, this dissertation explores the utility of process data through three studies. Study 1 proposes a multiple-group hierarchical speed-accuracy-revisits model to examine the gender differences in mathematics ability, response speed, revisit propensity, and the relationships among them. The model’s flexibility allows it to be applied in diverse contexts to investigate group differences in test-taking behaviors and achievement beyond gender. Study 2 addresses the overparameterization challenge in ILSA scaling by proposing a new approach: adding process variables to the usual contextual variables and replacing principal component analysis with variable selection for latent regression modeling. The findings show that process variables consistently improved measurement precision; using Lasso, random forests, and ultimately gradient boosting for variable selection achieved or surpassed the measurement precision of the conventional approach but with considerably fewer covariates. Integrating variable selection and process data yielded the highest measurement precision while achieving parsimony, demonstrating the effectiveness of the proposed method. Study 3 investigates students’ test-taking behaviors in the context of girls consistently outperforming boys on average across countries and assessments. Three types of test-taking behaviors were identified through latent class analysis: “Rapid”, “Challenged”, and “Engaged”. Using Bolck-Croon-Hagenaars and three-step methods reveals that girls in the “Rapid” class outperformed boys on average in all countries, while there were no significant gender differences in the “Engaged” class in three of the four countries. The gender gap in reading achievement may diminish to a mild to moderate extent if boys were to behave like girls, highlighting the importance of addressing disengagement issues in ILSAs. Collectively, these three papers advance the use of process data and demonstrate its value for analyzing and reporting results of ILSA data. / Thesis (PhD) — Boston College, 2024. / Submitted to: Boston College. Lynch School of Education. / Discipline: Education. Engagement Gender gap Large-scale assessments Process data Test-taking behavior Variable selection
143	High-dimensional Multimodal Bayesian Learning Salem, Mohamed Mahmoud 12 December 2024 (has links) High-dimensional datasets are fast becoming a cornerstone across diverse domains, fueled by advancements in data-capturing technology like DNA sequencing, medical imaging techniques, and social media. This dissertation delves into the inherent opportunities and challenges posed by these types of datasets. We develop three Bayesian methods: (1) Multilevel Network Recovery for Genomics, (2) Network Recovery for Functional data, and (3) Bayesian Inference in Transformer-based Models. Chapter 2 in our work examines a two-tiered data structure; to simultaneously explore the variable selection and identify dependency structures among both higher and lower-level variables, we propose a multi-level nonparametric kernel machine approach, utilizing variational inference to jointly identify multi-level variables as well as build the network. Chapter 3 addresses the development of a simultaneous selection of functional domain subsets, selection of functional graphical nodes, and continuous response modeling given both scalar and functional covariates under semiparametric, nonadditive models, which allow us to capture unknown, possibly nonlinear, interaction terms among high dimensional functional variables. In Chapter 4, we extend our investigation of leveraging structure in high dimensional datasets to the relatively new transformer architecture; we introduce a new penalty structure to the Bayesian classification transformer, leveraging the multi-tiered structure of the transformer-based model. This allows for increased, likelihood-based regularization, which is needed given the high dimensional nature of our motivating dataset. This new regularization approach allows us to integrate Bayesian inference via variational approximations into our transformer-based model and improves the calibration of probability estimates. / Doctor of Philosophy / In today's data-driven landscape, high-dimensional datasets have emerged as a corner stone across diverse domains, fueled by advancements in technology like sensor networks, genomics, and social media platforms. This dissertation delves into the inherent opportunities and challenges posed by these datasets, emphasizing their potential for uncovering hidden patterns and correlations amidst their complexity. As high-dimensional datasets proliferate, researchers face significant challenges in effectively analyzing and interpreting them. This research focuses on leveraging Bayesian methods as a robust approach to address these challenges. Bayesian approaches offer unique advantages, particularly in handling small sample sizes and complex models. By providing robust uncertainty quantification and regularization techniques, Bayesian methods ensure reliable inference and model generalization, even in the face of sparse or noisy data. Furthermore, this work examines the strategic integration of structured information as a regularization technique. By exploiting patterns and dependencies within the data, structured regularization enhances the interpretability and resilience of statistical models across various domains. Whether the structure arises from spatial correlations, temporal dependencies, or coordinated actions among covariates, incorporating this information enriches the modeling process and improves the reliability of the results. By exploring these themes, this research contributes to advancing the understanding and application of high-dimensional data analysis. Through a thorough examination of Bayesian methods and structured regularization techniques, this dissertation aims to support researchers in effectively navigating and extracting meaningful insights from the complex landscape of high-dimensional datasets. Gaussian Process High Dimensional Data Variable Selection Variational Inference Uncertainty Quantification
144	Application and feasibility of visible-NIR-MIR spectroscopy and classification techniques for wetland soil identification Whatley, Caleb 10 May 2024 (has links) (PDF) Wetland determinations require the visual identification of anaerobic soil indicators by an expert, which is a complex and subjective task. To eliminate bias, an objective method is needed to identify wetland soil. Currently, no such method exists that is rapid and easily interpretable. This study proposes a method for wetland soil identification using visible through mid-infrared (MIR) spectroscopy and classification algorithms. Wetland and non-wetland soils (n = 440) were collected across Mississippi. Spectra were measured from fresh and dried soil. Support Vector Classification and Random Forest modeling techniques were used to classify spectra with 75%/25% calibration and validation split. POWERSHAP Shapley feature selection and Gini importance were used to locate highest-contributing spectral features. Average classification accuracy was ~91%, with a maximum accuracy of 99.6% on MIR spectra. The most important features were related to iron compounds, nitrates, and soil texture. This study improves the reliability of wetland determinations as an objective and rapid wetland soil identification method while eliminating the need for an expert for determination.
145	High-Dimensional Functional Graphs and Inference for Unknown Heterogeneous Populations Chen, Han 21 November 2024 (has links) In this dissertation, we develop innovative methods for analyzing high-dimensional, heterogeneous functional data, focusing specifically on uncovering hidden patterns and network structures within such complex data. We utilize functional graphical models (FGMs) to explore the conditional dependence structure among random elements. We mainly focus on the following three research projects. The first project combines the strengths of FGMs with finite mixture of regression models (FMR) to overcome the challenges of estimating conditional dependence structures from heterogeneous functional data. This novel approach facilitates the discovery of latent patterns, proving particularly advantageous for analyzing complex datasets, such as brain imaging studies of autism spectrum disorder (ASD). Through numerical analysis of both simulated data and real-world ASD brain imaging, we demonstrate the effectiveness of our methodology in uncovering complex dependencies that traditional methods may miss due to their homogeneous data assumptions. Secondly, we address the challenge of variable selection within FMR in high-dimensional settings by proposing a joint variable selection technique. This technique employs a penalized expectation-maximization (EM) algorithm that leverages shared structures across regression components, thereby enhancing the efficiency of identifying relevant predictors and improving the predictive ability. We further expand this concept to mixtures of functional regressions, employing a group lasso penalty for variable selection in heterogeneous functional data. Lastly, we recognize the limitations of existing methods in testing the equality of multiple functional graphs and develop a novel, permutation-based testing procedure. This method provides a robust, distribution-free approach to comparing network structures across different functional variables, as illustrated through simulation studies and functional magnetic resonance imaging (fMRI) analysis for ASD. Hence, these research works provide a comprehensive framework for functional data analysis, significantly advancing the estimation of network structures, functional variable selection, and testing of functional graph equality. This methodology holds great promise for enhancing our understanding of heterogeneous functional data and its practical applications. / Doctor of Philosophy / This study introduces innovative techniques for analyzing complex, high-dimensional functional data, such as functional magnetic resonance imaging (fMRI) data from the brain. Our goal is to reveal underlying patterns and network connections, particularly in the context of autism spectrum disorder (ASD). In functional data, we treat each signal curve from various locations as a single data point. These datasets are characterized by high dimensionality, with the number of model parameters exceeding the sample size. We employ functional graphical models (FGMs) to investigate the conditional dependencies among data elements. Our approach combines FGMs with finite mixture of regression models (FMR), allowing us to uncover hidden patterns that traditional methods assuming homogeneity might miss. Additionally, we introduce a new method for selecting relevant variables in high-dimensional regression contexts. This method enhances prediction accuracy by utilizing shared information among regression components. Furthermore, we develop a robust testing framework to facilitate the comparison of network structures between groups without relying on distribution assumptions. This enables precise evaluations of functional graphs. Hence, our research works contribute to a deeper understanding of complex, diverse functional data, paving the way for novel insights across various fields. Functional Data Graphical Model Joint Variable Selection Mixture Regression Permutation Test
146	Topics in Modern Bayesian Computation Qamar, Shaan January 2015 (has links) <p>Collections of large volumes of rich and complex data has become ubiquitous in recent years, posing new challenges in methodological and theoretical statistics alike. Today, statisticians are tasked with developing flexible methods capable of adapting to the degree of complexity and noise in increasingly rich data gathered across a variety of disciplines and settings. This has spurred the need for novel multivariate regression techniques that can efficiently capture a wide range of naturally occurring predictor-response relations, identify important predictors and their interactions and do so even when the number of predictors is large but the sample size remains limited. </p><p>Meanwhile, efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Aided by the tremendous success of modern computing, Bayesian methods have gained tremendous popularity in recent years. These methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions. In addition, they provide a practical way of encoding model structure that can lead to large gains in statistical estimation and more interpretable results. However, this flexibility is often hindered in applications to modern data which are increasingly high dimensional, both in the number of observations $n$ and the number of predictors $p$. Here, computational complexity and the curse of dimensionality typically render posterior computation inefficient. In particular, Markov chain Monte Carlo (MCMC) methods which remain the workhorse for Bayesian computation (owing to their generality and asymptotic accuracy guarantee), typically suffer data processing and computational bottlenecks as a consequence of (i) the need to hold the entire dataset (or available sufficient statistics) in memory at once; and (ii) having to evaluate of the (often expensive to compute) data likelihood at each sampling iteration. </p><p>This thesis divides into two parts. The first part concerns itself with developing efficient MCMC methods for posterior computation in the high dimensional {\em large-n large-p} setting. In particular, we develop an efficient and widely applicable approximate inference algorithm that extends MCMC to the online data setting, and separately propose a novel stochastic search sampling scheme for variable selection in high dimensional predictor settings. The second part of this thesis develops novel methods for structured sparsity in the high-dimensional {\em large-p small-n} regression setting. Here, statistical methods should scale well with the predictor dimension and be able to efficiently identify low dimensional structure so as to facilitate optimal statistical estimation in the presence of limited data. Importantly, these methods must be flexible to accommodate potentially complex relationships between the response and its associated explanatory variables. The first work proposes a nonparametric additive Gaussian process model to learn predictor-response relations that may be highly nonlinear and include numerous lower order interaction effects, possibly in different parts of the predictor space. A second work proposes a novel class of Bayesian shrinkage priors for multivariate regression with a tensor valued predictor. Dimension reduction is achieved using a low-rank additive decomposition for the latter, enabling a highly flexible and rich structure within which excellent cell-estimation and region selection may be obtained through state-of-the-art shrinkage methods. In addition, the methods developed in these works come with strong theoretical guarantees.</p> / Dissertation Statistics Approximate Bayesian computation High dimensional regression Nonparametric regression Scalable Markov chain Monte Carlo Structured additive models Variable selection
147	Novel variable influence on projection (VIP) methods in OPLS, O2PLS, and OnPLS models for single- and multi-block variable selection : VIPOPLS, VIPO2PLS, and MB-VIOP methods Galindo-Prieto, Beatriz January 2017 (has links) Multivariate and multiblock data analysis involves useful methodologies for analyzing large data sets in chemistry, biology, psychology, economics, sensory science, and industrial processes; among these methodologies, partial least squares (PLS) and orthogonal projections to latent structures (OPLS®) have become popular. Due to the increasingly computerized instrumentation, a data set can consist of thousands of input variables which contain latent information valuable for research and industrial purposes. When analyzing a large number of data sets (blocks) simultaneously, the number of variables and underlying connections between them grow very much indeed; at this point, reducing the number of variables keeping high interpretability becomes a much needed strategy. The main direction of research in this thesis is the development of a variable selection method, based on variable influence on projection (VIP), in order to improve the model interpretability of OnPLS models in multiblock data analysis. This new method is called multiblock variable influence on orthogonal projections (MB-VIOP), and its novelty lies in the fact that it is the first multiblock variable selection method for OnPLS models. Several milestones needed to be reached in order to successfully create MB-VIOP. The first milestone was the development of a single-block variable selection method able to handle orthogonal latent variables in OPLS models, i.e. VIP for OPLS (denoted as VIPOPLS or OPLS-VIP in Paper I), which proved to increase the interpretability of PLS and OPLS models, and afterwards, was successfully extended to multivariate time series analysis (MTSA) aiming at process control (Paper II). The second milestone was to develop the first multiblock VIP approach for enhancement of O2PLS® models, i.e. VIPO2PLS for two-block multivariate data analysis (Paper III). And finally, the third milestone and main goal of this thesis, the development of the MB-VIOP algorithm for the improvement of OnPLS model interpretability when analyzing a large number of data sets simultaneously (Paper IV). The results of this thesis, and their enclosed papers, showed that VIPOPLS, VIPO2PLS, and MB-VIOP methods successfully assess the most relevant variables for model interpretation in PLS, OPLS, O2PLS, and OnPLS models. In addition, predictability, robustness, dimensionality reduction, and other variable selection purposes, can be potentially improved/achieved by using these methods. Variable influence on projection VIP MB-VIOP OPLS O2PLS OnPLS variable selection
148	Um procedimento para seleção de variáveis em modelos lineares generalizados duplos / A procedure for variable selection in double generalized linear models Cavalaro, Lucas Leite 01 April 2019 (has links) Os modelos lineares generalizados duplos (MLGD), diferentemente dos modelos lineares generalizados (MLG), permitem o ajuste do parâmetro de dispersão da variável resposta em função de variáveis preditoras, aperfeiçoando a forma de modelar fenômenos. Desse modo, os mesmos são uma possível solução quando a suposição de que o parâmetro de dispersão constante não é razoável e a variável resposta tem distribuição que pertence à família exponencial. Considerando nosso interesse em seleção de variáveis nesta classe de modelos, estudamos o esquema de seleção de variáveis em dois passos proposto por Bayer e Cribari-Neto (2015) e, com base neste método, desenvolvemos um esquema para seleção de variáveis em até k passos. Para verificar a performance do nosso procedimento, realizamos estudos de simulação de Monte Carlo em MLGD. Os resultados obtidos indicam que o nosso procedimento para seleção de variáveis apresenta, em geral, performance semelhante ou superior à das demais metodologias estudadas sem necessitar de um grande custo computacional. Também avaliamos o esquema para seleção de variáveis em até \"k\" passos em um conjunto de dados reais e o comparamos com diferentes métodos de regressão. Os resultados mostraram que o nosso procedimento pode ser também uma boa alternativa quando possui-se interesse em realizar previsões. / The double generalized linear models (DGLM), unlike the generalized linear model (GLM), allow the fit of the dispersion parameter of the response variable as a function of predictor variables, improving the way of modeling phenomena. Thus, they are a possible solution when the assumption that the constant dispersion parameter is unreasonable and the response variable has distribution belonging to the exponential family. Considering our interest in variable selection in this class of models, we studied the two-step variable selection scheme proposed by Bayer and Cribari-Neto (2015) and, based on this method, we developed a scheme to select variables in up to k steps. To check the performance of our procedure, we performed Monte Carlo simulation studies in DGLM. The results indicate that our procedure for variable selection presents, in general, similar or superior performance than the other studied methods without requiring a large computational cost. We also evaluated the scheme to select variables in up to \"k\" steps in a set of real data and compared it with different regression methods. The results showed that our procedure can also be a good alternative when the interest is in making predictions. Critérios de informação Double generalized linear models Information criteria Modelos lineares generalizados duplos Seleção de variáveis Stepwise Stepwise Variable selection
149	Penalized regression models for compositional data / Métodos de regressão penalizados para dados composicionais Shimizu, Taciana Kisaki Oliveira 10 December 2018 (has links) Compositional data consist of known vectors such as compositions whose components are positive and defined in the interval (0,1) representing proportions or fractions of a whole, where the sum of these components must be equal to one. Compositional data is present in different areas, such as in geology, ecology, economy, medicine, among many others. Thus, there is great interest in new modeling approaches for compositional data, mainly when there is an influence of covariates in this type of data. In this context, the main objective of this thesis is to address the new approach of regression models applied in compositional data. The main idea consists of developing a marked method by penalized regression, in particular the Lasso (least absolute shrinkage and selection operator), elastic net and Spike-and-Slab Lasso (SSL) for the estimation of parameters of the models. In particular, we envision developing this modeling for compositional data, when the number of explanatory variables exceeds the number of observations in the presence of large databases, and when there are constraints on the dependent variables and covariates. / Dados composicionais consistem em vetores conhecidos como composições cujos componentes são positivos e definidos no intervalo (0,1) representando proporções ou frações de um todo, sendo que a soma desses componentes totalizam um. Tais dados estão presentes em diferentes áreas, como na geologia, ecologia, economia, medicina entre outras. Desta forma, há um grande interesse em ampliar os conhecimentos acerca da modelagem de dados composicionais, principalmente quando há a influência de covariáveis nesse tipo de dado. Nesse contexto, a presente tese tem por objetivo propor uma nova abordagem de modelos de regressão aplicada em dados composicionais. A ideia central consiste no desenvolvimento de um método balizado por regressão penalizada, em particular Lasso, do inglês least absolute shrinkage and selection operator, elastic net e Spike-e-Slab Lasso (SSL) para a estimação dos parâmetros do modelo. Em particular, visionamos o desenvolvimento dessa modelagem para dados composicionais, com o número de variáveis explicativas excedendo o número de observações e na presença de grandes bases de dados, e além disso, quando há restrição na variável resposta e nas covariáveis. Compositional data Coordenadas log-razão isométricas Dados composicionais Isometric log-ratio coordinates Modelo de regressão Regression model Seleção de variáveis Variable selection
150	Seleção bayesiana de variáveis em modelos multiníveis da teoria de resposta ao item com aplicações em genômica / Bayesian variable selection for multilevel item response theory models with applications in genomics Fragoso, Tiago de Miranda 12 September 2014 (has links) As investigações sobre as bases genéticas de doenças complexas em Genômica utilizam diversos tipos de informação. Diversos sintomas são avaliados de maneira a diagnosticar a doença, os indivíduos apresentam padrões de agrupamento baseados, por exemplo no seu parentesco ou ambiente comum e uma quantidade imensa de características dos indivíduos são medidas por meio de marcadores genéticos. No presente trabalho, um modelo multiníveis da teoria de resposta ao item (TRI) é proposto de forma a integrar todas essas fontes de informação e caracterizar doenças complexas através de uma variável latente. Além disso, a quantidade de marcadores moleculares induz um problema de seleção de variáveis, para o qual uma seleção baseada nos métodos da busca estocástica e do LASSO bayesiano são propostos. Os parâmetros do modelo e a seleção de variáveis são realizados sob um paradigma bayesiano, no qual um algoritmo Monte Carlo via Cadeias de Markov é construído e implementado para a obtenção de amostras da distribuição a posteriori dos parâmetros. O mesmo é validado através de estudos de simulação, nos quais a capacidade de recuperação dos parâmetros, de escolha de variáveis e características das estimativas pontuais dos parâmetros são avaliadas em cenários similares aos dados reais. O processo de estimação apresenta uma recuperação satisfatória nos parâmetros estruturais do modelo e capacidade de selecionar covariáveis em espaços de dimensão elevada apesar de um viés considerável nas estimativas das variáveis latentes associadas ao traço latente e ao efeito aleatório. Os métodos desenvolvidos são então aplicados aos dados colhidos no estudo de associação familiar \'Corações de Baependi\', nos quais o modelo multiníveis se mostra capaz de caracterizar a síndrome metabólica, uma série de sintomas associados com o risco cardiovascular. O modelo multiníveis e a seleção de variáveis se mostram capazes de recuperar características conhecidas da doença e selecionar um marcador associado. / Recent investigations about the genetic architecture of complex diseases use diferent sources of information. Diferent symptoms are measured to obtain a diagnosis, individuals may not be independent due to kinship or common environment and their genetic makeup may be measured through a large quantity of genetic markers. In the present work, a multilevel item response theory (IRT) model is proposed that unifies all these diferent sources of information through a latent variable. Furthermore, the large ammount of molecular markers induce a variable selection problem, for which procedures based on stochastic search variable selection and the Bayesian LASSO are considered. Parameter estimation and variable selection is conducted under a Bayesian framework in which a Markov chain Monte Carlo algorithm is derived and implemented to obtain posterior distribution samples. The estimation procedure is validated through a series of simulation studies in which parameter recovery, variable selection and estimation error are evaluated in scenarios similar to the real dataset. The estimation procedure showed adequate recovery of the structural parameters and the capability to correctly nd a large number of the covariates even in high dimensional settings albeit it also produced biased estimates for the incidental latent variables. The proposed methods were then applied to the real dataset collected on the \'Corações de Baependi\' familiar association study and was able to apropriately model the metabolic syndrome, a series of symptoms associated with elevated heart failure and diabetes risk. The multilevel model produced a latent trait that could be identified with the syndrome and an associated molecular marker was found. Bayesian LASSO busca estocástica item response theory LASSO bayesiano stochastic search variable selection teoria da resposta ao item

Search results