• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 35
  • 15
  • 5
  • 3
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 290
  • 290
  • 101
  • 99
  • 81
  • 69
  • 69
  • 46
  • 39
  • 38
  • 38
  • 37
  • 35
  • 32
  • 31
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
201

Statistical Methods for Small Sample Cognitive Diagnosis

David B Arthur (10165121) 19 April 2024 (has links)
<p dir="ltr">It has been shown that formative assessments can lead to improvements in the learning process. Cognitive Diagnostic Models (CDMs) are a powerful formative assessment tool that can be used to provide individuals with valuable information regarding skill mastery in educational settings. These models provide each student with a ``skill mastery profile'' that shows the level of mastery they have obtained with regard to a specific set of skills. These profiles can be used to help both students and educators make more informed decisions regarding the educational process, which can in turn accelerate learning for students. However, despite their utility, these models are rarely used with small sample sizes. One reason for this is that these models are often complex, containing many parameters that can be difficult to estimate accurately when working with a small number of observations. This work aims to contribute to and expand upon previous work to make CDMs more accessible for a wider range of educators and students.</p><p dir="ltr">There are three main small sample statistical problems that we address in this work: 1) accurate estimation of the population distribution of skill mastery profiles, 2) accurate estimation of additional model parameters for CDMs as well as improved classification of individual skill mastery profiles, and 3) improved selection of an appropriate CDM for each item on the assessment. Each of these problems deals with a different aspect of educational measurement and the solutions provided to these problems can ultimately lead to improvements in the educational process for both students and teachers. By finding solutions to these problems that work well when using small sample sizes, we make it possible to improve learning in everyday classroom settings and not just in large scale assessment settings.</p><p dir="ltr">In the first part of this work, we propose novel algorithms for estimating the population distribution of skill mastery profiles for a popular CDM, the Deterministic Inputs Noisy ``and'' Gate (DINA) model. These algorithms borrow inspiration from the concepts behind popular machine learning algorithms. However, in contrast to these methods, which are often used solely for prediction, we illustrate how the ideas behind these methods can be adapted to obtain estimates of specific model parameters. Through studies involving simulated and real-life data, we illustrate how the proposed algorithms can be used to gain a better picture of the distribution of skill mastery profiles for an entire population students, but can do so by only using a small sample of students from that population. </p><p dir="ltr">In the second part of this work, we introduce a new method for regularizing high-dimensional CDMs using a class of Bayesian shrinkage priors known as catalytic priors. We show how a simpler model can first be fit to the observed data and then be used to generate additional pseudo-observations that, when combined with the original observations, make it easier to more accurately estimate the parameters in a complex model of interest. We propose an alternative, simpler model that can be used instead of the DINA model and show how the information from this model can be used to formulate an intuitive shrinkage prior that effectively regularizes model parameters. This makes it possible to improve the accuracy of parameter estimates for the more complex model, which in turn leads to better classification of skill mastery. We demonstrate the utility of this method in studies involving simulated and real-life data and show how the proposed approach is superior to other common approaches for small sample estimation of CDMs.</p><p dir="ltr">Finally, we discuss the important problem of selecting the most appropriate model for each item on assessment. Often, it is not uncommon in practice to use the same CDM for each item on an assessment. However, this can lead to suboptimal results in terms of parameter estimation and overall model fit. Current methods for item-level model selection rely on large sample asymptotic theory and are thus inappropriate when the sample size is small. We propose a Bayesian approach for performing item-level model selection using Reversible Jump Markov chain Monte Carlo. This approach allows for the simultaneous estimation of posterior probabilities and model parameters for each candidate model and does not require a large sample size to be valid. We again demonstrate through studies involving simulated and real-life data that the proposed approach leads to a much higher chance of selecting the best model for each item. This in turn leads to better estimates of item and other model parameters, which ultimately leads to more accurate information regarding skill mastery. </p>
202

Robust Representation Learning for Out-of-Distribution Extrapolation in Relational Data

Yangze Zhou (18369795) 17 April 2024 (has links)
<p dir="ltr">Recent advancements in representation learning have significantly enhanced the analysis of relational data across various domains, including social networks, bioinformatics, and recommendation systems. In general, these methods assume that the training and test datasets come from the same distribution, an assumption that often fails in real-world scenarios due to evolving data, privacy constraints, and limited resources. The task of out-of-distribution (OOD) extrapolation emerges when the distribution of test data differs from that of the training data, presenting a significant, yet unresolved challenge within the field. This dissertation focuses on developing robust representations for effective OOD extrapolation, specifically targeting relational data types like graphs and sets. For successful OOD extrapolation, it's essential to first acquire a representation that is adequately expressive for tasks within the distribution. In the first work, we introduce Set Twister, a permutation-invariant set representation that generalizes and enhances the theoretical expressiveness of DeepSets, a simple and widely used permutation-invariant representation for set data, allowing it to capture higher-order dependencies. We showcase its implementation simplicity and computational efficiency, as well as its competitive performances with more complex state-of-the-art graph representations in several graph node classification tasks. Secondly, we address OOD scenarios in graph classification and link prediction tasks, particularly when faced with varying graph sizes. Under causal model assumptions, we derive approximately invariant graph representations that improve extrapolation in OOD graph classification task. Furthermore, we provide the first theoretical study of the capability of graph neural networks for inductive OOD link prediction and present a novel representation model that produces structural pairwise embeddings, maintaining predictive accuracy for OOD link prediction as the test graph size increases. Finally, we investigate the impact of environmental data as a confounder between input and target variables, proposing a novel approach utilizing an auxiliary dataset to mitigate distribution shifts. This comprehensive study not only advances our understanding of representation learning in OOD contexts but also highlights potential pathways for future research in enhancing model robustness across diverse applications.</p>
203

MULTI-STATE MODELS WITH MISSING COVARIATES

Lou, Wenjie 01 January 2016 (has links)
Multi-state models have been widely used to analyze longitudinal event history data obtained in medical studies. The tools and methods developed recently in this area require the complete observed datasets. While, in many applications measurements on certain components of the covariate vector are missing on some study subjects. In this dissertation, several likelihood-based methodologies were proposed to deal with datasets with different types of missing covariates efficiently when applying multi-state models. Firstly, a maximum observed data likelihood method was proposed when the data has a univariate missing pattern and the missing covariate is a categorical variable. The construction of the observed data likelihood function is based on the model of a joint distribution of the response longitudinal event history data and the discrete covariate with missing values. Secondly, we proposed a maximum simulated likelihood method to deal with the missing continuous covariate when applying multi-state models. The observed data likelihood function was approximated by using the Monte Carlo simulation method. At last, an EM algorithm was used to deal with multiple missing covariates when estimating the parameters of multi-state model. The EM algorithm would be able to handle multiple missing discrete covariates in general missing pattern efficiently. All the proposed methods are justified by simulation studies and applications to the datasets from the SMART project, a consortium of 11 different high-quality longitudinal studies of aging and cognition.
204

META-ANALYSIS OF GENE EXPRESSION STUDIES

Siangphoe, Umaporn 01 January 2015 (has links)
Combining effect sizes from individual studies using random-effects models are commonly applied in high-dimensional gene expression data. However, unknown study heterogeneity can arise from inconsistency of sample qualities and experimental conditions. High heterogeneity of effect sizes can reduce statistical power of the models. We proposed two new methods for random effects estimation and measurements for model variation and strength of the study heterogeneity. We then developed a statistical technique to test for significance of random effects and identify heterogeneous genes. We also proposed another meta-analytic approach that incorporates informative weights in the random effects meta-analysis models. We compared the proposed methods with the standard and existing meta-analytic techniques in the classical and Bayesian frameworks. We demonstrate our results through a series of simulations and application in gene expression neurodegenerative diseases.
205

Dimension Reduction and Variable Selection

Moradi Rekabdarkolaee, Hossein 01 January 2016 (has links)
High-dimensional data are becoming increasingly available as data collection technology advances. Over the last decade, significant developments have been taking place in high-dimensional data analysis, driven primarily by a wide range of applications in many fields such as genomics, signal processing, and environmental studies. Statistical techniques such as dimension reduction and variable selection play important roles in high dimensional data analysis. Sufficient dimension reduction provides a way to find the reduced space of the original space without a parametric model. This method has been widely applied in many scientific fields such as genetics, brain imaging analysis, econometrics, environmental sciences, etc. in recent years. In this dissertation, we worked on three projects. The first one combines local modal regression and Minimum Average Variance Estimation (MAVE) to introduce a robust dimension reduction approach. In addition to being robust to outliers or heavy-tailed distribution, our proposed method has the same convergence rate as the original MAVE. Furthermore, we combine local modal base MAVE with a $L_1$ penalty to select informative covariates in a regression setting. This new approach can exhaustively estimate directions in the regression mean function and select informative covariates simultaneously, while being robust to the existence of possible outliers in the dependent variable. The second project develops sparse adaptive MAVE (saMAVE). SaMAVE has advantages over adaptive LASSO because it extends adaptive LASSO to multi-dimensional and nonlinear settings, without any model assumption, and has advantages over sparse inverse dimension reduction methods in that it does not require any particular probability distribution on \textbf{X}. In addition, saMAVE can exhaustively estimate the dimensions in the conditional mean function. The third project extends the envelope method to multivariate spatial data. The envelope technique is a new version of the classical multivariate linear model. The estimator from envelope asymptotically has less variation compare to the Maximum Likelihood Estimator (MLE). The current envelope methodology is for independent observations. While the assumption of independence is convenient, this does not address the additional complication associated with a spatial correlation. This work extends the idea of the envelope method to cases where independence is an unreasonable assumption, specifically multivariate data from spatially correlated process. This novel approach provides estimates for the parameters of interest with smaller variance compared to maximum likelihood estimator while still being able to capture the spatial structure in the data.
206

Security Analysis on Network Systems Based on Some Stochastic Models

Li, Xiaohu 01 December 2014 (has links)
Due to great effort from mathematicians, physicists and computer scientists, network science has attained rapid development during the past decades. However, because of the complexity, most researches in this area are conducted only based upon experiments and simulations, it is critical to do research based on theoretical results so as to gain more insight on how the structure of a network affects the security. This dissertation introduces some stochastic and statistical models on certain networks and uses a k-out-of-n tolerant structure to characterize both logically and physically the behavior of nodes. Based upon these models, we draw several illuminating results in the following two aspects, which are consistent with what computer scientists have observed in either practical situations or experimental studies. Suppose that the node in a P2P network loses the designed function or service when some of its neighbors are disconnected. By studying the isolation probability and the durable time of a single user, we prove that the network with the user's lifetime having more NWUE-ness is more resilient in the sense of having a smaller probability to be isolated by neighbors and longer time to be online without being interrupted. Meanwhile, some preservation properties are also studied for the durable time of a network. Additionally, in order to apply the model in practice, both graphical and nonparametric statistical methods are developed and are employed to a real data set. On the other hand, a stochastic model is introduced to investigate the security of network systems based on their vulnerability graph abstractions. A node loses its designed function when certain number of its neighbors are compromised in the sense of being taken over by the malicious codes or the hacker. The attack compromises some nodes, and the victimized nodes become accomplices. We derived an equation to solve the probability for a node to be compromised in a network. Since this equation has no explicit solution, we also established new lower and upper bounds for the probability. The two models proposed herewith generalize existing models in the literature, the corresponding theoretical results effectively improve those known results and hence carry an insight on designing a more secure system and enhancing the security of an existing system.
207

A Hierarchical Bayesian Model for the Unmixing Analysis of Compositional Data subject to Unit-sum Constraints

Yu, Shiyong 15 May 2015 (has links)
Modeling of compositional data is emerging as an active area in statistics. It is assumed that compositional data represent the convex linear mixing of definite numbers of independent sources usually referred to as end members. A generic problem in practice is to appropriately separate the end members and quantify their fractions from compositional data subject to nonnegative and unit-sum constraints. A number of methods essentially related to polytope expansion have been proposed. However, these deterministic methods have some potential problems. In this study, a hierarchical Bayesian model was formulated, and the algorithms were coded in MATLABÒ. A test run using both a synthetic and real-word dataset yields scientifically sound and mathematically optimal outputs broadly consistent with other non-Bayesian methods. Also, the sensitivity of this model to the choice of different priors and structure of the covariance matrix of error were discussed.
208

Analyse statistique de données issues de batteries en usage réel sur des véhicules électriques, pour la compréhension, l’estimation et la gestion des phénomènes de vieillissement / Statistical analysis of battery signals in real electric vehicle uses, for comprehension, estimation and management of ageing phenomena

Barré, Anthony 17 October 2014 (has links)
Le marché des véhicules électriques connait actuellement un développement important motivé par diverses raisons. Cependant, des limites liées à leurs performances constituent des inconvénients majeurs à une croissance des ventes de plus grande importance. Les performances et durée de vie des batteries utilisées sont au cœur des préoccupations des utilisateurs. Les batteries sont sujettes à des pertes de performances au fil du temps, dus à des phénomènes complexes impliquant des interactions entre les diverses conditions de vie de celles-ci. Dans l'objectif d'améliorer la compréhension et l'estimation du vieillissement d'une batterie, ces travaux étudient des données issues d'usages réels de batteries sur des véhicules électriques. En particulier, l'étude consiste en l'adaptation d'approches statistiques fondées sur les données mesurées, mettant en évidence des interactions entre variables, ainsi que la création de méthodes d'estimation du niveau de performance de batterie uniquement basé sur les mesures obtenues. Les résultats de ces méthodologies ont permis d'illustrer l'apport d'une approche statistique, par exemple en démontrant la présence d'informations contenues dans les signaux issus de la batterie, utiles pour l'estimation de son état de santé. / Due to different reason The electrical vehicle market is undergoing important developments. However the limits associated with performance represent major drawbacks to increase the sales even more. The batteries performance and lifetime are the main focus of EV users. Batteries are subject to performance loss due to complex phenomena implying interactions between the different life conditions of the battery. In order to improve the understanding and estimation of battery aging, the studies were based on datasets from real use ev batteries. More precisely, this study consists in the adaptation and application of statistical approaches on the available data in order to highlight the interactions between variables, as well as the creation of methods for the estimation of battery performance. The obtained results allowed to illustrate the interests of a statistical approach. For example the demonstration of informations contained in the signals coming from the battery which are useful for the estimation of its state of health.
209

Modelos lineares mistos em dados longitudionais com o uso do pacote ASReml-R / Linear Mixed Models with longitudinal data using ASReml-R package

Alcarde, Renata 10 April 2012 (has links)
Grande parte dos experimentos instalados atualmente é planejada para que sejam realizadas observações ao longo do tempo, ou em diferentes profundidades, enfim, tais experimentos geralmente contem um fator longitudinal. Uma maneira de se analisar esse tipo de conjunto de dados é utilizando modelos mistos, por meio da inclusão de fatores de efeito aleatório e, fazendo uso do método da máxima verossimilhança restrita (REML), podem ser estimados os componentes de variância associados a tais fatores com um menor viés. O pacote estatístico ASReml-R, muito eficiente no ajuste de modelos lineares mistos por possuir uma grande variedade de estruturas para as matrizes de variâncias e covariâncias já implementadas, apresenta o inconveniente de nao ter como objetos as matrizes de delineamento X e Z, nem as matrizes de variâncias e covariâncias D e , sendo estas de grande importância para a verificação das pressuposições do modelo. Este trabalho reuniu ferramentas que facilitam e fornecem passos para a construção de modelos baseados na aleatorização, tais como o diagrama de Hasse, o diagrama de aleatorização e a construção de modelos mistos incluindo fatores longitudinais. Sendo o vetor de resíduos condicionais e o vetor de parâmetros de efeitos aleatórios confundidos, ou seja, não independentes, foram obtidos resíduos, denominados na literatura, resíduos com confundimento mínimo e, como proposta deste trabalho foi calculado o EBLUP com confudimento mínimo. Para tanto, foram implementadas funções que, utilizando os objetos de um modelo ajustado com o uso do pacote estatístico ASReml-R, tornam disponíveis as matrizes de interesse e calculam os resíduos com confundimento mínimo e o EBLUP com confundimento m´nimo. Para elucidar as técnicas neste apresentadas e salientar a importância da verificação das pressuposições do modelo adotado, foram considerados dois exemplos contendo fatores longitudinais, sendo o primeiro um experimento simples, visando a comparação da eficiência de diferentes coberturas em instalações avícolas, e o segundo um experimento realizado em três fases, contendo fatores inteiramente confundidos, com o objetivos de avaliar características do papel produzido por diferentes espécies de eucaliptos em diferentes idades. / Currently, most part of the experiments installed is designed to be carried out observations over time or at different depths. These experiments usually have a longitudinal factor. One way of analyzing this data set is by using mixed models through means of inclusion of random effect factors, and it is possible to estimate the variance components associated to such factors with lower bias by using the Restricted maximum likelihood method (REML). The ASRemi-R statistic package, very efficient in fitting mixed linear models because it has a wide variety of structures for the variance - covariance matrices already implemented, presents the disadvantage of having neither the design matricesX and Z, nor the variance - covariance matrices D and , and they are very important to verify the assumption of the model. This paper gathered tools which facilitate and provide steps to build models based on randomization such as the Hasse diagram, randomization diagram and the mixed model formulations including longitudinal factors. Since the conditional residuals and random effect parameters are confounded, that is, not independent, it was calculated residues called in the literature as least confounded residuals and as a proposal of this work, it was calculated the least confound EBLUP. It was implemented functions which using the objects of fitted models with the use of the ASReml-R statistic package becoming available the matrices of interests and calculate the least confounded residuals and the least confounded EBLUP. To elucidate the techniques shown in this paper and highlight the importance of the verification of the adopted models assumptions, it was considered two examples with longitudinal factors. The former example was a simple experiment and the second one conducted in three phases, containing completely confounded factors, with the purpose of evaluating the characteristics of the paper produced by different species of eucalyptus from different ages.
210

Mapeamento de QTLs utilizando as abordagens Clássica e Bayesiana / Mapping QTLs: Classical and Bayesian approaches

Toledo, Elisabeth Regina de 02 October 2006 (has links)
A produção de grãos e outros caracteres de importância econômica para a cultura do milho, tais como a altura da planta, o comprimento e o diâmetro da espiga, apresentam herança poligênica, o que dificulta a obtenção de informações sobre as bases genéticas envolvidas na variação desses caracteres. Associações entre marcadores e QTLs foram analisadas através dos métodos de mapeamento por intervalo composto (CIM) e mapeamento por intervalo Bayesiano (BIM). A partir de um conjunto de dados de produção de grãos, referentes à avaliação de 256 progênies de milho genotipadas para 139 marcadores moleculares codominantes, verificou-se que as metodologias apresentadas permitiram classificar marcas associadas a QTLs. Através do procedimento CIM, associações entre marcadores e QTLs foram consideradas significativas quando o valor da estatística de razão de verossimilhança (LR) ao longo do cromossomo atingiu o valor máximo dentre os que ultrapassaram o limite crítico LR = 11; 5 no intervalo considerado. Dez QTLs foram mapeados distribuídos em três cromossomos. Juntos, explicaram 19,86% da variância genética. Os tipos de interação alélica predominantes foram de dominância parcial (quatro QTLs) e dominância completa (três QTLs). O grau médio de dominância calculado foi de 1,12, indicando grau médio de dominância completa. Grande parte dos alelos favoráveis ao caráter foram provenientes da linhagem parental L0202D, que apresentou mais elevada produção de grãos. Adotando-se a abordagem Bayesiana, foram implementados métodos de amostragem através de cadeias de Markov (MCMC), para obtenção de uma amostra da distribuição a posteriori dos parâmetros de interesse, incorporando as crenças e incertezas a priori. Resumos sobre as localizações dos QTLs e seus efeitos aditivo e de dominância foram obtidos. Métodos MCMC com saltos reversíveis (RJMCMC) foram utilizados para a análise Bayesiana e Fator calculado de Bayes para estimar o número de QTLs. Através do método BIM associações entre marcadores e QTLs foram consideradas significativas em quatro cromossomos, com um total de cinco QTLs mapeados. Juntos, esses QTLs explicaram 13,06% da variância genética. A maior parte dos alelos favoráveis ao caráter também foram provenientes da linhagem parental L02-02D. / Grain yield and other important economic traits in maize, such as plant heigth, stalk length, and stalk diameter, exhibit polygenic inheritance, making dificult information achievement about the genetic bases related to the variation of these traits. The number and sites of (QTLs) loci that control grain yield in maize have been estimated. Associations between markers and QTLs were undertaken by composite interval mapping (CIM) and Bayesian interval mapping (BIM). Based on a set of grain yield data, obtained from the evaluation of 256 maize progenies genotyped for 139 codominant molecular markers, the presented methodologies allowed classification of markers associated to QTLs.Through composite interval mapping were significant when value of likelihood ratio (LR) throughout the chromosome surpassed LR = 11; 5. Significant associations between markers and QTLs were obtained in three chromosomes, ten QTLs has been mapped, which explained 19; 86% of genetic variation. Predominant genetic action for mapped QTLs was partial dominance and (four QTLs) complete dominance (tree QTLs). Average dominance amounted to 1,12 and confirmed complete dominance for grain yield. Most alleles that contributed positively in trait came from parental strain L0202D. The latter had the highest yield rate. Adopting a Bayesian approach to inference, usually implemented via Markov chain Monte Carlo (MCMC). The output of a Bayesian analysis is a posterior distribution on the parameters, fully incorporating prior beliefs and parameter uncertainty. Reversible Jump MCMC (RJMCMC) is used in this work. Bayes Factor is used to estimate the number of QTL. Through Bayesian interval, significant associations between markers and QTLs were obtained in four chromosomes and five QTLs has been mapped, which explained 13; 06% of genetic variation. Most alleles that contributed positively in trait came from parental strain L02-02D. The latter had the highest yield rate.

Page generated in 0.0969 seconds