• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 113
  • 49
  • 30
  • 11
  • 9
  • Tagged with
  • 221
  • 176
  • 49
  • 49
  • 46
  • 43
  • 43
  • 32
  • 29
  • 29
  • 27
  • 25
  • 22
  • 22
  • 21
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
141

Partition clustering of High Dimensional Low Sample Size data based on P-Values

Von Borries, George Freitas January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Haiyan Wang / This thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time. Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations. The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables. PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.
142

R[superscript]2 statistics with application to association mapping

Sun, Guannan January 1900 (has links)
Master of Science / Department of Statistics / Shie-Shien Yang / In fitting linear models, R[superscript]2 statistic has been wildly used as one of the measures to assess the goodness-of-fit and prediction power of the model. Unlike fixed linear models, at this time there is no single universally accepted measure for assessing goodness-of-fit and prediction power of a linear mixed model. In this report, we reviewed seven different approaches proposed to define a measure analogous to the usual R[superscript]2 statistic for assessing mixed models. One of seven statistics,Rc, has both conditional and marginal versions. Association mapping is an efficient way to link the genotype data with the phenotype diversity. When applying the R[superscript]2 statistic to the association mapping application, it can determine how well genetic polymorphisms, which are the explanatory variables in the mixed models, explain the phenotypic variation, which is the dependent variation. A linear mixed model method recently has been developed to control the spurious associations due to population structure and relative kinship among individuals of an association mapping. We assess seven definitions of R[superscript]2 statistic for the linear mixed model using data from two empirical association mapping samples: a sample with 277 diverse maize inbred lines and a global sample of 95 Arabidopsis thaliana accessions using the new method. R[superscript]2[subscript]LR statistic derived from the log-likelihood principle follows all the criterions of R[superscript]2 statistic and can be used to understand the overlap between population structure and relative kinship in controlling for sample relatedness. From our results,R[superscript]2[subscript]LR statistic is an appropriate R[superscript]2 statistic for comparing models with different fixed and random variables. Therefore, we recommend using RLR statistic for linear mixed models in association mapping.
143

Inference of nonparametric hypothesis testing on high dimensional longitudinal data and its application in DNA copy number variation and micro array data analysis

Zhang, Ke January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Haiyan Wang / High throughput screening technologies have generated a huge amount of biological data in the last ten years. With the easy availability of array technology, researchers started to investigate biological mechanisms using experiments with more sophisticated designs that pose novel challenges to statistical analysis. We provide theory for robust statistical tests in three flexible models. In the first model, we consider the hypothesis testing problems when there are a large number of variables observed repeatedly over time. A potential application is in tumor genomics where an array comparative genome hybridization (aCGH) study will be used to detect progressive DNA copy number changes in tumor development. In the second model, we consider hypothesis testing theory in a longitudinal microarray study when there are multiple treatments or experimental conditions. The tests developed can be used to detect treatment effects for a large group of genes and discover genes that respond to treatment over time. In the third model, we address a hypothesis testing problem that could arise when array data from different sources are to be integrated. We perform statistical tests by assuming a nested design. In all models, robust test statistics were constructed based on moment methods allowing unbalanced design and arbitrary heteroscedasticity. The limiting distributions were derived under the nonclassical setting when the number of probes is large. The test statistics are not targeted at a single probe. Instead, we are interested in testing for a selected set of probes simultaneously. Simulation studies were carried out to compare the proposed methods with some traditional tests using linear mixed-effects models and generalized estimating equations. Interesting results obtained with the proposed theory in two cancer genomic studies suggest that the new methods are promising for a wide range of biological applications with longitudinal arrays.
144

Individual treatment effect heterogeneity in multiple time points trials

Ndum, Edwin Andong January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Gary Gadbury / In biomedical studies, the treatment main effect is often expressed in terms of an “average difference.” A treatment that appears superior based on the average effect may not be superior for all subjects in a population if there is substantial “subject-treatment interaction.” A parameter quantifying subject-treatment interaction is inestimable in two sample completely randomized designs. Crossover designs have been suggested as a way to estimate the variability in individual treatment effects since an “individual treatment effect” can be measured. However, variability in these observed individual effects may include variability due to the treatment plus inherent variability of a response over time. We use the “Neyman - Rubin Model of Causal Inference” (Neyman, 1923; Rubin, 1974) for analyses. This dissertation consists of two parts: The quantitative and qualitative response analyses. The quantitative part focuses on disentangling the variability due to treatment effects from variability due to time effects using suitable crossover designs. Next, we propose a variable that defines the variance of a true individual treatment effect in a two crossover designs and show that they are not directly estimable but the mean effect is estimable. Furthermore, we show the variance of individual treatment effects is biased under both designs. The bias depends on time effects. Under certain design considerations, linear combinations of time effects can be estimated, making it possible to separate the variability due to time from that due to treatment. The qualitative section involves a binary response and is centered on estimating the average treatment effect and bounding a probability of a negative effect, a parameter which relates to the individual treatment effect variability. Using a stated joint probability distribution of potential outcomes, we express the probability of the observed outcomes under a two treatment, two periods crossover design. Maximum likelihood estimates of these probabilities are found using an iterative numerical method. From these, we propose bounds for an inestimable probability of negative effect. Tighter bounds are obtained with information from subjects that receive the same treatments over the two periods. Finally, we simulate an example of observed count data to illustrate estimation of the bounds.
145

Inference for the intrinsic separation among distributions which may differ in location and scale

Ling, Yan January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Paul I. Nelson / The null hypothesis of equal distributions, H0 : F1[equals]F2[equals]...[equals]FK , is commonly used to compare two or more treatments based on data consisting of independent random samples. Using this approach, evidence of a difference among the treatments may be reported even though from a practical standpoint their effects are indistinguishable, a longstanding problem in hypothesis testing. The concept of effect size is widely used in the social sciences to deal with this issue by computing a unit-free estimate of the magnitude of the departure from H0 in terms of a change in location. I extend this approach by replacing H0 with hypotheses H0* that state that the distributions {Fi} are possibly different in location and or scale, but close, so that rejection provides evidence that at least one treatment has an important practical effect. Assessing statistical significance under H0* is difficult and typically requires inference in the presence of nuisance parameters. I will use frequentist, Bayesian and Fiducial modes of inference to obtain approximate tests and carry out simulation studies of their behavior in terms of size and power. In some cases a bootstrap will be employed. I will focus on tests based on independent random samples arising from K[greater than and equals]3 normal distributions not required to have the same variances to generalize the K[equals]2 sample parameter P(X1>X2) and non-centrality type parameters that arise in testing for the equality of means.
146

Generalized and multiple-trait extensions to Quantitative-Trait Locus mapping

Joehanes, Roby January 1900 (has links)
Doctor of Philosophy / Genetics Interdepartmental Program / James C. Nelson / QTL (quantitative-trait locus) analysis aims to locate and estimate the effects of genes that are responsible for quantitative traits, by means of statistical methods that evaluate the association of genetic variation with trait (phenotypic) variation. Quantitative traits are typically controlled by multiple genes with varying degrees of influence on the phenotype. I describe a new QTL analysis method based on shrinkage and a unifying framework based on the generalized linear model for non-normal data. I develop their extensions to multiple-trait QTL analysis. Expression QTL, or eQTL, analysis is QTL analysis applied to gene expression data to reveal the eQTLs controlling transcript-abundance variation, with the goal of elucidating gene regulatory networks. For exploiting eQTL data, I develop a novel extension of the graphical Gaussian model that produces an undirected graph of a gene regulatory network. To reduce the dimensionality, the extension constructs networks one cluster at a time. However, because Fuzzy-K, the clustering method of choice, relies on subjective visual cutoffs for cluster membership, I develop a bootstrap method to overcome this disadvantage. Finally, I describe QGene, an extensible QTL- and eQTL-analysis software platform written in Java and used for implementation of all analyses.
147

A simulation comparison of parametric and nonparametric estimators of quantiles from right censored data

Serasinghe, Shyamalee Kumary January 1900 (has links)
Master of Science / Department of Statistics / Paul I. Nelson / Quantiles are useful in describing distributions of component lifetimes. Data, consisting of the lifetimes of sample units, used to estimate quantiles are often censored. Right censoring, the setting investigated here, occurs, for example, when some test units may still be functioning when the experiment is terminated. This study investigated and compared the performance of parametric and nonparametric estimators of quantiles from right censored data generated from Weibull and Lognormal distributions, models which are commonly used in analyzing lifetime data. Parametric quantile estimators based on these assumed models were compared via simulation to each other and to quantile estimators obtained from the nonparametric Kaplan- Meier Estimator of the survival function. Various combinations of quantiles, censoring proportion, sample size, and distributions were considered. Our simulation show that the larger the sample size and the lower the censoring rate the better the performance of the estimates of the 5th percentile of Weibull data. The lognormal data are very sensitive to the censoring rate and we observed that for higher censoring rates the incorrect parametric estimates perform the best. If you do not know the underlying distribution of the data, it is risky to use parametric estimates of quantiles close to one. A limitation in using the nonparametric estimator of large quantiles is their instability when the censoring rate is high and the largest observations are censored. Key Words: Quantiles, Right Censoring, Kaplan-Meier estimator
148

Statistical Methods for Dating Collections of Historical Documents

Tilahun, Gelila 31 August 2011 (has links)
The problem in this thesis was originally motivated by problems presented with documents of Early England Data Set (DEEDS). The central problem with these medieval documents is the lack of methods to assign accurate dates to those documents which bear no date. With the problems of the DEEDS documents in mind, we present two methods to impute missing features of texts. In the first method, we suggest a new class of metrics for measuring distances between texts. We then show how to combine the distances between the texts using statistical smoothing. This method can be adapted to settings where the features of the texts are ordered or unordered categoricals (as in the case of, for example, authorship assignment problems). In the second method, we estimate the probability of occurrences of words in texts using nonparametric regression techniques of local polynomial fitting with kernel weight to generalized linear models. We combine the estimated probability of occurrences of words of a text to estimate the probability of occurrence of a text as a function of its feature -- the feature in this case being the date in which the text is written. The application and results of our methods to the DEEDS documents are presented.
149

Neighborhood-Oriented feature selection and classification of Duke’s stages on colorectal Cancer using high density genomic data.

Peng, Liang January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / The selection of relevant genes for classification of phenotypes for diseases with gene expression data have been extensively studied. Previously, most relevant gene selection was conducted on individual gene with limited sample size. Modern technology makes it possible to obtain microarray data with higher resolution of the chromosomes. Considering gene sets on an entire block of a chromosome rather than individual gene could help to reveal important connection of relevant genes with the disease phenotypes. In this report, we consider feature selection and classification while taking into account of the spatial location of probe sets in classification of Duke’s stages B and C using DNA copy number data or gene expression data from colorectal cancers. A novel method was presented for feature selection in this report. A chromosome was first partitioned into blocks after the probe sets were aligned along their chromosome locations. Then a test of interaction between Duke’s stage and probe sets was conducted on each block of probe sets to select significant blocks. For each significant block, a new multiple comparison procedure was carried out to identify truly relevant probe sets while preserving the neighborhood location information of the probe sets. Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) classification using the selected final probe sets was conducted for all samples. Leave-One-Out Cross Validation (LOOCV) estimate of accuracy is reported as an evaluation of selected features. We applied the method on two large data sets, each containing more than 50,000 features. Excellent classification accuracy was achieved by the proposed procedure along with SVM or KNN for both data sets even though classification of prognosis stages (Duke’s stages B and C) is much more difficult than that for the normal or tumor types.
150

Statistical identification of metabolic reactions catalyzed by gene products of unknown function

Zheng, Lianqing January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Gary L. Gadbury / High-throughput metabolite analysis is an approach used by biologists seeking to identify the functions of genes. A mutation in a gene encoding an enzyme is expected to alter the level of the metabolites which serve as the enzyme’s reactant(s) (also known as substrate) and product(s). To find the function of a mutated gene, metabolite data from a wild-type organism and a mutant are compared and candidate reactants and products are identified. The screening principle is that the concentration of reactants will be higher and the concentration of products will be lower in the mutant than in wild type. This is because the mutation reduces the reaction between the reactant and the product in the mutant organism. Based upon this principle, we suggest a method to screen the possible lipid reactant and product pairs related to a mutation affecting an unknown reaction. Some numerical facts are given for the treatment means for the lipid pairs in each treatment group, and relations between the means are found for the paired lipids. A set of statistics from the relations between the means of the lipid pairs is derived. Reactant and product lipid pairs associated with specific mutations are used to assess the results. We have explored four methods using the test statistics to obtain a list of potential reactant-product pairs affected by the mutation. The first method uses the parametric bootstrap to obtain an empirical null distribution of the test statistic and a technique to identify a family of distributions and corresponding parameter estimates for modeling the null distribution. The second method uses a mixture of normal distributions to model the empirical bootstrap null. The third method uses a normal mixture model with multiple components to model the entire distribution of test statistics from all pairs of lipids. The argument is made that, for some cases, one of the model components is that for lipid pairs affected by the mutation while the other components model the null distribution. The fourth method uses a two-way ANOVA model with an interaction term to find the relations between the mean concentrations and the role of a lipid as a reactant or product in a specific lipid pair. The goal of all methods is to identify a list of findings by false discovery techniques. Finally a simulation technique is proposed to evaluate properties of statistical methods for identifying candidate reactant-product pairs.

Page generated in 0.0265 seconds