Spelling suggestions: "subject:"estatistics anda probability"" "subject:"estatistics anda aprobability""
91 |
On the Model Selection in a Frailty SettingLundell, Jill F. 01 May 1998 (has links)
When analyzing data in a survival setting, whether of people or objects, one of the assumptions made is that the population is homogeneous. This is not true in reality and certain adjustments can be made in the model to account for heterogeneity. Frailty is one method of dealing with some of this heterogeneity. It is not possible to measure frailty directly and hence it can be very difficult to determine which frailty model is appropriate for the data in interest. This thesis investigates three model selection methods in their effectiveness at determining which frailty distribution best describes a given set of data. The model selection methods used are the Bayes factor, neural networks, and classification trees. Results favored classification trees. Very poor results were observed with neural networks.
|
92 |
Adaptive Density Estimation Based on the Mode Existence TestJawhar, Nizar Sami 01 May 1996 (has links)
The kernel persists as the most useful tool for density estimation. Although, in general, fixed kernel estimates have proven superior to results of available variable kernel estimators, Minnotte's mode tree and mode existence test give us newfound hope of producing a useful adaptive kernel estimator that triumphs when the fixed kernel methods fail. It improves on the fixed kernel in multimodal distributions where the size of modes is unequal, and where the degree of separation of modes varies. When these latter conditions exist, they present a serious challenge to the best of fixed kernel density estimators. Capitalizing on the work of Minnotte in detecting multimodality adaptively, we found it possible to determine the bandwidth h adaptively in a most original fashion and to estimate the mixture normals adaptively, using the normal kernel with encouraging results.
|
93 |
Tuning Hyperparameters in Supervised Learning Models and Applications of Statistical Learning in Genome-Wide Association Studies with Emphasis on HeritabilityLundell, Jill F. 01 August 2019 (has links)
Machine learning is a buzz word that has inundated popular culture in the last few years. This is a term for a computer method that can automatically learn and improve from data instead of being explicitly programmed at every step. Investigations regarding the best way to create and use these methods are prevalent in research. Machine learning models can be difficult to create because models need to be tuned. This dissertation explores the characteristics of tuning three popular machine learning models and finds a way to automatically select a set of tuning parameters. This information was used to create an R software package called EZtune that can be used to automatically tune three widely used machine learning algorithms: support vector machines, gradient boosting machines, and adaboost.
The second portion of this dissertation investigates the implementation of machine learning methods in finding locations along a genome that are associated with a trait. The performance of methods that have been commonly used for these types of studies, and some that have not been commonly used, are assessed using simulated data. The affect of the strength of the relationship between the genetic code and the trait is of particular interest. It was found that the strength of this relationship was the most important characteristic in the efficacy of each method.
|
94 |
The "Fair" Triathlon: Equating Standard Deviations Using Non-Linear Bayesian ModelsCurtis, Steven McKay 14 May 2004 (has links) (PDF)
The Ironman triathlon was created in 1978 by combining events with the longest distances for races then contested in Hawaii in swimming, cycling, and running. The Half Ironman triathlon was formed using half the distances of each of the events in the Ironman. The Olympic distance triathlon was created by combining events with the longest distances for races sanctioned by the major federations for swimming, cycling, and running. The relative importance of each event in overall race outcome was not given consideration when determining the distances of each of the races in modern triathlons. Thus, there is a general belief among triathletes that the swimming portion of the standard-distance triathlons is underweighted. We present a nonlinear Bayesian model for triathlon finishing times that models time and standard deviation of time as a function of distance. We use this model to create "fair" triathlons by equating the standard deviations of the times taken to complete the swimming, cycling, and running events. Thus, in these "fair" triathlons, a one standard deviation improvement in any event has an equivalent impact on overall race time.
|
95 |
Determining the Optimum Number of Increments in Composite SamplingHathaway, John Ellis 20 May 2005 (has links) (PDF)
Composite sampling can be more cost effective than simple random sampling. This paper considers how to determine the optimum number of increments to use in composite sampling. Composite sampling terminology and theory are outlined and a model is developed which accounts for different sources of variation in compositing and data analysis. This model is used to define and understand the process of determining the optimum number of increments that should be used in forming a composite. The blending variance is shown to have a smaller range of possible values than previously reported when estimating the number of increments in a composite sample. Accounting for differing levels of the blending variance significantly affects the estimated number of increments.
|
96 |
Computation of Weights for Probabilistic Record Linkage Using the EM AlgorithmBauman, G. John 29 June 2006 (has links) (PDF)
Record linkage is the process of combining information about a single individual from two or more records. Probabilistic record linkage gives weights to each field that is compared. The decision of whether the records should be linked is then determined by the sum of the weights, or “Score”, over all fields compared. Using methods similar to the simple versus simple most powerful test, an optimal record linkage decision rule can be established to minimize the number of unlinked records when the probability of false positive and false negative errors are specified. The weights needed for probabilistic record linkage necessitate linking a “training” subset of records for the computations. This is not practical in many settings, as hand matching requires a considerable time investment. In 1989, Matthew A. Jaro demonstrated how the Expectation-Maximization, or EM, algorithm could be used to compute the needed weights when fields have Binomial matching possibilities. This project applies this method of using the EM algorithm to calculate weights for head-of-household records from the 1910 and 1920 Censuses for Ascension Parish of Louisiana and Church and County Records from Perquimans County, North Carolina. This project also expands the Jaro's EM algorithm to a Multinomial framework. The performance of the EM algorithm for calculating weights will be assessed by comparing the computed weights to weights computed by clerical matching. Simulations will also be conducted to investigate the sensitivity of the algorithm to the total number of record pairs, the number of fields with missing entries, the starting values of estimated probabilities, and the convergence epsilon value.
|
97 |
Sources of Variability in a Proteomic ExperimentCrawford, Scott Daniel 11 August 2006 (has links) (PDF)
The study of proteomics holds the hope for detecting serious diseases earlier than is currently possible by analyzing blood samples in a mass spectrometer. Unfortunately, the statistics involved in comparing a control group to a diseased group are not trivial, and these difficulties have led others to incorrect decisions in the past. This paper considers a nested design that was used to quantify and identify the sources of variation in the mass spectrometer at BYU, so that correct conclusions can be drawn from blood samples analyzed in proteomics. Algorithms were developed which detect, align, correct, and cluster the peaks in this experiment. The variation in the m/z values as well as the variation in the intensities was studied, and the nested nature of the design allowed us to estimate the sources of that variation. The variation due to the machine components, including the mass spectrometer itself, was much greater than the variation in the preprocessing steps. This conclusion inspires future studies to investigate which part of the machine steps is causing the most variation.
|
98 |
A Simulation-Based Approach for Evaluating Gene Expression AnalysesPendleton, Carly Ruth 17 March 2007 (has links) (PDF)
Microarrays enable biologists to measure differences in gene expression in thousands of genes simultaneously. The data produced by microarrays present a statistical challenge, one which has been met both by new modifications of existing methods and by completely new approaches. One of the difficulties with a new approach to microarray analysis is validating the method's power and sensitivity. A simulation study could provide such validation by simulating gene expression data and investigating the method's response to changes in the data; however, due to the complex dependencies and interactions found in gene expression data, such a simulation would be complicated and time consuming. This thesis proposes a way to simulate gene expression data and validate a method by borrowing information from existing data. Analogous to the spike-in technique used to validate expression levels on an array, this simulation-based approach will add a simulated gene with known features to an existing data set. Analysis of this appended data set will reveal aspects of the method's sensitivity and power. The method and data on which this technique is illustrated come from Storey et al. (2005).
|
99 |
Development of Informative Priors in Microarray StudiesFronczyk, Kassandra M. 19 July 2007 (has links) (PDF)
Microarrays measure the abundance of DNA transcripts for thousands of gene sequences, simultaneously facilitating genomic comparisons across tissue types or disease status. These experiments are used to understand fundamental aspects of growth and development and to explore the underlying genetic causes of many diseases. The data from most microarray studies are found in open-access online databases. Bayesian models are ideal for the analysis of microarray data because of their ability to integrate prior information; however, most current Bayesian analyses use empirical or flat priors. We present a Perl script to build an informative prior by mining online databases for similar microarray experiments. Four prior distributions are investigated: a power prior including information from multiple previous experiments, an informative prior using information from one previous experiment, an empirically estimated prior, and a flat prior. The method is illustrated with a two-sample experiment to determine the preferential regulation of genes by tamoxifen in breast cancer cells.
|
100 |
Statistical Considerations in Designing for Biomarker DetectionPulsipher, Trenton C. 16 July 2007 (has links) (PDF)
The purpose of this project is to develop a statistical method for use in rapid detection of biological agents using portable gas chromatography mass spectrometry (GC/MS) devices. Of particular interest is 2,6-pyridinedicarboxylic acid (dipicolinic acid, or DPA), a molecule that is present at high concentrations in spores of Clostridium and Bacillus, the latter of which includes the threat organism Bacillus anthracis, or anthrax. Dipicolinic acid may be useful as a first-step discriminator of the biological warfare agent B. anthracis. The results of experiments with B. anthracis Sterne strain and Bacillus thuringiensis spores lead to a conceptual model for the chemical phenomena that are believed to occur between Calcium, DPA and its esters, water, acid, and alkali during treatment of spores by a novel analytical procedure. The hypothesized model for chemical phenomena is tested using a compound study in the form of a mixture experiment.
|
Page generated in 0.1155 seconds