Spelling suggestions: "subject:"estatistics,"" "subject:"cstatistics,""
321 |
Inequalities and equalities associated with ordinary least squares and generalized least squares in partitioned linear modelsChu, Ka Lok, 1975- January 2004 (has links)
The motivation for this thesis is the paper by Paul L. Canner [The American Statistician, vol. 23, no. 5, pp. 39--40 (1969)] in which it was noted that in simple linear regression it is possible for the generalized least squares regression line to lie either entirely above or entirely below all of the observed data points. / Chapter I builds on the observation that in Canner's model the ordinary least squares and generalized least squares regression lines are parallel, which led us to introduce a new measure of efficiency of ordinary least squares and to find conditions for which the total Watson efficiency of ordinary least squares in a partitioned linear model exceeds or is less than the product of the two subset Watson efficiencies, i.e., the product of the Watson efficiencies associated with the two subsets of parameters in the underlying partitioned linear model. / We introduce the notions of generalized efficiency function, efficiency factorization multiplier, and determinantal covariance ratio, and obtain several inequalities and equalities. We give special attention to those partitioned linear models for which the total Watson efficiency of ordinary least squares equals the product of the two subset Watson efficiencies. A key characterization involves the equality between the squares of a certain partial correlation coefficient and its associated ordinary correlation coefficient. / In Chapters II and IV we suppose that the underlying partitioned linear model is weakly singular in that the column space of the model matrix is contained in the column space of the covariance matrix of the errors in the linear model. In Chapter III our results are specialized to partitioned linear models where the partitioning is orthogonal and the covariance matrix of the errors is positive definite.
|
322 |
Essays on Causal Inference in Randomized ExperimentsLin, Winston 11 October 2013 (has links)
<p> This dissertation explores methodological topics in the analysis of randomized experiments, with a focus on weakening the assumptions of conventional models.</p><p> Chapter 1 gives an overview of the dissertation, emphasizing connections with other areas of statistics (such as survey sampling) and other fields (such as econometrics and psychometrics).</p><p> Chapter 2 reexamines Freedman's critique of ordinary least squares regression adjustment in randomized experiments. Using Neyman's model for randomization inference, Freedman argued that adjustment can lead to worsened asymptotic precision, invalid measures of precision, and small-sample bias. This chapter shows that in sufficiently large samples, those problems are minor or easily fixed. OLS adjustment cannot hurt asymptotic precision when a full set of treatment-covariate interactions is included. Asymptotically valid confidence intervals can be constructed with the Huber-White sandwich standard error estimator. Checks on the asymptotic approximations are illustrated with data from a randomized evaluation of strategies to improve college students' achievement. The strongest reasons to support Freedman's preference for unadjusted estimates are transparency and the dangers of specification search.</p><p> Chapter 3 extends the discussion and analysis of the small-sample bias of OLS adjustment. The leading term in the bias of adjustment for multiple covariates is derived and can be estimated empirically, as was done in Chapter 2 for the single-covariate case. Possible implications for choosing a regression specification are discussed.</p><p> Chapter 4 explores and modifies an approach suggested by Rosenbaum for analysis of treatment effects when the outcome is censored by death. The chapter is motivated by a randomized trial that studied the effects of an intensive care unit staffing intervention on length of stay in the ICU. The proposed approach estimates effects on the distribution of a composite outcome measure based on ICU mortality and survivors' length of stay, addressing concerns about selection bias by comparing the entire treatment group with the entire control group. Strengths and weaknesses of possible primary significance tests (including the Wilcoxon-Mann-Whitney rank sum test and a heteroskedasticity-robust variant due to Brunner and Munzel) are discussed and illustrated.</p>
|
323 |
Error detection and data smoothing based on local proceduresOrtiz, Victor Manuel Guerra January 1974 (has links)
This thesis presents an algorithm which is able to locate isolated bad points and correct them without contaminating the rest of the good data. This work has been greatly influenced and motivated by what is currently done in the manual loft. It is not within the scope of this work to handle small random errors characteristic of a noisy system, and it is therefore assumed that the bad points are isolated and relatively few when compared with the total number of points.
Motivated by the desire to imitate the loftsman we conducted a visual experiment to determine what is considered smooth data by most people. This criterion is used to determine how much the data should be smoothed and to prove that our method produces such data. The method ultimately converges to a set of points that lies on the polynomial that interpolates the first and last points; however convergence to such a set is definitely not the purpose of our algorithm. The proof of convergence is necessary to demonstrate that oscillation does not take place and that in a finite number of steps the method produces a set as smooth as desired.
The amount of work for the method described here is of order n. The one dimensional and two dimensional cases are treated in detail; the theory can be readily extended to higher dimensions.
|
324 |
A STUDY OF PROJECTION PURSUIT METHODS (MULTIVARIATE STATISTICS, DIMENSION REDUCTION, DENSITY ESTIMATION, GRAPHICS, ENTROPY)JEE, JAMES RODNEY January 1985 (has links)
A standard method for analyzing high dimensional multivariate data is to view scatter-plots of 2-dimensional projections of the data. Since all projections are not equally informative and the number of significantly different 2-dimensional projections in a high dimensional space can be large, there is a need for computer algorithms which will automatically determine the most informative projections for viewing.
When the data are assumed to be a sample from a population density then it is natural to measure the information content in a projection by evaluating the Shannon entropy or the Fisher information of the marginal density corresponding to the projection. Because the population density is an unknown the techniques of nonparametric probability density estimation can be employed to estimate the population density thereby providing a means for extracting a well known measure of information from a projection of a sample.
A theoretical study of algorithms based on these ideas suggests that Fisher information is a slightly better measure of information for use in projection pursuit. Calculation of both Shannon entropy and Fisher information measures in data-based algorithms is based on computationally efficient oversmoothed histograms. Application of the algorithms to real data sets reveals that these methods are very promising.
|
325 |
Locally-adaptive polynomial-smoothed histograms with application to massive and pre-binned data setsPapkov, Galen I. January 2008 (has links)
Data-driven research is often hampered by privacy restrictions in the form of limited datasets or graphical representations without the benefit of actual data. This dissertation has developed a variety of nonparametric techniques that circumvent these issues by using local moment information. Scott and Sagae's local moment non-parametric density estimation method, namely the smoothed polynomial histogram, provides a solid foundation for this research. This method utilizes binned data and their sample moments, such as the bin areas, means, and variances, in order to estimate the underlying distribution of the data via polynomial splines. The optimization problem does not account for the differing amounts of data across bins. More emphasis or trust should be placed on lower-order moments as they tend to be more accurate than higher-order moments. Hence, to ensure fidelity to the data and its local sample moments, this research has incorporated a weight matrix into the optimization problem.
An alternative to the weighted smoothed polynomial histogram is the penalized smoothed polynomial histogram, which is similar to the smoothed polynomial histogram, but with a difference penalty on the coefficients. This type of penalty is simple to implement and yields equivalent, if not better results than the smoothed polynomial histogram and can also benefit from the inclusion of a weight matrix. Advancement has also been achieved by extending the smoothed polynomial histogram to higher dimensions via the use of tensor products of B-splines. In addition to density estimation, these nonparametric techniques can be used to conduct bump hunting and change-point analysis. Future work will explore the effects of adaptive meshes, automatic knot selection, and higher-order derivatives in the penalty on the quality of these local-moment density estimators.
|
326 |
Investigation of the tau-leap method for stochastic simulationNoyola-Martinez, Josue C. January 2008 (has links)
The use of the relatively new tau-leap algorithm to model the kinematics of regulatory systems and other chemical processes inside the cell, is of great interest; however, the accuracy of the tau-leap algorithm is not known. We introduce a new method that enables us to establish the accuracy of the tau-leap method effectively. Our approach takes advantage of the fact that the stochastic simulation algorithm (SSA) and the tau-leap method can be represented as a special type of counting process, that can essentially "couple", or tie together, a single realization of the SSA process to one of the tau-leap. Because the SSA is exact we can evaluate the accuracy of the tau-leap by comparing it to the SSA. Our approach gives error estimates which are unrivaled by any method currently available. Moreover, our coupling algorithm allows us to propose an adaptive parameter selection algorithm which finds the appropriate parameter values needed to achieve a pre-determined error threshold in the tau-leap algorithm. This error-controlled adaptive parameter selection method could not have been proposed before the introduction of our coupling algorithm, and it is a novel approach to the use of the tau-leap algorithm.
|
327 |
Multilevel classification: Classification of populations from measurements on membersYamal, Jose-Miguel January 2007 (has links)
Multilevel classification is a problem in statistics which has gained increasing importance in many real-world problems, but it has not yet received the same statistical understanding as the general problem of classification. An example we consider here is to develop a method to detect cervical neoplasia (pre-cancer) using quantitative cytology, which involves measurements on the cells obtained in a Papanicolou smear. The multilevel structure comes from the embedded cells within a patient, where we have quantitative measurements on the cells, yet we want to classify the patients, not the cells. An additional challenge comes from the fact that we have a high-dimensional feature vector of measurements on each cell. The problem has historically been approached in two ways: (a) ignore this multilevel structure of the data and perform classification at the microscopic (cellular) level, and then use ad-hoc methods to classify at the macroscopic (patient) level, or (b) summarize the microscopic level data using a few statistics and then use these to compare the subjects at the macroscopic level. We consider a more rigorous statistical approach, the Cumulative Log-Odds (CLO) Method, which models the posterior log-odds of disease for a patient given the cell-level measured feature vectors for that patient. Combining the CLO method with a latent variable model (Latent-Class CLO Method) helps to account for between-patient heterogeneity. We apply many different approaches and evaluate their performance using out of sample prediction. We find that our best methods classify with substantial greater accuracy than the subjective Papanicolou Smear interpretation by a clinical pathologist.
|
328 |
Some approaches to Bayesian design of experiments and microarray data analysisRossell, David January 2007 (has links)
This thesis consists of three projects. The first project introduces methodology to design drug development studies in an optimal fashion. Optimality is defined in a decision-theoretic framework where the goal is expected utility maximization. We show how our approach outperforms some other conventional designs. The second project generalizes the hierarchical Gamma/Gamma model for microarray data analysis. We illustrate how our generalization improves the fit without increasing the model complexity, and how one can use it to find differentially expressed genes and to build a classifier. When the sample size is small our method finds more genes and classifies samples better than several standard methods. Only as the number of microarrays grows large competing methods detect more genes. The last project explores the use of L 2E partial density estimation as an exploratory technique in the context of microarray data analysis. We propose a heuristic that combines frequentist and Bayesian ideas. Our approach outperforms other competing methods when its assumptions hold, but it presents increased false positive rates when the assumptions do not hold.
|
329 |
Adaptive kernel density estimationSain, Stephan R. January 1994 (has links)
The need for improvements over the fixed kernel density estimator in certain situations has been discussed extensively in the literature, particularly in the application of density estimation to mode hunting. Problem densities often exhibit skewness or multimodality with differences in scale for each mode. By varying the bandwidth in some fashion, it is possible to achieve significant improvements over the fixed bandwidth approach. In general, variable bandwidth kernel density estimators can be divided into two categories: those that vary the bandwidth with the estimation point (balloon estimators) and those that vary the bandwidth with each data point (sample point estimators).
For univariate balloon estimators, it can be shown that there exists a bandwidth in regions of f where f is convex (e.g. the tails) such that the bias is exactly zero. Such a bandwidth leads to a MSE = $O(n\sp{-1})$ for points in the appropriate regions. A global implementation strategy using a local cross-validation algorithm to estimate such bandwidths is developed.
The theoretical behavior of the sample point estimator is difficult to examine as the form of the bandwidth function is unknown. An approximation based on binning the data is used to study the behavior of the MISE and the optimal bandwidth function. A practical data-based procedure for determining bandwidths for the sample point estimator is developed using a spline function to estimate the unknown bandwidth function.
Finally, the multivariate problem is briefly addressed by examining the shape and size of the optimal bivariate kernels suggested by Terrell and Scott (1992). Extensions of the binning and spline estimation ideas are also discussed.
|
330 |
Tests for harmonic components in the spectra of categorical time seriesMcGee, Monnie January 1995 (has links)
The main purpose of spectral analysis in time series is to determine what patterns exist in a particular set of data. To accomplish this, one often calculates the Fourier periodogram of the data and inspects it for peaks. However, since the periodogram is not a consistent estimate of the true spectral density, peaks can be obscured. Therefore, it is necessary to test for significance of the peaks. One of the most widely used tests for significance of periodicities in the periodogram is Fisher's test. The test statistic for Fisher's test is the quotient of the maximum periodogram ordinate and the sum of all the ordinates. If the test statistic is too large, we reject the null hypothesis that the data are white noise.
In this thesis, I develop a test for categorical time series which is an analog of Fisher's test for continuous parameter time series. The test involves finding the Walsh-Fourier periodogram of the data and then calculating the test statistic for Fisher's test. I explain the theory behind Walsh-Fourier analysis and compare the theory to that of Fourier analysis. Asymptotic results for the distribution of Fisher's test for Walsh-Fourier spectra are presented and compared with a simulated distribution. I also perform power studies in order to assess the detection capability of the test. In the presence of multiple peaks, this test tends to loose power. Therefore, I also explore several alternatives to Fisher's test for Walsh-Fourier spectra and apply all of the alternative methods to several real data sets.
|
Page generated in 0.0923 seconds