311 |
Using ATS to Turn Time Series Estimation and Model Diagnostics into Fast Regression Estimation and Model DiagnosticsJeremy M. Troisi (5930336) 15 May 2019 (has links)
<pre>The Average Transform Smooth (ATS) statistical methods [McRae, Mallows, and Cleveland], are applied to measurements of a non-gaussian random variable to make them close to gaussian. This gaussianization makes use of the well known concept of variance stabilizing transformation, but takes it further by first averaging blocks of r measurements, transforming next, and then smoothing. The smoothing can be nonparametric, or can be the fitting of a parametric model. The gaussianization makes analysis simpler and more effective.</pre><pre><br></pre><pre>In this work ATS is applied to the periodogram of a stationary parametric time series, and makes use of the periodogram large sample properties given the true power spectrum [Brillinger], to develop a new approach to parametric time series model estimation and model diagnostics. The ATS results and the theory are reformulated as a regression model, PPS-REG, involving true power spectrum and the periodogram. PPS-REG has attractive properties: iid gaussian error terms with mean 0 and a known variance; accurate estimation; much faster estimation than the classical maximum likelihood when the time series is large; enables the use of the very powerful classical regression model diagnostics; bases the diagnostics on the power spectrum, adding substantially to the standard use of the autocovariance function for diagnosing the fits of models specified in the time domain.</pre>
|
312 |
Permutation procedures for ANOVA, regression and PCAStorm, Christine 24 May 2013 (has links)
Parametric methods are effective and appropriate when data sets are obtained by well-defined random sampling procedures, the population distribution for responses is well-defined, the null sampling distributions of suitable test statistics do not depend on any unknown entity and well-defined likelihood models are provided for by nuisance parameters. Permutation testing methods, on the other hand, are appropriate and unavoidable when distribution models for responses are not well specified, nonparametric or depend on too many nuisance parameters; when ancillary statistics in well-specified distributional models have a strong influence on inferential results or are confounded with other nuisance entities; when the sample sizes are less than the number of parameters and when data sets are obtained by ill-specified selection-bias procedures. In addition, permutation tests are useful not only when parametric tests are not possible, but also when more importance needs to be given to the observed data set, than to the population model, as is typical for example in biostatistics. The different types of permutation methods for analysis of variance, multiple linear regression and principal component analysis are explored. More specifically, one-way, twoway and three-way ANOVA permutation strategies will be discussed. Approximate and exact permutation tests for the significance of one or more regression coefficients in a multiple linear regression model will be explained next, and lastly, the use of permutation tests used as a means to validate and confirm the results obtained from the exploratory PCA will be described. / Dissertation (MSc)--University of Pretoria, 2012. / Statistics / unrestricted
|
313 |
A detailed investigation of the linear model and some of its underlying assumptionsCoutsourides, Dimitris January 1977 (has links)
Bibliography: p. 178-182. / The purpose of this thesis is to provide a study of the linear model. The whole work has been split into 6 chapters. In Chapter 1 we define and examine the two linear models, i.e. the regression and the correlation model. More specifically we show that the regression model is the conditional version of the correlation model. In Chapter 2 we deal with the problem of multicollinearity. We investigate the sources of near singularities, we give some methods of detecting the multicollinearity, and we state briefly methods for overcoming this problem. In Chapter 3 we consider the least squares method with restrictions, and we dispose of some tests for testing the linear restrictions. The theory concerning the sign of least squares estimates is discussed, then we deal with the method for augmenting existing data. Chapter 4 is mainly devoted to ridge regression. We state methods for selecting the best estimate for k. Some extensions are given dealing with the shrinkage estimators and the linear transforms of the least squares. In Chapter 5 we deal with the principal components, and we give methods for selecting the best subset of principal components. Much attention was given to a method called fractional rank and latent root regression analysis. In Chapter 6 comparisons were performed between estimators previously mentioned. Finally the conclusions are stated.
|
314 |
Regression analysis of big count data via a-optimal subsamplingZhao, Xiaofeng 19 July 2018 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / There are two computational bottlenecks for Big Data analysis: (1) the data is too large
for a desktop to store, and (2) the computing task takes too long waiting time to finish.
While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling
approach simultaneously beat both of them.
The uniform sampling and the nonuniform sampling--the Leverage Scores sampling--
are frequently used in the recent development of fast randomized algorithms. However,
both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting
important information from data.
In this thesis, we conduct regression analysis for big count data via A-optimal subsampling.
We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices
in general estimating equations (GEE). We point out that the A-optimal distributions have the
same running times as the full data M-estimator. To fast compute the distributions,
we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and
sequentially updatable for stream data, and has faster running time than that of the
full data M-estimator. We present asymptotic normality for the estimates in GEE's and
in generalized count regression.
A data truncation method is introduced.
We conduct extensive simulations to evaluate the numerical performance of the proposed sampling
distributions. We apply the proposed A-optimal subsampling method to analyze
two real count data sets, the Bike Sharing data and the Blog Feedback data.
Our results in both simulations and real data sets indicated that
the A-optimal distributions substantially outperformed the uniform distribution,
and have faster running times than the full data M-estimators.
|
315 |
Machine learning and template based modeling for improving and expanding the functionality of rigid body dockingDesta, Israel Tilahun 28 September 2021 (has links)
Proteins govern practically every process in living organisms through inhibiting, activating, or acting on other proteins in different ways. With the large and growing number of known interactions through high throughput screening technologies, experimental determination of atomic-level details of these interactions is nigh impossible. Computational methods such as docking can speed up efforts of understanding these interactions. However, several issues ought to be addressed before docking can replace experimental methods. This thesis describes work on assessment of the state of the art in docking methods, implementation of a machine learning algorithm to improve model ranking and integration of docking with template-based modeling to expand its usage with a special focus on antibody-antigen interactions.
Firstly, the performance of docking methods was rigorously assessed by using a diverse set of protein complexes with a special focus on ClusPro, one of the leading rigid-body docking servers. Different strengths and potential areas of improvement for ClusPro and rigid-body docking methods in general were highlighted. Secondly, one of the major short-comings of docking noted in the first project, poor ranking of good models, was addressed. A regression-based machine learning algorithm was introduced to improve the ranking. Finally, a server was developed to tackle the challenge of epitope mapping by integrating template-based modeling with docking. An intuitive ensemble approach to scoring residue likelihood using docking poses and different homologues is shown to yield great success. In addition to shifting docking’s purpose of conformational search to interface identification, this server also allows users to start with protein sequence inputs. / 2022-09-28T00:00:00Z
|
316 |
Linear Regression Analysis of the Suspended Sediment Load in Rivers and Streams Using Data of Similar Precipitation ValuesJamison, Jonathan A. 21 November 2018 (has links)
No description available.
|
317 |
Comparison of Regression Methods with Non-Convex PenaltiesPipher, Brandon 07 November 2019 (has links)
No description available.
|
318 |
A multi-gene symbolic regression approach for predicting LGD : A benchmark comparative studyTuoremaa, Hanna January 2023 (has links)
Under the Basel accords for measuring regulatory capital requirements, the set of credit risk parameters probability of default (PD), exposure at default (EAD) and loss given default (LGD) are measured with own estimates by the internal rating based approach. The estimated parameters are also the foundation of understanding the actual risk in a banks credit portfolio. The predictive performance of such models are therefore interesting to examine. The credit risk parameter LGD has been seen to give low performance for predictive models and LGD values are generally hard to estimate. The main purpose of this thesis is to analyse the predictive performance of a multi-gene genetic programming approach to symbolic regression compared to three benchmark regression models. The goal of multi-gene symbolic regression is to estimate the underlying relationship in the data through a linear combination of a set of generated mathematical expressions. The benchmark models are Logit Transformed Regression, Beta Regression and Regression Tree. All benchmark models are frequently used in the area. The data used to compare the models is a set of randomly selected, de-identified loans from the portfolios of underlying U.S. residential mortgage-backed securities retrieved from International Finance Research. The conclusion from implementing and comparing the models is that, the credit risk parameter LGD is continued difficult to estimated, the symbolic regression approach did not yield a better predictive ability than the benchmark models and it did not seem to find the underlying relationship in the data. The benchmark models are more user-friendly with easier implementation and they all requires less calculation complexity than symbolic regression.
|
319 |
An Empirical Study on Correlation Patterns of Disruptions by Flooding HazardsWang, Jin 17 May 2014 (has links)
Flooding is one of the fatal natural hazards frequently generating serious impact to infrastructures. As yet, its characteristics are expected to be changing with the changing of global climate. This paper identifies the spatio-temporal correlation patterns of disruptions by flooding hazards at the county-level for the Deep South in the United States, particularly the state of Arkansas. The frequency of each flooding disruption calculated as time series, is generated from flooding records within research period of 1998-2013. A set of quality control procedures including duplicated data check, spatial outliers check, and homogeneity test is applied prior to the regression analysis. The spatial characteristic of those disruptions is identified by mapping them, while their temporal characteristic is assessed using correlation coefficient defined in this paper. Accordingly, greater correlation of disruptions by flooding is found with the decreasing of the distance between for most pairs of the locations throughout the study period.
|
320 |
Evaluation of Herbicide Formulation and Spray Nozzle Selection on Physical Spray DriftCobb, Jasper Lewis 13 December 2014 (has links)
New transgenic crops are currently being developed which will be tolerant to dicamba and 2,4-D herbicides. This technology could greatly benefit producers who are impacted by weed species that have developed resistance to other herbicides, like glyphosate-resistant Palmer Amaranth. Adoption of this new technology is likely to be rapid and widespread which will lead to an increase in the amount of dicamba and 2,4-D applied each season. It is well-documented that these herbicides are very injurious to soybeans, cotton, tomatoes, and most other broadleaf crops, and their increased use brings along increased chances of physical spray drift onto susceptible crops. Because of these risks, research is being conducted on new herbicide formulation/spray nozzle combinations to determine management options which may minimize physical spray drift.
|
Page generated in 0.1685 seconds