Global ETD Search

31	Algorithms for a Partially Regularized Least Squares Problem Skoglund, Ingegerd January 2007 (has links) Vid analys av vattenprover tagna från t.ex. ett vattendrag betäms halten av olika ämnen. Dessa halter är ofta beroende av vattenföringen. Det är av intresse att ta reda på om observerade förändringar i halterna beror på naturliga variationer eller är orsakade av andra faktorer. För att undersöka detta har föreslagits en statistisk tidsseriemodell som innehåller okända parametrar. Modellen anpassas till uppmätta data vilket leder till ett underbestämt ekvationssystem. I avhandlingen studeras bl.a. olika sätt att säkerställa en unik och rimlig lösning. Grundidén är att införa vissa tilläggsvillkor på de sökta parametrarna. I den studerade modellen kan man t.ex. kräva att vissa parametrar inte varierar kraftigt med tiden men tillåter årstidsvariationer. Det görs genom att dessa parametrar i modellen regulariseras. Detta ger upphov till ett minsta kvadratproblem med en eller två regulariseringsparametrar. I och med att inte alla ingående parametrar regulariseras får vi dessutom ett partiellt regulariserat minsta kvadratproblem. I allmänhet känner man inte värden på regulariseringsparametrarna utan problemet kan behöva lösas med flera olika värden på dessa för att få en rimlig lösning. I avhandlingen studeras hur detta problem kan lösas numeriskt med i huvudsak två olika metoder, en iterativ och en direkt metod. Dessutom studeras några sätt att bestämma lämpliga värden på regulariseringsparametrarna. I en iterativ lösningsmetod förbättras stegvis en given begynnelseapproximation tills ett lämpligt valt stoppkriterium blir uppfyllt. Vi använder här konjugerade gradientmetoden med speciellt konstruerade prekonditionerare. Antalet iterationer som krävs för att lösa problemet utan prekonditionering och med prekonditionering jämförs både teoretiskt och praktiskt. Metoden undersöks här endast med samma värde på de två regulariseringsparametrarna. I den direkta metoden används QR-faktorisering för att lösa minsta kvadratproblemet. Idén är att först utföra de beräkningar som kan göras oberoende av regulariseringsparametrarna samtidigt som hänsyn tas till problemets speciella struktur. För att bestämma värden på regulariseringsparametrarna generaliseras Reinsch’s etod till fallet med två parametrar. Även generaliserad korsvalidering och en mindre beräkningstung Monte Carlo-metod undersöks. / Statistical analysis of data from rivers deals with time series which are dependent, e.g., on climatic and seasonal factors. For example, it is a well-known fact that the load of substances in rivers can be strongly dependent on the runoff. It is of interest to find out whether observed changes in riverine loads are due only to natural variation or caused by other factors. Semi-parametric models have been proposed for estimation of time-varying linear relationships between runoff and riverine loads of substances. The aim of this work is to study some numerical methods for solving the linear least squares problem which arises. The model gives a linear system of the form A1x1 + A2x2 + n = b1. The vector n consists of identically distributed random variables all with mean zero. The unknowns, x, are split into two groups, x1 and x2. In this model, usually there are more unknowns than observations and the resulting linear system is most often consistent having an infinite number of solutions. Hence some constraint on the parameter vector x is needed. One possibility is to avoid rapid variation in, e.g., the parameters x2. This can be accomplished by regularizing using a matrix A3, which is a discretization of some norm. The problem is formulated as a partially regularized least squares problem with one or two regularization parameters. The parameter x2 has here a two-dimensional structure. By using two different regularization parameters it is possible to regularize separately in each dimension. We first study (for the case of one parameter only) the conjugate gradient method for solution of the problem. To improve rate of convergence blockpreconditioners of Schur complement type are suggested, analyzed and tested. Also a direct solution method based on QR decomposition is studied. The idea is to first perform operations independent of the values of the regularization parameters. Here we utilize the special block-structure of the problem. We further discuss the choice of regularization parameters and generalize in particular Reinsch’s method to the case with two parameters. Finally the cross-validation technique is treated. Here also a Monte Carlo method is used by which an approximation to the generalized cross-validation function can be computed efficiently. Least squares Regularization Block-matrices Conjugate gradient QR-factorization Cross-validation Numerical analysis Numerisk analys
32	Data Mining in Tree-Based Models and Large-Scale Contingency Tables Kim, Seoung Bum 11 January 2005 (has links) This thesis is composed of two parts. The first part pertains to tree-based models. The second part deals with multiple testing in large-scale contingency tables. Tree-based models have gained enormous popularity in statistical modeling and data mining. We propose a novel tree-pruning algorithm called frontier-based tree-pruning algorithm (FBP). The new method has an order of computational complexity comparable to cost-complexity pruning (CCP). Regarding tree pruning, it provides a full spectrum of information. Numerical study on real data sets reveals a surprise: in the complexity-penalization approach, most of the tree sizes are inadmissible. FBP facilitates a more faithful implementation of cross validation, which is favored by simulations. One of the most common test procedures using two-way contingency tables is the test of independence between two categorizations. Current test procedures such as chi-square or likelihood ratio tests provide overall independency but bring limited information about the nature of the association in contingency tables. We propose an approach of testing independence of categories in individual cells of contingency tables based on a multiple testing framework. We then employ the proposed method to identify the patterns of pair-wise associations between amino acids involved in beta-sheet bridges of proteins. We identify a number of amino acid pairs that exhibit either strong or weak association. These patterns provide useful information for algorithms that predict secondary and tertiary structures of proteins. Cross validation Tree-Based Models Data mining Protein structure Contingency tables
33	Choosing a Kernel for Cross-Validation Savchuk, Olga 14 January 2010 (has links) The statistical properties of cross-validation bandwidths can be improved by choosing an appropriate kernel, which is different from the kernels traditionally used for cross- validation purposes. In the light of this idea, we developed two new methods of bandwidth selection termed: Indirect cross-validation and Robust one-sided cross- validation. The kernels used in the Indirect cross-validation method yield an improvement in the relative bandwidth rate to n^1=4, which is substantially better than the n^1=10 rate of the least squares cross-validation method. The robust kernels used in the Robust one-sided cross-validation method eliminate the bandwidth bias for the case of regression functions with discontinuous derivatives. bandwidth selection cross-validation kernel density estimation kernel regression nonparametric function estimation
34	Applying Data Mining Techniques to the Prediction of Marine Smuggling Behaviors Lee, Chang-mou 26 July 2008 (has links) none 5-fold cross-validation Random Sampling Data Mining Artificial Neural Network Decision Tree
35	Applying Classification and Regression Trees to manage financial risk Martin, Stephen Fredrick 16 August 2012 (has links) This goal of this project is to develop a set of business rules to mitigate risk related to a specific financial decision within the prepaid debit card industry. Under certain circumstances issuers of prepaid debit cards may need to decide if funds on hold can be released early for use by card holders prior to the final transaction settlement. After a brief introduction to the prepaid card industry and the financial risk associated with the early release of funds on hold, the paper presents the motivation to apply the CART (Classification and Regression Trees) method. The paper provides a tutorial of the CART algorithms formally developed by Breiman, Friedman, Olshen and Stone in the monograph Classification and Regression Trees (1984), as well as, a detailed explanation of the R programming code to implement the RPART function. (Therneau 2010) Special attention is given to parameter selection and the process of finding an optimal solution that balances complexity against predictive classification accuracy when measured against an independent data set through a cross validation process. Lastly, the paper presents an analysis of the financial risk mitigation based on the resulting business rules. / text CART Classification and Regression Trees Breiman Risk Prepaid Debit cards Rollback R RPART Cross validation
36	Factors that Influence Cross-validation of Hierarchical Linear Models Widman, Tracy 07 May 2011 (has links) While use of hierarchical linear modeling (HLM) to predict an outcome is reasonable and desirable, employing the model for prediction without first establishing the model’s predictive validity is ill-advised. Estimating the predictive validity of a regression model by cross-validation has been thoroughly researched, but there is a dearth of research investigating the cross-validation of hierarchical linear models. One of the major obstacles in cross-validating HLM is the lack of a measure of explained variance similar to the squared multiple correlation coefficient in regression analysis. The purpose of this Monte Carlo simulation study is to explore the impact of sample size, centering, and predictor-criterion correlation magnitudes on potential cross-validation measurements for hierarchical linear modeling. This study considered the impact of 64 simulated conditions across three explained variance approaches: Raudenbush and Bryk’s (2002) proportional reduction in error variance, Snijders and Bosker’s (1994) modeled variance, and a measure of explained variance proposed by Gagné and Furlow (2009). For each of the explained variance approaches, a cross-validation measurement, shrinkage, was obtained. The results indicate that sample size, predictor-criterion correlations, and centering impact the cross-validation measurement. The degree and direction of the impact differs with the explained variance approach employed. Under some explained variance approaches, shrinkage decreased with larger level-2 sample sizes and increased in others. Likewise, in comparing group- and grand-mean centering, with some approaches grand-mean centering resulted in higher shrinkage estimates but smaller estimates in others. Larger total sample sizes yielded smaller shrinkage estimates, as did the predictor-criterion correlation combination in which the group-level predictor had a stronger correlation. The approaches to explained variance differed substantially in their usability for cross-validation. The Snijders and Bosker approach provided relatively large shrinkage estimates, and, depending on the predictor-criterion correlation, shrinkage under both Raudenbush and Bryk approaches could be sizable to the degree that the estimate begins to lack meaning. Researchers seeking to cross-validate HLM need to be mindful of the interplay between the explained variance approach employed and the impact of sample size, centering, and predictor-criterion correlations on shrinkage estimates when making research design decisions. HLM cross-validation explained variance Monte Carlo centering sample size Education Education Policy
37	Stability Selection of the Number of Clusters Reizer, Gabriella v 18 April 2011 (has links) Selecting the number of clusters is one of the greatest challenges in clustering analysis. In this thesis, we propose a variety of stability selection criteria based on cross validation for determining the number of clusters. Clustering stability measures the agreement of clusterings obtained by applying the same clustering algorithm on multiple independent and identically distributed samples. We propose to measure the clustering stability by the correlation between two clustering functions. These criteria are motivated by the concept of clustering instability proposed by Wang (2010), which is based on a form of clustering distance. In addition, the effectiveness and robustness of the proposed methods are numerically demonstrated on a variety of simulated and real world samples. Consistency Cross validation Hierarchical clustering Instability k-means clustering Spectral clustering Stability Mathematics
38	Spatio-temporal prediction modeling of clusters of influenza cases Qiu, Weiyu Unknown Date No description available. influenza prediction generalized linear mixed model pseudo-likelihood cross-validation multivariate Pearson Type VII family
39	Bayesian Analysis of Spatial Point Patterns Leininger, Thomas Jeffrey January 2014 (has links) <p>We explore the posterior inference available for Bayesian spatial point process models. In the literature, discussion of such models is usually focused on model fitting and rejecting complete spatial randomness, with model diagnostics and posterior inference often left as an afterthought. Posterior predictive point patterns are shown to be useful in performing model diagnostics and model selection, as well as providing a wide array of posterior model summaries. We prescribe Bayesian residuals and methods for cross-validation and model selection for Poisson processes, log-Gaussian Cox processes, Gibbs processes, and cluster processes. These novel approaches are demonstrated using existing datasets and simulation studies.</p> / Dissertation Statistics cross-validation Gibbs process Log-Gaussian Cox process model selection point pattern residuals Poisson process
40	Exploiting diversity for efficient machine learning Geras, Krzysztof Jerzy January 2018 (has links) A common practice for solving machine learning problems is currently to consider each problem in isolation, starting from scratch every time a new learning problem is encountered or a new model is proposed. This is a perfectly feasible solution when the problems are sufficiently easy or, if the problem is hard when a large amount of resources, both in terms of the training data and computation, are available. Although this naive approach has been the main focus of research in machine learning for a few decades and had a lot of success, it becomes infeasible if the problem is too hard in proportion to the available resources. When using a complex model in this naive approach, it is necessary to collect large data sets (if possible at all) to avoid overfitting and hence it is also necessary to use large computational resources to handle the increased amount of data, first during training to process a large data set and then also at test time to execute a complex model. An alternative to this strategy of treating each learning problem independently is to leverage related data sets and computation encapsulated in previously trained models. By doing that we can decrease the amount of data necessary to reach a satisfactory level of performance and, consequently, improve the accuracy achievable and decrease training time. Our attack on this problem is to exploit diversity - in the structure of the data set, in the features learnt and in the inductive biases of different neural network architectures. In the setting of learning from multiple sources we introduce multiple-source cross-validation, which gives an unbiased estimator of the test error when the data set is composed of data coming from multiple sources and the data at test time are coming from a new unseen source. We also propose new estimators of variance of the standard k-fold cross-validation and multiple-source cross-validation, which have lower bias than previously known ones. To improve unsupervised learning we introduce scheduled denoising autoencoders, which learn a more diverse set of features than the standard denoising auto-encoder. This is thanks to their training procedure, which starts with a high level of noise, when the network is learning coarse features and then the noise is lowered gradually, which allows the network to learn some more local features. A connection between this training procedure and curriculum learning is also drawn. We develop further the idea of learning a diverse representation by explicitly incorporating the goal of obtaining a diverse representation into the training objective. The proposed model, the composite denoising autoencoder, learns multiple subsets of features focused on modelling variations in the data set at different levels of granularity. Finally, we introduce the idea of model blending, a variant of model compression, in which the two models, the teacher and the student, are both strong models, but different in their inductive biases. As an example, we train convolutional networks using the guidance of bidirectional long short-term memory (LSTM) networks. This allows to train the convolutional neural network to be more accurate than the LSTM network at no extra cost at test time.

Search results