Global ETD Search

41	PAC-Bayesian aggregation and multi-armed bandits Audibert, Jean-Yves 14 October 2010 (has links) (PDF) This habilitation thesis presents several contributions to (1) the PAC-Bayesian analysis of statistical learning, (2) the three aggregation problems: given d functions, how to predict as well as (i) the best of these d functions (model selection type aggregation), (ii) the best convex combination of these d functions, (iii) the best linear combination of these d functions, (3) the multi-armed bandit problems. [MATH:MATH_ST] Mathematics/Statistics [STAT:TH] Statistics/Statistics Theory statistical learning aggregation problems
42	Stochastic Stepwise Ensembles for Variable Selection Xin, Lu 30 April 2009 (has links) Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA. Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step. Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen. From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable. Stochastic Stepwise Ensemble Parallel Genetic Algorithm Variable Selection statistical learning Statistics
43	Fundamental Limitations of Semi-Supervised Learning Lu, Tyler (Tian) 30 April 2009 (has links) The emergence of a new paradigm in machine learning known as semi-supervised learning (SSL) has seen benefits to many applications where labeled data is expensive to obtain. However, unlike supervised learning (SL), which enjoys a rich and deep theoretical foundation, semi-supervised learning, which uses additional unlabeled data for training, still remains a theoretical mystery lacking a sound fundamental understanding. The purpose of this research thesis is to take a first step towards bridging this theory-practice gap. We focus on investigating the inherent limitations of the benefits SSL can provide over SL. We develop a framework under which one can analyze the potential benefits, as measured by the sample complexity of SSL. Our framework is utopian in the sense that a SSL algorithm trains on a labeled sample and an unlabeled distribution, as opposed to an unlabeled sample in the usual SSL model. Thus, any lower bound on the sample complexity of SSL in this model implies lower bounds in the usual model. Roughly, our conclusion is that unless the learner is absolutely certain there is some non-trivial relationship between labels and the unlabeled distribution (``SSL type assumption''), SSL cannot provide significant advantages over SL. Technically speaking, we show that the sample complexity of SSL is no more than a constant factor better than SL for any unlabeled distribution, under a no-prior-knowledge setting (i.e. without SSL type assumptions). We prove that for the class of thresholds in the realizable setting the sample complexity of SL is at most twice that of SSL. Also, we prove that in the agnostic setting for the classes of thresholds and union of intervals the sample complexity of SL is at most a constant factor larger than that of SSL. We conjecture this to be a general phenomenon applying to any hypothesis class. We also discuss issues regarding SSL type assumptions, and in particular the popular cluster assumption. We give examples that show even in the most accommodating circumstances, learning under the cluster assumption can be hazardous and lead to prediction performance much worse than simply ignoring the unlabeled data and doing supervised learning. We conclude with a look into future research directions that build on our investigation. artificial intelligence machine learning semi-supervised learning statistical learning theory Computer Science
44	RELIABILITY AND RISK ASSESSMENT OF NETWORKED URBAN INFRASTRUCTURE SYSTEMS UNDER NATURAL HAZARDS Rokneddin, Keivan 16 September 2013 (has links) Modern societies increasingly depend on the reliable functioning of urban infrastructure systems in the aftermath of natural disasters such as hurricane and earthquake events. Apart from a sizable capital for maintenance and expansion, the reliable performance of infrastructure systems under extreme hazards also requires strategic planning and effective resource assignment. Hence, efficient system reliability and risk assessment methods are needed to provide insights to system stakeholders to understand infrastructure performance under different hazard scenarios and accordingly make informed decisions in response to them. Moreover, efficient assignment of limited financial and human resources for maintenance and retrofit actions requires new methods to identify critical system components under extreme events. Infrastructure systems such as highway bridge networks are spatially distributed systems with many linked components. Therefore, network models describing them as mathematical graphs with nodes and links naturally apply to study their performance. Owing to their complex topology, general system reliability methods are ineffective to evaluate the reliability of large infrastructure systems. This research develops computationally efficient methods such as a modified Markov Chain Monte Carlo simulations algorithm for network reliability, and proposes a network reliability framework (BRAN: Bridge Reliability Assessment in Networks) that is applicable to large and complex highway bridge systems. Since the response of system components to hazard scenario events are often correlated, the BRAN framework enables accounting for correlated component failure probabilities stemming from different correlation sources. Failure correlations from non-hazard sources are particularly emphasized, as they potentially have a significant impact on network reliability estimates, and yet they have often been ignored or only partially considered in the literature of infrastructure system reliability. The developed network reliability framework is also used for probabilistic risk assessment, where network reliability is assigned as the network performance metric. Risk analysis studies may require prohibitively large number of simulations for large and complex infrastructure systems, as they involve evaluating the network reliability for multiple hazard scenarios. This thesis addresses this challenge by developing network surrogate models by statistical learning tools such as random forests. The surrogate models can replace network reliability simulations in a risk analysis framework, and significantly reduce computation times. Therefore, the proposed approach provides an alternative to the established methods to enhance the computational efficiency of risk assessments, by developing a surrogate model of the complex system at hand rather than reducing the number of analyzed hazard scenarios by either hazard consistent scenario generation or importance sampling. Nevertheless, the application of surrogate models can be combined with scenario reduction methods to improve even further the analysis efficiency. To address the problem of prioritizing system components for maintenance and retrofit actions, two advanced metrics are developed in this research to rank the criticality of system components. Both developed metrics combine system component fragilities with the topological characteristics of the network, and provide rankings which are either conditioned on specific hazard scenarios or probabilistic, based on the preference of infrastructure system stakeholders. Nevertheless, they both offer enhanced efficiency and practical applicability compared to the existing methods. The developed frameworks for network reliability evaluation, risk assessment, and component prioritization are intended to address important gaps in the state-of-the-art management and planning for infrastructure systems under natural hazards. Their application can enhance public safety by informing the decision making process for expansion, maintenance, and retrofit actions for infrastructure systems. Urban Infrastructure Network Reliability Seismic Risk Assessment Correlated Bridge Failures Network Surrogate Models Statistical Learning in Networks
45	Stochastic Stepwise Ensembles for Variable Selection Xin, Lu 30 April 2009 (has links) Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA. Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step. Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen. From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable. Stochastic Stepwise Ensemble Parallel Genetic Algorithm Variable Selection statistical learning Statistics
46	Fundamental Limitations of Semi-Supervised Learning Lu, Tyler (Tian) 30 April 2009 (has links) The emergence of a new paradigm in machine learning known as semi-supervised learning (SSL) has seen benefits to many applications where labeled data is expensive to obtain. However, unlike supervised learning (SL), which enjoys a rich and deep theoretical foundation, semi-supervised learning, which uses additional unlabeled data for training, still remains a theoretical mystery lacking a sound fundamental understanding. The purpose of this research thesis is to take a first step towards bridging this theory-practice gap. We focus on investigating the inherent limitations of the benefits SSL can provide over SL. We develop a framework under which one can analyze the potential benefits, as measured by the sample complexity of SSL. Our framework is utopian in the sense that a SSL algorithm trains on a labeled sample and an unlabeled distribution, as opposed to an unlabeled sample in the usual SSL model. Thus, any lower bound on the sample complexity of SSL in this model implies lower bounds in the usual model. Roughly, our conclusion is that unless the learner is absolutely certain there is some non-trivial relationship between labels and the unlabeled distribution (``SSL type assumption''), SSL cannot provide significant advantages over SL. Technically speaking, we show that the sample complexity of SSL is no more than a constant factor better than SL for any unlabeled distribution, under a no-prior-knowledge setting (i.e. without SSL type assumptions). We prove that for the class of thresholds in the realizable setting the sample complexity of SL is at most twice that of SSL. Also, we prove that in the agnostic setting for the classes of thresholds and union of intervals the sample complexity of SL is at most a constant factor larger than that of SSL. We conjecture this to be a general phenomenon applying to any hypothesis class. We also discuss issues regarding SSL type assumptions, and in particular the popular cluster assumption. We give examples that show even in the most accommodating circumstances, learning under the cluster assumption can be hazardous and lead to prediction performance much worse than simply ignoring the unlabeled data and doing supervised learning. We conclude with a look into future research directions that build on our investigation. artificial intelligence machine learning semi-supervised learning statistical learning theory Computer Science
47	Secession and Survival: Nations, States and Violent Conflict Siroky, David S. January 2009 (has links) <p>Secession is a watershed event not only for the new state that is created and the old state that is dissolved, but also for neighboring states, proximate ethno-political groups and major powers. This project examines the problem of violent secessionist conflict and addresses an important debate at the intersection of comparative and international politics about the conditions under which secession is a peaceful solution to ethnic conflict. It demonstrates that secession is rarely a solution to ethnic conflict, does not assure the protection of remaining minorities and produces new forms of violence. To explain why some secessions produce peace, while others generate violence, the project develops a theoretical model of the conditions that produce internally coherent, stable and peaceful post-secessionist states rather than recursive secession (i.e., secession from a new secessionist state) or interstate disputes between the rump and secessionist state. Theoretically, the analysis reveals a curvilinear relationship between ethno-territorial heterogeneity and conflict, explains disparate findings in the literature on ethnic conflict and conclusively links ethnic structure and violence. The project also contributes to the literature on secessionist violence, and civil war more generally, by linking intrastate and interstate causes, showing that what is frequently thought of as a domestic phenomenon is in fact mostly a phenomenon of international politics. Drawing upon original data, methodological advances at the interface of statistics, computer science and probability theory, and qualitative methods such as elite interviews and archival research, the project offers a comprehensive, comparative and contextual treatment of secession and violence.</p> / Dissertation Political Science International Law Statistics Conflict Ethnicity Heterogeneity Secession Statistical Learning Theory Violence
48	Model-based Learning: t-Families, Variable Selection, and Parameter Estimation Andrews, Jeffrey Lambert 27 August 2012 (has links) The phrase model-based learning describes the use of mixture models in machine learning problems. This thesis focuses on a number of issues surrounding the use of mixture models in statistical learning tasks: including clustering, classification, discriminant analysis, variable selection, and parameter estimation. After motivating the importance of statistical learning via mixture models, five papers are presented. For ease of consumption, the papers are organized into three parts: mixtures of multivariate t-families, variable selection, and parameter estimation. / Natural Sciences and Engineering Research Council of Canada through a doctoral postgraduate scholarship. Computational Statistics Cluster Analysis Multivariate Statistics Classification Statistical Learning Mixture Models
49	Inference Of Switching Networks By Using A Piecewise Linear Formulation Akcay, Didem 01 December 2005 (has links) (PDF) Inference of regulatory networks has received attention of researchers from many fields. The challenge offered by this problem is its being a typical modeling problem under insufficient information about the process. Hence, we need to derive the apriori unavailable information from the empirical observations. Modeling by inference consists of selecting or defining the most appropriate model structure and inferring the parameters. An appropriate model structure should have the following properties. The model parameters should be inferable. Given the observation and the model class, all parameters used in the model should have a unique solution restriction of the solution space). The forward model should be accurately computable (restriction of the solution space). The model should be capable of exhibiting the essential qualitative features of the system (limit of the restriction). The model should be relevant with the process (limit of the restriction). A piecewise linear formulation, described by a switching state transition matrix and a switching state transition vector with a Boolean function indicating the switching conditions is proposed for the inference of gene regulatory networks. This thesis mainly concerns using a formulation of switching networks obeying all the above mentioned requirements and developing an inference algorithm for estimating the parameters of the formulation. The methodologies used or developed during this study are applicable to various fields of science and engineering. QA General 15707
50	Mathematical Theories of Interaction with Oracles Yang, Liu 01 October 2013 (has links) No description available. Property Testing Active Learning Computational Learning Theory Learning DNF Statistical Learning Theory Transfer Learning

Search results