581 |
Stochastic Stepwise Ensembles for Variable SelectionXin, Lu 30 April 2009 (has links)
Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA.
Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step.
Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen.
From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable.
|
582 |
Framework for Calibration of a Traffic State Space ModelSandin, Mats, Fransson, Magnus January 2012 (has links)
To evaluate the traffic state over time and space, several models can be used. A typical model for estimating the state of the traffic for a stretch of road or a road network is the cell transmission model, which is a form of state space model. This kind of model typically needs to be calibrated since the different roads have different properties. This thesis will present a calibration framework for the velocity based cell transmission model, the CTM-v. The cell transmission model for velocity is a discrete time dynamical system that can model the evolution of the velocity field on highways. Such a model can be fused with an ensemble Kalman filter update algorithm for the purpose of velocity data assimilation. Indeed, enabling velocity data assimilation was the purpose for ever developing the model in the first place and it is an essential part of the Mobile Millennium research project. Therefore a systematic methodology for calibrating the cell transmission is needed. This thesis presents a framework for calibration of the velocity based cell transmission model that is combined with the ensemble Kalman filter. The framework consists of two separate methods, one is a statistical approach to calibration of the fundamental diagram. The other is a black box optimization method, a simplification of the complex method that can solve inequality constrained optimization problems with non-differentiable objective functions. Both of these methods are integrated with the existing system, yielding a calibration framework, in particular highways were stationary detectors are part of the infrastructure. The output produced by the above mentioned system is highly dependent on the values of its characterising parameters. Such parameters need to be calibrated so as to make the model a valid representation of reality. Model calibration and validation is a process of its own, most often tailored for the researchers models and purposes. The combination of the two methods are tested in a suit of experiments for two separate highway models of Interstates 880 and 15, CA which are evaluated against travel time and space mean speed estimates given by Bluetooth detectors with an error between 7.4 and 13.4 % for the validation time periods depending on the parameter set and model.
|
583 |
Cooperative Training in Multiple Classifier SystemsDara, Rozita Alaleh January 2007 (has links)
Multiple classifier system has shown to be an effective technique for classification.
The success of multiple classifiers does not entirely depend on the base classifiers
and/or the aggregation technique. Other parameters, such as training data, feature
attributes, and correlation among the base classifiers may also contribute to the
success of multiple classifiers. In addition, interaction of these parameters with each other may have an impact on multiple classifiers performance. In the present study, we intended to examine some of these interactions and investigate further the effects of these interactions on the performance of classifier ensembles.
The proposed research introduces a different direction in the field of multiple
classifiers systems. We attempt to understand and compare ensemble methods from
the cooperation perspective. In this thesis, we narrowed down our focus on cooperation at training level. We first developed measures to estimate the degree and type of cooperation among training data partitions. These evaluation measures enabled us to evaluate the diversity and correlation among a set of disjoint and overlapped partitions. With the aid of properly selected measures and training information, we proposed two new data partitioning approaches: Cluster, De-cluster, and Selection (CDS) and Cooperative Cluster, De-cluster, and Selection (CO-CDS). In the end, a
comprehensive comparative study was conducted where we compared our proposed
training approaches with several other approaches in terms of robustness of their
usage, resultant classification accuracy and classification stability.
Experimental assessment of CDS and CO-CDS training approaches validates
their robustness as compared to other training approaches. In addition, this study
suggests that: 1) cooperation is generally beneficial and 2) classifier ensembles that
cooperate through sharing information have higher generalization ability compared
to the ones that do not share training information.
|
584 |
Stochastic Stepwise Ensembles for Variable SelectionXin, Lu 30 April 2009 (has links)
Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA.
Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step.
Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen.
From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable.
|
585 |
Musik für HolzinstrumenteDrude, Matthias 19 November 2012 (has links) (PDF)
Partitur eines Kammermusikwerkes von Matthias Drude. Das Werk wurde 2010 für Oboe, Klarinette, Fagott, Marimbaphon und Streichquintett komponiert.
|
586 |
Develop Microchip with Gold Nanoelectrode Ensemble Electrodes for Electrochemical Detection of VerapamilChuang, Jui-Fen 11 August 2011 (has links)
Verapamil is a commonly used medicine for the treatment of supraventricular arrhythmias, angina and hypertension. Recently, some newly developed applications of Verapamil, such as treating hypomania and chemotherapy for cancers, have been reported. Thus, monitoring the concentration of Verapamil accurately is very important. The major clinical analytical methods of Verapamil concentration determination are high performance liquid chromatography (HPLC) with UV or with fluorescence detector. However, these analytical methods have some disadvantages, like expensive instruments, complex operation, and time-consuming etc.
The chemical structure and properties of Verapamil are very stable. The preliminary result of electrochemical analysis doesn¡¦t show any electrochemical activity. In this study, we developed an innovative ozone pre-treatment method to oxidize Verapamil to the smaller molecules and change its structure. Verapamil have excellent electrochemical activity after ozone pre-treatment. The spectroscopy and mass spectrometry show the changes of Verapamil structure. The products of Verapamil treated with ozone are also predicted by mass spectrometry.
The gold nanoelectrode ensemble electrodes (GNEE) are used as working electrode for its good catalytic activity of electrochemical reaction, high sensitivity and high selectivity. The overall experimental framework of this study is microchip with GNEE working electrode accompanied by cyclic voltammetry, an electrochemical analytical instrument. Compared with traditional analytical methods, the system has some advantages such as small size, micro sample volume, easy operation, rapid detection and low cost.
The limit concentration of Verapamil solution for stable detection in the system is 10 ng/mL. A linear dynamic range with a high correlation factor from 10 ng/mL to 100 £gg/mL was obtained. For the analysis of serum sample, Verapamil present excellent electrochemical activity at 1 ng/mL. A linear dynamic range with a high correlation factor from 1 ng/mL to 100 £gg/mLwas obtained. According to the results, our system for clinical Verapmil concentration analysis has the feasibility of the practical application.
|
587 |
Ensemble Statistics and Error Covariance of a Rapidly Intensifying HurricaneRigney, Matthew C. 16 January 2010 (has links)
This thesis presents an investigation of ensemble Gaussianity, the effect of non-
Gaussianity on covariance structures, storm-centered data assimilation techniques, and
the relationship between commonly used data assimilation variables and the underlying
dynamics for the case of Hurricane Humberto. Using an Ensemble Kalman Filter
(EnKF), a comparison of data assimilation results in Storm-centered and Eulerian
coordinate systems is made. In addition, the extent of the non-Gaussianity of the model
ensemble is investigated and quantified. The effect of this non-Gaussianity on
covariance structures, which play an integral role in the EnKF data assimilation scheme,
is then explored. Finally, the correlation structures calculated from a Weather Research
Forecast (WRF) ensemble forecast of several state variables are investigated in order to
better understand the dynamics of this rapidly intensifying cyclone.
Hurricane Humberto rapidly intensified in the northwestern Gulf of Mexico from
a tropical disturbance to a strong category one hurricane with 90 mph winds in 24 hours.
Numerical models did not capture the intensification of Humberto well. This could be
due in large part to initial condition error, which can be addressed by data assimilation schemes. Because the EnKF scheme is a linear theory developed on the assumption of
the normality of the ensemble distribution, non-Gaussianity in the ensemble distribution
used could affect the EnKF update. It is shown that multiple state variables do indeed
show significant non-Gaussianity through an inspection of statistical moments.
In addition, storm-centered data assimilation schemes present an alternative to
traditional Eulerian schemes by emphasizing the centrality of the cyclone to the
assimilation window. This allows for an update that is most effective in the vicinity of
the storm center, which is of most concern in mesoscale events such as Humberto.
Finally, the effect of non-Gaussian distributions on covariance structures is
examined through data transformations of normal distributions. Various standard
transformations of two Gaussian distributions are made. Skewness, kurtosis, and
correlation between the two distributions are taken before and after the transformations.
It can be seen that there is a relationship between a change in skewness and kurtosis and
the correlation between the distributions. These effects are then taken into consideration
as the dynamics contributing to the rapid intensification of Humberto are explored
through correlation structures.
|
588 |
Upscaling methods for multi-phase flow and transport in heterogeneous porous mediaLi, Yan 2009 December 1900 (has links)
In this dissertation we discuss some upscaling methods for flow and transport
in heterogeneous reservoirs. We studied realization-based multi-phase flow and
transport upscaling and ensemble-level flow upscaling. Multi-phase upscaling is more
accurate than single-phase upscaling and is often required for high level of coarsening.
In multi-phase upscaling, the upscaled transport parameters are time-dependent functions
and are challenging to compute. Due to the hyperbolic feature of the saturation
equation, the nonlocal effects evolve in both space and time. Standard local two-phase
upscaling gives significantly biased results with reference to fine-scale solutions. In
this work, we proposed two types of multi-phase upscaling methods, TOF (time-offlight)-
based two-phase upscaling and local-global two-phase upscaling. These two
methods incorporate global flow information into local two-phase upscaling calculations.
A linear function of time and time-of-flight and a global coarse-scale two-phase
solution (time-dependent) are used respectively in these two approaches. The local
boundary condition therefore captures the global flow effects both spatially and temporally.
These two methods are applied to permeability distributions with various
correlation lengths. Numerical results show that they consistently improve existing
two-phase upscaling methods and provide accurate coarse-scale solutions for both
flow and transport.
We also studied ensemble level flow upscaling. Ensemble level upscaling is up scaling for multiple geological realizations and often required for uncertainty quantification.
Solving the flow problem for all the realizations is time-consuming. In recent
years, some stochastic procedures are combined with upscaling methods to efficiently
compute the upscaled coefficients for a large set of realization. We proposed a fast
perturbation approach in the ensemble level upscaling. By Karhunen-Lo`eve expansion
(KLE), we proposed a correction scheme to fast compute the upscaled permeability
for each realization. Then the sparse grid collocation and adaptive clustering are coupled
with the correction scheme. When we solve the local problem, the solution can
be represented by a product of Green's function and source term. Using collocation
and clusering technique, one can avoid the computation of Green's function for all
the realizations. We compute Green's function at the interpolation nodes, then for
any realization, the Green's function can be obtained by interpolation. The above
techniques allow us to compute the upscaled permeability rapidly for all realizations
in stochastic space.
|
589 |
The Bootstrap in Supervised Learning and its Applications in Genomics/ProteomicsVu, Thang 2011 May 1900 (has links)
The small-sample size issue is a prevalent problem in Genomics and Proteomics today.
Bootstrap, a resampling method which aims at increasing the efficiency of data usage,
is considered to be an effort to overcome the problem of limited sample size. This dissertation
studies the application of bootstrap to two problems of supervised learning with small
sample data: estimation of the misclassification error of Gaussian discriminant analysis,
and the bagging ensemble classification method.
Estimating the misclassification error of discriminant analysis is a classical problem in
pattern recognition and has many important applications in biomedical research. Bootstrap
error estimation has been shown empirically to be one of the best estimation methods in
terms of root mean squared error. In the first part of this work, we conduct a detailed
analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA)
classification rule under Gaussian populations. We derive the exact formulas of the first
and the second moment of the zero bootstrap and the convex bootstrap estimators, as well
as their cross moments with the resubstitution estimator and the true error. Based on these
results, we obtain the exact formulas of the bias, the variance, and the root mean squared
error of the deviation from the true error of these bootstrap estimators. This includes the
moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight
for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all
the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions.
In the second part of this work, we conduct an extensive empirical investigation of
bagging, which is an application of bootstrap to ensemble classification. We investigate
the performance of bagging in the classification of small-sample gene-expression data and
protein-abundance mass spectrometry data, as well as the accuracy of small-sample error
estimation with this ensemble classification rule. We observed that, under t-test and
RELIEF filter-based feature selection, bagging generally does a good job of improving
the performance of unstable, overtting classifiers, such as CART decision trees and neural
networks, but that improvement was not sufficient to beat the performance of single stable,
non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or
3-nearest neighbors. Furthermore, the ensemble method did not improve the performance
of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator
that is intended to remove estimator bias, by formulating carefully how the error
count is normalized, and investigate the performance of error estimation for bagging of
common classification rules, including LDA, 3NN, and CART, applied on both synthetic
and real patient data, corresponding to the use of common error estimators such as resubstitution,
leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus,
bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the
numerical experiments indicated that the performance of the out-of-bag estimator is very
similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically
biased. The performance of the other estimators is consistent with their performance
with the corresponding single classifiers, as reported in other studies. The results of this
work are expected to provide helpful guidance to practitioners who are interested in applying
the bootstrap in supervised learning applications.
|
590 |
An Ensemble Approach for Text Categorization with Positive and Unlabeled ExamplesChen, Hsueh-Ching 29 July 2005 (has links)
Text categorization is the process of assigning new documents to predefined document categories on the basis of a classification model(s) induced from a set of pre-categorized training documents. In a typical dichotomous classification scenario, the set of training documents includes both positive and negative examples; that is, each of the two categories is associated with training documents. However, in many real-world text categorization applications, positive and unlabeled documents are readily available, whereas the acquisition of samples of negative documents is extremely expensive or even impossible. In this study, we propose and develop an ensemble approach, referred to as E2, to address the limitations of existing algorithms for learning from positive and unlabeled training documents. Using the spam email filtering as the evaluation application, our empirical evaluation results suggest that the proposed E2 technique exhibits more stable and reliable performance than PNB and PEBL.
|
Page generated in 0.028 seconds