• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Ensemble Learning With Imbalanced Data

Shoemaker, Larry 20 September 2010 (has links)
We describe an ensemble approach to learning salient spatial regions from arbitrarily partitioned simulation data. Ensemble approaches for anomaly detection are also explored. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data. We have also found that using random forests weighted and distance-based outlier ensemble methods for supervised learning of anomaly detection provide significant accuracy improvements when compared to existing methods on the same dataset. Further, distance-based outlier and local outlier factor ensemble methods for unsupervised learning of anomaly detection also compare favorably to existing methods.

Page generated in 0.0985 seconds