Global ETD Search

21	Improving Hoeffding Trees Kirkby, Richard Brendon January 2008 (has links) Modern information technology allows information to be collected at a far greater rate than ever before. So fast, in fact, that the main problem is making sense of it all. Machine learning offers promise of a solution, but the field mainly focusses on achieving high accuracy when data supply is limited. While this has created sophisticated classification algorithms, many do not cope with increasing data set sizes. When the data set sizes get to a point where they could be considered to represent a continuous supply, or data stream, then incremental classification algorithms are required. In this setting, the effectiveness of an algorithm cannot simply be assessed by accuracy alone. Consideration needs to be given to the memory available to the algorithm and the speed at which data is processed in terms of both the time taken to predict the class of a new data sample and the time taken to include this sample in an incrementally updated classification model. The Hoeffding tree algorithm is a state-of-the-art method for inducing decision trees from data streams. The aim of this thesis is to improve this algorithm. To measure improvement, a comprehensive framework for evaluating the performance of data stream algorithms is developed. Within the framework memory size is fixed in order to simulate realistic application scenarios. In order to simulate continuous operation, classes of synthetic data are generated providing an evaluation on a large scale. Improvements to many aspects of the Hoeffding tree algorithm are demonstrated. First, a number of methods for handling continuous numeric features are compared. Second, tree prediction strategy is investigated to evaluate the utility of various methods. Finally, the possibility of improving accuracy using ensemble methods is explored. The experimental results provide meaningful comparisons of accuracy and processing speeds between different modifications of the Hoeffding tree algorithm under various memory limits. The study on numeric attributes demonstrates that sacrificing accuracy for space at the local level often results in improved global accuracy. The prediction strategy shown to perform best adaptively chooses between standard majority class and Naive Bayes prediction in the leaves. The ensemble method investigation shows that combining trees can be worthwhile, but only when sufficient memory is available, and improvement is less likely than in traditional machine learning. In particular, issues are encountered when applying the popular boosting method to streams. machine learning classification data streams decision trees hoeffding trees boosting bagging option trees
22	Inégalités probabilistes pour l'estimateur de validation croisée dans le cadre de l'apprentissage statistique et Modèles statistiques appliqués à l'économie et à la finance Cornec, Matthieu 04 June 2009 (has links) (PDF) L'objectif initial de la première partie de cette thèse est d'éclairer par la théorie une pratique communément répandue au sein des practiciens pour l'audit (ou risk assessment en anglais) de méthodes prédictives (ou prédicteurs) : la validation croisée (ou cross-validation en anglais). La seconde partie s'inscrit principalement dans la théorie des processus et son apport concerne essentiellement les applications à des données économiques et financières. Le chapitre 1 s'intéresse au cas classique de prédicteurs de Vapnik-Chernovenkis dimension (VC-dimension dans la suite) finie obtenus par minimisation du risque empirique. Le chapitre 2 s'intéresse donc à une autre classe de prédicteurs plus large que celle du chapitre 1 : les estimateurs stables. Dans ce cadre, nous montrons que les méthodes de validation croisée sont encore consistantes. Dans le chapitre 3, nous exhibons un cas particulier important le subagging où la méthode de validation croisée permet de construire des intervalles de confiance plus étroits que la méthodologie traditionnelle issue de la minimisation du risque empirique sous l'hypothèse de VC-dimension finie. Le chapitre 4 propose un proxy mensuel du taux de croissance du Produit Intérieur Brut français qui est disponible officiellement uniquement à fréquence trimestrielle. Le chapitre 5 décrit la méthodologie pour construire un indicateur synthétique mensuel dans les enquêtes de conjoncture dans le secteur des services en France. L'indicateur synthétique construit est publié mensuellement par l'Insee dans les Informations Rapides. Le chapitre 6 décrit d'un modèle semi-paramétrique de prix spot d'électricité sur les marchés de gros ayant des applications dans la gestion du risque de la production d'électricité. [MATH] Mathematics cross-validation stability concentration inequality bagging Empirical risk minimisation Kalman filter
23	The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics Vu, Thang 2011 May 1900 (has links) The small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications. Bootstrap Error Estimation Classification LDA Bagging Out-of-Bag Estimation Ensemble Methods Genomics Proteomics
24	Applications of Data Mining on Drug Safety: Predicting Proper Dosage of Vancomycin for Patients with Renal Insufficiency and Impairment Yon, Chuen-huei 24 August 2004 (has links) Abstract Drug misuses result in medical resource wastes and significant society costs. Due to the narrow therapeutic range of vancomycin, appropriate vancomycin dosage is difficult to determine. When inappropriate dosage is used, such side effects as poisoning reaction or drug resistance may occur. Clinically, medical professionals adjust drug protocols of vancomycin based on the Therapeutic Drug Monitoring (TDM) results. TDM is usually defined as the clinical use of drug blood concentration measurements as an aid in dosage finding and adjustment. However, TDM cannot be applied to first-time treatments and, in case, dosage decisions need to reply on medical professionals¡¦ clinical experiences and judgments. Data mining has been applied in various medical and healthcare applications. In this study, we will employ a decision-tree induction (specifically, C4.5) and a backpropagation neural network technique for predicting the appropriateness of vancomycin usage for patients with renal insufficiency and impairment. In addition, we will evaluate whether the use of the boosting and bagging algorithms will improve predictive accuracy. Our empirical evaluation results suggest that use of the boosting and bagging algorithms could improve predictive accuracy. Specifically, use of C4.5 in conjunction with the AdaBoost algorithm achieves an overall accuracy of 79.65%, which significantly improves that of the existing practice, recording an accuracy rate at 41.38%. With respect to the appropriateness category (¡§Y¡¨) and the inappropriateness category (¡§N¡¨), C4.5 in conjunction with the AdaBoost algorithm can achieve a recall rate at 78.75% and 80.25%, respectively. Hence, the incorporation of data mining techniques to decision support would enhance the drug safety, which in turn, would improve patient safety and reduce subsequent medical resource wastes. Data Mining AdaBoost Drug Safety Bagging Backpropagation Network Classification Analysis Decision Tree Induction
25	Manufacturing And Structural Analysis Of A Lightweight Sandwich Composite Uav Wing Turgut, Tahir 01 September 2007 (has links) (PDF) This thesis work deals with manufacturing a lightweight composite unmanned aerial vehicle (UAV) wing, material characterization of the composites used in the UAV wing, and preliminary structural analysis of the UAV wing. Manufacturing is performed at the composite laboratory founded in the Department of Aerospace Engineering, and with hand lay-up and vacuum bagging method at room temperature the wing is produced. This study encloses the detailed manufacturing process of the UAV wing from the mold manufacturing up to the final wing configuration supported with sketches and pictures. Structural analysis of the composite wing performed in this study is based on the material properties determined by coupon tests and micromechanics approaches. Contrary to the metallic materials, the actual material properties of composites are generally not available in the material handbooks, because the elastic properties of composite materials are dependent on the manufacturing process. In this study, the mechanical properties, i.e. Young&rsquo / s Modulus, are determined utilizing three different methods. Firstly, longitudinal tensile testing of the coupon specimens is performed to obtain the elastic properties. Secondly, mechanics of materials approach is used to determine the elastic properties. Additionally, an approximate method, that can be used in a preliminary study, is employed. The elastic properties determined by the tests and other approaches are compared to each other. One of the aims of this study is to establish an equivalent material model based on test and micromechanics approach, and use the equivalent model in the structural analysis by finite element method. To achieve this, composite structure of the wing is modeled in detail with full composite material descriptions of the surfaces of the wing structure, and comparisons are made with the results obtained by utilizing equivalent elastic constants. The analyses revealed that all three approaches have consistent, and close results / especially in terms of deflections and natural frequencies. Stress values obtained are also comparable as well. For a case study on level flight conditions, spanwise wing loading distribution is obtained using a program of ESDU, and the wing is analyzed with the distributed loading. Reasonable results are obtained, and the results compared with the tip loading case. Another issue dealt in this study is analyzing the front spar of the wing separately. The analysis of the front spar is performed using transformed section method and finite element analysis. In the results, it is found that both methods calculates the deflections very close to each other. Close stress results are found when solid elements are used in the finite element analysis, whereas, the results were deviating when shell elements are used in the analysis.
26	An Analysis of Wind Power Plant Site Prospecting in the Central United States Carlos, Mark E. 01 December 2010 (has links) Rapid deployment of terrestrial wind power plants (WPPs) is a function of accurate identification of areas suitable for WPPs. Efficient WPP site prospecting not only decreases installation lead time, but also reduces site selection expenses and provides faster reductions of greenhouse gas emissions. Combining conventional predictor variables, such as wind strength and proximity to transmission lines, with nonconventional socioeconomic and demographic predictor variables, will result in improved identification of suitable counties for WPPs and therefore accelerate the site prospecting phase of wind power plant deployment. Existing and under-construction American terrestrial WPPs located in the top 12 windiest states (230 as of June 2009) plus 178 potential county level predictor variables are introduced to logistic regression with stepwise selection and a random sampling validation methodology to identify influential predictor variables. In addition to the wind resource and proximity to electricity transmission lines, existence of a Renewable Portfolio Standard, the population density within a 200 mile radius of the county center, median home values, and farm land area in the county are the four strongest nonconventional predictors (Hosmer and Lemeshow Chi-Square = 9.1250, N = 1009, df = 8, p = 0.3319, - 2LogLikelihood = 619.521). Evaluation of the final model using multiple statistics, including the Heidke skill score (0.2647), confirms overall model predictive skill. The model identifies the existence of 238 suitable counties in the twelve state region that do not possess WPPs (~73% validated overall accuracy) and eliminates 654 counties that are not classified as suitable for WPPs. The 238 counties identified by the model represent ideal counties for further exploration of WPP development and possible transmission line construction. The results of this study will therefore allow faster integration of renewable energy sources and limit climate change impacts from increasing atmospheric greenhouse gas concentrations. bagging GIS logistic regression stepwise selection wind energy wind power plant
27	Intelligent Adaptation of Ensemble Size in Data Streams Using Online Bagging Olorunnimbe, Muhammed January 2015 (has links) In this era of the Internet of Things and Big Data, a proliferation of connected devices continuously produce massive amounts of fast evolving streaming data. There is a need to study the relationships in such streams for analytic applications, such as network intrusion detection, fraud detection and financial forecasting, amongst other. In this setting, it is crucial to create data mining algorithms that are able to seamlessly adapt to temporal changes in data characteristics that occur in data streams. These changes are called concept drifts. The resultant models produced by such algorithms should not only be highly accurate and be able to swiftly adapt to changes. Rather, the data mining techniques should also be fast, scalable, and efficient in terms of resource allocation. It then becomes important to consider issues such as storage space needs and memory utilization. This is especially relevant when we aim to build personalized, near-instant models in a Big Data setting. This research work focuses on mining in a data stream with concept drift, using an online bagging method, with consideration to the memory utilization. Our aim is to take an adaptive approach to resource allocation during the mining process. Specifically, we consider metalearning, where the models of multiple classifiers are combined into an ensemble, has been very successful when building accurate models against data streams. However, little work has been done to explore the interplay between accuracy, efficiency and utility. This research focuses on this issue. We introduce an adaptive metalearning algorithm that takes advantage of the memory utilization cost of concept drift, in order to vary the ensemble size during the data mining process. We aim to minimize the memory usage, while maintaining highly accurate models with a high utility. We evaluated our method against a number of benchmarking datasets and compare our results against the state-of-the art. Return on Investment (ROI) was used to evaluate the gain in performance in terms of accuracy, in contrast to the time and memory invested. We aimed to achieve high ROI without compromising on the accuracy of the result. Our experimental results indicate that we achieved this goal. Data stream Concept drift Metalearning Cost sensitive adaptation ROI Utility Adaptive ensemble size Online bagging
28	Credit Card Approval Prediction : A comparative analysis between logistic regressionclassifier, random forest classifier, support vectorclassifier with ensemble bagging classifier. Janapareddy, Dhanush, Yenduri, Narendra Chowdary January 2023 (has links) Background. Due to an increasing number of credit card defaulters, companies arenow taking greater precautions when approving credit applications. When a customermeets certain requirements, credit card firms typically use their experience todecide whether to grant them a credit card. Additionally, a few machine learningmethods have been applied to support the final decision. Objectives. The aim of this thesis is to compare the accuracy of logistic regressionclassifier, random forest classifier, and support vector classifier with the ensemblebagging classifier for predicting credit card approval. Methods. This thesis follows a method called general experimentation to determinethe most accurate classification technique for predicting credit card approval. Thedataset is taken from Kaggle, which contains information about credit card applications.The selected algorithms are trained with training data and validate themusing validation data then evaluate their performance on the testing data by usingmetrics such as accuracy, precision, recall, F1 score, and ROC curve. Now ensemblelearning bagging technique is applied to combine the predictions of these multiplemodels using majority voting to create an ensemble model. Finally, the performanceof the ensemble model was evaluated on the testing data and compared its accuracyto that of the individual models to identify the most accurate classification techniquefor predicting credit card approval. Results. Among the four selected machine learning algorithms, the random forestclassifier performed better with an accuracy of 88.41% on the testing dataset.The second-best algorithm is the ensemble bagging classifier, with an accuracy of84.78%. Hence, the random forest classifier is the most accurate algorithm for predictingcredit card approval. Conclusions. After evaluating various classifiers, including logistic regression classifier,random forest classifier, support vector classifier, and ensemble bagging, it wasobserved that the random forest classifier outperformed the other models in termsof predicting accuracy. This indicates that the random forest classifier was better atpredicting credit card approval. Machine Learning Logistic Regression Random Forest Support Vector Machine Ensemble Learning Bagging. Computer Sciences Datavetenskap (datalogi)
29	Machine Learning for Inverse Design Thomas, Evan 08 February 2023 (has links) "Inverse design" formulates the design process as an inverse problem; optimal values of a parameterized design space are sought so to best reproduce quantitative outcomes from the forwards dynamics of the design's intended environment. Arguably, two subtasks are necessary to iteratively solve such a design problem; the generation and evaluation of designs. This thesis work documents two experiments leveraging machine learning (ML) to facilitate each subtask. Included first is a review of relevant physics and machine learning theory. Then, analysis on the theoretical foundations of ensemble methods realizes a novel equation describing the effect of Bagging and Random Forests on the expected mean squared error of a base model. Complex models of design evaluation may capture environmental dynamics beyond those that are useful for a design optimization. These constitute unnecessary time and computational costs. The first experiment attempts to avoid these by replacing EGSnrc, a Monte Carlo simulation of coupled electron-photon transport, with an efficient ML "surrogate model". To investigate the benefits of surrogate models, a simulated annealing design optimization is twice conducted to reproduce an arbitrary target design, once using EGSnrc and once using a random forest regressor as a surrogate model. It is found that using the surrogate model produced approximately an 100x speed-up, and converged upon an effective design in fewer iterations. In conclusion, using a surrogate model is faster and (in this case) also more effective per-iteration. The second experiment of this thesis work leveraged machine learning for design generation. As a proof-of-concept design objective, the work seeks to efficiently sample 2D Ising spin model configurations from an optimized design space with a uniform distribution of internal energies. Randomly sampling configurations yields a narrow Gaussian distribution of internal energies. Convolutional neural networks (CNN) trained with NeuroEvolution, a mutation-only genetic algorithm, were used to statistically shape the design space. Networks contribute to sampling by processing random inputs, their outputs are then regularized into acceptable configurations. Samples produced with CNNs had more uniform distribution of internal energies, and ranged across the entire space of possible values. In combination with conventional sampling methods, these CNNs can facilitate the sampling of configurations with uniformly distributed internal energies. Machine Learning Inverse Design Bagging Optimization Artificial Intelligence Random Forests Automation Design Surrogate Model Generative Design
30	Large scale support vector machines algorithms for visual classification / Algorithmes de SVM pour la classification d'images à grande échelle Doan, Thanh-Nghi 07 November 2013 (has links) Nous présentons deux contributions majeures : 1) une combinaison de plusieurs descripteurs d’images pour la classification à grande échelle, 2) des algorithmes parallèles de SVM pour la classification d’images à grande échelle. Nous proposons aussi un algorithme incrémental et parallèle de classification lorsque les données ne peuvent plus tenir en mémoire vive. / We have proposed a novel method of combination multiple of different features for image classification. For large scale learning classifiers, we have developed the parallel versions of both state-of-the-art linear and nonlinear SVMs. We have also proposed a novel algorithm to extend stochastic gradient descent SVM for large scale learning. A class of large scale incremental SVM classifiers has been developed in order to perform classification tasks on large datasets with very large number of classes and training data can not fit into memory. Séparateurs à Vaste Marge Apprentissage incrémental et parallèle Descente de gradient stochastique Algorithme de bagging équilibré Support vector machines Incremental learning method Stochastic gradient descent Balanced bagging Large scale classification

Search results