111 |
Rozhodovací stromy / Decision treesPatera, Jan January 2008 (has links)
This diploma thesis presents description on several algorithms for decision trees induction and software RapidMiner. The first part of the thesis deals with partition and terminology of decision trees. There’re described all algorithms for decision tree construction in RapidMiner. The second part deals with implementation and comparison of chosen algorithms. The application was developed in C++. Based on the real datesets the comparisson of different algorithms was realized using Rapid Miner 4.0.
|
112 |
Real-time prediction of projectile penetration to laminates by training machine learning models with finite element solver as the trainerWadagbalkar, Pushkar 15 June 2020 (has links)
No description available.
|
113 |
Detection of IoT Botnets using Decision TreesMeghana Raghavendra (10723905) 29 April 2021 (has links)
<p>International Data Corporation<sup>[3]</sup> (IDC) data estimates that 152,200 Internet of things (IoT) devices will be connected to the Internet every minute by the year 2025. This rapid expansion in the utilization of IoT devices in everyday life leads to an increase in the attack surface for cybercriminals. IoT devices are frequently compromised and used for the creation of botnets. However, it is difficult to apply the traditional methods to counteract IoT botnets and thus calls for finding effective and efficient methods to mitigate such threats. In this work, the network snapshots of IoT traffic infected with two botnets, i.e., Mirai and Bashlite, are studied. Specifically, the collected datasets include network traffic from 9 different IoT devices such as baby monitor, doorbells, thermostat, web cameras, and security cameras. Each dataset consists of 115 stream aggregation feature statistics like weight, mean, covariance, correlation coefficient, standard deviation, radius, and magnitude with a timeframe decay factor, along with a class label defining the traffic as benign or anomalous.</p><p>The goal of the research is to identify a proper machine learning method that can detect IoT botnet traffic accurately and in real-time on IoT edge devices with low computation power, in order to form the first line of defense in an IoT network. The initial step is to identify the most important features that distinguish between benign and anomalous traffic for IoT devices. Specifically, the Input Perturbation Ranking algorithm<sup>[12]</sup> with XGBoost<sup>[26]</sup>is applied to find the 9 most important features among the 115 features. These 9 features can be collected in real time and be applied as inputs to any detection method. Next, a supervised predictive machine learning method, i.e., Decision Trees, is proposed for faster and accurate detection of botnet traffic. The advantage of using decision trees over other machine learning methodologies, is that it achieves accurate results with low computation time and power. Unlike deep learning methodologies, decision trees can provide visual representation of the decision making and detection process. This can be easily translated into explicit security policies in the IoT environment. In the experiments conducted, it can be clearly seen that decision trees can detect anomalous traffic with an accuracy of 99.997% and takes 59 seconds for training and 0.068 seconds for prediction, which is much faster than the state-of-art deep-learning based detector, i.e., Kitsune<sup>[4]</sup>. Moreover, our results show that decision trees have an extremely low false positive rate of 0.019%. Using the 9 most important features, decision trees can further reduce the processing time while maintaining the accuracy. Hence, decision trees with important features are able to accurately and efficiently detect IoT botnets in real time and on a low performance edge device such as Raspberry Pi<sup>[9]</sup>.</p>
|
114 |
Predictive model to reduce the dropout rate of university students in Perú: Bayesian Networks vs. Decision TreesMedina, Erik Cevallos, Chunga, Claudio Barahona, Armas-Aguirre, Jimmy, Grandon, Elizabeth E. 01 June 2020 (has links)
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / This research proposes a prediction model that might help reducing the dropout rate of university students in Peru. For this, a three-phase predictive analysis model was designed which was combined with the stages proposed by the IBM SPSS Modeler methodology. Bayesian network techniques was compared with decision trees for their level of accuracy over other algorithms in an Educational Data Mining (EDM) scenario. Data were collected from 500 undergraduate students from a private university in Lima. The results indicate that Bayesian networks behave better than decision trees based on metrics of precision, accuracy, specificity, and error rate. Particularly, the accuracy of Bayesian networks reaches 67.10% while the accuracy for decision trees is 61.92% in the training sample for iteration with 8:2 rate. On the other hand, the variables athletic person (0.30%), own house (0.21%), and high school grades (0.13%) are the ones that contribute most to the prediction model for both Bayesian networks and decision trees.
|
115 |
Genetic Algorithms for Optimization of Machine-learning Models and their Applications in BioinformaticsMagana-Mora, Arturo 29 April 2017 (has links)
Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems.
The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms.
Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.
|
116 |
Credit risk modelling and prediction: Logistic regression versus machine learning boosting algorithmsMachado, Linnéa, Holmer, David January 2022 (has links)
The use of machine learning methods in credit risk modelling has been proven to yield good results in terms of increasing the accuracy of the risk score as- signed to customers. In this thesis, the aim is to examine the performance of the machine learning boosting algorithms XGBoost and CatBoost, with logis- tic regression as a benchmark model, in terms of assessing credit risk. These methods were applied to two different data sets where grid search was used for hyperparameter optimization of XGBoost and CatBoost. The evaluation metrics used to examine the classification accuracy of the methods were model accuracy, ROC curves, AUC and cross validation. According to our results, the machine learning boosting methods outperformed logistic regression on the test data for both data sets and CatBoost yield the highest results in terms of both accuracy and AUC.
|
117 |
Global Translation of Machine Learning Models to Interpretable ModelsAlmerri, Mohammad 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / The widespread and growing usage of machine learning models, especially in highly critical areas such as law, predicate the need for interpretable models. Models that cannot be audited are vulnerable to inheriting biases from the dataset. Even locally interpretable models are vulnerable to adversarial attack. To address this issue a new methodology is proposed to translate any existing machine learning model into a globally interpretable one.
This methodology, MTRE-PAN, is designed as a hybrid SVM-decision tree model and leverages the interpretability of linear hyperplanes. MTRE-PAN uses this hybrid model to create polygons that act as intermediates for the decision boundary. MTRE-PAN is compared to a previously proposed model, TRE-PAN, on three non-synthetic datasets: Abalone, Census and Diabetes data. TRE-PAN translates a machine learning model to a 2-3 decision tree in order to provide global interpretability for the target model. The datasets are each used to train a Neural Network that represents the non-interpretable model. For all target models, the results show that MTRE-PAN generates interpretable decision trees that have a lower
number of leaves and higher parity compared to TRE-PAN.
|
118 |
MODELING INPUT VARIABLE AGE IN SEPSIS PREDICTION USING TREE-BASED MODELSWastesson, Oscar January 2023 (has links)
Last observation carried forward (LOCF) is a common imputation method, regularly used for clinical data. It is based on the principle that the most recent observation that is known is carried forward to replace missing values. In this thesis, we investigate the effect that variable age has on sepsis prediction when used as a conditional decision variable for imputation. In an iterative experiment, we combine the LOCF method with a more passive approach of model-inbuilt ways of handling missing data, using tree-based models. A measurement of variable age is created by measuring the distance in time between missing observations and the most recent known value. Based on this measurement, different cut-off values based on variable-specific percentiles are evaluated during imputation. In the event of missing values, where the last known value is more or equally recent as the decided cutoff, imputation is made through LOCF. The remaining entries are retained as missing and handled by the model during prediction. Results based on out-of-sample prediction performance for increasing variable age percentile cutoffs suggest that too restrictive constraints on the variable age decrease predictive performance for CART and Random forest, whereas no such performance decrease is found for XGBoost. In addition, tendencies of a slight decrease in performance are seen for higher variable percentiles as compared to the variable age interval that was found optimal in most cases. Finally, SHAP and LIME values show that there is a clear association between the variable age and prediction contributions for some variables. Further research is necessary to confirm and extend the results.
|
119 |
Application of machine learning methods and airborne hyperspectral remote sensing for crop yield estimationUno, Yoji January 2003 (has links)
No description available.
|
120 |
A Knowledge Based Approach of Toxicity Prediction for Drug Formulation. Modelling Drug Vehicle Relationships Using Soft Computing TechniquesMistry, Pritesh January 2015 (has links)
This multidisciplinary thesis is concerned with the prediction of drug formulations for the reduction of drug toxicity. Both scientific and computational approaches are utilised to make original contributions to the field of predictive toxicology.
The first part of this thesis provides a detailed scientific discussion on all aspects of drug formulation and toxicity. Discussions are focused around the principal mechanisms of drug toxicity and how drug toxicity is studied and reported in the literature. Furthermore, a review of the current technologies available for formulating drugs for toxicity reduction is provided. Examples of studies reported in the literature that have used these technologies to reduce drug toxicity are also reported. The thesis also provides an overview of the computational approaches currently employed in the field of in silico predictive toxicology. This overview focuses on the machine learning approaches used to build predictive QSAR classification models, with examples discovered from the literature provided.
Two methodologies have been developed as part of the main work of this thesis. The first is focused on use of directed bipartite graphs and Venn diagrams for the visualisation and extraction of drug-vehicle relationships from large un-curated datasets which show changes in the patterns of toxicity. These relationships can be rapidly extracted and visualised using the methodology proposed in chapter 4.
The second methodology proposed, involves mining large datasets for the extraction of drug-vehicle toxicity data. The methodology uses an area-under-the-curve principle to make pairwise comparisons of vehicles which are classified according to the toxicity protection they offer, from which predictive classification models based on random forests and decisions trees are built. The results of this methodology are reported in chapter 6.
|
Page generated in 0.0624 seconds