Spelling suggestions: "subject:"nearest neighbor"" "subject:"knearest neighbor""
31 |
Data-Driven Predictions of Heating Energy Savings in Residential BuildingsLindblom, Ellen, Almquist, Isabelle January 2019 (has links)
Along with the increasing use of intermittent electricity sources, such as wind and sun, comes a growing demand for user flexibility. This has paved the way for a new market of services that provide electricity customers with energy saving solutions. These include a variety of techniques ranging from sophisticated control of the customers’ home equipment to information on how to adjust their consumption behavior in order to save energy. This master thesis work contributes further to this field by investigating an additional incentive; predictions of future energy savings related to indoor temperature. Five different machine learning models have been tuned and used to predict monthly heating energy consumption for a given set of homes. The model tuning process and performance evaluation were performed using 10-fold cross validation. The best performing model was then used to predict how much heating energy each individual household could save by decreasing their indoor temperature by 1°C during the heating season. The highest prediction accuracy (of about 78%) is achieved with support vector regression (SVR), closely followed by neural networks (NN). The simpler regression models that have been implemented are, however, not far behind. According to the SVR model, the average household is expected to lower their heating energy consumption by approximately 3% if the indoor temperature is decreased by 1°C.
|
32 |
SPARSE DISCRETE WAVELET DECOMPOSITION AND FILTER BANK TECHNIQUES FOR SPEECH RECOGNITIONJingzhao Dai (6642491) 11 June 2019 (has links)
<p>Speech recognition is widely applied to
translation from speech to related text, voice driven commands, human machine
interface and so on [1]-[8]. It has been increasingly proliferated to Human’s
lives in the modern age. To improve the accuracy of speech recognition, various
algorithms such as artificial neural network, hidden Markov model and so on
have been developed [1], [2].</p>
<p>In this thesis work, the tasks of speech
recognition with various classifiers are investigated. The classifiers employed
include the support vector machine (SVM), k-nearest neighbors (KNN), random
forest (RF) and convolutional neural network (CNN). Two novel features extraction
methods of sparse discrete wavelet decomposition (SDWD) and bandpass filtering
(BPF) based on the Mel filter banks [9] are developed and proposed. In order to
meet diversity of classification algorithms, one-dimensional (1D) and two-dimensional
(2D) features are required to be obtained. The 1D features are the array of
power coefficients in frequency bands, which are dedicated for training SVM,
KNN and RF classifiers while the 2D features are formed both in frequency domain
and temporal variations. In fact, the 2D feature consists of the power values
in decomposed bands versus consecutive speech frames. Most importantly, the 2D
feature with geometric transformation are adopted to train CNN.</p>
<p>Speech recognition including males and females
are from the recorded data set as well as the standard data set. Firstly, the
recordings with little noise and clear pronunciation are applied with the
proposed feature extraction methods. After many trials and experiments using
this dataset, a high recognition accuracy is achieved. Then, these feature
extraction methods are further applied to the standard recordings having random
characteristics with ambient noise and unclear pronunciation. Many experiment
results validate the effectiveness of the proposed feature extraction techniques.</p>
|
33 |
Data, learning and privacy in recommendation systems / Données, apprentissage et respect de la vie privée dans les systèmes de recommandationMittal, Nupur 25 November 2016 (has links)
Les systèmes de recommandation sont devenus une partie indispensable des services et des applications d’internet, en particulier dû à la surcharge de données provenant de nombreuses sources. Quel que soit le type, chaque système de recommandation a des défis fondamentaux à traiter. Dans ce travail, nous identifions trois défis communs, rencontrés par tous les types de systèmes de recommandation: les données, les modèles d'apprentissage et la protection de la vie privée. Nous élaborons différents problèmes qui peuvent être créés par des données inappropriées en mettant l'accent sur sa qualité et sa quantité. De plus, nous mettons en évidence l'importance des réseaux sociaux dans la mise à disposition publique de systèmes de recommandation contenant des données sur ses utilisateurs, afin d'améliorer la qualité des recommandations. Nous fournissons également les capacités d'inférence de données publiques liées à des données relatives aux utilisateurs. Dans notre travail, nous exploitons cette capacité à améliorer la qualité des recommandations, mais nous soutenons également qu'il en résulte des menaces d'atteinte à la vie privée des utilisateurs sur la base de leurs informations. Pour notre second défi, nous proposons une nouvelle version de la méthode des k plus proches voisins (knn, de l'anglais k-nearest neighbors), qui est une des méthodes d'apprentissage parmi les plus populaires pour les systèmes de recommandation. Notre solution, conçue pour exploiter la nature bipartie des ensembles de données utilisateur-élément, est évolutive, rapide et efficace pour la construction d'un graphe knn et tire sa motivation de la grande quantité de ressources utilisées par des calculs de similarité dans les calculs de knn. Notre algorithme KIFF utilise des expériences sur des jeux de données réelles provenant de divers domaines, pour démontrer sa rapidité et son efficacité lorsqu'il est comparé à des approches issues de l'état de l'art. Pour notre dernière contribution, nous fournissons un mécanisme permettant aux utilisateurs de dissimuler leur opinion sur des réseaux sociaux sans pour autant dissimuler leur identité. / Recommendation systems have gained tremendous popularity, both in academia and industry. They have evolved into many different varieties depending mostly on the techniques and ideas used in their implementation. This categorization also marks the boundary of their application domain. Regardless of the types of recommendation systems, they are complex and multi-disciplinary in nature, involving subjects like information retrieval, data cleansing and preprocessing, data mining etc. In our work, we identify three different challenges (among many possible) involved in the process of making recommendations and provide their solutions. We elaborate the challenges involved in obtaining user-demographic data, and processing it, to render it useful for making recommendations. The focus here is to make use of Online Social Networks to access publicly available user data, to help the recommendation systems. Using user-demographic data for the purpose of improving the personalized recommendations, has many other advantages, like dealing with the famous cold-start problem. It is also one of the founding pillars of hybrid recommendation systems. With the help of this work, we underline the importance of user’s publicly available information like tweets, posts, votes etc. to infer more private details about her. As the second challenge, we aim at improving the learning process of recommendation systems. Our goal is to provide a k-nearest neighbor method that deals with very large amount of datasets, surpassing billions of users. We propose a generic, fast and scalable k-NN graph construction algorithm that improves significantly the performance as compared to the state-of-the art approaches. Our idea is based on leveraging the bipartite nature of the underlying dataset, and use a preprocessing phase to reduce the number of similarity computations in later iterations. As a result, we gain a speed-up of 14 compared to other significant approaches from literature. Finally, we also consider the issue of privacy. Instead of directly viewing it under trivial recommendation systems, we analyze it on Online Social Networks. First, we reason how OSNs can be seen as a form of recommendation systems and how information dissemination is similar to broadcasting opinion/reviews in trivial recommendation systems. Following this parallelism, we identify privacy threat in information diffusion in OSNs and provide a privacy preserving algorithm for the same. Our algorithm Riposte quantifies the privacy in terms of differential privacy and with the help of experimental datasets, we demonstrate how Riposte maintains the desirable information diffusion properties of a network.
|
34 |
An IoT Solution for Urban Noise Identification in Smart Cities : Noise Measurement and ClassificationAlsouda, Yasser January 2019 (has links)
Noise is defined as any undesired sound. Urban noise and its effect on citizens area significant environmental problem, and the increasing level of noise has become a critical problem in some cities. Fortunately, noise pollution can be mitigated by better planning of urban areas or controlled by administrative regulations. However, the execution of such actions requires well-established systems for noise monitoring. In this thesis, we present a solution for noise measurement and classification using a low-power and inexpensive IoT unit. To measure the noise level, we implement an algorithm for calculating the sound pressure level in dB. We achieve a measurement error of less than 1 dB. Our machine learning-based method for noise classification uses Mel-frequency cepstral coefficients for audio feature extraction and four supervised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregating, and random forest). We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for the classification of sound samples in the dataset under study. We achieve noise classification accuracy in the range of 88% – 94%.
|
35 |
Um estudo sobre a extraÃÃo de caracterÃsticas e a classificaÃÃo de imagens invariantes à rotaÃÃo extraÃdas de um sensor industrial 3D / A study on the extraction of characteristics and the classification of invariant images through the rotation of an 3D industrial sensorRodrigo Dalvit Carvalho da Silva 08 May 2014 (has links)
CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior / Neste trabalho, à discutido o problema de reconhecimento de objetos utilizando imagens extraÃdas de um sensor industrial 3D. NÃs nos concentramos em 9 extratores de caracterÃsticas, dos quais 7 sÃo baseados nos momentos invariantes (Hu, Zernike, Legendre, Fourier-Mellin, Tchebichef, Bessel-Fourier e Gaussian-Hermite), um outro à baseado na Transformada de Hough e o Ãltimo na anÃlise de componentes independentes, e, 4 classificadores, Naive Bayes, k-Vizinhos mais PrÃximos, MÃquina de Vetor de Suporte e Rede Neural Artificial-Perceptron Multi-Camadas. Para a escolha do melhor extrator de caracterÃsticas, foram comparados os seus desempenhos de classificaÃÃo em termos de taxa de acerto e de tempo de extraÃÃo, atravÃs do classificador k-Vizinhos mais PrÃximos utilizando distÃncia euclidiana. O extrator de caracterÃsticas baseado nos momentos de Zernike obteve as melhores taxas de acerto, 98.00%, e tempo relativamente baixo de extraÃÃo de caracterÃsticas, 0.3910 segundos. Os dados gerados a partir deste, foram apresentados a diferentes heurÃsticas de classificaÃÃo. Dentre os classificadores testados, o classificador k-Vizinhos mais PrÃximos, obteve a melhor taxa mÃdia de acerto, 98.00% e, tempo mÃdio de classificaÃÃo relativamente baixo, 0.0040 segundos, tornando-se o classificador mais adequado para a aplicaÃÃo deste estudo. / In this work, the problem of recognition of objects using images extracted from a 3D industrial sensor is discussed. We focus in 9 feature extractors (where seven are based on invariant moments -Hu, Zernike, Legendre, Fourier-Mellin, Tchebichef, BesselâFourier and Gaussian-Hermite-, another is based on the Hough transform and the last one on independent component analysis), and 4 classifiers (Naive Bayes, k-Nearest Neighbor, Support Vector machines and Artificial Neural Network-Multi-Layer Perceptron). To choose the best feature extractor, their performance was compared in terms of classification accuracy rate and extraction time by the k-nearest neighbors classifier using euclidean distance. The feature extractor based on Zernike moments, got the best hit rates, 98.00 %, and relatively low time feature extraction, 0.3910 seconds. The data generated from this, were presented to different heuristic classification. Among the tested classifiers, the k-nearest neighbors classifier achieved the highest average hit rate, 98.00%, and average time of relatively low rank, 0.0040 seconds, thus making it the most suitable classifier for the implementation of this study.
|
36 |
Kombination von terrestrischen Aufnahmen und Fernerkundungsdaten mit Hilfe der kNN-Methode zur Klassifizierung und Kartierung von Wäldern / Combination of field data and remote sensing data with the knn-method (k-nearest neighbors method) for classification and mapping of forestsStümer, Wolfgang 30 August 2004 (has links) (PDF)
Bezüglich des Waldes hat sich in den letzten Jahren seitens der Politik und Wirtschaft ein steigender Informationsbedarf entwickelt. Zur Bereitstellung dieses Bedarfes stellt die Fernerkundung ein wichtiges Hilfsmittel dar, mit dem sich flächendeckende Datengrundlagen erstellen lassen. Die k-nächsten-Nachbarn-Methode (kNN-Methode), die terrestrische Aufnahmen mit Fernerkundungsdaten kombiniert, stellt eine Möglichkeit dar, diese Datengrundlage mit Hilfe der Fernerkundung zu verwirklichen. Deshalb beschäftigt sich die vorliegende Dissertation eingehend mit der kNN-Methode. An Hand der zwei Merkmale Grundfläche (metrische Daten) und Totholz (kategoriale Daten) wurden umfangreiche Berechnungen durchgeführt, wobei verschiedenste Variationen der kNN-Methode berücksichtigt wurden. Diese Variationen umfassen verschiedenste Einstellungen der Distanzfunktion, der Wichtungsfunktion und der Anzahl k-nächsten Nachbarn. Als Fernerkundungsdatenquellen kamen Landsat- und Hyperspektraldaten zum Einsatz, die sich sowohl von ihrer spektralen wie auch ihrer räumlichen Auflösung unterscheiden. Mit Hilfe von Landsat-Szenen eines Gebietes von verschiedenen Zeitpunkten wurde außerdem der multitemporale Ansatz berücksichtigt. Die terrestrische Datengrundlage setzt sich aus Feldaufnahmen mit verschiedenen Aufnahmedesigns zusammen, wobei ein wichtiges Kriterium die gleichmäßige Verteilung von Merkmalswerten (z.B. Grundflächenwerten) über den Merkmalsraum darstellt. Für die Durchführung der Berechnungen wurde ein Programm mit Visual Basic programmiert, welches mit der Integrierung aller Funktionen auf der Programmoberfläche eine benutzerfreundliche Bedienung ermöglicht. Die pixelweise Ausgabe der Ergebnisse mündete in detaillierte Karten und die Verifizierung der Ergebnisse wurde mit Hilfe des prozentualen Root Mean Square Error und der Bootstrap-Methode durchgeführt. Die erzielten Genauigkeiten für das Merkmal Grundfläche liegen zwischen 35 % und 67 % (Landsat) bzw. zwischen 65 % und 67 % (HyMapTM). Für das Merkmal Totholz liegen die Übereinstimmungen zwischen den kNN-Schätzern und den Referenzwerten zwischen 60,0 % und 73,3 % (Landsat) und zwischen 60,0 % und 63,3 % (HyMapTM). Mit den erreichten Genauigkeiten bietet sich die kNN-Methode für die Klassifizierung von Beständen bzw. für die Integrierung in Klassifizierungsverfahren an. / Mapping forest variables and associated characteristics is fundamental for forest planning and management. The following work describes the k-nearest neighbors (kNN) method for improving estimations and to produce maps for the attributes basal area (metric data) and deadwood (categorical data). Several variations within the kNN-method were tested, including: distance metric, weighting function and number of neighbors. As sources of remote sensing Landsat TM satellite images and hyper spectral data were used, which differ both from their spectral as well as their spatial resolutions. Two Landsat scenes from the same area acquired September 1999 and 2000 regard multiple approaches. The field data for the kNN- method comprise tree field measurements which were collected from the test site Tharandter Wald (Germany). The three field data collections are characterized by three different designs. For the kNN calculation a program with integration all kNN functions were developed. The relative root mean square errors (RMSE) and the Bootstrap method were evaluated in order to find optimal parameters. The estimation accuracy for the attribute basal area is between 35 % and 67 % (Landsat) and 65 % and 67 % (HyMapTM). For the attribute deadwood is the accuracy between 60 % and 73 % (Landsat) and 60 % and 63 % (HyMapTM). Recommendations for applying the kNN method for mapping and regional estimation are provided.
|
37 |
Video Recommendation Based on Object DetectionNyberg, Selma January 2018 (has links)
In this thesis, various machine learning domains have been combined in order to build a video recommender system that is based on object detection. The work combines two extensively studied research fields, recommender systems and computer vision, that also are rapidly growing and popular techniques on commercial markets. To investigate the performance of the approach, three different content-based recommender systems have been implemented at Spotify, which are based on the following video features: object detections, titles and descriptions, and user preferences. These systems have then been evaluated and compared against each other together with their hybridized result. Two algorithms have been implemented, the prediction and the top-N algorithm, where the former is the more reliable source for evaluating the system's performance. The evaluation of the system shows that the overall performance scores for predicting values of the users' liked and disliked videos are in the range from about 40 % to 70 % for the prediction algorithm and from about 15 % to 70 % for the top-N algorithm. The approach based on object detection performs worse in comparison to the other approaches. Hence, there seems to be is a low correlation between the user preferences and the video contents in terms of object detection data. Therefore, this data is not very suitable for describing the content of videos and using it in the recommender system. However, the results of this study cannot be generalized to apply for other systems before the approach has been evaluated in other environments and for various data sets. Moreover, there are plenty of room for refinements and improvements to the system, as well as there are many interesting research areas for future work.
|
38 |
Entropic measures of connectivity with an application to intracerebral epileptic signals / Mesures entropiques de connectivité avec application à l'épilepsieZhu, Jie 22 June 2016 (has links)
Les travaux présentés dans cette thèse s'inscrivent dans la problématique de la connectivité cérébrale, connectivité tripartite puisqu'elle sous-tend les notions de connectivité structurelle, fonctionnelle et effective. Ces trois types de connectivité que l'on peut considérer à différentes échelles d'espace et de temps sont bien évidemment liés et leur analyse conjointe permet de mieux comprendre comment structures et fonctions cérébrales se contraignent mutuellement. Notre recherche relève plus particulièrement de la connectivité effective qui permet de définir des graphes de connectivité qui renseignent sur les liens causaux, directs ou indirects, unilatéraux ou bilatéraux via des chemins de propagation, représentés par des arcs, entre les nœuds, ces derniers correspondant aux régions cérébrales à l'échelle macroscopique. Identifier les interactions entre les aires cérébrales impliquées dans la génération et la propagation des crises épileptiques à partir d'enregistrements intracérébraux est un enjeu majeur dans la phase pré-chirurgicale et l'objectif principal de notre travail. L'exploration de la connectivité effective suit généralement deux approches, soit une approche basée sur les modèles, soit une approche conduite par les données comme nous l'envisageons dans le cadre de cette thèse où les outils développés relèvent de la théorie de l'information et plus spécifiquement de l'entropie de transfert, la question phare que nous adressons étant celle de la précision des estimateurs de cette grandeur dans le cas des méthodes développées basées sur les plus proches voisins. Les approches que nous proposons qui réduisent le biais au regard d'estimateurs issus de la littérature sont évaluées et comparées sur des signaux simulés de type bruits blancs, processus vectoriels autorégressifs linéaires et non linéaires, ainsi que sur des modèles physiologiques réalistes avant d'être appliquées sur des signaux électroencéphalographiques de profondeur enregistrés sur un patient épileptique et comparées à une approche assez classique basée sur la fonction de transfert dirigée. En simulation, dans les situations présentant des non-linéarités, les résultats obtenus permettent d'apprécier la réduction du biais d'estimation pour des variances comparables vis-à-vis des techniques connues. Si les informations recueillies sur les données réelles sont plus difficiles à analyser, elles montrent certaines cohérences entre les méthodes même si les résultats préliminaires obtenus s'avèrent davantage en accord avec les conclusions des experts cliniciens en appliquant la fonction de transfert dirigée. / The work presented in this thesis deals with brain connectivity, including structural connectivity, functional connectivity and effective connectivity. These three types of connectivities are obviously linked, and their joint analysis can give us a better understanding on how brain structures and functions constrain each other. Our research particularly focuses on effective connectivity that defines connectivity graphs with information on causal links that may be direct or indirect, unidirectional or bidirectional. The main purpose of our work is to identify interactions between different brain areas from intracerebral recordings during the generation and propagation of seizure onsets, a major issue in the pre-surgical phase of epilepsy surgery treatment. Exploring effective connectivity generally follows two kinds of approaches, model-based techniques and data-driven ones. In this work, we address the question of improving the estimation of information-theoretic quantities, mainly mutual information and transfer entropy, based on k-Nearest Neighbors techniques. The proposed approaches we developed are first evaluated and compared with existing estimators on simulated signals including white noise processes, linear and nonlinear vectorial autoregressive processes, as well as realistic physiology-based models. Some of them are then applied on intracerebral electroencephalographic signals recorded on an epileptic patient, and compared with the well-known directed transfer function. The experimental results show that the proposed techniques improve the estimation of information-theoretic quantities for simulated signals, while the analysis is more difficult in real situations. Globally, the different estimators appear coherent and in accordance with the ground truth given by the clinical experts, the directed transfer function leading to interesting performance.
|
39 |
An Efficient Classification Model for Analyzing Skewed Data to Detect Frauds in the Financial Sector / Un modèle de classification efficace pour l'analyse des données déséquilibrées pour détecter les fraudes dans le secteur financierMakki, Sara 16 December 2019 (has links)
Différents types de risques existent dans le domaine financier, tels que le financement du terrorisme, le blanchiment d’argent, la fraude de cartes de crédit, la fraude d’assurance, les risques de crédit, etc. Tout type de fraude peut entraîner des conséquences catastrophiques pour des entités telles que les banques ou les compagnies d’assurances. Ces risques financiers sont généralement détectés à l'aide des algorithmes de classification. Dans les problèmes de classification, la distribution asymétrique des classes, également connue sous le nom de déséquilibre de classe (class imbalance), est un défi très commun pour la détection des fraudes. Des approches spéciales d'exploration de données sont utilisées avec les algorithmes de classification traditionnels pour résoudre ce problème. Le problème de classes déséquilibrées se produit lorsque l'une des classes dans les données a beaucoup plus d'observations que l’autre classe. Ce problème est plus vulnérable lorsque l'on considère dans le contexte des données massives (Big Data). Les données qui sont utilisées pour construire les modèles contiennent une très petite partie de groupe minoritaire qu’on considère positifs par rapport à la classe majoritaire connue sous le nom de négatifs. Dans la plupart des cas, il est plus délicat et crucial de classer correctement le groupe minoritaire plutôt que l'autre groupe, comme la détection de la fraude, le diagnostic d’une maladie, etc. Dans ces exemples, la fraude et la maladie sont les groupes minoritaires et il est plus délicat de détecter un cas de fraude en raison de ses conséquences dangereuses qu'une situation normale. Ces proportions de classes dans les données rendent très difficile à l'algorithme d'apprentissage automatique d'apprendre les caractéristiques et les modèles du groupe minoritaire. Ces algorithmes seront biaisés vers le groupe majoritaire en raison de leurs nombreux exemples dans l'ensemble de données et apprendront à les classer beaucoup plus rapidement que l'autre groupe. Dans ce travail, nous avons développé deux approches : Une première approche ou classifieur unique basée sur les k plus proches voisins et utilise le cosinus comme mesure de similarité (Cost Sensitive Cosine Similarity K-Nearest Neighbors : CoSKNN) et une deuxième approche ou approche hybride qui combine plusieurs classifieurs uniques et fondu sur l'algorithme k-modes (K-modes Imbalanced Classification Hybrid Approach : K-MICHA). Dans l'algorithme CoSKNN, notre objectif était de résoudre le problème du déséquilibre en utilisant la mesure de cosinus et en introduisant un score sensible au coût pour la classification basée sur l'algorithme de KNN. Nous avons mené une expérience de validation comparative au cours de laquelle nous avons prouvé l'efficacité de CoSKNN en termes de taux de classification correcte et de détection des fraudes. D’autre part, K-MICHA a pour objectif de regrouper des points de données similaires en termes des résultats de classifieurs. Ensuite, calculez les probabilités de fraude dans les groupes obtenus afin de les utiliser pour détecter les fraudes de nouvelles observations. Cette approche peut être utilisée pour détecter tout type de fraude financière, lorsque des données étiquetées sont disponibles. La méthode K-MICHA est appliquée dans 3 cas : données concernant la fraude par carte de crédit, paiement mobile et assurance automobile. Dans les trois études de cas, nous comparons K-MICHA au stacking en utilisant le vote, le vote pondéré, la régression logistique et l’algorithme CART. Nous avons également comparé avec Adaboost et la forêt aléatoire. Nous prouvons l'efficacité de K-MICHA sur la base de ces expériences. Nous avons également appliqué K-MICHA dans un cadre Big Data en utilisant H2O et R. Nous avons pu traiter et analyser des ensembles de données plus volumineux en très peu de temps / There are different types of risks in financial domain such as, terrorist financing, money laundering, credit card fraudulence and insurance fraudulence that may result in catastrophic consequences for entities such as banks or insurance companies. These financial risks are usually detected using classification algorithms. In classification problems, the skewed distribution of classes also known as class imbalance, is a very common challenge in financial fraud detection, where special data mining approaches are used along with the traditional classification algorithms to tackle this issue. Imbalance class problem occurs when one of the classes have more instances than another class. This problem is more vulnerable when we consider big data context. The datasets that are used to build and train the models contain an extremely small portion of minority group also known as positives in comparison to the majority class known as negatives. In most of the cases, it’s more delicate and crucial to correctly classify the minority group rather than the other group, like fraud detection, disease diagnosis, etc. In these examples, the fraud and the disease are the minority groups and it’s more delicate to detect a fraud record because of its dangerous consequences, than a normal one. These class data proportions make it very difficult to the machine learning classifier to learn the characteristics and patterns of the minority group. These classifiers will be biased towards the majority group because of their many examples in the dataset and will learn to classify them much faster than the other group. After conducting a thorough study to investigate the challenges faced in the class imbalance cases, we found that we still can’t reach an acceptable sensitivity (i.e. good classification of minority group) without a significant decrease of accuracy. This leads to another challenge which is the choice of performance measures used to evaluate models. In these cases, this choice is not straightforward, the accuracy or sensitivity alone are misleading. We use other measures like precision-recall curve or F1 - score to evaluate this trade-off between accuracy and sensitivity. Our objective is to build an imbalanced classification model that considers the extreme class imbalance and the false alarms, in a big data framework. We developed two approaches: A Cost-Sensitive Cosine Similarity K-Nearest Neighbor (CoSKNN) as a single classifier, and a K-modes Imbalance Classification Hybrid Approach (K-MICHA) as an ensemble learning methodology. In CoSKNN, our aim was to tackle the imbalance problem by using cosine similarity as a distance metric and by introducing a cost sensitive score for the classification using the KNN algorithm. We conducted a comparative validation experiment where we prove the effectiveness of CoSKNN in terms of accuracy and fraud detection. On the other hand, the aim of K-MICHA is to cluster similar data points in terms of the classifiers outputs. Then, calculating the fraud probabilities in the obtained clusters in order to use them for detecting frauds of new transactions. This approach can be used to the detection of any type of financial fraud, where labelled data are available. At the end, we applied K-MICHA to a credit card, mobile payment and auto insurance fraud data sets. In all three case studies, we compare K-MICHA with stacking using voting, weighted voting, logistic regression and CART. We also compared with Adaboost and random forest. We prove the efficiency of K-MICHA based on these experiments
|
40 |
Using supervised learning methods to predict the stop duration of heavy vehicles.Oldenkamp, Emiel January 2020 (has links)
In this thesis project, we attempt to predict the stop duration of heavy vehicles using data based on GPS positions collected in a previous project. All of the training and prediction is done in AWS SageMaker, and we explore possibilities with Linear Learner, K-Nearest Neighbors and XGBoost, all of which are explained in this paper. Although we were not able to construct a production-grade model within the time frame of the thesis, we were able to show that the potential for such a model does exist given more time, and propose some suggestions for the paths one can take to improve on the endpoint of this project.
|
Page generated in 0.0988 seconds