Spelling suggestions: "subject:"undersampling"" "subject:"undesampling""
1 |
Contribution à la commande et l'observation des systèmes dynamiques continus sous mesures clairsemées / Contribution to the observation and control of continuous systems under sparse measurementsKhaled, Yassine 13 June 2014 (has links)
Les travaux de cette thèse portent sur l'analyse de stabilité des systèmes dynamiques impulsionnels et la synthèse d'observateurs pour les systèmes dynamiques continus avec mesures discrètes.On considère que les mesures sont prises d'une façon aléatoire pour éviter la perte d'observabilité et on montre que la synthèse d'un observateur impulsionnel couplé avec un observateur classique continu via un gain est une solution pertinente pour reconstruire l'état continu du système et commander et stabiliser ces systèmes par un retour d'état basé sur ces observateurs. De plus, ce nouveau schéma d'observateur (impulsionnel couplé avec observateur classique continu) permet de reconstruire le vecteur de sortie même si les mesures prises ne vérifient pas les conditions du Shannon-Nyquist. Ensuite, un chapitre est dédié à la détection de mode actif et à la reconstruction de son état associé, ceci pour une classe de systèmes linéaires hybrides sous mesures clairsemées. La solution que nous avons apportée à ce problème est d'une part l'analyse d'observabilité des systèmes sous échantillonnage aléatoire et d'autre part la synthèse d'observateurs impulsionnels. Ici, la première approche est basée sur le concept d'échantillonnage compressif bien connu en théorie du traitementdu signal. Une synthèse d'observateurs impulsionnels a été présentée pourquelques cas particuliers.D'autre part, une nouvelle méthode de synthèse d'observateurs spécifique aux systèmes non linéaire continus avec mesures discrètes est également proposée. Cette méthode utilise la condition de Lipchitz pour la transformation d'un système non linéaire à un système linéaire à paramètres variants basée sur l'utilisation du théorème des accroissements finis afin de synthétiser des observateurs impulsionnels.Enfin, les observateurs proposés sont testés sur une application à la synchronisation de systèmes chaotiques dédiés à la communication sécurisée. / This thesis deals with the stability analysis of impulsive systems and the design of impulsive observers for systems under sparse measurements.The measures are sparse but random in order to avoid the loss of observability.Moreover, it is highlighted that the synthesis of an impulsive observer coupled with a classical continuous observer via an observer gain is an appropriate solution to reconstruct the continuous system state and to stabilize this system by state feedback based on these observers. In addition, this new scheme (impulsive observer coupled with classical observer) can reconstruct the output vector, even if the available measurement do not verify the Nyquist-Shannon conditions. Another part is dedicated to the detection of the active mode and to the estimation of the associated continuous state for a class of linear hybrid systems under sparse measurements. The solution we found to this problem is firstly the observability of systems under random sampling and secondly the design of an impulsive observer. Here, the first approach is based on the concept of compressive sensing theory well known in signal processing. The design of the impulsive observer is presented for some special classes of nonlinear systems.Moreover, a novel observer design method for continuous nonlinear systems withdiscrete measurements is proposed. This method uses the Lipchitz conditions andthe mean value theorem in order to transform the problem in a linear one.Finally, the proposed observer are tested on application to the synchronization of chaotic systems dedicated to the secure communications
|
2 |
Empirical Evaluations of Different Strategies for Classification with Skewed Class DistributionLing, Shih-Shiung 09 August 2004 (has links)
Existing classification analysis techniques (e.g., decision tree induction,) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes. Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness.
In this study, we empirically evaluate three different approaches, namely the under-sampling, the over-sampling and the multi-classifier committee approaches, for addressing classification with highly skewed class distribution. Due to its popularity, C4.5 is selected as the underlying classification analysis technique. Based on 10 highly skewed class distribution datasets, our empirical evaluations suggest that the multi-classifier committee generally outperformed the under-sampling and the over-sampling approaches, using the recall rate, precision rate and F1-measure as the evaluation criteria. Furthermore, for applications aiming at a high recall rate, use of the over-sampling approach will be suggested. On the other hand, if the precision rate is the primary concern, adoption of the classification model induced directly from original datasets would be recommended.
|
3 |
PATTERN RECOGNITION IN CLASS IMBALANCED DATASETSSiddique, Nahian A 01 January 2016 (has links)
Class imbalanced datasets constitute a significant portion of the machine learning problems of interest, where recognizing the ‘rare class’ is the primary objective for most applications. Traditional linear machine learning algorithms are often not effective in recognizing the rare class. In this research work, a specifically optimized feed-forward artificial neural network (ANN) is proposed and developed to train from moderate to highly imbalanced datasets.
The proposed methodology deals with the difficulty in classification task in multiple stages—by optimizing the training dataset, modifying kernel function to generate the gram matrix and optimizing the NN structure. First, the training dataset is extracted from the available sample set through an iterative process of selective under-sampling. Then, the proposed artificial NN comprises of a kernel function optimizer to specifically enhance class boundaries for imbalanced datasets by conformally transforming the kernel functions. Finally, a single hidden layer weighted neural network structure is proposed to train models from the imbalanced dataset. The proposed NN architecture is derived to effectively classify any binary dataset with even very high imbalance ratio with appropriate parameter tuning and sufficient number of processing elements.
Effectiveness of the proposed method is tested on accuracy based performance metrics, achieving close to and above 90%, with several imbalanced datasets of generic nature and compared with state of the art methods. The proposed model is also used for classification of a 25GB computed tomographic colonography database to test its applicability for big data. Also the effectiveness of under-sampling, kernel optimization for training of the NN model from the modified kernel gram matrix representing the imbalanced data distribution is analyzed experimentally. Computation time analysis shows the feasibility of the system for practical purposes. This report is concluded with discussion of prospect of the developed model and suggestion for further development works in this direction.
|
4 |
On the role of correspondence noise in human visual motion perception : a systematic study on the role of correspondence noise affecting Dmax and Dmin, using random dot kinematograms : a psychophysical and modelling approachShafiullah, Syed Nadeemullah January 2008 (has links)
One of the major goals of this thesis is to investigate the extent to which correspondence noise, (i.e., the false pairing of dots in adjacent frames) limits motion detection performance in random dot kinematograms (RDKs). The performance measures of interest are Dmax and Dmin i.e., the largest and smallest inter-frame dot displacement, respectively, for which motion can be reliably detected. Dmax and threshold coherence (i.e., the smallest proportion of dots that must be moved between frames for motion to be reliably detected) in RDKs are known to be affected by false pairing or correspondence noise. Here the roles of correspondence noise and receptive field geometry in limiting performance are investigated. The range of Dmax observed in the literature is consistent with the current information-limit based interpretation. Dmin is interpreted in the light of correspondence noise and under-sampling. Based on the psychophysical experiments performed in the early parts of the dissertation, a model for correspondence noise based on the principle of receptive field scaling is developed for Dmax. Model simulations provide a good account of psychophysically estimated Dmax over a range of stimulus parameters, showing that correspondence noise and receptive field geometry have a major influence on displacement thresholds.
|
5 |
Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúdeLento, Gabriel Carneiro 27 March 2017 (has links)
Submitted by Gabriel Lento (gabriel.carneiro.lento@gmail.com) on 2017-05-01T23:16:04Z
No. of bitstreams: 1
Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1
Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1
Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Previous issue date: 2017-03-27 / In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn. / Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde.
|
6 |
SCUT-DS: Methodologies for Learning in Imbalanced Data StreamsOlaitan, Olubukola January 2018 (has links)
The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes.
Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances.
In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.
|
7 |
Handling Imbalanced Data Classification With Variational Autoencoding And Random Under-Sampling BoostingLudvigsen, Jesper January 2020 (has links)
In this thesis, a comparison of three different pre-processing methods for imbalanced classification data, is conducted. Variational Autoencoder, Random Under-Sampling Boosting and a hybrid approach of the two, are applied to three imbalanced classification data sets with different class imbalances. A logistic regression (LR) model is fitted to each pre-processed data set and based on its classification performance, the pre-processing methods are evaluated. All three methods shows indications of different advantages when handling class imbalances. For each pre-processed data, the LR-model has is better at correctly classifying minority class observations, compared to a LR-model fitted to the original class imbalanced data sets. Evaluating the overall classification performance, both VAE and RUSBoost shows improving classification results while the hybrid method performs worse for the moderate class imbalanced data and best for the highly imbalanced data.
|
8 |
On the role of correspondence noise in human visual motion perception. A systematic study on the role of correspondence noise affecting Dmax and Dmin, using random dot kinematograms: A psychophysical and modelling approach.Shafiullah, Syed N. January 2008 (has links)
One of the major goals of this thesis is to investigate the extent to which correspondence noise, (i.e., the false pairing of dots in adjacent frames) limits motion detection performance in random dot kinematograms (RDKs). The performance measures of interest are Dmax and Dmin i.e., the largest and smallest inter-frame dot displacement, respectively, for which motion can be reliably detected. Dmax and threshold coherence (i.e., the smallest proportion of dots that must be moved between frames for motion to be reliably detected) in RDKs are known to be affected by false pairing or correspondence noise. Here the roles of correspondence noise and receptive field geometry in limiting performance are investigated. The range of Dmax observed in the literature is consistent with the current information-limit based interpretation. Dmin is interpreted in the light of correspondence noise and under-sampling. Based on the psychophysical experiments performed in the early parts of the dissertation, a model for correspondence noise based on the principle of receptive field scaling is developed for Dmax. Model simulations provide a good account of psychophysically estimated Dmax over a range of stimulus parameters, showing that correspondence noise and receptive field geometry have a major influence on displacement thresholds.
|
9 |
[en] MACHINE LEARNING METHODS APPLIED TO PREDICTIVE MODELS OF CHURN FOR LIFE INSURANCE / [pt] MÉTODOS DE MACHINE LEARNING APLICADOS À MODELAGEM PREDITIVA DE CANCELAMENTOS DE CLIENTES PARA SEGUROS DE VIDATHAIS TUYANE DE AZEVEDO 26 September 2018 (has links)
[pt] O objetivo deste estudo foi explorar o problema de churn em seguros de vida, no sentido de prever se o cliente irá cancelar o produto nos próximos 6 meses. Atualmente, métodos de machine learning vêm se popularizando para este tipo de análise, tornando-se uma alternativa ao tradicional método de modelagem da probabilidade de cancelamento através da regressão logística. Em geral, um dos desafios encontrados neste tipo de modelagem é que a proporção de clientes que cancelam o serviço é relativamente pequena. Para isso, este estudo recorreu a técnicas de balanceamento para tratar a base naturalmente desbalanceada – técnicas de undersampling, oversampling e diferentes combinações destas duas foram utilizadas e comparadas entre si. As bases foram utilizadas para treinar modelos de Bagging, Random Forest e Boosting, e seus resultados foram comparados entre si e também aos resultados obtidos através do modelo de Regressão Logística. Observamos que a técnica SMOTE-modificado para balanceamento da base, aplicada ao modelo de Bagging, foi a combinação que apresentou melhores resultados dentre as combinações exploradas. / [en] The purpose of this study is to explore the churn problem in life insurance, in the sense of predicting if the client will cancel the product in the next 6 months. Currently, machine learning methods are becoming popular in this type of analysis, turning it into an alternative to the traditional method of modeling the probability of cancellation through logistics regression. In general, one of the challenges found in this type of modelling is that the proportion of clients who cancelled the service is relatively small. For this, the study resorted to balancing techniques to treat the naturally unbalanced base – under-sampling and over-sampling techniques and different combinations of these two were used and compared among each other. The bases were used to train models of Bagging, Random Forest and Boosting, and its results were compared among each other and to the results obtained through the Logistics Regression model. We observed that the modified SMOTE technique to balance the base, applied to the Bagging model, was the combination that presented the best results among the explored combinations.
|
Page generated in 0.0807 seconds