Global ETD Search

1	A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine Learning Tumati, Saini 05 October 2021 (has links) No description available. Computer Science Imbalanced datasets Multi-class imbalanced datasets Oversampling Concept drifts Machine learning ensemble learning
2	Multivariate non-parametric statistical tests to reuse classifiers in recurring concept drifting environments GONÇALVES JÚNIOR, Paulo Mauricio 23 April 2013 (has links) Data streams are a recent processing model where data arrive continuously, in large quantities, at high speeds, so that they must be processed on-line. Besides that, several private and public institutions store large amounts of data that also must be processed. Traditional batch classi ers are not well suited to handle huge amounts of data for basically two reasons. First, they usually read the available data several times until convergence, which is impractical in this scenario. Second, they imply that the context represented by data is stable in time, which may not be true. In fact, the context change is a common situation in data streams, and is named concept drift. This thesis presents rcd, a framework that o ers an alternative approach to handle data streams that su er from recurring concept drifts. It creates a new classi er to each context found and stores a sample of the data used to build it. When a new concept drift occurs, rcd compares the new context to old ones using a non-parametric multivariate statistical test to verify if both contexts come from the same distribution. If so, the corresponding classi er is reused. If not, a new classi er is generated and stored. Three kinds of tests were performed. One compares the rcd framework with several adaptive algorithms (among single and ensemble approaches) in arti cial and real data sets, among the most used in the concept drift research area, with abrupt and gradual concept drifts. It is observed the ability of the classi ers in representing each context, how they handle concept drift, and training and testing times needed to evaluate the data sets. Results indicate that rcd had similar or better statistical results compared to the other classi ers. In the real-world data sets, rcd presented accuracies close to the best classi er in each data set. Another test compares two statistical tests (knn and Cramer) in their capability in representing and identifying contexts. Tests were performed using adaptive and batch classi ers as base learners of rcd, in arti cial and real-world data sets, with several rates-of-change. Results indicate that, in average, knn had better results compared to the Cramer test, and was also faster. Independently of the test used, rcd had higher accuracy values compared to their respective base learners. It is also presented an improvement in the rcd framework where the statistical tests are performed in parallel through the use of a thread pool. Tests were performed in three processors with di erent numbers of cores. Better results were obtained when there was a high number of detected concept drifts, the bu er size used to represent each data distribution was large, and there was a high test frequency. Even if none of these conditions apply, parallel and sequential execution still have very similar performances. Finally, a comparison between six di erent drift detection methods was also performed, comparing the predictive accuracies, evaluation times, and drift handling, including false alarm and miss detection rates, as well as the average distance to the drift point and its standard deviation. / Submitted by João Arthur Martins (joao.arthur@ufpe.br) on 2015-03-12T18:02:08Z No. of bitstreams: 2 Tese Paulo Gonçalves Jr..pdf: 2957463 bytes, checksum: de163caadf10cbd5442e145778865224 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-12T18:02:08Z (GMT). No. of bitstreams: 2 Tese Paulo Gonçalves Jr..pdf: 2957463 bytes, checksum: de163caadf10cbd5442e145778865224 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2013-04-23 / Fluxos de dados s~ao um modelo de processamento de dados recente, onde os dados chegam continuamente, em grandes quantidades, a altas velocidades, de modo que eles devem ser processados em tempo real. Al em disso, v arias institui c~oes p ublicas e privadas armazenam grandes quantidades de dados que tamb em devem ser processadas. Classi cadores tradicionais n~ao s~ao adequados para lidar com grandes quantidades de dados por basicamente duas raz~oes. Primeiro, eles costumam ler os dados dispon veis v arias vezes at e convergirem, o que e impratic avel neste cen ario. Em segundo lugar, eles assumem que o contexto representado por dados e est avel no tempo, o que pode n~ao ser verdadeiro. Na verdade, a mudan ca de contexto e uma situa c~ao comum em uxos de dados, e e chamado de mudan ca de conceito. Esta tese apresenta o rcd, uma estrutura que oferece uma abordagem alternativa para lidar com os uxos de dados que sofrem de mudan cas de conceito recorrentes. Ele cria um novo classi cador para cada contexto encontrado e armazena uma amostra dos dados usados para constru -lo. Quando uma nova mudan ca de conceito ocorre, rcd compara o novo contexto com os antigos, utilizando um teste estat stico n~ao param etrico multivariado para veri car se ambos os contextos prov^em da mesma distribui c~ao. Se assim for, o classi cador correspondente e reutilizado. Se n~ao, um novo classi cador e gerado e armazenado. Tr^es tipos de testes foram realizados. Um compara o rcd com v arios algoritmos adaptativos (entre as abordagens individuais e de agrupamento) em conjuntos de dados arti ciais e reais, entre os mais utilizados na area de pesquisa de mudan ca de conceito, com mudan cas bruscas e graduais. E observada a capacidade dos classi cadores em representar cada contexto, como eles lidam com as mudan cas de conceito e os tempos de treinamento e teste necess arios para avaliar os conjuntos de dados. Os resultados indicam que rcd teve resultados estat sticos semelhantes ou melhores, em compara c~ao com os outros classi cadores. Nos conjuntos de dados do mundo real, rcd apresentou precis~oes pr oximas do melhor classi cador em cada conjunto de dados. Outro teste compara dois testes estat sticos (knn e Cramer) em suas capacidades de representar e identi car contextos. Os testes foram realizados utilizando classi cadores xi xii RESUMO tradicionais e adaptativos como base do rcd, em conjuntos de dados arti ciais e do mundo real, com v arias taxas de varia c~ao. Os resultados indicam que, em m edia, KNN obteve melhores resultados em compara c~ao com o teste de Cramer, al em de ser mais r apido. Independentemente do crit erio utilizado, rcd apresentou valores mais elevados de precis~ao em compara c~ao com seus respectivos classi cadores base. Tamb em e apresentada uma melhoria do rcd onde os testes estat sticos s~ao executadas em paralelo por meio do uso de um pool de threads. Os testes foram realizados em tr^es processadores com diferentes n umeros de n ucleos. Melhores resultados foram obtidos quando houve um elevado n umero de mudan cas de conceito detectadas, o tamanho das amostras utilizadas para representar cada distribui c~ao de dados era grande, e havia uma alta freq u^encia de testes. Mesmo que nenhuma destas condi c~oes se aplicam, a execu c~ao paralela e seq uencial ainda t^em performances muito semelhantes. Finalmente, uma compara c~ao entre seis diferentes m etodos de detec c~ao de mudan ca de conceito tamb em foi realizada, comparando a precis~ao, os tempos de avalia c~ao, manipula c~ao das mudan cas de conceito, incluindo as taxas de falsos positivos e negativos, bem como a m edia da dist^ancia ao ponto de mudan ca e o seu desvio padr~ao. Fluxos de dados Contextos recorrentes Aprendizado em tempo real Data streams Concept drifts Recurring contexts on-line learning
3	Multivariate non-parametric statistical tests to reuse classifiers in recurring concept drifting environments Gonçalves Júnior, Paulo Mauricio 23 April 2013 (has links) Data streams are a recent processing model where data arrive continuously, in large quantities, at high speeds, so that they must be processed on-line. Besides that, several private and public institutions store large amounts of data that also must be processed. Traditional batch classi ers are not well suited to handle huge amounts of data for basically two reasons. First, they usually read the available data several times until convergence, which is impractical in this scenario. Second, they imply that the context represented by data is stable in time, which may not be true. In fact, the context change is a common situation in data streams, and is named concept drift. This thesis presents rcd, a framework that o ers an alternative approach to handle data streams that su er from recurring concept drifts. It creates a new classi er to each context found and stores a sample of the data used to build it. When a new concept drift occurs, rcd compares the new context to old ones using a non-parametric multivariate statistical test to verify if both contexts come from the same distribution. If so, the corresponding classi er is reused. If not, a new classi er is generated and stored. Three kinds of tests were performed. One compares the rcd framework with several adaptive algorithms (among single and ensemble approaches) in arti cial and real data sets, among the most used in the concept drift research area, with abrupt and gradual concept drifts. It is observed the ability of the classi ers in representing each context, how they handle concept drift, and training and testing times needed to evaluate the data sets. Results indicate that rcd had similar or better statistical results compared to the other classi ers. In the real-world data sets, rcd presented accuracies close to the best classi er in each data set. Another test compares two statistical tests (knn and Cramer) in their capability in representing and identifying contexts. Tests were performed using adaptive and batch classi ers as base learners of rcd, in arti cial and real-world data sets, with several rates-of-change. Results indicate that, in average, knn had better results compared to the Cramer test, and was also faster. Independently of the test used, rcd had higher accuracy values compared to their respective base learners. It is also presented an improvement in the rcd framework where the statistical tests are performed in parallel through the use of a thread pool. Tests were performed in three processors with di erent numbers of cores. Better results were obtained when there was a high number of detected concept drifts, the bu er size used to represent each data distribution was large, and there was a high test frequency. Even if none of these conditions apply, parallel and sequential execution still have very similar performances. Finally, a comparison between six di erent drift detection methods was also performed, comparing the predictive accuracies, evaluation times, and drift handling, including false alarm and miss detection rates, as well as the average distance to the drift point and its standard deviation. / Submitted by João Arthur Martins (joao.arthur@ufpe.br) on 2015-03-12T19:25:11Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) tese Paulo Mauricio Gonçalves Jr..pdf: 2957463 bytes, checksum: de163caadf10cbd5442e145778865224 (MD5) / Made available in DSpace on 2015-03-12T19:25:11Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) tese Paulo Mauricio Gonçalves Jr..pdf: 2957463 bytes, checksum: de163caadf10cbd5442e145778865224 (MD5) Previous issue date: 2013-04-23 / Fluxos de dados s~ao um modelo de processamento de dados recente, onde os dados chegam continuamente, em grandes quantidades, a altas velocidades, de modo que eles devem ser processados em tempo real. Al em disso, v arias institui c~oes p ublicas e privadas armazenam grandes quantidades de dados que tamb em devem ser processadas. Classi cadores tradicionais n~ao s~ao adequados para lidar com grandes quantidades de dados por basicamente duas raz~oes. Primeiro, eles costumam ler os dados dispon veis v arias vezes at e convergirem, o que e impratic avel neste cen ario. Em segundo lugar, eles assumem que o contexto representado por dados e est avel no tempo, o que pode n~ao ser verdadeiro. Na verdade, a mudan ca de contexto e uma situa c~ao comum em uxos de dados, e e chamado de mudan ca de conceito. Esta tese apresenta o rcd, uma estrutura que oferece uma abordagem alternativa para lidar com os uxos de dados que sofrem de mudan cas de conceito recorrentes. Ele cria um novo classi cador para cada contexto encontrado e armazena uma amostra dos dados usados para constru -lo. Quando uma nova mudan ca de conceito ocorre, rcd compara o novo contexto com os antigos, utilizando um teste estat stico n~ao param etrico multivariado para veri car se ambos os contextos prov^em da mesma distribui c~ao. Se assim for, o classi cador correspondente e reutilizado. Se n~ao, um novo classi cador e gerado e armazenado. Tr^es tipos de testes foram realizados. Um compara o rcd com v arios algoritmos adaptativos (entre as abordagens individuais e de agrupamento) em conjuntos de dados arti ciais e reais, entre os mais utilizados na area de pesquisa de mudan ca de conceito, com mudan cas bruscas e graduais. E observada a capacidade dos classi cadores em representar cada contexto, como eles lidam com as mudan cas de conceito e os tempos de treinamento e teste necess arios para avaliar os conjuntos de dados. Os resultados indicam que rcd teve resultados estat sticos semelhantes ou melhores, em compara c~ao com os outros classi cadores. Nos conjuntos de dados do mundo real, rcd apresentou precis~oes pr oximas do melhor classi cador em cada conjunto de dados. Outro teste compara dois testes estat sticos (knn e Cramer) em suas capacidades de representar e identi car contextos. Os testes foram realizados utilizando classi cadores tradicionais e adaptativos como base do rcd, em conjuntos de dados arti ciais e do mundo real, com v arias taxas de varia c~ao. Os resultados indicam que, em m edia, KNN obteve melhores resultados em compara c~ao com o teste de Cramer, al em de ser mais r apido. Independentemente do crit erio utilizado, rcd apresentou valores mais elevados de precis~ao em compara c~ao com seus respectivos classi cadores base. Tamb em e apresentada uma melhoria do rcd onde os testes estat sticos s~ao executadas em paralelo por meio do uso de um pool de threads. Os testes foram realizados em tr^es processadores com diferentes n umeros de n ucleos. Melhores resultados foram obtidos quando houve um elevado n umero de mudan cas de conceito detectadas, o tamanho das amostras utilizadas para representar cada distribui c~ao de dados era grande, e havia uma alta freq u^encia de testes. Mesmo que nenhuma destas condi c~oes se aplicam, a execu c~ao paralela e seq uencial ainda t^em performances muito semelhantes. Finalmente, uma compara c~ao entre seis diferentes m etodos de detec c~ao de mudan ca de conceito tamb em foi realizada, comparando a precis~ao, os tempos de avalia c~ao, manipula c~ao das mudan cas de conceito, incluindo as taxas de falsos positivos e negativos, bem como a m edia da dist^ancia ao ponto de mudan ca e o seu desvio padr~ao. Fluxos de dados Mudan ças de conceito Contextos recorrentes Aprendizado em tempo real Data streams Concept drifts Recurring contexts on-line learning
4	Towards on-line domain-independent big data learning : novel theories and applications Malik, Zeeshan January 2015 (has links) Feature extraction is an extremely important pre-processing step to pattern recognition, and machine learning problems. This thesis highlights how one can best extract features from the data in an exhaustively online and purely adaptive manner. The solution to this problem is given for both labeled and unlabeled datasets, by presenting a number of novel on-line learning approaches. Specifically, the differential equation method for solving the generalized eigenvalue problem is used to derive a number of novel machine learning and feature extraction algorithms. The incremental eigen-solution method is used to derive a novel incremental extension of linear discriminant analysis (LDA). Further the proposed incremental version is combined with extreme learning machine (ELM) in which the ELM is used as a preprocessor before learning. In this first key contribution, the dynamic random expansion characteristic of ELM is combined with the proposed incremental LDA technique, and shown to offer a significant improvement in maximizing the discrimination between points in two different classes, while minimizing the distance within each class, in comparison with other standard state-of-the-art incremental and batch techniques. In the second contribution, the differential equation method for solving the generalized eigenvalue problem is used to derive a novel state-of-the-art purely incremental version of slow feature analysis (SLA) algorithm, termed the generalized eigenvalue based slow feature analysis (GENEIGSFA) technique. Further the time series expansion of echo state network (ESN) and radial basis functions (EBF) are used as a pre-processor before learning. In addition, the higher order derivatives are used as a smoothing constraint in the output signal. Finally, an online extension of the generalized eigenvalue problem, derived from James Stone’s criterion, is tested, evaluated and compared with the standard batch version of the slow feature analysis technique, to demonstrate its comparative effectiveness. In the third contribution, light-weight extensions of the statistical technique known as canonical correlation analysis (CCA) for both twinned and multiple data streams, are derived by using the same existing method of solving the generalized eigenvalue problem. Further the proposed method is enhanced by maximizing the covariance between data streams while simultaneously maximizing the rate of change of variances within each data stream. A recurrent set of connections used by ESN are used as a pre-processor between the inputs and the canonical projections in order to capture shared temporal information in two or more data streams. A solution to the problem of identifying a low dimensional manifold on a high dimensional dataspace is then presented in an incremental and adaptive manner. Finally, an online locally optimized extension of Laplacian Eigenmaps is derived termed the generalized incremental laplacian eigenmaps technique (GENILE). Apart from exploiting the benefit of the incremental nature of the proposed manifold based dimensionality reduction technique, most of the time the projections produced by this method are shown to produce a better classification accuracy in comparison with standard batch versions of these techniques - on both artificial and real datasets. 006.3

1

Page generated in 0.0408 seconds