• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 33
  • 10
  • 5
  • 4
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 64
  • 24
  • 20
  • 18
  • 16
  • 14
  • 12
  • 12
  • 11
  • 11
  • 10
  • 10
  • 10
  • 9
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Ensembles na classificação relacional / Ensembles in relational classification

Llerena, Nils Ever Murrugarra 08 September 2011 (has links)
Em diversos domínios, além das informações sobre os objetos ou entidades que os compõem, existem, também, informaçõoes a respeito das relações entre esses objetos. Alguns desses domínios são, por exemplo, as redes de co-autoria, e as páginas Web. Nesse sentido, é natural procurar por técnicas de classificação que levem em conta estas informações. Dentre essas técnicas estão as denominadas classificação baseada em grafos, que visam classificar os exemplos levando em conta as relações existentes entre eles. Este trabalho aborda o desenvolvimento de métodos para melhorar o desempenho de classificadores baseados em grafos utilizando estratégias de ensembles. Um classificador ensemble considera um conjunto de classificadores cujas predições individuais são combinadas de alguma forma. Este classificador normalmente apresenta um melhor desempenho do que seus classificadores individualmente. Assim, foram desenvolvidas três técnicas: a primeira para dados originalmente no formato proposicional e transformados para formato relacional baseado em grafo e a segunda e terceira para dados originalmente já no formato de grafo. A primeira técnica, inspirada no algoritmo de boosting, originou o algoritmo KNN Adaptativo Baseado em Grafos (A-KNN). A segunda ténica, inspirada no algoritmo de Bagging originou trê abordagens de Bagging Baseado em Grafos (BG). Finalmente, a terceira técnica, inspirada no algoritmo de Cross-Validated Committees, originou o Cross-Validated Committees Baseado em Grafos (CVCG). Os experimentos foram realizados em 38 conjuntos de dados, sendo 22 conjuntos proposicionais e 16 conjuntos no formato relacional. Na avaliação foi utilizado o esquema de 10-fold stratified cross-validation e para determinar diferenças estatísticas entre classificadores foi utilizado o método proposto por Demsar (2006). Em relação aos resultados, as três técnicas melhoraram ou mantiveram o desempenho dos classificadores bases. Concluindo, ensembles aplicados em classificadores baseados em grafos apresentam bons resultados no desempenho destes / In many fields, besides information about the objects or entities that compose them, there is also information about the relationships between objects. Some of these fields are, for example, co-authorship networks and Web pages. Therefore, it is natural to search for classification techniques that take into account this information. Among these techniques are the so-called graphbased classification, which seek to classify examples taking into account the relationships between them. This paper presents the development of methods to improve the performance of graph-based classifiers by using strategies of ensembles. An ensemble classifier considers a set of classifiers whose individual predictions are combined in some way. This combined classifier usually performs better than its individual classifiers. Three techniques have been developed: the first applied for originally propositional data transformed to relational format based on graphs and the second and the third applied for data originally in graph format. The first technique, inspired by the boosting algorithm originated the Adaptive Graph-Based K-Nearest Neighbor (A-KNN). The second technique, inspired by the bagging algorithm led to three approaches of Graph-Based Bagging (BG). Finally the third technique, inspired by the Cross- Validated Committees algorithm led to the Graph-Based Cross-Validated Committees (CVCG). The experiments were performed on 38 data sets, 22 datasets in propositional format and 16 in relational format. Evaluation was performed using the scheme of 10-fold stratified cross-validation and to determine statistical differences between the classifiers it was used the method proposed by Demsar (2006). Regarding the results, these three techniques improved or at least maintain the performance of the base classifiers. In conclusion, ensembles applied to graph-based classifiers have good results in the performance of them
12

Um modelo de credit scoring para microcrédito: uma inovação no mercado brasileiro

Siqueira, Vânia Rosatti de 10 February 2011 (has links)
Made available in DSpace on 2016-03-15T19:25:42Z (GMT). No. of bitstreams: 1 Vania Rosatti de Siqueira.pdf: 636275 bytes, checksum: a16be8a6db840089b4bb3645148a7376 (MD5) Previous issue date: 2011-02-10 / The Grameen Bank experiences with microcredit operations have been imitated in various countries, mainly the ones related to the two great innovations in this market: the credit agent s role and the solidary group mechanism. The massification of the operations and the reduction in their costs become vital for economies of scale to be achieved, as well as a greater appetite for the MFIs to expand their activity in the microcredit market. In this context, the next great innovation in the microcredit market will be the introduction of credit scoring models in such operations. This will speed up the process, reduce the risks and consequently the costs. Historical information about microcredit operations was taken into account for the creation of a credit model. It was then possible to identify key variables that help to distinguish between the good and the bad borrowers. The results show that as machine learning techniques bagging and boosting are added to the traditional methods of credit analysis discriminant analysis and logistic regression , an improvement in the performance of the credit scoring models for microcredit can be achieved. / As experiências do Grameen Bank com operações de microcrédito têm sido reproduzidas em vários países, principalmente as relacionadas com as duas grandes inovações neste mercado: o papel do agente de crédito e o mecanismo de grupo solidário. A massificação das operações e a redução de custos tornam-se imprescindíveis para que haja economia de escala e maior apetite para as IMFs ampliarem sua atuação neste mercado. Neste cenário, a implantação de modelos de credit scoring será a próxima inovação do microcrédito e proporcionará agilidade, redução de riscos e, conseqüentemente, redução dos custos. Com base em informações históricas de operações de microcrédito foi elaborado um modelo de crédito. Foram identificadas variáveis chave que permitem distinguir os bons e maus pagadores. Os resultados mostram que, acoplando-se técnicas de linguagem de máquina bagging e boosting aos métodos tradicionais de análise de crédito análise discriminante e regressão logística , obtém-se melhora na performance dos modelos de credit scoring para microcrédito.
13

Ensembles na classificação relacional / Ensembles in relational classification

Nils Ever Murrugarra Llerena 08 September 2011 (has links)
Em diversos domínios, além das informações sobre os objetos ou entidades que os compõem, existem, também, informaçõoes a respeito das relações entre esses objetos. Alguns desses domínios são, por exemplo, as redes de co-autoria, e as páginas Web. Nesse sentido, é natural procurar por técnicas de classificação que levem em conta estas informações. Dentre essas técnicas estão as denominadas classificação baseada em grafos, que visam classificar os exemplos levando em conta as relações existentes entre eles. Este trabalho aborda o desenvolvimento de métodos para melhorar o desempenho de classificadores baseados em grafos utilizando estratégias de ensembles. Um classificador ensemble considera um conjunto de classificadores cujas predições individuais são combinadas de alguma forma. Este classificador normalmente apresenta um melhor desempenho do que seus classificadores individualmente. Assim, foram desenvolvidas três técnicas: a primeira para dados originalmente no formato proposicional e transformados para formato relacional baseado em grafo e a segunda e terceira para dados originalmente já no formato de grafo. A primeira técnica, inspirada no algoritmo de boosting, originou o algoritmo KNN Adaptativo Baseado em Grafos (A-KNN). A segunda ténica, inspirada no algoritmo de Bagging originou trê abordagens de Bagging Baseado em Grafos (BG). Finalmente, a terceira técnica, inspirada no algoritmo de Cross-Validated Committees, originou o Cross-Validated Committees Baseado em Grafos (CVCG). Os experimentos foram realizados em 38 conjuntos de dados, sendo 22 conjuntos proposicionais e 16 conjuntos no formato relacional. Na avaliação foi utilizado o esquema de 10-fold stratified cross-validation e para determinar diferenças estatísticas entre classificadores foi utilizado o método proposto por Demsar (2006). Em relação aos resultados, as três técnicas melhoraram ou mantiveram o desempenho dos classificadores bases. Concluindo, ensembles aplicados em classificadores baseados em grafos apresentam bons resultados no desempenho destes / In many fields, besides information about the objects or entities that compose them, there is also information about the relationships between objects. Some of these fields are, for example, co-authorship networks and Web pages. Therefore, it is natural to search for classification techniques that take into account this information. Among these techniques are the so-called graphbased classification, which seek to classify examples taking into account the relationships between them. This paper presents the development of methods to improve the performance of graph-based classifiers by using strategies of ensembles. An ensemble classifier considers a set of classifiers whose individual predictions are combined in some way. This combined classifier usually performs better than its individual classifiers. Three techniques have been developed: the first applied for originally propositional data transformed to relational format based on graphs and the second and the third applied for data originally in graph format. The first technique, inspired by the boosting algorithm originated the Adaptive Graph-Based K-Nearest Neighbor (A-KNN). The second technique, inspired by the bagging algorithm led to three approaches of Graph-Based Bagging (BG). Finally the third technique, inspired by the Cross- Validated Committees algorithm led to the Graph-Based Cross-Validated Committees (CVCG). The experiments were performed on 38 data sets, 22 datasets in propositional format and 16 in relational format. Evaluation was performed using the scheme of 10-fold stratified cross-validation and to determine statistical differences between the classifiers it was used the method proposed by Demsar (2006). Regarding the results, these three techniques improved or at least maintain the performance of the base classifiers. In conclusion, ensembles applied to graph-based classifiers have good results in the performance of them
14

[en] COMBINING TO SUCCEED: A NOVEL STRATEGY TO IMPROVE FORECASTS FROM EXPONENTIAL SMOOTHING MODELS / [pt] COMBINANDO PARA TER SUCESSO: UMA NOVA ESTRATÉGIA PARA MELHORAR A PREVISÕES DE MODELOS DE AMORTECIMENTO EXPONENCIAL

TIAGO MENDES DANTAS 04 February 2019 (has links)
[pt] A presente tese se insere no contexto de previsão de séries temporais. Nesse sentido, embora muitas abordagens tenham sido desenvolvidas, métodos simples como o de amortecimento exponencial costumam gerar resultados extremamente competitivos muitas vezes superando abordagens com maior nível de complexidade. No contexto previsão, papers seminais na área mostraram que a combinação de previsões tem potencial para reduzir de maneira acentuada o erro de previsão. Especificamente, a combinação de previsões geradas por amortecimento exponencial tem sido explorada em papers recentes. Apesar da combinação de previsões utilizando Amortecimento Exponencial poder ser feita de diversas formas, um método proposto recentemente e chamado de Bagged.BLD.MBB.ETS utiliza uma técnica chamada Bootstrap Aggregating (Bagging) em combinação com métodos de amortecimento exponencial para gerar previsões mostrando que a abordagem é capaz de gerar previsões mensais mais precisas que todos os benchmarks analisados. A abordagem era considerada o estado da arte na utilização de Bagging e Amortecimento Exponencial até o desenvolvimento dos resultados obtidos nesta tese. A tese em questão se ocupa de, inicialmente, validar o método Bagged.BLD.MBB.ETS em um conjunto de dados relevante do ponto de vista de uma aplicação real, expandindo assim os campos de aplicação da metodologia. Posteriormente, são identificados motivos relevantes para redução do erro de e é proposta uma nova metodologia que utiliza Bagging, Amortecimento Exponencial e Clusters para tratar o efeito covariância, até então não identificado anteriormente na literatura do método. A abordagem proposta foi testada utilizando diferentes tipo de séries temporais da competição M3, CIF 2016 e M4, bem como utilizando dados simulados. Os resultados empíricos apontam para uma redução substancial na variância e no erro de previsão. / [en] This thesis is inserted in the context of time series forecasting. In this sense, although many approaches have been developed, simple methods such as exponential smoothing usually produce extremely competitive results, often surpassing approaches with a higher level of complexity. Seminal papers in time series forecasting showed that the combination of forecasts has the potential to dramatically reduce the forecast error. Specifically, the combination of forecasts generated by Exponential Smoothing has been explored in recent papers. Although this can be done in many ways, a specific method called Bagged.BLD.MBB.ETS uses a technique called Bootstrap Aggregating (Bagging) in combination with Exponential Smoothing methods to generate forecasts, showing that the approach can generate more accurate monthly forecasts than all the analyzed benchmarks. The approach was considered the state of the art in the use of Bagging and Exponential Smoothing until the development of the results obtained in this thesis. This thesis initially deals with validating Bagged.BLD.MBB.ETS in a data set relevant from the point of view of a real application, thus expanding the fields of application of the methodology. Subsequently, relevant motifs for error reduction are identified and a new methodology using Bagging, Exponential Smoothing and Clusters is proposed to treat the covariance effect, not previously identified in the method s literature. The proposed approach was tested using data from three time series competitions (M3, CIF 2016 and M4), as well as using simulated data. The empirical results point to a substantial reduction in variance and forecast error.
15

Modelos de classificação de risco de crédito para financiamentos imobiliários: regressão logística, análise discriminante, árvores de decisão, bagging e boosting

Lopes, Neilson Soares 08 August 2011 (has links)
Made available in DSpace on 2016-03-15T19:25:35Z (GMT). No. of bitstreams: 1 Neilson Soares Lopes.pdf: 983372 bytes, checksum: 2233d489295cd76cb2d8dcbd78e1e5de (MD5) Previous issue date: 2011-08-08 / Fundo Mackenzie de Pesquisa / This study applied the techniques of traditional parametric discriminant analysis and logistic regression analysis of credit real estate financing transactions where borrowers may or may not have a payroll loan transaction. It was the hit rate compared these methods with the non-parametric techniques based on classification trees, and the methods of meta-learning bagging and boosting that combine classifiers for improved accuracy in the algorithms.In a context of high housing deficit, especially in Brazil, the financing of real estate can still be very encouraged. The impacts of sustainable growth in the mortgage not only bring economic benefits and social. The house is, for most individuals, the largest source of expenditure and the most valuable asset that will have during her lifetime.At the end of the study concluded that the computational techniques of decision trees are more effective for the prediction of payers (94.2% correct), followed by bagging (80.7%) and boosting (or arcing , 75.2%). For the prediction of bad debtors in mortgages, the techniques of logistic regression and discriminant analysis showed the worst results (74.6% and 70.7%, respectively). For the good payers, the decision tree also showed the best predictive power (75.8%), followed by discriminant analysis (75.3%) and boosting (72.9%). For the good paying mortgages, bagging and logistic regression showed the worst results (72.1% and 71.7%, respectively). Logistic regression shows that for a borrower with payroll loans, the chance to be a bad credit is 2.19 higher than if the borrower does not have such type of loan.The presence of credit between the payroll operations of mortgage borrowers also has relevance in the discriminant analysis. / Neste estudo foram aplicadas as técnicas paramétricas tradicionais de análise discriminante e regressão logística para análise de crédito de operações de financiamento imobiliário. Foi comparada a taxa de acertos destes métodos com as técnicas não-paramétricas baseadas em árvores de classificação, além dos métodos de meta-aprendizagem BAGGING e BOOSTING, que combinam classificadores para obter uma melhor precisão nos algoritmos.Em um contexto de alto déficit de moradias, em especial no caso brasileiro, o financiamento de imóveis ainda pode ser bastante fomentado. Os impactos de um crescimento sustentável no crédito imobiliário trazem benefícios não só econômicos como sociais. A moradia é, para grande parte dos indivíduos, a maior fonte de despesas e o ativo mais valioso que terão durante sua vida. Ao final do estudo, concluiu-se que as técnicas computacionais de árvores de decisão se mostram mais efetivas para a predição de maus pagadores (94,2% de acerto), seguida do BAGGING (80,7%) e do BOOSTING (ou ARCING, 75,2%). Para a predição de maus pagadores em financiamentos imobiliários, as técnicas de regressão logística e análise discriminante apresentaram os piores resultados (74,6% e 70,7%, respectivamente). Para os bons pagadores, a árvore de decisão também apresentou o melhor poder preditivo (75,8%), seguida da análise discriminante (75,3%) e do BOOSTING (72,9%). Para os bons pagadores de financiamentos imobiliários, BAGGING e regressão logística apresentaram os piores resultados (72,1% e 71,7%, respectivamente).A regressão logística mostra que, para um tomador com crédito consignado, a chance se ser um mau pagador é 2,19 maior do que se este tomador não tivesse tal modalidade de empréstimo. A presença de crédito consignado entre as operações dos tomadores de financiamento imobiliário também apresenta relevância na análise discriminante.
16

Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping / Investigation des problèmes des données d'apprentissage en classification ensembliste basée sur le concept de marge : application à la cartographie d'occupation du sol

Feng, Wei 19 July 2017 (has links)
La classification a été largement étudiée en apprentissage automatique. Les méthodes d’ensemble, qui construisent un modèle de classification en intégrant des composants d’apprentissage multiples, atteignent des performances plus élevées que celles d’un classifieur individuel. La précision de classification d’un ensemble est directement influencée par la qualité des données d’apprentissage utilisées. Cependant, les données du monde réel sont souvent affectées par les problèmes de bruit d’étiquetage et de déséquilibre des données. La marge d'ensemble est un concept clé en apprentissage d'ensemble. Elle a été utilisée aussi bien pour l'analyse théorique que pour la conception d'algorithmes d'apprentissage automatique. De nombreuses études ont montré que la performance de généralisation d'un classifieur ensembliste est liée à la distribution des marges de ses exemples d'apprentissage. Ce travail se focalise sur l'exploitation du concept de marge pour améliorer la qualité de l'échantillon d'apprentissage et ainsi augmenter la précision de classification de classifieurs sensibles au bruit, et pour concevoir des ensembles de classifieurs efficaces capables de gérer des données déséquilibrées. Une nouvelle définition de la marge d'ensemble est proposée. C'est une version non supervisée d'une marge d'ensemble populaire. En effet, elle ne requière pas d'étiquettes de classe. Les données d'apprentissage mal étiquetées sont un défi majeur pour la construction d'un classifieur robuste que ce soit un ensemble ou pas. Pour gérer le problème d'étiquetage, une méthode d'identification et d'élimination du bruit d'étiquetage utilisant la marge d'ensemble est proposée. Elle est basée sur un algorithme existant d'ordonnancement d'instances erronées selon un critère de marge. Cette méthode peut atteindre un taux élevé de détection des données mal étiquetées tout en maintenant un taux de fausses détections aussi bas que possible. Elle s'appuie sur les valeurs de marge des données mal classifiées, considérant quatre différentes marges d'ensemble, incluant la nouvelle marge proposée. Elle est étendue à la gestion de la correction du bruit d'étiquetage qui est un problème plus complexe. Les instances de faible marge sont plus importantes que les instances de forte marge pour la construction d'un classifieur fiable. Un nouvel algorithme, basé sur une fonction d'évaluation de l'importance des données, qui s'appuie encore sur la marge d'ensemble, est proposé pour traiter le problème de déséquilibre des données. Cette méthode est évaluée, en utilisant encore une fois quatre différentes marges d'ensemble, vis à vis de sa capacité à traiter le problème de déséquilibre des données, en particulier dans un contexte multi-classes. En télédétection, les erreurs d'étiquetage sont inévitables car les données d'apprentissage sont typiquement issues de mesures de terrain. Le déséquilibre des données d'apprentissage est un autre problème fréquent en télédétection. Les deux méthodes d'ensemble proposées, intégrant la définition de marge la plus pertinente face à chacun de ces deux problèmes majeurs affectant les données d'apprentissage, sont appliquées à la cartographie d'occupation du sol. / Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers.
17

[en] GETTING THE MOST OUT OF THE WISDOM OF THE CROWDS: IMPROVING FORECASTING PERFORMANCE THROUGH ENSEMBLE METHODS AND VARIABLE SELECTION TECHNIQUES / [pt] TIRANDO O MÁXIMO PROVEITO DA SABEDORIA DAS MASSAS: APRIMORANDO PREVISÕES POR MEIO DE MÉTODOS DE ENSEMBLE E TÉCNICAS DE SELEÇÃO DE VARIÁVEIS

ERICK MEIRA DE OLIVEIRA 03 June 2020 (has links)
[pt] A presente pesquisa tem como foco o desenvolvimento de abordagens híbridas que combinam algoritmos de aprendizado de máquina baseados em conjuntos (ensembles) e técnicas de modelagem e previsão de séries temporais. A pesquisa também inclui o desenvolvimento de heurísticas inteligentes de seleção, isto é, procedimentos capazes de selecionar, dentre o pool de preditores originados por meio dos métodos de conjunto, aqueles com os maiores potenciais de originar previsões agregadas mais acuradas. A agregação de funcionalidades de diferentes métodos visa à obtenção de previsões mais acuradas sobre o comportamento de uma vasta gama de eventos/séries temporais. A tese está dividida em uma sequência de ensaios. Como primeiro esforço, propôsse um método alternativo de geração de conjunto de previsões, o que resultou em previsões satisfatórias para certos tipos de séries temporais de consumo de energia elétrica. A segunda iniciativa consistiu na proposição de uma nova abordagem de previsão combinando algoritmos de Bootstrap Aggregation (Bagging) e técnicas de regularização para se obter previsões acuradas de consumo de gás natural e de abastecimento de energia em diferentes países. Uma nova variante de Bagging, na qual a construção do conjunto de classificadores é feita por meio de uma reamostragem de máxima entropia, também foi proposta. A terceira contribuição trouxe uma série de inovações na maneira pela qual são conduzidas as rotinas de seleção e combinação de modelos de previsão. Os ganhos em acurácia oriundos dos procedimentos propostos são demonstrados por meio de um experimento extensivo utilizando séries das Competições M1, M3 e M4. / [en] This research focuses on the development of hybrid approaches that combine ensemble-based supervised machine learning techniques and time series methods to obtain accurate forecasts for a wide range of variables and processes. It also includes the development of smart selection heuristics, i.e., procedures that can select, among the pool of forecasts originated via ensemble methods, those with the greatest potential of delivering accurate forecasts after aggregation. Such combinatorial approaches allow the forecasting practitioner to deal with different stylized facts that may be present in time series, such as nonlinearities, stochastic components, heteroscedasticity, structural breaks, among others, and deliver satisfactory forecasting results, outperforming benchmarks on many occasions. The thesis is divided into a series of essays. The first endeavor proposed an alternative method to generate ensemble forecasts which delivered satisfactory forecasting results for certain types of electricity consumption time series. In a second effort, a novel forecasting approach combining Bootstrap aggregating (Bagging) algorithms, time series methods and regularization techniques was introduced to obtain accurate forecasts of natural gas consumption and energy supplied series across different countries. A new variant of Bagging, in which the set of classifiers is built by means of a Maximum Entropy Bootstrap routine, was also put forth. The third contribution brought a series of innovations to model selection and model combination in forecasting routines. Gains in accuracy for both point forecasts and prediction intervals were demonstrated by means of an extensive empirical experiment conducted on a wide range of series from the M- Competitions.
18

Bias reduction studies in nonparametric regression with applications : an empirical approach / Marike Krugell

Krugell, Marike January 2014 (has links)
The purpose of this study is to determine the effect of three improvement methods on nonparametric kernel regression estimators. The improvement methods are applied to the Nadaraya-Watson estimator with crossvalidation bandwidth selection, the Nadaraya-Watson estimator with plug-in bandwidth selection, the local linear estimator with plug-in bandwidth selection and a bias corrected nonparametric estimator proposed by Yao (2012). The di erent resulting regression estimates are evaluated by minimising a global discrepancy measure, i.e. the mean integrated squared error (MISE). In the machine learning context various improvement methods, in terms of the precision and accuracy of an estimator, exist. The rst two improvement methods introduced in this study are bootstrapped based. Bagging is an acronym for bootstrap aggregating and was introduced by Breiman (1996a) from a machine learning viewpoint and by Swanepoel (1988, 1990) in a functional context. Bagging is primarily a variance reduction tool, i.e. bagging is implemented to reduce the variance of an estimator and in this way improve the precision of the estimation process. Bagging is performed by drawing repetitive bootstrap samples from the original sample and generating multiple versions of an estimator. These replicates of the estimator are then used to obtain an aggregated estimator. Bragging stands for bootstrap robust aggregating. A robust estimator is obtained by using the sample median over the B bootstrap estimates instead of the sample mean as in bagging. The third improvement method aims to reduce the bias component of the estimator and is referred to as boosting. Boosting is a general method for improving the accuracy of any given learning algorithm. The method starts of with a sensible estimator and improves iteratively, based on its performance on a training dataset. Results and conclusions verifying existing literature are provided, as well as new results for the new methods. / MSc (Statistics), North-West University, Potchefstroom Campus, 2015
19

Bias reduction studies in nonparametric regression with applications : an empirical approach / Marike Krugell

Krugell, Marike January 2014 (has links)
The purpose of this study is to determine the effect of three improvement methods on nonparametric kernel regression estimators. The improvement methods are applied to the Nadaraya-Watson estimator with crossvalidation bandwidth selection, the Nadaraya-Watson estimator with plug-in bandwidth selection, the local linear estimator with plug-in bandwidth selection and a bias corrected nonparametric estimator proposed by Yao (2012). The di erent resulting regression estimates are evaluated by minimising a global discrepancy measure, i.e. the mean integrated squared error (MISE). In the machine learning context various improvement methods, in terms of the precision and accuracy of an estimator, exist. The rst two improvement methods introduced in this study are bootstrapped based. Bagging is an acronym for bootstrap aggregating and was introduced by Breiman (1996a) from a machine learning viewpoint and by Swanepoel (1988, 1990) in a functional context. Bagging is primarily a variance reduction tool, i.e. bagging is implemented to reduce the variance of an estimator and in this way improve the precision of the estimation process. Bagging is performed by drawing repetitive bootstrap samples from the original sample and generating multiple versions of an estimator. These replicates of the estimator are then used to obtain an aggregated estimator. Bragging stands for bootstrap robust aggregating. A robust estimator is obtained by using the sample median over the B bootstrap estimates instead of the sample mean as in bagging. The third improvement method aims to reduce the bias component of the estimator and is referred to as boosting. Boosting is a general method for improving the accuracy of any given learning algorithm. The method starts of with a sensible estimator and improves iteratively, based on its performance on a training dataset. Results and conclusions verifying existing literature are provided, as well as new results for the new methods. / MSc (Statistics), North-West University, Potchefstroom Campus, 2015
20

Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data Massive

Adjout Rehab, Moufida 01 April 2016 (has links)
Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction.

Page generated in 0.4332 seconds