Global ETD Search

1	Systematic ensemble learning and extensions for regression / Méthodes d'ensemble systématiques et extensions en apprentissage automatique pour la régression Aldave, Roberto January 2015 (has links) Abstract : The objective is to provide methods to improve the performance, or prediction accuracy of standard stacking approach, which is an ensemble method composed of simple, heterogeneous base models, through the integration of the diversity generation, combination and/or selection stages for regression problems. In Chapter 1, we propose to combine a set of level-1 learners into a level-2 learner, or ensemble. We also propose to inject a diversity generation mechanism into the initial cross-validation partition, from which new cross-validation partitions are generated, and sub-sequent ensembles are trained. Then, we propose an algorithm to select best partition, or corresponding ensemble. In Chapter 2, we formulate the partition selection as a Pareto-based multi-criteria optimization problem, as well as an algorithm to make the partition selection iterative with the aim to improve more the ensemble prediction accuracy. In Chapter 3, we propose to generate multiple populations or partitions by injecting a diversity mechanism to the original dataset. Then, an algorithm is proposed to select the best partition among all partitions generated by the multiple populations. All methods designed and implemented in this thesis get encouraging, and favorably results across different dataset against both state-of-the-art models, and ensembles for regression. / Résumé : L’objectif est de fournir des techniques permettant d’améliorer la performance de l’algorithme de stacking, une méthode ensembliste composée de modèles de base simples et hétérogènes, à travers l’intégration de la génération de la diversité, la sélection et combinaison des modèles. Dans le chapitre 1, nous proposons de combiner différents sous-ensembles de modèles de base obtenus au primer niveau. Nous proposons un mécanisme pour injecter de la diversité dans la partition croisée initiale, à partir de laquelle de nouvelles partitions de validation croisée sont générées, et les ensembles correspondant sont formés. Ensuite, nous proposons un algorithme pour sélectionner la meilleure partition. Dans le chapitre 2, nous formulons la sélection de la partition comme un problème d’optimisation multi-objectif fondé sur un principe de Pareto, ainsi que d’un algorithme pour faire une application itérative de la sélection avec l’objectif d’améliorer d’avantage la précision d’ensemble. Dans le chapitre 3, nous proposons de générer plusieurs populations en injectant un mécanisme de diversité à l’ensemble de données original. Ensuite, un algorithme est proposé pour sélectionner la meilleur partition entre toutes les partitions produite par les multiples populations. Nous avons obtenu des résultats encourageants avec ces algorithmes lors de comparaisons avec des modèles reconnus sur plusieurs bases de données. Ensemble learning Stacked generalization Systematic cross-validation Ensemble selection Regression Pareto non-dominated alternatives Multi-criteria optimization
2	Toward The Frontiers Of Stacked Generalization Architecture For Learning Mertayak, Cuneyt 01 September 2007 (has links) (PDF) In pattern recognition, &ldquo / bias-variance&rdquo / trade-off is a challenging issue that the scientists has been working to get better generalization performances over the last decades. Among many learning methods, two-layered homogeneous stacked generalization has been reported to be successful in the literature, in different problem domains such as object recognition and image annotation. The aim of this work is two-folded. First, the problems of stacked generalization are attacked by a proposed novel architecture. Then, a set of success criteria for stacked generalization is studied. A serious drawback of stacked generalization architecture is the sensitivity to curse of dimensionality problem. In order to solve this problem, a new architecture named &ldquo / unanimous decision&rdquo / is designed. The performance of this architecture is shown to be comparably similar to two layered homogeneous stacked generalization architecture in low number of classes while it performs better than stacked generalization architecture in higher number of classes. Additionally, a new success criterion for two layered homogeneous stacked generalization architecture is proposed based on the individual properties of the used descriptors and it is verified in synthetic datasets. QA Computer Software 76.75-76.765
3	Performance Analysis Of Stacked Generalization Ozay, Mete 01 September 2008 (has links) (PDF) Stacked Generalization (SG) is an ensemble learning technique, which aims to increase the performance of individual classifiers by combining them under a hierarchical architecture. This study consists of two major parts. In the first part, the performance of Stacked Generalization technique is analyzed with respect to the performance of the individual classifiers and the content of the training data. In the second part, based on the findings for a new class of algorithms, called Meta-Fuzzified Yield Value (Meta-FYV) is introduced. The first part introduces and verifies two hypotheses by a set of controlled experiments to assure the performance gain for SG. The learning mechanisms of SG to achieve high performance are explored and the relationship between the performance of the individual classifiers and that of SG is investigated. It is shown that if the samples in the training set are correctly classified by at least one base layer classifier, then, the generalization performance of the SG is increased, compared to the performance of the individual classifiers. In the second hypothesis, the effect of the spurious samples, which are not correctly labeled by any of the base layer classifiers, is investigated. In the second part of the thesis, six theorems are constructed based on the analysis of the feature spaces and the stacked generalization architecture. Based on the theorems and hypothesis, a new class of SG algorithms is proposed. The experiments are performed on both Corel data and synthetically generated data, using parallel programming techniques, on a high performance cluster.
4	Um método para deduplicação de metadados bibliográficos baseado no empilhamento de classificadores / A method for bibliographic metadata deduplication based on stacked generalization Borges, Eduardo Nunes January 2013 (has links) Metadados bibliográficos duplicados são registros que correspondem a referências bibliográficas semanticamente equivalentes, ou seja, que descrevem a mesma publicação. Identificar metadados bibliográficos duplicados em uma ou mais bibliotecas digitais é uma tarefa essencial para garantir a qualidade de alguns serviços como busca, navegação e recomendação de conteúdo. Embora diversos padrões de metadados tenham sido propostos, eles não resolvem totalmente os problemas de interoperabilidade porque mesmo que exista um mapeamento entre diferentes esquemas de metadados, podem existir variações na representação do conteúdo. Grande parte dos trabalhos propostos para identificar duplicatas aplica uma ou mais funções sobre o conteúdo de determinados campos no intuito de captar a similaridade entre os registros. Entretanto, é necessário escolher um limiar que defina se dois registros são suficientemente similares para serem considerados semanticamente equivalentes ou duplicados. Trabalhos mais recentes tratam a deduplicação de registros como um problema de classificação de dados, em que um modelo preditivo é treinado para estimar a que objeto do mundo real um registro faz referência. O objetivo principal desta tese é o desenvolvimento de um método efetivo e automático para identificar metadados bibliográficos duplicados, combinando o aprendizado de múltiplos classificadores supervisionados, sem a necessidade de intervenção humana na definição de limiares de similaridade. Sobre o conjunto de treinamento são aplicadas funções de similaridade desenvolvidas especificamente para o contexto de bibliotecas digitais e com baixo custo computacional. Os escores produzidos pelas funções são utilizados para treinar múltiplos modelos de classificação heterogêneos, ou seja, a partir de algoritmos de diversos tipos: baseados em árvores, regras, redes neurais artificiais e probabilísticos. Os classificadores aprendidos são combinados através da estratégia de empilhamento visando potencializar o resultado da deduplicação a partir do conhecimento heterogêneo adquirido individualmente pelos algoritmo de aprendizagem. O modelo de classificação final é aplicado aos pares candidatos ao casamento retornados por uma estratégia de blocagem de dois níveis bastante eficiente. A solução proposta é baseada na hipótese de que o empilhamento de classificadores supervisionados pode aumentar a qualidade da deduplicação quando comparado a outras estratégias de combinação. A avaliação experimental mostra que a hipótese foi confirmada quando o método proposto é comparado com a escolha do melhor classificador e com o voto da maioria. Ainda são analisados o impacto da diversidade dos classificadores no resultado do empilhamento e os casos de falha do método proposto. / Duplicated bibliographic metadata are semantically equivalent records, i.e., references that describe the same publication. Identifying duplicated bibliographic metadata in one or more digital libraries is an essential task to ensure the quality of some services such as search, navigation, and content recommendation. Although many metadata standards have been proposed, they do not completely solve interoperability problems because even if there is a mapping between different metadata schemas, there may be variations in the content representation. Most of work proposed to identify duplicated records uses one or more functions on some fields in order to capture the similarity between the records. However, we need to choose a threshold that defines whether two records are sufficiently similar to be considered semantically equivalent or duplicated. Recent studies deal with record deduplication as a data classification problem, in which a predictive model is trained to estimate the real-world object to which a record refers. The main goal of this thesis is the development of an effective and automatic method to identify duplicated bibliographic metadata, combining multiple supervised classifiers, without any human intervention in the setting of similarity thresholds. We have applied on the training set cheap similarity functions specifically designed for the context of digital libraries. The scores returned by these functions are used to train multiple and heterogeneous classification models, i.e., using learning algorithms based on trees, rules, artificial neural networks and probabilistic models. The learned classifiers are combined by stacked generalization strategy to improve the deduplication result through heterogeneous knowledge acquired by each learning algorithm. The final model is applied to pairs of records that are candidate to matching. These pairs are defined by an efficient two phase blocking strategy. The proposed solution is based on the hypothesis that stacking supervised classifiers can improve the quality of deduplication when compared to other combination strategies. The experimental evaluation shows that the hypothesis has been confirmed by comparing the proposed method to selecting the best classifier or the majority vote technique. We also have analyzed the impact of classifiers diversity on the stacking results and the cases for which the proposed method fails. Banco : Dados Mineracao : Dados Metadados Recuperacao : Informacao Deduplication Approximate matching Similariry Supervised learning Stacked generalization
5	Um método para deduplicação de metadados bibliográficos baseado no empilhamento de classificadores / A method for bibliographic metadata deduplication based on stacked generalization Borges, Eduardo Nunes January 2013 (has links) Metadados bibliográficos duplicados são registros que correspondem a referências bibliográficas semanticamente equivalentes, ou seja, que descrevem a mesma publicação. Identificar metadados bibliográficos duplicados em uma ou mais bibliotecas digitais é uma tarefa essencial para garantir a qualidade de alguns serviços como busca, navegação e recomendação de conteúdo. Embora diversos padrões de metadados tenham sido propostos, eles não resolvem totalmente os problemas de interoperabilidade porque mesmo que exista um mapeamento entre diferentes esquemas de metadados, podem existir variações na representação do conteúdo. Grande parte dos trabalhos propostos para identificar duplicatas aplica uma ou mais funções sobre o conteúdo de determinados campos no intuito de captar a similaridade entre os registros. Entretanto, é necessário escolher um limiar que defina se dois registros são suficientemente similares para serem considerados semanticamente equivalentes ou duplicados. Trabalhos mais recentes tratam a deduplicação de registros como um problema de classificação de dados, em que um modelo preditivo é treinado para estimar a que objeto do mundo real um registro faz referência. O objetivo principal desta tese é o desenvolvimento de um método efetivo e automático para identificar metadados bibliográficos duplicados, combinando o aprendizado de múltiplos classificadores supervisionados, sem a necessidade de intervenção humana na definição de limiares de similaridade. Sobre o conjunto de treinamento são aplicadas funções de similaridade desenvolvidas especificamente para o contexto de bibliotecas digitais e com baixo custo computacional. Os escores produzidos pelas funções são utilizados para treinar múltiplos modelos de classificação heterogêneos, ou seja, a partir de algoritmos de diversos tipos: baseados em árvores, regras, redes neurais artificiais e probabilísticos. Os classificadores aprendidos são combinados através da estratégia de empilhamento visando potencializar o resultado da deduplicação a partir do conhecimento heterogêneo adquirido individualmente pelos algoritmo de aprendizagem. O modelo de classificação final é aplicado aos pares candidatos ao casamento retornados por uma estratégia de blocagem de dois níveis bastante eficiente. A solução proposta é baseada na hipótese de que o empilhamento de classificadores supervisionados pode aumentar a qualidade da deduplicação quando comparado a outras estratégias de combinação. A avaliação experimental mostra que a hipótese foi confirmada quando o método proposto é comparado com a escolha do melhor classificador e com o voto da maioria. Ainda são analisados o impacto da diversidade dos classificadores no resultado do empilhamento e os casos de falha do método proposto. / Duplicated bibliographic metadata are semantically equivalent records, i.e., references that describe the same publication. Identifying duplicated bibliographic metadata in one or more digital libraries is an essential task to ensure the quality of some services such as search, navigation, and content recommendation. Although many metadata standards have been proposed, they do not completely solve interoperability problems because even if there is a mapping between different metadata schemas, there may be variations in the content representation. Most of work proposed to identify duplicated records uses one or more functions on some fields in order to capture the similarity between the records. However, we need to choose a threshold that defines whether two records are sufficiently similar to be considered semantically equivalent or duplicated. Recent studies deal with record deduplication as a data classification problem, in which a predictive model is trained to estimate the real-world object to which a record refers. The main goal of this thesis is the development of an effective and automatic method to identify duplicated bibliographic metadata, combining multiple supervised classifiers, without any human intervention in the setting of similarity thresholds. We have applied on the training set cheap similarity functions specifically designed for the context of digital libraries. The scores returned by these functions are used to train multiple and heterogeneous classification models, i.e., using learning algorithms based on trees, rules, artificial neural networks and probabilistic models. The learned classifiers are combined by stacked generalization strategy to improve the deduplication result through heterogeneous knowledge acquired by each learning algorithm. The final model is applied to pairs of records that are candidate to matching. These pairs are defined by an efficient two phase blocking strategy. The proposed solution is based on the hypothesis that stacking supervised classifiers can improve the quality of deduplication when compared to other combination strategies. The experimental evaluation shows that the hypothesis has been confirmed by comparing the proposed method to selecting the best classifier or the majority vote technique. We also have analyzed the impact of classifiers diversity on the stacking results and the cases for which the proposed method fails. Banco : Dados Mineracao : Dados Metadados Recuperacao : Informacao Deduplication Approximate matching Similariry Supervised learning Stacked generalization
6	Um método para deduplicação de metadados bibliográficos baseado no empilhamento de classificadores / A method for bibliographic metadata deduplication based on stacked generalization Borges, Eduardo Nunes January 2013 (has links) Metadados bibliográficos duplicados são registros que correspondem a referências bibliográficas semanticamente equivalentes, ou seja, que descrevem a mesma publicação. Identificar metadados bibliográficos duplicados em uma ou mais bibliotecas digitais é uma tarefa essencial para garantir a qualidade de alguns serviços como busca, navegação e recomendação de conteúdo. Embora diversos padrões de metadados tenham sido propostos, eles não resolvem totalmente os problemas de interoperabilidade porque mesmo que exista um mapeamento entre diferentes esquemas de metadados, podem existir variações na representação do conteúdo. Grande parte dos trabalhos propostos para identificar duplicatas aplica uma ou mais funções sobre o conteúdo de determinados campos no intuito de captar a similaridade entre os registros. Entretanto, é necessário escolher um limiar que defina se dois registros são suficientemente similares para serem considerados semanticamente equivalentes ou duplicados. Trabalhos mais recentes tratam a deduplicação de registros como um problema de classificação de dados, em que um modelo preditivo é treinado para estimar a que objeto do mundo real um registro faz referência. O objetivo principal desta tese é o desenvolvimento de um método efetivo e automático para identificar metadados bibliográficos duplicados, combinando o aprendizado de múltiplos classificadores supervisionados, sem a necessidade de intervenção humana na definição de limiares de similaridade. Sobre o conjunto de treinamento são aplicadas funções de similaridade desenvolvidas especificamente para o contexto de bibliotecas digitais e com baixo custo computacional. Os escores produzidos pelas funções são utilizados para treinar múltiplos modelos de classificação heterogêneos, ou seja, a partir de algoritmos de diversos tipos: baseados em árvores, regras, redes neurais artificiais e probabilísticos. Os classificadores aprendidos são combinados através da estratégia de empilhamento visando potencializar o resultado da deduplicação a partir do conhecimento heterogêneo adquirido individualmente pelos algoritmo de aprendizagem. O modelo de classificação final é aplicado aos pares candidatos ao casamento retornados por uma estratégia de blocagem de dois níveis bastante eficiente. A solução proposta é baseada na hipótese de que o empilhamento de classificadores supervisionados pode aumentar a qualidade da deduplicação quando comparado a outras estratégias de combinação. A avaliação experimental mostra que a hipótese foi confirmada quando o método proposto é comparado com a escolha do melhor classificador e com o voto da maioria. Ainda são analisados o impacto da diversidade dos classificadores no resultado do empilhamento e os casos de falha do método proposto. / Duplicated bibliographic metadata are semantically equivalent records, i.e., references that describe the same publication. Identifying duplicated bibliographic metadata in one or more digital libraries is an essential task to ensure the quality of some services such as search, navigation, and content recommendation. Although many metadata standards have been proposed, they do not completely solve interoperability problems because even if there is a mapping between different metadata schemas, there may be variations in the content representation. Most of work proposed to identify duplicated records uses one or more functions on some fields in order to capture the similarity between the records. However, we need to choose a threshold that defines whether two records are sufficiently similar to be considered semantically equivalent or duplicated. Recent studies deal with record deduplication as a data classification problem, in which a predictive model is trained to estimate the real-world object to which a record refers. The main goal of this thesis is the development of an effective and automatic method to identify duplicated bibliographic metadata, combining multiple supervised classifiers, without any human intervention in the setting of similarity thresholds. We have applied on the training set cheap similarity functions specifically designed for the context of digital libraries. The scores returned by these functions are used to train multiple and heterogeneous classification models, i.e., using learning algorithms based on trees, rules, artificial neural networks and probabilistic models. The learned classifiers are combined by stacked generalization strategy to improve the deduplication result through heterogeneous knowledge acquired by each learning algorithm. The final model is applied to pairs of records that are candidate to matching. These pairs are defined by an efficient two phase blocking strategy. The proposed solution is based on the hypothesis that stacking supervised classifiers can improve the quality of deduplication when compared to other combination strategies. The experimental evaluation shows that the hypothesis has been confirmed by comparing the proposed method to selecting the best classifier or the majority vote technique. We also have analyzed the impact of classifiers diversity on the stacking results and the cases for which the proposed method fails. Banco : Dados Mineracao : Dados Metadados Recuperacao : Informacao Deduplication Approximate matching Similariry Supervised learning Stacked generalization
7	MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels in Stacking Ensemble Learning Ploshchik, Ilya January 2023 (has links) Stacking, also known as stacked generalization, is a method of ensemble learning where multiple base models are trained on the same dataset, and their predictions are used as input for one or more metamodels in an extra layer. This technique can lead to improved performance compared to single layer ensembles, but often requires a time-consuming trial-and-error process. Therefore, the previously developed Visual Analytics system, StackGenVis, was designed to help users select the set of the most effective and diverse models and measure their predictive performance. However, StackGenVis was developed with only one metamodel: Logistic Regression. The focus of this Bachelor's thesis is to examine how alternative metamodels affect the performance of stacked ensembles through the use of a visualization tool called MetaStackVis. Our interactive tool facilitates visual examination of individual metamodels and metamodels' pairs based on their predictive probabilities (or confidence), various supported validation metrics, and their accuracy in predicting specific problematic data instances. The efficiency and effectiveness of MetaStackVis are demonstrated with an example based on a real healthcare dataset. The tool has also been evaluated through semi-structured interview sessions with Machine Learning and Visual Analytics experts. In addition to this thesis, we have written a short research paper explaining the design and implementation of MetaStackVis. However, this thesis provides further insights into the topic explored in the paper by offering additional findings and in-depth analysis. Thus, it can be considered a supplementary source of information for readers who are interested in diving deeper into the subject. Visualization interaction metamodels validation metrics predicted probabilities stacking stacked generalization ensemble learning machine learning Computer Sciences Datavetenskap (datalogi)
8	Improving Artist Content Matching with Stacking : A comparison of meta-level learners for stacked generalization Magnússon, Fannar January 2018 (has links) Using automatic methods to assign incoming tracks and albums from multiple sources to artists entities in a digital rights management company, where no universal artist identifier is available and artist names can be ambiguous, is a challenging problem. In this work we propose to use stacked generalization to combine the predictions of heterogeneous classifiers for an improved quality of artist content matching on two datasets from a digital rights management company. We compare the performance of using a nonlinear meta-level learner to a linear meta-level learner for the stacked generalization on the two datasets, as well as on eight additional datasets to see how well our results general- ize. We conduct experiments and evaluate how the different meta-level learners perform, using the base learners’ class probabilities or a combination of the base learners’ class probabilities and original input features as meta-features. Our results indicate that stacking with a non-linear meta-level learner can improve predictions on the artist chooser problem. Furthermore, our results indicate that when using a linear meta-level learner for stacked generalization, using the base learners’ class probabilities as metafeatures works best, while using a combination of the base learners’ class probabilities and the original input features as meta-features works best when using a non-linear metalevel learner. Among all the evaluated stacking approaches, stacking with a non-linear meta-level learner, using a combination of the base learners’ class probabilities and the original input features as meta-features, performs the best in our experiments over the ten evaluation datasets. / Att använda automatiska metoder för att tilldela spår och album från olika källor till artister i en digital underhållningstjänst är problematiskt då det inte finns några universellt använda identifierare för artister och namn på artister kan vara tvetydiga. I det här verket föreslår vi en användning av staplad generalisering för att kombinera förutsägningar från heterogena klassificerare för förbättra artistmatchningen i två datamäng från en digital underhållningstjänst. Vi jämför prestandan mellan en linjär och en icke-linjär metainlärningsmetod för den staplade generaliseringen av de två datamängder, samt även åtta ytterligare datamäng för att se hur resultaten kan generaliseras. Vi utför experiment och utvärderar hur de olika metainlärningsmetoderna presterar genom att använda basinlärningsmetodens klassannolikheter eller en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation. Våra resultat indikerar att staplandet med en icke-linjär metainlärningsmetod kan förbättra förutsägningarna i problemet med att tilldela artister. Vidare indikerar våra resultat att när man använder en linjär metainlärningsmetod för en staplad generalisering är det bäst att använda basinlärningsmetodens klassannolikheter som metarepresentation, medan när man använder en icke-linjär metainlärningsmetod för en staplade generaliseringen är det bäst att använda en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation. Av alla utvärderade sätt att stapla är staplandet med en icke-linjär metainlärningsmetod med en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation den ansats som presterar bäst i våra experiment över de tio datamängderna. Elektroteknik och elektronik Computer and Information Sciences Data- och informationsvetenskap

Search results