Global ETD Search

1	Computer-Enhanced Knowledge Discovery in Environmental Science Fukuda, Kyoko January 2009 (has links) Encouraging the use of computer algorithms by developing new algorithms and introducing uncommonly known algorithms for use on environmental science problems is a significant contribution, as it provides knowledge discovery tools to extract new aspects of results and draw new insights, additional to those from general statistical methods. Conducting analysis with appropriately chosen methods, in terms of quality of performance and results, computation time, flexibility and applicability to data of various natures, will help decision making in the policy development and management process for environmental studies. This thesis has three fundamental aims and motivations. Firstly, to develop a flexibly applicable attribute selection method, Tree Node Selection (TNS), and a decision tree assessment tool, Tree Node Selection for assessing decision tree structure (TNS-A), both of which use decision trees pre-generated by the widely used C4.5 decision tree algorithm as their information source, to identify important attributes from data. TNS helps the cost effective and efficient data collection and policy making process by selecting fewer, but important, attributes, and TNS-A provides a tool to assess the decision tree structure to extract information on the relationship of attributes and decisions. Secondly, to introduce the use of new, theoretical or unknown computer algorithms, such as the K-Maximum Subarray Algorithm (K-MSA) and Ant-Miner, by adjusting and maximizing their applicability and practicality to assess environmental science problems to bring new insights. Additionally, the unique advanced statistical and mathematical method, Singular Spectrum Analysis (SSA), is demonstrated as a data pre-processing method to help improve C4.5 results on noisy measurements. Thirdly, to promote, encourage and motivate environmental scientists to use ideas and methods developed in this thesis. The methods were tested with benchmark data and various real environmental science problems: sea container contamination, the Weed Risk Assessment model and weed spatial analysis for New Zealand Biosecurity, air pollution, climate and health, and defoliation imagery. The outcome of this thesis will be to introduce the concept and technique of data mining, a process of knowledge discovery from databases, to environmental science researchers in New Zealand and overseas by collaborating on future research to achieve, together with future policy and management, to maintain and sustain a healthy environment to live in. algorithm attribute selection data mining Knowledge Discovery environmental science problmes
2	Computer-Enhanced Knowledge Discovery in Environmental Science Fukuda, Kyoko January 2009 (has links) Encouraging the use of computer algorithms by developing new algorithms and introducing uncommonly known algorithms for use on environmental science problems is a significant contribution, as it provides knowledge discovery tools to extract new aspects of results and draw new insights, additional to those from general statistical methods. Conducting analysis with appropriately chosen methods, in terms of quality of performance and results, computation time, flexibility and applicability to data of various natures, will help decision making in the policy development and management process for environmental studies. This thesis has three fundamental aims and motivations. Firstly, to develop a flexibly applicable attribute selection method, Tree Node Selection (TNS), and a decision tree assessment tool, Tree Node Selection for assessing decision tree structure (TNS-A), both of which use decision trees pre-generated by the widely used C4.5 decision tree algorithm as their information source, to identify important attributes from data. TNS helps the cost effective and efficient data collection and policy making process by selecting fewer, but important, attributes, and TNS-A provides a tool to assess the decision tree structure to extract information on the relationship of attributes and decisions. Secondly, to introduce the use of new, theoretical or unknown computer algorithms, such as the K-Maximum Subarray Algorithm (K-MSA) and Ant-Miner, by adjusting and maximizing their applicability and practicality to assess environmental science problems to bring new insights. Additionally, the unique advanced statistical and mathematical method, Singular Spectrum Analysis (SSA), is demonstrated as a data pre-processing method to help improve C4.5 results on noisy measurements. Thirdly, to promote, encourage and motivate environmental scientists to use ideas and methods developed in this thesis. The methods were tested with benchmark data and various real environmental science problems: sea container contamination, the Weed Risk Assessment model and weed spatial analysis for New Zealand Biosecurity, air pollution, climate and health, and defoliation imagery. The outcome of this thesis will be to introduce the concept and technique of data mining, a process of knowledge discovery from databases, to environmental science researchers in New Zealand and overseas by collaborating on future research to achieve, together with future policy and management, to maintain and sustain a healthy environment to live in. algorithm attribute selection data mining Knowledge Discovery environmental science problmes
3	Combining data mining techniques with multicriteria decision aid in classification problems with composition probabilistc of preferences in trichotomic procedure (CPP-TRI) Silva, Glauco Barbosa da 27 July 2017 (has links) Submitted by Secretaria Pós de Produção (tpp@vm.uff.br) on 2017-07-27T19:24:17Z No. of bitstreams: 1 D2016 - Glauco Barbosa da Silva.pdf: 5776264 bytes, checksum: d50812d465487f52e8c02b7f636435ab (MD5) / Made available in DSpace on 2017-07-27T19:24:17Z (GMT). No. of bitstreams: 1 D2016 - Glauco Barbosa da Silva.pdf: 5776264 bytes, checksum: d50812d465487f52e8c02b7f636435ab (MD5) / Problem: From Modeling decision maker preferences, the Multicriteria Decision Aid (MCDA) is a field dedicated to the study of real-world decision-making problems that are, usually, too complex and not so well-structured to be considered through the examination of a single point of view (criteria). This feature of MCDA implies that a comprehensive model of a decision maker situation cannot be “created”, but instead, the model should be developed to meet the requirements of the Decision Maker (DM). In general, the development of such a model can only be achieved through an iterative and interactive process, until the preferences of the decision maker are consistently represented in the model. However, an interactive method is a procedure that consists of an alternation of calculation and discussion stages, which presumes that the decision maker is willing to answer a large number of relatively difficult questions. For instance, one of the main difficulties to be faced when interacting with a Decision Maker in order to build a decision aid procedure is the various parameters’ elicitation of the preference model. Methodology: In this thesis, as an alternative to interactive process, among the main streams of MCDA, a Preference Disaggregation Analysis method was used, which is considered to assess or to infer preference models from the given preferential structures and to address decision-aiding activities to elicit preferential information and to construct decision models from decision examples. Combining Composition Probabilistic of Preference with Data Mining techniques, a proposal of a three-step process is presented: attribute selection; clustering; and classification. The first and the second ones are data mining tasks and the last one is a multicriteria task. Purpose: This thesis aims to present a new approach with a Data Mining Layer (attribute selection and/or clustering) in Composition Probabilistic of Preferences in Trichotomic procedure (CPP-TRI), which combines data mining techniques with a Multicriteria Decision Aid method in classification (sorting) problems. Findings: The decision maker ability to comprehend without powerful tools has been exceeded. Therefore, important decisions are often made based not on the information-rich data stored in data repositories, but rather, on intuition of the decision maker. Involved in similar problems, the connections between disaggregation methods and data mining (identifying patterns, extracting knowledge from data, eliciting preferential information and constructing decision models from decision examples) are explored to combine and improve the CPP-TRI Method from Attribute Selection Techniques. / Problem: From Modeling decision maker preferences, the Multicriteria Decision Aid (MCDA) is a field dedicated to the study of real-world decision-making problems that are, usually, too complex and not so well-structured to be considered through the examination of a single point of view (criteria). This feature of MCDA implies that a comprehensive model of a decision maker situation cannot be “created”, but instead, the model should be developed to meet the requirements of the Decision Maker (DM). In general, the development of such a model can only be achieved through an iterative and interactive process, until the preferences of the decision maker are consistently represented in the model. However, an interactive method is a procedure that consists of an alternation of calculation and discussion stages, which presumes that the decision maker is willing to answer a large number of relatively difficult questions. For instance, one of the main difficulties to be faced when interacting with a Decision Maker in order to build a decision aid procedure is the various parameters’ elicitation of the preference model. Methodology: In this thesis, as an alternative to interactive process, among the main streams of MCDA, a Preference Disaggregation Analysis method was used, which is considered to assess or to infer preference models from the given preferential structures and to address decision-aiding activities to elicit preferential information and to construct decision models from decision examples. Combining Composition Probabilistic of Preference with Data Mining techniques, a proposal of a three-step process is presented: attribute selection; clustering; and classification. The first and the second ones are data mining tasks and the last one is a multicriteria task. Purpose: This thesis aims to present a new approach with a Data Mining Layer (attribute selection and/or clustering) in Composition Probabilistic of Preferences in Trichotomic procedure (CPP-TRI), which combines data mining techniques with a Multicriteria Decision Aid method in classification (sorting) problems. Findings: The decision maker ability to comprehend without powerful tools has been exceeded. Therefore, important decisions are often made based not on the information-rich data stored in data repositories, but rather, on intuition of the decision maker. Involved in similar problems, the connections between disaggregation methods and data mining (identifying patterns, extracting knowledge from data, eliciting preferential information and constructing decision models from decision examples) are explored to combine and improve the CPP-TRI Method from Attribute Selection Techniques. MCDA CPP-TRI Data mining Attribute selection techniques Sistemas, apoio à decisão e logística MCDA Attribute selection techniques CPP-TRI Data mining
4	Um filtro iterativo utilizando árvores de decisão / An Iterative Decision Tree Threshold Filter Picchi Netto, Oscar 24 September 2013 (has links) Usar algoritmos de Aprendizado de Máquina é um dos modos ecientes de extrair as informações de grandes bases biológicas. Sabendo-se que a quantidade de dados que são coletados cresce a cada dia, o uso de alguma técnica de seleção de atributos eficiente é, em alguns casos, essencial não só para otimizar o tempo do algoritmo de Aprendizado da Máquina a ser aplicado posteriormente como também para reduzir os dados, de forma que possa ser possível testá-los, por exemplo, em uma bancada de laboratório em algumas situações específicas. O objetivo deste estudo é propor uma abordagem utilizando árvores de decisão em um filtro iterativo, visando auxiliar na extração de informação de grande bases biológicas. Pois, com uma base de menor dimensionalidade, um especialista humano pode entender melhor ou ainda utilizar um algoritmo de Aprendizado de Máquina de forma mais eficaz. O filtro proposto pode utilizar qualquer classificador com um seletor de atributos embutido e qualquer métrica pode ser utilizada para determinar se o atributo deve ser escolhido. Foi fixado, neste estudo, o algoritmo utilizado como J48 e a área embaixo da curva ROC (AUC) como métrica. Em experimentos utilizando diversas bases de dados biomédicas, o filtro proposto foi analisado e sua capacidade de compressão e desempenho foram avaliados em cinco diferentes paradigmas de aprendizado de máquina, utilizando dois limiares diferentes para a métrica escolhida. O melhor limiar obteve uma capacidade de compressão de cerca de 50% dos dados em geral e 99.4% em bases de baixa densidade, geralmente grandes bases. Os valores AUC obtidos pelo filtro quando comparados com cinco algoritmos de paradigmas de aprendizado diferentes mostraram um desempenho melhor em quatro das cinco situações avaliadas. O filtro proposto foi depois analisado e comparado com outros seletores de atributos da literatura e o indutor sozinho. Quanto ao tempo gasto pelo filtro em relação aos outros ele se apresentou no mesmo patamar de 3 dos 4 seletores testados. Quando comparado em relação ao AUC o filtro proposto se mostrou robusto nos cinco indutores analisados, não apresentando nenhuma diferença significativa em nenhum dos cenários testados. Em relação aos indutores, o filtro apresentou um desempenho melhor, mesmo que não significante, em 4 dos 5 indutores. / Using Machine Learning algorithms is an eficient way to extract information from large biological databases. But, in some cases, the amount of data is huge that using an eficient featured subset selection is, in some cases, essencial not only to optimize the learning time but also to reduce the amount of data, allowing, for example, a test in a laboratory workbench. The objective of this study is to propose an approach using decision trees in a iterative filter. The filter helps information extraction from large biological databases, since in a database with few dimensions a human specialist can understand it better or can use Machine Learning algorithms in a more efective way. The proposed lter can use any classier with embed featured subset selection and can use any performance metric to determine which attribute must be chosen. In this study, we have fixed the algorithm used within the filter as J48 and AUC was used as metric for performance evaluation. In experiments using biomedical databases, the proposed filter was analyzed and its compression capacity and performance were tested. In five diferent Machine Learning paradigms, using two diferent thresholds for the chosen metric. The best threshold was capable of reducing around 50% of the data using all databases and 99.4% on the small density bases, usually high dimensional databases. AUC values for the filter when compared with the five algorithm got a better performance in four of five tested situations. The proposed filter then was tested against others featured subset selectors from the literature, and against the inducer alone. Analyzing time the proposed lter is in the same level as 3 of 4 of the tested selectors. When tested for AUC the proposed selector shows itself robust in the five inducers tested, not showing any signicant diference in all tested scenarios. Against the inducers alone our filter showed a better performance, even not signicant, in 4 of the 5 inducers. Alta Dimensionalidade Aprendizado de Máquina Attribute Selection High Dimensions Machine Learning Seleção de Atributos
5	Machine learning for automatic classification of remotely sensed data Milne, Linda, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links) As more and more remotely sensed data becomes available it is becoming increasingly harder to analyse it with the more traditional labour intensive, manual methods. The commonly used techniques, that involve expert evaluation, are widely acknowledged as providing inconsistent results, at best. We need more general techniques that can adapt to a given situation and that incorporate the strengths of the traditional methods, human operators and new technologies. The difficulty in interpreting remotely sensed data is that often only a small amount of data is available for classification. It can be noisy, incomplete or contain irrelevant information. Given that the training data may be limited we demonstrate a variety of techniques for highlighting information in the available data and how to select the most relevant information for a given classification task. We show that more consistent results between the training data and an entire image can be obtained, and how misclassification errors can be reduced. Specifically, a new technique for attribute selection in neural networks is demonstrated. Machine learning techniques, in particular, provide us with a means of automating classification using training data from a variety of data sources, including remotely sensed data and expert knowledge. A classification framework is presented in this thesis that can be used with any classifier and any available data. While this was developed in the context of vegetation mapping from remotely sensed data using machine learning classifiers, it is a general technique that can be applied to any domain. The emphasis of the applicability for this framework being domains that have inadequate training data available. contribution analysis ensemble classifiers multi-strategy classification attribute selection feature selection
6	Effective Linear-Time Feature Selection Pradhananga, Nripendra January 2007 (has links) The classification learning task requires selection of a subset of features to represent patterns to be classified. This is because the performance of the classifier and the cost of classification are sensitive to the choice of the features used to construct the classifier. Exhaustive search is impractical since it searches every possible combination of features. The runtime of heuristic and random searches are better but the problem still persists when dealing with high-dimensional datasets. We investigate a heuristic, forward, wrapper-based approach, called Linear Sequential Selection, which limits the search space at each iteration of the feature selection process. We introduce randomization in the search space. The algorithm is called Randomized Linear Sequential Selection. Our experiments demonstrate that both methods are faster, find smaller subsets and can even increase the classification accuracy. We also explore the idea of ensemble learning. We have proposed two ensemble creation methods, Feature Selection Ensemble and Random Feature Ensemble. Both methods apply a feature selection algorithm to create individual classifiers of the ensemble. Our experiments have shown that both methods work well with high-dimensional data. filter wrapper feature selection attribute selection ensemble learning machine learning Linear Feature Selection
7	Machine learning for automatic classification of remotely sensed data Milne, Linda, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links) As more and more remotely sensed data becomes available it is becoming increasingly harder to analyse it with the more traditional labour intensive, manual methods. The commonly used techniques, that involve expert evaluation, are widely acknowledged as providing inconsistent results, at best. We need more general techniques that can adapt to a given situation and that incorporate the strengths of the traditional methods, human operators and new technologies. The difficulty in interpreting remotely sensed data is that often only a small amount of data is available for classification. It can be noisy, incomplete or contain irrelevant information. Given that the training data may be limited we demonstrate a variety of techniques for highlighting information in the available data and how to select the most relevant information for a given classification task. We show that more consistent results between the training data and an entire image can be obtained, and how misclassification errors can be reduced. Specifically, a new technique for attribute selection in neural networks is demonstrated. Machine learning techniques, in particular, provide us with a means of automating classification using training data from a variety of data sources, including remotely sensed data and expert knowledge. A classification framework is presented in this thesis that can be used with any classifier and any available data. While this was developed in the context of vegetation mapping from remotely sensed data using machine learning classifiers, it is a general technique that can be applied to any domain. The emphasis of the applicability for this framework being domains that have inadequate training data available. contribution analysis ensemble classifiers multi-strategy classification attribute selection feature selection
8	Um filtro iterativo utilizando árvores de decisão / An Iterative Decision Tree Threshold Filter Oscar Picchi Netto 24 September 2013 (has links) Usar algoritmos de Aprendizado de Máquina é um dos modos ecientes de extrair as informações de grandes bases biológicas. Sabendo-se que a quantidade de dados que são coletados cresce a cada dia, o uso de alguma técnica de seleção de atributos eficiente é, em alguns casos, essencial não só para otimizar o tempo do algoritmo de Aprendizado da Máquina a ser aplicado posteriormente como também para reduzir os dados, de forma que possa ser possível testá-los, por exemplo, em uma bancada de laboratório em algumas situações específicas. O objetivo deste estudo é propor uma abordagem utilizando árvores de decisão em um filtro iterativo, visando auxiliar na extração de informação de grande bases biológicas. Pois, com uma base de menor dimensionalidade, um especialista humano pode entender melhor ou ainda utilizar um algoritmo de Aprendizado de Máquina de forma mais eficaz. O filtro proposto pode utilizar qualquer classificador com um seletor de atributos embutido e qualquer métrica pode ser utilizada para determinar se o atributo deve ser escolhido. Foi fixado, neste estudo, o algoritmo utilizado como J48 e a área embaixo da curva ROC (AUC) como métrica. Em experimentos utilizando diversas bases de dados biomédicas, o filtro proposto foi analisado e sua capacidade de compressão e desempenho foram avaliados em cinco diferentes paradigmas de aprendizado de máquina, utilizando dois limiares diferentes para a métrica escolhida. O melhor limiar obteve uma capacidade de compressão de cerca de 50% dos dados em geral e 99.4% em bases de baixa densidade, geralmente grandes bases. Os valores AUC obtidos pelo filtro quando comparados com cinco algoritmos de paradigmas de aprendizado diferentes mostraram um desempenho melhor em quatro das cinco situações avaliadas. O filtro proposto foi depois analisado e comparado com outros seletores de atributos da literatura e o indutor sozinho. Quanto ao tempo gasto pelo filtro em relação aos outros ele se apresentou no mesmo patamar de 3 dos 4 seletores testados. Quando comparado em relação ao AUC o filtro proposto se mostrou robusto nos cinco indutores analisados, não apresentando nenhuma diferença significativa em nenhum dos cenários testados. Em relação aos indutores, o filtro apresentou um desempenho melhor, mesmo que não significante, em 4 dos 5 indutores. / Using Machine Learning algorithms is an eficient way to extract information from large biological databases. But, in some cases, the amount of data is huge that using an eficient featured subset selection is, in some cases, essencial not only to optimize the learning time but also to reduce the amount of data, allowing, for example, a test in a laboratory workbench. The objective of this study is to propose an approach using decision trees in a iterative filter. The filter helps information extraction from large biological databases, since in a database with few dimensions a human specialist can understand it better or can use Machine Learning algorithms in a more efective way. The proposed lter can use any classier with embed featured subset selection and can use any performance metric to determine which attribute must be chosen. In this study, we have fixed the algorithm used within the filter as J48 and AUC was used as metric for performance evaluation. In experiments using biomedical databases, the proposed filter was analyzed and its compression capacity and performance were tested. In five diferent Machine Learning paradigms, using two diferent thresholds for the chosen metric. The best threshold was capable of reducing around 50% of the data using all databases and 99.4% on the small density bases, usually high dimensional databases. AUC values for the filter when compared with the five algorithm got a better performance in four of five tested situations. The proposed filter then was tested against others featured subset selectors from the literature, and against the inducer alone. Analyzing time the proposed lter is in the same level as 3 of 4 of the tested selectors. When tested for AUC the proposed selector shows itself robust in the five inducers tested, not showing any signicant diference in all tested scenarios. Against the inducers alone our filter showed a better performance, even not signicant, in 4 of the 5 inducers. Alta Dimensionalidade Aprendizado de Máquina Seleção de Atributos Attribute Selection High Dimensions Machine Learning
9	Uma estratégia para seleção de atributos relevantes no processo de resolução de entidades CANALLE, Gabrielle Karine 22 August 2016 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2017-03-02T12:07:34Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao_versao_final.pdf: 2318178 bytes, checksum: 1c672f9c2706d51a970a72df59fdb7a1 (MD5) / Made available in DSpace on 2017-03-02T12:07:34Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao_versao_final.pdf: 2318178 bytes, checksum: 1c672f9c2706d51a970a72df59fdb7a1 (MD5) Previous issue date: 2016-08-22 / Integração de Dados é um processo essencial quando deseja-se obter uma visão unificada de dados armazenados em fontes de dados autônomas, heterogêneas e distribuídas. Uma etapa crucial desse processo é a Resolução de Entidades, que consiste em identificar instâncias que se referem à mesma entidade do mundo real. A Resolução de Entidades se subdivide em várias fases, incluindo uma fase de comparação entre pares de instâncias. Nesta fase, são utilizadas funções que avaliam a similaridade entre os valores dos atributos que descrevem as instâncias. É importante notar que a qualidade do resultado do processo de Resolução de Entidades é diretamente afetada pelo conjunto de atributos selecionados para a fase de comparação de instâncias. Contudo, selecionar tais atributos pode ser um grande desafio, devido ao grande número de atributos que descrevem as instâncias ou à baixa relevância de alguns atributos para o processo de Resolução de Entidades. Na literatura existem alguns trabalhos que abordam esse problema. Em sua maioria, as abordagens propostas para seleção de atributos utilizam aprendizagem de máquina. No entanto, além da necessidade de um conjunto de treinamento, cuja definição é uma tarefa difícil, principalmente em cenários de grandes volumes de dados, a aprendizagem de máquina é um processo custoso. Neste contexto, este trabalho propõe uma estratégia para seleção de atributos relevantes a serem considerados na fase de comparação de instâncias do processo de Resolução de Entidades. A estratégia proposta considera critérios relacionados aos dados, tais como a densidade e repetição de valores de cada atributo, e critérios relacionados às fontes, tal como a confiabilidade, para avaliar a relevância de um atributo para a fase de comparação de instâncias. Um atributo é considerado relevante se contribui positivamente para a identificação de correspondências verdadeiras, e irrelevante se contribui na identificação de correspondências erradas (falsos positivos e falsos negativos). Em experimentos realizados, utilizando a estratégia proposta, foi possível alcançar bons resultados na comparação de instâncias do processo de Resolução de Entidades, ou seja, os atributos dados como relevantes foram aqueles que contribuíram para encontrar o maior número de correspondências verdadeiras, com o menor número de correspondências erradas. / Data integration is an essential task for achieving a unified view of data stored in autonomous, heterogeneous and distributed sources. A key step in this process is Entity Resolution, which consists of identifying instances that refer to the same real-world entity. Entity Resolution can be subdivided into several stages, including a comparison step between instance pairs. In this step, functions that check the similarity between values of attributes are used to discover equivalent instances. It is important to note that the quality of the result of the entity resolution process is directly affected by the set of selected attributes used to compare the instances. However, selecting such attributes can be challenging, due to either the large number of attributes that describes an instance or to the low relevance of some attributes regarding to the entity resolution process. In the literature, there are some approaches that investigated this problem. Most of them employ machine learning techniques for selecting relevant attributes. Usually, these techniques are computationally costly and also have the necessity of defining a training set, which requirements are non-trivial, mainly in large volumes of data scenarios. In this context, this work proposes a strategy for selecting relevant attributes to be considered in the instance comparison phase of the process of Entity Resolution. The proposed strategy considers criteria related to data, such as density and repetition of values of each attribute, and related to sources, such as reliability, to evaluate the relevance of the attributes. An attribute is considered relevant if contributes positively for the identification of true matches, and irrelevant if contributes for the identification of incorrect matches (false positives and false negatives). In our experiments, the proposed strategy achieved good results for the Entity Resolution process. That is, the attributes classified as relevant were the ones that contributed to find the greatest number of true matches with a few incorrect matches. Integraçãode Dados Resolução de Entidades Seleção de Atributos Data Integration Entity Resolution Attribute Selection
10	Determining Attribute Importance Using an Ensemble of Genetic Programs and Permutation Tests : Relevansbestämning av attribut med hjälp av genetiska program och permutationstester Annica, Ivert January 2015 (has links) When classifying high-dimensional data, a lot can be gained, in terms of both computational time and precision, by only considering the most important features. Many feature selection methods are based on the assumption that important features are highly correlated with their corresponding classes, but mainly uncorrelated with each other. Often, this assumption can help eliminate redundancies and produce good predictors using only a small subset of features. However, when the predictability depends on interactions between the features, such methods will fail to produce satisfactory results. Also, since the suitability of the selected features depends on the learning algorithm in which they will be used, correlation-based filter methods might not be optimal when using genetic programs as the final classifiers, as they fail to capture the possibly complex relationships that are expressible by the genetic programming rules. In this thesis a method that can find important features, both independently and dependently discriminative, is introduced. This method works by performing two different types of permutation tests that classifies each of the features as either irrelevant, independently predictive or dependently predictive. The proposed method directly evaluates the suitability of the features with respect to the learning algorithm in question. Also, in contrast to computationally expensive wrapper methods that require several subsets of features to be evaluated, a feature classification can be obtained after only one single pass, even though the time required does equal the training time of the classifier. The evaluation shows that the attributes chosen by the permutation tests always yield a classifier at least as good as the one obtained when all attributes are used during training - and often better. The proposed method also fares well when compared to other attribute selection methods such as RELIEFF and CFS. / Då man handskas med data av hög dimensionalitet kan man uppnå både bättre precision och förkortad exekveringstid genom att enbart fokusera på de viktigaste attributen. Många metoder för att hitta viktiga attribut är baserade på ett grundantagande om en stark korrelation mellan de viktiga attributen och dess tillhörande klass, men ofta även på ett oberoende mellan de individuella attributen. Detta kan å ena sidan leda till att överflödiga attribut lätt kan elimineras och därmed underlätta processen att hitta en bra klassifierare, men å andra sidan också ge missvisande resultat ifall förmågan att separera klasser i hög grad beror på interaktioner mellan olika attribut. Då lämpligheten av de valda attributen också beror på inlärningsalgoritmen i fråga är det troligtvis inte optimalt att använda sig av metoder som är baserade på korrelationer mellan individuella attribut och dess tillhörande klass, ifall målet är att skapa klassifierare i form av genetiska program, då sådana metoder troligtvis inte har förmågan att fånga de komplexa interaktioner som genetiska program faktiskt möjliggör. Det här arbetet introducerar en metod för att hitta viktiga attribut - både de som kan klassifiera data relativt oberoende och de som får sina krafter endast genom att utnyttja beroenden av andra attribut. Den föreslagna metoden baserar sig på två olika typer av permutationstester, där attribut permuteras mellan de olika dataexemplaren för att sedan klassifieras som antingen oberende, beroende eller irrelevanta. Lämpligheten av ett attribut utvärderas direkt med hänsyn till den valda inlärningsalgoritmen till skillnad från så kallade wrappers, som är tidskrävande då de kräver att flera delmängder av attribut utvärderas. Resultaten visar att de attribut som ansetts viktiga efter permutationstesten genererar klassifierare som är åtminstone lika bra som när alla attribut används, men ofta bättre. Metoden står sig också bra när den jämförs med andra metoder som till exempel RELIEFF och CFS. Genetic programming attribute selection permutation test Genetisk programmering atttribut relevansbestämning permutationstest Computer Sciences Datavetenskap (datalogi)

Search results