Global ETD Search

11	Analogy-based software project effort estimation : contributions to projects similarity measurement, attribute selection and attribute weighting algorithms for analogy-based effort estimation Azzeh, Mohammad Y. A. January 2010 (has links) Software effort estimation by analogy is a viable alternative method to other estimation techniques, and in many cases, researchers found it outperformed other estimation methods in terms of accuracy and practitioners' acceptance. However, the overall performance of analogy based estimation depends on two major factors: similarity measure and attribute selection & weighting. Current similarity measures such as nearest neighborhood techniques have been criticized that have some inadequacies related to attributes relevancy, noise and uncertainty in addition to the problem of using categorical attributes. This research focuses on improving the efficiency and flexibility of analogy-based estimation to overcome the abovementioned inadequacies. Particularly, this thesis proposes two new approaches to model and handle uncertainty in similarity measurement method and most importantly to reflect the structure of dataset on similarity measurement using Fuzzy modeling based Fuzzy C-means algorithm. The first proposed approach called Fuzzy Grey Relational Analysis method employs combined techniques of Fuzzy set theory and Grey Relational Analysis to improve local and global similarity measure and tolerate imprecision associated with using different data types (Continuous and Categorical). The second proposed approach presents the use of Fuzzy numbers and its concepts to develop a practical yet efficient approach to support analogy-based systems especially at early phase of software development. Specifically, we propose a new similarity measure and adaptation technique based on Fuzzy numbers. We also propose a new attribute subset selection algorithm and attribute weighting technique based on the hypothesis of analogy-based estimation that assumes projects that are similar in terms of attribute value are also similar in terms of effort values, using row-wise Kendall rank correlation between similarity matrix based project effort values and similarity matrix based project attribute values. A literature review of related software engineering studies revealed that the existing attribute selection techniques (such as brute-force, heuristic algorithms) are restricted to the choice of performance indicators such as (Mean of Magnitude Relative Error and Prediction Performance Indicator) and computationally far more intensive. The proposed algorithms provide sound statistical basis and justification for their procedures. The performance figures of the proposed approaches have been evaluated using real industrial datasets. Results and conclusions from a series of comparative studies with conventional estimation by analogy approach using the available datasets are presented. The studies were also carried out to statistically investigate the significant differences between predictions generated by our approaches and those generated by the most popular techniques such as: conventional analogy estimation, neural network and stepwise regression. The results and conclusions indicate that the two proposed approaches have potential to deliver comparable, if not better, accuracy than the compared techniques. The results also found that Grey Relational Analysis tolerates the uncertainty associated with using different data types. As well as the original contributions within the thesis, a number of directions for further research are presented. Most chapters in this thesis have been disseminated in international journals and highly refereed conference proceedings. 005.3
12	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems Anette, Kniberg, Nokto, David January 2018 (has links) Feature selection is the process of automatically selecting important features from data. It is an essential part of machine learning, artificial intelligence, data mining, and modelling in general. There are many feature selection algorithms available and the appropriate choice can be difficult. The aim of this thesis was to compare feature selection algorithms in order to provide an experimental basis for which algorithm to choose. The first phase involved assessing which algorithms are most common in the scientific community, through a systematic literature study in the two largest reference databases: Scopus and Web of Science. The second phase involved constructing and implementing a benchmark pipeline to compare 31 algorithms’ performance on 50 data sets.The selected features were used to construct classification models and their predictive performances were compared, as well as the runtime of the selection process. The results show a small overall superiority of embedded type algorithms, especially types that involve Decision Trees. However, there is no algorithm that is significantly superior in every case. The pipeline and data from the experiments can be used by practitioners in determining which algorithms to apply to their respective problems. / Variabelselektion är en process där relevanta variabler automatiskt selekteras i data. Det är en essentiell del av maskininlärning, artificiell intelligens, datautvinning och modellering i allmänhet. Den stora mängden variabelselektionsalgoritmer kan göra det svårt att avgöra vilken algoritm som ska användas. Målet med detta examensarbete är att jämföra variabelselektionsalgoritmer för att ge en experimentell bas för valet av algoritm. I första fasen avgjordes vilka algoritmer som är mest förekommande i vetenskapen, via en systematisk litteraturstudie i de två största referensdatabaserna: Scopus och Web of Science. Den andra fasen bestod av att konstruera och implementera en experimentell mjukvara för att jämföra algoritmernas prestanda på 50 data set. De valda variablerna användes för att konstruera klassificeringsmodeller vars prediktiva prestanda, samt selektionsprocessens körningstid, jämfördes. Resultatet visar att inbäddade algoritmer i viss grad är överlägsna, framför allt typer som bygger på beslutsträd. Det finns dock ingen algoritm som är signifikant överlägsen i varje sammanhang. Programmet och datan från experimenten kan användas av utövare för att avgöra vilken algoritm som bör appliceras på deras respektive problem. feature selection variable selection attribute selection machine learning data mining benchmark classification variabelselektion maskininlärning datautvinning klassificering Medical Engineering Medicinteknik
13	Analogy-based software project effort estimation. Contributions to projects similarity measurement, attribute selection and attribute weighting algorithms for analogy-based effort estimation. Azzeh, Mohammad Y.A. January 2010 (has links) Software effort estimation by analogy is a viable alternative method to other estimation techniques, and in many cases, researchers found it outperformed other estimation methods in terms of accuracy and practitioners¿ acceptance. However, the overall performance of analogy based estimation depends on two major factors: similarity measure and attribute selection & weighting. Current similarity measures such as nearest neighborhood techniques have been criticized that have some inadequacies related to attributes relevancy, noise and uncertainty in addition to the problem of using categorical attributes. This research focuses on improving the efficiency and flexibility of analogy-based estimation to overcome the abovementioned inadequacies. Particularly, this thesis proposes two new approaches to model and handle uncertainty in similarity measurement method and most importantly to reflect the structure of dataset on similarity measurement using Fuzzy modeling based Fuzzy C-means algorithm. The first proposed approach called Fuzzy Grey Relational Analysis method employs combined techniques of Fuzzy set theory and Grey Relational Analysis to improve local and global similarity measure and tolerate imprecision associated with using different data types (Continuous and Categorical). The second proposed approach presents the use of Fuzzy numbers and its concepts to develop a practical yet efficient approach to support analogy-based systems especially at early phase of software development. Specifically, we propose a new similarity measure and adaptation technique based on Fuzzy numbers. We also propose a new attribute subset selection algorithm and attribute weighting technique based on the hypothesis of analogy-based estimation that assumes projects that are similar in terms of attribute value are also similar in terms of effort values, using row-wise Kendall rank correlation between similarity matrix based project effort values and similarity matrix based project attribute values. A literature review of related software engineering studies revealed that the existing attribute selection techniques (such as brute-force, heuristic algorithms) are restricted to the choice of performance indicators such as (Mean of Magnitude Relative Error and Prediction Performance Indicator) and computationally far more intensive. The proposed algorithms provide sound statistical basis and justification for their procedures. The performance figures of the proposed approaches have been evaluated using real industrial datasets. Results and conclusions from a series of comparative studies with conventional estimation by analogy approach using the available datasets are presented. The studies were also carried out to statistically investigate the significant differences between predictions generated by our approaches and those generated by the most popular techniques such as: conventional analogy estimation, neural network and stepwise regression. The results and conclusions indicate that the two proposed approaches have potential to deliver comparable, if not better, accuracy than the compared techniques. The results also found that Grey Relational Analysis tolerates the uncertainty associated with using different data types. As well as the original contributions within the thesis, a number of directions for further research are presented. Most chapters in this thesis have been disseminated in international journals and highly refereed conference proceedings. / Applied Science University, Jordan. Analogy-based effort estimation Attributes relevancy Fuzzy Grey Relational Analysis Fuzzy numbers Attribute weighting Attribute selection Algorithms
14	Classificação visual de mudas de plantas ornamentais: análise da eficácia de técnicas de seleção de atributos. / Visual classification of ornamental plants seedlings: analysis of attribute selection efficacy. Silva, Luiz Otávio Lamardo Alves 03 December 2013 (has links) A automação da classificação visual de produtos vem ganhando cada vez mais importância nos processos produtivos agrícolas. Isto posto, uma das principais dificuldades encontradas por produtores de flores e plantas ornamentais é garantir o crescimento homogêneo de suas plantas. Nesse cenário, as mudas utilizadas para gerar as plantas são importantes uma vez que se pode estimar seu potencial de crescimento através de uma inspeção visual. Sendo assim, um sistema de visão computacional pode ser empregado para automatizar essa tarefa. Porém, diferentemente de indústrias tradicionais, a indústria agrícola apresenta grande variabilidade entre os produtos analisados. Técnicas de aprendizado de máquina supervisionado conseguem avaliar um conjunto de atributos referentes ao objeto inspecionado para classificá-lo corretamente, de forma a lidar tanto com a variabilidade dos produtos em inspeção quanto com a incorporação do conhecimento de especialistas pelo sistema. A definição do conjunto de atributos a ser extraído das imagens dos produtos é de extrema importância, pois é ele quem fornece toda a informação utilizada no sistema. Um conjunto com diversos atributos assegura que toda a informação necessária é capturada, porém atributos irrelevantes ou redundantes podem prejudicar o desempenho dos classificadores. Técnicas de seleção de atributos podem ser utilizadas para equilibrar essas necessidades. O objetivo do trabalho foi o de avaliar a eficácia dessas técnicas para a classificação de mudas de violeta. Vinte e seis parâmetros foram extraídos de seiscentas imagens rotuladas em quatro níveis de qualidade. Em seguida, os desempenhos de seis classificadores foram comparados considerando-se um universo de subconjuntos gerados por quatro técnicas de seleção de atributos. Os resultados mostraram que essas técnicas são realmente vantajosas, gerando ganhos de até 8,8% nas taxas de acertos e ao mesmo tempo reduzindo de 26 para 11 o número médio de atributos utilizados. O classificador Logistic Regression associado ao subconjunto gerado pelo Chi-quadrado foi o que apresentou melhor desempenho global, atingindo 80% de acerto. O classificador Random Forest ficou em segundo lugar, porém se mostrou menos sensível a seleção de atributos. / The automation of visual classification of products is gaining more importance in agricultural production processes. That said, one of the main difficulties encountered by ornamental plants and flowers producers is to ensure homogeneous growth of their plants. In this scenario, the seedlings used to grow the plants are very important since it is possible to estimate their growth potential by means of a visual inspection. Therefore, a computer vision system can be used to automate this task. Unlike traditional industries, the agricultural industry shows great variability among the products inspected. Supervised machine learning techniques can evaluate an attribute set representing the inspected object in order to correctly classify it, making it possible not only to deal with the variability of the inspected products but also with the incorporation of experts knowledge into the system. The definition of the attribute set to be extracted from the images of the products is of utmost importance, as it is it that provides all information used by the system. A set with several attributes ensures that all necessary information is captured; however irrelevant or redundant attributes can affect the performance of classifiers. Attribute selection techniques can be used to balance these needs. The aim of this study was then to evaluate the effectiveness of these techniques regarding the classification of African violet seedlings. Twenty- six parameters were extracted from six hundred images, labeled into four quality groups. Then, the performances of six classifiers were compared by considering the universe of subsets generated by four attribute selection techniques. The results showed that these techniques are indeed advantageous, generating gains of up to 8.8% in accuracy rate while reducing from 26 to 11 the average number of attributes used. Logistic Regression classifier, associated with the subset generated by the Chi-squared filter showed the best overall performance, achieving 80 % accuracy. Random Forest was second, but was less sensitive to attribute selection. Agricultural products Aprendizado de máquina Attribute selection Computer vision Cuttings Flores Flowers Machine learning Mudas Produtos agrícolas Seleção Seleção de atributos Selection Separação Sorting Visão computacional
15	Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learning Moura, Maria Fernanda 26 October 2009 (has links) A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems Hierarchial document cluster labeling Mineração de textos n-gram attribute selection Seleção de atributos n-gramas Taxonomia de tópicos Text mining Topic taxonomy
16	Classificação visual de mudas de plantas ornamentais: análise da eficácia de técnicas de seleção de atributos. / Visual classification of ornamental plants seedlings: analysis of attribute selection efficacy. Luiz Otávio Lamardo Alves Silva 03 December 2013 (has links) A automação da classificação visual de produtos vem ganhando cada vez mais importância nos processos produtivos agrícolas. Isto posto, uma das principais dificuldades encontradas por produtores de flores e plantas ornamentais é garantir o crescimento homogêneo de suas plantas. Nesse cenário, as mudas utilizadas para gerar as plantas são importantes uma vez que se pode estimar seu potencial de crescimento através de uma inspeção visual. Sendo assim, um sistema de visão computacional pode ser empregado para automatizar essa tarefa. Porém, diferentemente de indústrias tradicionais, a indústria agrícola apresenta grande variabilidade entre os produtos analisados. Técnicas de aprendizado de máquina supervisionado conseguem avaliar um conjunto de atributos referentes ao objeto inspecionado para classificá-lo corretamente, de forma a lidar tanto com a variabilidade dos produtos em inspeção quanto com a incorporação do conhecimento de especialistas pelo sistema. A definição do conjunto de atributos a ser extraído das imagens dos produtos é de extrema importância, pois é ele quem fornece toda a informação utilizada no sistema. Um conjunto com diversos atributos assegura que toda a informação necessária é capturada, porém atributos irrelevantes ou redundantes podem prejudicar o desempenho dos classificadores. Técnicas de seleção de atributos podem ser utilizadas para equilibrar essas necessidades. O objetivo do trabalho foi o de avaliar a eficácia dessas técnicas para a classificação de mudas de violeta. Vinte e seis parâmetros foram extraídos de seiscentas imagens rotuladas em quatro níveis de qualidade. Em seguida, os desempenhos de seis classificadores foram comparados considerando-se um universo de subconjuntos gerados por quatro técnicas de seleção de atributos. Os resultados mostraram que essas técnicas são realmente vantajosas, gerando ganhos de até 8,8% nas taxas de acertos e ao mesmo tempo reduzindo de 26 para 11 o número médio de atributos utilizados. O classificador Logistic Regression associado ao subconjunto gerado pelo Chi-quadrado foi o que apresentou melhor desempenho global, atingindo 80% de acerto. O classificador Random Forest ficou em segundo lugar, porém se mostrou menos sensível a seleção de atributos. / The automation of visual classification of products is gaining more importance in agricultural production processes. That said, one of the main difficulties encountered by ornamental plants and flowers producers is to ensure homogeneous growth of their plants. In this scenario, the seedlings used to grow the plants are very important since it is possible to estimate their growth potential by means of a visual inspection. Therefore, a computer vision system can be used to automate this task. Unlike traditional industries, the agricultural industry shows great variability among the products inspected. Supervised machine learning techniques can evaluate an attribute set representing the inspected object in order to correctly classify it, making it possible not only to deal with the variability of the inspected products but also with the incorporation of experts knowledge into the system. The definition of the attribute set to be extracted from the images of the products is of utmost importance, as it is it that provides all information used by the system. A set with several attributes ensures that all necessary information is captured; however irrelevant or redundant attributes can affect the performance of classifiers. Attribute selection techniques can be used to balance these needs. The aim of this study was then to evaluate the effectiveness of these techniques regarding the classification of African violet seedlings. Twenty- six parameters were extracted from six hundred images, labeled into four quality groups. Then, the performances of six classifiers were compared by considering the universe of subsets generated by four attribute selection techniques. The results showed that these techniques are indeed advantageous, generating gains of up to 8.8% in accuracy rate while reducing from 26 to 11 the average number of attributes used. Logistic Regression classifier, associated with the subset generated by the Chi-squared filter showed the best overall performance, achieving 80 % accuracy. Random Forest was second, but was less sensitive to attribute selection. Aprendizado de máquina Flores Mudas Produtos agrícolas Seleção Seleção de atributos Separação Visão computacional Agricultural products Attribute selection Computer vision Cuttings Flowers Machine learning Selection Sorting
17	A simulation and machine learning approach to critical infrastructure resilience appraisal : Case study on payment disruptions Samstad, Anna January 2018 (has links) This study uses a simulation to gather data regarding a payment disruption. The simulation is part of a project called CCRAAAFFFTING, which examines what happens to a society when a payment disruption occurs. The purpose of this study is to develop a measure for resilience in the simulation and use machine learning to analyse the attributes in the simulation to see how they affect the resilience in the society. The resilience is defined as “the ability to bounce back to a previous state”, and the resilience measure is developed according to this definition. Two resilience measurements are defined, one which relates the simulated value to the best-case and worst-case scenarios, and the other which takes the pace of change in values into consideration. These two measurements are then combined to one measure of the total resilience. The three machine learning algorithms compared are Neural Network, Support Vector Machine and Random Forest, and the performance measure of these are the error rate. The results show that Random Forest performs significantly better than the other two algorithms, and that the most important attributes in the simulation are those concerning the customers’ ability to make purchases in the simulation. The developed resilience measure proves to respond logically to how the situation unfolded, and some suggestions to further improve the measurement is provided for future research. / I denna studie används en simulering för att samla in data. Simuleringen är en del i ett projekt som kallas för CCRAAAFFFTING, vars syfte är att undersöka vad som händer i ett samhälle om en störning i betalsystemet inträffar. Syftet med denna studie är att utveckla ett mått för resiliens i simuleringen, samt att använda machine learning för att analysera attributen i simuleringen för att se hur de påverkar resiliensen i samhället. Resiliensen definieras enligt ”förmågan att snabbt gå tillbaka till ett tidigare stadie”, och resiliensmåttet utvecklas i enlighet med denna definition. Två resiliensmått definieras, där det ena måttet relaterar det simulerade värdet till de värsta och bästa scenarierna, och det andra måttet tar i beaktning hur snabbt värdena förändrades. Dessa två mått kombineras sedan till ett mått för den totala resiliensen. De tre machine learning-algoritmerna som jämförs är Neural Network, Support Vector Machine och Random Forest, och måttet för hur de presterar är felfrekvens. Resultaten visar att Random Forest presterar märkbart bättre än de andra två algoritmerna, och att de viktigaste attributen i simuleringen är de som berör kunders möjlighet att genomföra köp i simuleringen. Det utvecklade resiliensmåttet svarar på ett logiskt sätt enligt hur situationen utvecklar sig, och några förslag för att vidare utveckla måttet ges för vidare forskning. Data mining resilience measure attribute selection error rate classification Datautvinning resiliensmått attributval simulering felfrekvens klassificering Övrig annan teknik
18	Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learning Maria Fernanda Moura 26 October 2009 (has links) A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems Mineração de textos Seleção de atributos n-gramas Taxonomia de tópicos Hierarchial document cluster labeling n-gram attribute selection Text mining Topic taxonomy
19	Feature selection in short-term load forecasting / Val av attribut vid kortvarig lastprognos för energiförbrukning Söderberg, Max Joel, Meurling, Axel January 2019 (has links) This paper investigates correlation between energy consumption 24 hours ahead and features used for predicting energy consumption. The features originate from three categories: weather, time and previous energy. The correlations are calculated using Pearson correlation and mutual information. This resulted in the highest correlated features being those representing previous energy consumption, followed by temperature and month. Two identical feature sets containing all attributes1 were obtained by ranking the features according to correlation. Three feature sets were created manually. The first set contained seven attributes representing previous energy consumption over the course of seven days prior to the day of prediction. The second set consisted of weather and time attributes. The third set consisted of all attributes from the first and second set. These sets were then compared on different machine learning models. It was found the set containing all attributes and the set containing previous energy attributes yielded the best performance for each machine learning model. 1In this report, the words ”attribute” and ”feature” are used interchangeably. / I denna rapport undersöks korrelation och betydelsen av olika attribut för att förutspå energiförbrukning 24 timmar framåt. Attributen härstammar från tre kategorier: väder, tid och tidigare energiförbrukning. Korrelationerna tas fram genom att utföra Pearson Correlation och Mutual Information. Detta resulterade i att de högst korrelerade attributen var de som representerar tidigare energiförbrukning, följt av temperatur och månad. Två identiska attributmängder erhölls genom att ranka attributen över korrelation. Tre attributmängder skapades manuellt. Den första mängden innehåll sju attribut som representerade tidigare energiförbrukning, en för varje dag, sju dagar innan datumet för prognosen av energiförbrukning. Den andra mängden bestod av väderoch tidsattribut. Den tredje mängden bestod av alla attribut från den första och andra mängden. Dessa mängder jämfördes sedan med hjälp av olika maskininlärningsmodeller. Resultaten visade att mängden med alla attribut och den med tidigare energiförbrukning gav bäst resultat för samtliga modeller. Short-term load forecasting energy consumption forecasting Linear regression SVR Random Forest machine learning regression feature selection attribute selection Pearson correlation Mutual information correlation matrix Two-way ANOVA Tukey’s HSD test. Kortsiktig lastprognos Energiförbrukningsprognos Linjär regression SVR Random forest Maskininlärning Attributval Pearson-korrelation Ömsesidig information Korrelationsmatris Tvåvägs ANOVA Tukey’s HSD-test. Computer and Information Sciences Data- och informationsvetenskap

Search results