91 |
Abordagens de seleção de variáveis para classificação e regressão em química analítica / Feature selection approaches for classification and regression in analytical chemistrySoares, Felipe January 2017 (has links)
A utilização de técnicas analíticas para classificação de produtos ou predição de propriedades químicas tem se mostrado de especial interesse tanto na indústria quanto na academia. Através da análise da concentração elementar, ou de técnicas de espectroscopia, é possível obter-se um grande número de informações sobre as amostras em análise. Contudo, o elevado número de variáveis disponíveis (comprimentos de onda, ou elementos químicos, por exemplo) pode prejudicar a acurácia dos modelos gerados, necessitando da utilização de técnicas para seleção das variáveis mais relevantes com vistas a tornar os modelos mais robustos. Esta dissertação propõe métodos para seleção de variáveis em química analítica com propósito de classificação de produtos e predição via regressão de propriedades químicas. Para tal, inicialmente propõe-se um método de seleção de intervalos não equidistantes de comprimentos de onda em espectroscopia para classificação de combustíveis, o qual baseia-se na distância entre espectros médios de duas classes distintas; os intervalos são então utilizados em técnicas de classificação.Ao ser aplicado em dois bancos de dados de espectroscopia, o método foi capaz de reduzir o número de variáveis utilizadas para somente 23,19% e 4,95% das variáveis originais, diminuindo o erro de 13,90% para 11,63% e de 4,71% para 1,21%. Em seguida é apresentado um método para seleção dos elementos mais relevantes para classificação de vinhos provenientes de quatro países da América do Sul, baseado nos parâmetros da análise discriminante linear. O método possibilitou atingir acurácia média de 99,9% retendo em média 6,82 elementos químicos, sendo que a melhor acurácia média atingida utilizando todos os 45 elementos disponíveis foi de 91,2%. Por fim, utiliza-se o algoritmo support vector regression – recursive feature elimination (SVR-RFE) para seleção dos comprimentos de onda mais importantes na regressão por vetores de suporte. Ao serem aplicado em 12 bancos de dados juntamente com outros métodos de seleção e regressão, o SVR e o SVR-RFE obtiveram os melhores resultados em 8 deles, sendo que o SVR-RFE foi significativamente superior dentre os algoritmos de seleção. A aplicação dos métodos deseleção de variáveis propostos na presente dissertação possibilitou a realização de classificações e regressões mais robustas, bem como a redução do número de variáveis retidas nos modelos. / The use of analytical techniques in product classification or chemical properties estimation has been of great interest in both industry and academy. The employment of spectroscopy techniques, or through elemental analysis, provides a great amount of information about the samples being analyzed. However, the large number of features (e.g.: wavelengths or chemical elements) included in the models may jeopardize the accuracy, urging the employment of feature selection techniques to identify the most relevant features, producing more robust models. This dissertation presents feature selection methods focused on analytical chemistry, aiming at product classification and chemical property estimation (regression). For that matter, the first proposed method aims at identifying the most relevant wavelength intervals for fuel classification based on the distance between the average spectra of the two classes being analyzed. The identified intervals are then used as input for classifiers. When applied to two spectroscopy datasets, the proposed framework reduced the number of features to just 23.19% and 4.95% of the original ones, also reducing the misclassification error to 4.71% and 1.21%. Next, a method for identifying the most important elements for wine classification is presented, which is based on the parameters from linear discriminant analysis and aims at classifying wine samples produced in four south American countries. The method achieved average accuracy of 99.9% retaining average 8.82 chemical elements; the best accuracy using all 45 available chemical elements was 91.2%. Finally, the use of the support vector regression – recursive feature elimination (SVR-RFE) algorithm is proposed to identify the most relevant wavelengths for support vector regression. The proposed framework was applied to 12 datasets with other feature selection approaches and regression algorithms. SVR and SVR-RFE achieved the best results in 8 out of 12 datasets; SVR-RFE when compared to other feature selection algorithms proved have significantly better performance. The employment of the proposed feature selection methodsin this dissertation yield more robust classifiers and regression models, also reducing the number of features needed to produce accurate results.
|
92 |
Minimização de funções decomponíveis em curvas em U definidas sobre cadeias de posets -- algoritmos e aplicações / Minimization of decomposable in U-shaped curves functions defined on poset chains -- algorithms and applicationsReis, Marcelo da Silva 28 November 2012 (has links)
O problema de seleção de características, no contexto de Reconhecimento de Padrões, consiste na escolha de um subconjunto X de um conjunto S de características, de tal forma que X seja \"ótimo\" dentro de algum critério. Supondo a escolha de uma função custo c apropriada, o problema de seleção de características é reduzido a um problema de busca que utiliza c para avaliar os subconjuntos de S e assim detectar um subconjunto de características ótimo. Todavia, o problema de seleção de características é NP-difícil. Na literatura existem diversos algoritmos e heurísticas propostos para abordar este problema; porém, quase nenhuma dessas técnicas explora o fato que existem funções custo cujos valores são estimados a partir de uma amostra e que descrevem uma \"curva em U\" nas cadeias do reticulado Booleano (P(S),<=), um fenômeno bem conhecido em Reconhecimento de Padrões: conforme aumenta-se o número de características consideradas, há uma queda no custo do subconjunto avaliado, até o ponto em que a limitação no número de amostras faz com que seguir adicionando características passe a aumentar o custo, devido ao aumento no erro de estimação. Em 2010, Ris e colegas propuseram um novo algoritmo para resolver esse caso particular do problema de seleção de características, que aproveita o fato de que o espaço de busca pode ser organizado como um reticulado Booleano, assim como a estrutura de curvas em U das cadeias do reticulado, para encontrar um subconjunto ótimo. Neste trabalho estudamos a estrutura do problema de minimização de funções custo cujas cadeias são decomponíveis em curvas em U (problema U-curve), provando que o mesmo é NP-difícil. Mostramos que o algoritmo de Ris e colegas possui um erro que o torna de fato sub-ótimo, e propusemos uma versão corrigida e melhorada do mesmo, o algoritmo U-Curve-Search (UCS). Apresentamos também duas variações do algoritmo UCS que controlam o espaço de busca de forma mais sistemática. Introduzimos dois novos algoritmos branch-and-bound para abordar o problema, chamados U-Curve-Branch-and-Bound (UBB) e Poset-Forest-Search (PFS). Para todos os algoritmos apresentados nesta tese, fornecemos análise de complexidade de tempo e, para alguns deles, também prova de corretude. Implementamos todos os algoritmos apresentados utilizando o arcabouço featsel, também desenvolvido neste trabalho; realizamos experimentos ótimos e sub-ótimos com instâncias de dados reais e simulados e analisamos os resultados obtidos. Por fim, propusemos um relaxamento do problema U-curve que modela alguns tipos de projeto de classificadores; também provamos que os algoritmos UCS, UBB e PFS resolvem esta versão generalizada do problema. / The feature selection problem, in the context of Pattern Recognition, consists in the choice of a subset X of a set S of features, such that X is \"optimal\" under some criterion. If we assume the choice of a proper cost function c, then the feature selection problem is reduced to a search problem, which uses c to evaluate the subsets of S, therefore finding an optimal feature subset. However, the feature selection problem is NP-hard. Although there are a myriad of algorithms and heuristics to tackle this problem in the literature, almost none of those techniques explores the fact that there are cost functions whose values are estimated from a sample and describe a \"U-shaped curve\" in the chains of the Boolean lattice o (P(S),<=), a well-known phenomenon in Pattern Recognition: for a fixed number of samples, the increase in the number of considered features may have two consequences: if the available sample is enough to a good estimation, then it should occur a reduction of the estimation error, otherwise, the lack of data induces an increase of the estimation error. In 2010, Ris et al. proposed a new algorithm to solve this particular case of the feature selection problem: their algorithm takes into account the fact that the search space may be organized as a Boolean lattice, as well as that the chains of this lattice describe a U-shaped curve, to find an optimal feature subset. In this work, we studied the structure of the minimization problem of cost functions whose chains are decomposable in U-shaped curves (the U-curve problem), and proved that this problem is actually NP-hard. We showed that the algorithm introduced by Ris et al. has an error that leads to suboptimal solutions, and proposed a corrected and improved version, the U-Curve-Search (UCS) algorithm. Moreover, to manage the search space in a more systematic way, we also presented two modifications of the UCS algorithm. We introduced two new branch-and-bound algorithms to tackle the U-curve problem, namely U-Curve-Branch-and-Bound (UBB) and Poset-Forest-Search (PFS). For each algorithm presented in this thesis, we provided time complexity analysis and, for some of them, also proof of correctness. We implemented each algorithm through the featsel framework, which was also developed in this work; we performed optimal and suboptimal experiments with instances from real and simulated data, and analyzed the results. Finally, we proposed a generalization of the U-curve problem that models some kinds of classifier design; we proved the correctness of the UCS, UBB, and PFS algorithms for this generalized version of the U-curve problem.
|
93 |
Feature Selection for Factored Phrase-Based Machine Translation / Feature Selection for Factored Phrase-Based Machine TranslationTamchyna, Aleš January 2012 (has links)
In the presented work we investigate factored models for machine translation. We provide a thorough theoretical description of this machine translation paradigm. We describe a method for evaluating the complexity of factored models and verify its usefulness in practice. We present a software tool for automatic creation of machine translation experiments and search in the space of possible configurations. In the experimental part of the work we verify our analyses and give some insight into the potential of factored systems. We indicate some of the possible directions that lead to improvement in translation quality, however we conclude that it is not possible to explore these options in a fully automatic way.
|
94 |
Metodologia de fusão de vídeos e sons para monitoração de comportamento de insetos / Merging methodology videos and sounds for monitoring insect behaviorJorge, Lúcio André de Castro 02 September 2011 (has links)
Este trabalho apresenta uma nova abordagem para fusão de vídeo e som diretamente no espaço de atributos visando otimizar a identificação do comportamento de insetos. Foi utilizado o detector de Harris para rastreamento dos insetos, assim como a técnica inovadora Wavelet-Multifractal para análise de som. No caso da Wavelet-Multifractal, foram testadas várias Wavelet-mães, sendo a Morlet a melhor escolha para sons de insetos. Foi proposto a Wavelet Módulo Máximo para extrair atributos multifractais dos sons para serem utilizados no reconhecimento de padrões de comportamento de insetos. A abordagem Wrapper de mineração de dados foi usada para selecionar os atributos relevantes. Foi constatado que a abordagem Wavelet-multifractal identifica melhor os sons, particularmente no caso de distorções provocadas por ruídos. As imagens foram responsáveis pela identificação de acasalamento e os sons pelos outros comportamentos. Foi também proposto um novo método do triângulo como representação simplificada do espectro multifractal visando simplificação do processamento. / This work presents an innovative video and sound fusion approach by feature subset selection under the space of attributes to optimally identify insects behavior. Harris detector was used for insect movement tracking and an innovative technique of Multifractal-Wavelet was used to analyze the insect sounds. In the case of Multifractal-Wavelet, more than one mother-wavelet was tested, being the Morlet wavelet the best choice of mother-wavelet for insect sounds. The wavelet modulus maxima was proposed to extract multifractal sound attributes to be used in pattern recognition of an insect behavior. The wrapper data mining approach was used to select relevant attributes. It has been found that, in general, wavelet-multifractal based schemes perform better for sound, particularly in terms of minimizing noise distortion influence. The image features only determine the mating and the sound other behaviors. A new triangle representation of multifractal spectrum was proposed as a processing simplification.
|
95 |
Random neural networks for dimensionality reduction and regularized supervised learningHu, Renjie 01 August 2019 (has links)
This dissertation explores Random Neural Networks (RNNs) in several aspects and their applications. First, Novel RNNs have been proposed for dimensionality reduction and visualization. Based on Extreme Learning Machines (ELMs) and Self-Organizing Maps (SOMs) a new method is created to identify the important variables and visualize the data. This technique reduces the curse of dimensionality and improves furthermore the interpretability of the visualization and is tested on real nursing survey datasets. ELM-SOM+ is an autoencoder created to preserves the intrinsic quality of SOM and also brings continuity to the projection using two ELMs. This new methodology shows considerable improvement over SOM on real datasets. Second, as a Supervised Learning method, ELMs has been applied to the hierarchical multiscale method to bridge the the molecular dynamics to continua. The method is tested on simulation data and proven to be efficient for passing the information from one scale to another. Lastly, the regularization of ELMs has been studied and a new regularization algorithm for ELMs is created using a modified Lanczos Algorithm. The Lanczos ELM on average divide computational time by 20 and reduce the Normalized MSE by 14% comparing with regular ELMs.
|
96 |
[pt] CLASSIFICAÇÃO DE SENTIMENTO PARA NOTÍCIAS SOBRE A PETROBRAS NO MERCADO FINANCEIRO / [en] SENTIMENT ANALYSIS FOR FINANCIAL NEWS ABOUT PETROBRAS COMPANYPAULA DE CASTRO SONNENFELD VILELA 21 December 2011 (has links)
[pt] Hoje em dia, encontramos uma grande quantidade de informações na internet,
em particular, notícias sobre o mercado financeiro. Diversas pesquisas
mostram que notícias sobre o mercado financeiro possuem uma grande relação com variáveis de mercado como volume de transações, volatilidade e preço
das ações. Nesse trabalho, investigamos o problema de Análise de Sentimentos
de notícias jornalísticas do mercado financeiro. Nosso objetivo é classificar
notícias como favoráveis ou não a Petrobras. Utilizamos técnicas de Processamento
de Linguagem Natural para melhorar a acurácia do modelo clássico de
saco-de-palavras. Filtramos frases sobre a Petrobras e inserimos novos atributos
linguísticos, tanto sintáticos como estilísticos. Para a classifição do sentimento
é utilizado o algoritmo de aprendizado Support Vector Machine, sendo
aplicados ainda quatro seletores de atributos e um comitê dos melhores modelos.
Apresentamos aqui o Petronews, um corpus com notícias em português
sobre a Petrobras, anotado manualmente com a informação de sentimento.
Esse corpus é composto de mil e cinquenta notícias online de 02/06/2006 a
29/01/2010. Nossos experimentos mostram uma melhora de 5.29 por cento
com relação ao modelo saco-de-palavras, atingindo uma acurácia de 87.14 por cento. / [en] A huge amount of information is available online, in particular regarding
financial news. Current research indicate that stock news have a strong
correlation to market variables such as trade volumes, volatility, stock prices
and firm earnings. Here, we investigate a Sentiment Analysis problem for
financial news. Our goal is to classify financial news as favorable or unfavorable
to Petrobras, an oil and gas company with stocks in the Stock Exchange
market. We explore Natural Language Processing techniques in a way to
improve the sentiment classification accuracy of a classical bag of words
approach. We filter on topic phrases for each Petrobras related news and build
syntactic and stylistic input features. For sentiment classification, Support
Vector Machines algorithm is used. Moreover we apply four feature selection
methods and build a committee of SVM models. Additionally, we introduce
Petronews, a Portuguese financial news annotated corpus about Petrobras.
It is composed by a collection of one thousand and fifty online financial news
from 06/02/2006 to 01/29/2010. Our experiments indicate that our method
is 5.29 per cent better than a standard bag-of-words approach, reaching 87.14 per cent
accuracy rate for this domain.
|
97 |
Use of Random Subspace Ensembles on Gene Expression Profiles in Survival Prediction for Colon Cancer PatientsKamath, Vidya 04 November 2005 (has links)
Cancer is a disease process that emerges out of a series of genetic mutations that cause seemingly uncontrolled multiplication of cells. The molecular genetics of cells indicates that different combinations of genetic events or alternative pathways in cells may lead to cancer. A study of the gene expressions of cancer cells, in combination with the external influential factors, can greatly aid in cancer management such as understanding the initiation and etiology of cancer, as well as detection, assessment and prediction of the progression of cancer.
Gene expression analysis of cells yields a very large number of features that can be used to describe the condition of the cell. Feature selection methods are explored to choose the best of these features that are most relevant to the problem at hand. Random subspace ensembles created using these selected features perform poorly in predicting the 36-month survival for colon cancer patients. A modification to the random subspace scheme is proposed to enhance the accuracy of prediction. The method first applies random subspace ensembles with decision trees to select predictive features. Then, support vector machines are used to analyze the selected gene expression profiles in cancer tissue to predict the survival outcome for a patient.
The proposed method is shown to achieve a weighted accuracy of 58.96%, with 40.54% sensitivity and 77.38% specificity in predicting 36-month survival for new and unknown colon cancer patients. The prediction accuracy of the method is comparable to the baseline classifiers and significantly better than random subspace ensembles on gene expression profiles of colon cancer.
|
98 |
Application Of Support Vector Machines And Neural Networks In Digital Mammography: A Comparative StudyCandade, Nivedita V 28 October 2004 (has links)
Microcalcification (MC) detection is an important component of breast cancer diagnosis. However, visual analysis of mammograms is a difficult task for radiologists. Computer Aided Diagnosis (CAD) technology helps in identifying lesions and assists the radiologist in making his final decision.
This work is a part of a CAD project carried out at the Imaging Science Research Division (ISRD), Digital Medical Imaging Program, Moffitt Cancer Research Center, Tampa, FL. A CAD system had been previously developed to perform the following tasks: (a) pre-processing, (b) segmentation and (c) feature extraction of mammogram images. Ten features covering spatial, and morphological domains were extracted from the mammograms and the samples were classified as Microcalcification (MC) or False alarm (False Positive microcalcification/ FP) based on a binary truth file obtained from a radiologist's initial investigation.
The main focus of this work was two-fold: (a) to analyze these features, select the most significant features among them and study their impact on classification accuracy and (b) to implement and compare two machine-learning algorithms, Neural Networks (NNs) and Support Vector Machines (SVMs) and evaluate their performances with these features.
The NN was based on the Standard Back Propagation (SBP) algorithm. The SVM was implemented using polynomial, linear and Radial Basis Function (RBF) kernels. A detailed statistical analysis of the input features was performed. Feature selection was done using Stepwise Forward Selection (SFS) method. Training and testing of the classifiers was carried out using various training methods. Classifier evaluation was first performed with all the ten features in the model. Subsequently, only the features from SFS were used in the model to study their effect on classifier performance. Accuracy assessment was done to evaluate classifier performance.
Detailed statistical analysis showed that the given dataset showed poor discrimination between classes and proved a very difficult pattern recognition problem. The SVM performed better than the NN in most cases, especially on unseen data. No significant improvement in classifier performance was noted with feature selection. However, with SFS, the NN showed improved performance on unseen data. The training time taken by the SVM was several magnitudes less than the NN. Classifiers were compared on the basis of their accuracy and parameters like sensitivity and specificity. Free Receiver Operating Curves (FROCs) were used for evaluation of classifier performance.
The highest accuracy observed was about 93% on training data and 76% for testing data with the SVM using Leave One Out (LOO) Cross Validation (CV) training. Sensitivity was 81% and 46% on training and testing data respectively for a threshold of 0.7. The NN trained using the 'single test' method showed the highest accuracy of 86% on training data and 70% on testing data with respective sensitivity of 84% and 50%. Threshold in this case was -0.2. However, FROC analyses showed overall superiority of SVM especially on unseen data.
Both spatial and morphological domain features were significant in our model. Features were selected based on their significance in the model. However, when tested with the NN and SVM, this feature selection procedure did not show significant improvement in classifier performance. It was interesting to note that the model with interactions between these selected variables showed excellent testing sensitivity with the NN classifier (about 81%).
Recent research has shown SVMs outperform NNs in classification tasks. SVMs show distinct advantages such as better generalization, increased speed of learning, ability to find a global optimum and ability to deal with linearly non-separable data. Thus, though NNs are more widely known and used, SVMs are expected to gain popularity in practical applications. Our findings show that the SVM outperforms the NN. However, its performance depends largely on the nature of data used.
|
99 |
Machine learning for automatic classification of remotely sensed dataMilne, Linda, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links)
As more and more remotely sensed data becomes available it is becoming increasingly harder to analyse it with the more traditional labour intensive, manual methods. The commonly used techniques, that involve expert evaluation, are widely acknowledged as providing inconsistent results, at best. We need more general techniques that can adapt to a given situation and that incorporate the strengths of the traditional methods, human operators and new technologies. The difficulty in interpreting remotely sensed data is that often only a small amount of data is available for classification. It can be noisy, incomplete or contain irrelevant information. Given that the training data may be limited we demonstrate a variety of techniques for highlighting information in the available data and how to select the most relevant information for a given classification task. We show that more consistent results between the training data and an entire image can be obtained, and how misclassification errors can be reduced. Specifically, a new technique for attribute selection in neural networks is demonstrated. Machine learning techniques, in particular, provide us with a means of automating classification using training data from a variety of data sources, including remotely sensed data and expert knowledge. A classification framework is presented in this thesis that can be used with any classifier and any available data. While this was developed in the context of vegetation mapping from remotely sensed data using machine learning classifiers, it is a general technique that can be applied to any domain. The emphasis of the applicability for this framework being domains that have inadequate training data available.
|
100 |
Improving Feature Selection Techniques for Machine LearningTan, Feng 27 November 2007 (has links)
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases.
|
Page generated in 0.1003 seconds