Global ETD Search

151	Algoritmos de seleção de características personalizados por classe para categorização de texto FRAGOSO, Rogério César Peixoto 26 August 2016 (has links) Submitted by Rafael Santana (rafael.silvasantana@ufpe.br) on 2017-08-31T19:39:48Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Rogerio_Fragoso.pdf: 1117500 bytes, checksum: 3e7915ee5c34322de3a8358d59679961 (MD5) / Made available in DSpace on 2017-08-31T19:39:48Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Rogerio_Fragoso.pdf: 1117500 bytes, checksum: 3e7915ee5c34322de3a8358d59679961 (MD5) Previous issue date: 2016-08-26 / A categorização de textos é uma importante ferramenta para organização e recuperação de informações em documentos digitais. Uma abordagem comum é representar cada palavra como uma característica. Entretanto, a maior parte das características em um documento textual são irrelevantes para sua categorização. Assim, a redução de dimensionalidade é um passo fundamental para melhorar o desempenho de classificação e reduzir o alto custo computacional inerente a problemas de alta dimensionalidade, como é o caso da categorização de textos. A estratégia mais utilizada para redução de dimensionalidade em categorização de textos passa por métodos de seleção de características baseados em filtragem. Métodos deste tipo exigem um esforço para configurar o tamanho do vetor final de características. Este trabalho propõe métodos de filtragem com o intuito melhorar o desempenho de classificação em comparação com os métodos atuais e de tornar possível a automatização da escolha do tamanho do vetor final de características. O primeiro método proposto, chamado Category-dependent Maximum f Features per Document-Reduced (cMFDR), define um limiar para cada categoria para determinar quais documentos serão considerados no processo de seleção de características. O método utiliza um parâmetro para definir quantas características são selecionadas por documento. Esta abordagem apresenta algumas vantagens, como a simplificação do processo de escolha do subconjunto mais efetivo através de uma drástica redução da quantidade de possíveis configurações. O segundo método proposto, Automatic Feature Subsets Analyzer (AFSA), introduz um procedimento para determinar, de maneira guiada por dados, o melhor subconjunto de características dentre um número de subconjuntos gerados. Este método utiliza o mesmo parâmetro usado por cMFDR para definir a quantidade de características no vetor final. Isto permite que a busca pelo melhor subconjunto tenha um baixo custo computacional. O desempenho dos métodos propostos foram avaliados nas bases de dados WebKB, Reuters, 20 Newsgroup e TDT2, utilizando as funções de avaliação de características Bi-Normal Separation, Class Discriminating Measure e Chi-Squared Statistics. Os resultados dos experimentos demonstraram uma maior efetividade dos métodos propostos em relação aos métodos do estado da arte. / Text categorization is an important technic to organize and retrieve information from digital documents. A common approach is to represent each word as a feature. However most of the features in a textual document is irrelevant to its categorization. Thus, dimensionality reduction is a fundamental step to improve classification performance and diminish the high computational cost inherent to high dimensional problems, such as text categorization. The most commonly adopted strategy for dimensionality reduction in text categorization undergoes feature selection methods based on filtering. This kind of method requires an effort to configure the size of the final feature vector. This work proposes filtering methods aiming to improve categorization performence comparing to state-of-the-art methods and to provide a possibility of automitic determination of the size of the final feature set. The first proposed method, namely Category-dependent Maximum f Features per Document-Reduced (cMFDR), sets a threshold for each category that determines which documents are considered in feature selection process. The method uses a parameter to arbitrate how many features are selected per document. This approach presents some advantages, such as simplifying the process of choosing the most effective subset through a strong reduction of the number of possible configurations. The second proposed method, Automatic Feature Subsets Analyzer (AFSA), presents a procedure to determine, in a data driven way, the most effective subset among a number of generated subsets. This method uses the same parameter used by cMFDR to define the size of the final feature vector. This fact leads to lower computational costs to find the most effective set. The performance of the proposed methods was assessed in WebKB, Reuters, 20 Newsgroup and TDT2 datasets, using Bi-Normal Separation, Class Discriminating Measure and Chi-Squared Statistics feature evaluations functions. The experimental results demonstrates that the proposed methods are more effective than state-of-art methods.
152	Definição automática da quantidade de atributos selecionados em tarefas de agrupamento de dados / Automatic feature quantification in data clustering tasks José Augusto Andrade Filho 17 September 2013 (has links) Conjuntos de dados reais muitas vezes apresentam um grande número de atributos preditivos ou de entrada, o que leva a uma grande quantidade de informação. Entretanto, essa quantidade de informação nem sempre significa uma melhoria em termos de desempenho de técnicas de agrupamento. Além disso, alguns atributos podem estar correlacionados ou adicionar ruído, reduzindo a qualidade do agrupamento de dados. Esse problema motivou o desenvolvimento de técnicas de seleção de atributos, que tentam encontrar um subconjunto com os atributos mais relevantes para agrupar os dados. Neste trabalho, o foco está no problema de seleção de atributos não supervisionados. Esse é um problema difícil, pois não existe informação sobre rótulos das classes. Portanto, não existe um guia para medir a qualidade do subconjunto de atributos. O principal objetivo deste trabalho é definir um método para identificar quanto atributos devem ser selecionados (após ordená-los com base em algum critério). Essa tarefa é realizada por meio da técnica de Falsos Vizinhos Mais Próximos, que tem sua origem na teoria do caos. Resultados experimentais mostram que essa técnica informa um bom número aproximado de atributos a serem selecionados. Quando comparado a outras técnicas, na maioria dos casos analisados, enquanto menos atributos são selecionados, a qualidade da partição dos dados é mantida / Real-world datasets commonly present high dimensional data, what leads to an increased amount of information. However, this does not always imply on an improvement in terms of clustering techniques performance. Furthermore, some features may be correlated or add unexpected noise, reducing the data clustering performance. This problem motivated the development of feature selection techniques, which attempt to find the most relevant subset of features to cluster data. In this work, we focus on the problem of unsupervised feature selection. This is a difficult problem, since there is no class label information. Therefore, there is no guide to measure the quality of the feature subset. The main goal of this work is to define a method to identify the number of features to select (after sorting them based on some criterion). This task is carried out by means of the False Nearest Neighbor, which has its root in the Chaos Theory. Experimental results show that this technique gives an good approximate number of features to select. When compared to other techniques, in most of the analyzed cases, while selecting fewer features, it maintains the quality of the data partition Agrupamento de dados Aprendizado de máquina Seleção de atributos Teoria do caos Chaos theory Clustering Feature selection Machine learning
153	Metodologia de fusão de vídeos e sons para monitoração de comportamento de insetos / Merging methodology videos and sounds for monitoring insect behavior Lúcio André de Castro Jorge 02 September 2011 (has links) Este trabalho apresenta uma nova abordagem para fusão de vídeo e som diretamente no espaço de atributos visando otimizar a identificação do comportamento de insetos. Foi utilizado o detector de Harris para rastreamento dos insetos, assim como a técnica inovadora Wavelet-Multifractal para análise de som. No caso da Wavelet-Multifractal, foram testadas várias Wavelet-mães, sendo a Morlet a melhor escolha para sons de insetos. Foi proposto a Wavelet Módulo Máximo para extrair atributos multifractais dos sons para serem utilizados no reconhecimento de padrões de comportamento de insetos. A abordagem Wrapper de mineração de dados foi usada para selecionar os atributos relevantes. Foi constatado que a abordagem Wavelet-multifractal identifica melhor os sons, particularmente no caso de distorções provocadas por ruídos. As imagens foram responsáveis pela identificação de acasalamento e os sons pelos outros comportamentos. Foi também proposto um novo método do triângulo como representação simplificada do espectro multifractal visando simplificação do processamento. / This work presents an innovative video and sound fusion approach by feature subset selection under the space of attributes to optimally identify insects behavior. Harris detector was used for insect movement tracking and an innovative technique of Multifractal-Wavelet was used to analyze the insect sounds. In the case of Multifractal-Wavelet, more than one mother-wavelet was tested, being the Morlet wavelet the best choice of mother-wavelet for insect sounds. The wavelet modulus maxima was proposed to extract multifractal sound attributes to be used in pattern recognition of an insect behavior. The wrapper data mining approach was used to select relevant attributes. It has been found that, in general, wavelet-multifractal based schemes perform better for sound, particularly in terms of minimizing noise distortion influence. The image features only determine the mating and the sound other behaviors. A new triangle representation of multifractal spectrum was proposed as a processing simplification. Fusão de sensores Seleção de características Wavelet-multifractal Feature selection Fusion Wavelet-multifractal Wrapper
154	The effect of colour use on the quality of websites Grijseels, Dorieke January 2016 (has links) The design of a website is important for the success of a company. Colours play an important part in websites. The goal of this thesis is to find out how the use of colour in websites relates to the quality of websites. Different aspects are studied. First it was found that the harmony of a colour palette only weakly correlates with the quality of a website. This correlation increases when only darker colour palettes are used. Next a method was proposed to extract the colour palette from a website. This novel method takes the saliency of the pixels in a website into account. Lastly, the palettes extracted using this method were utilized to propose a model to explain the relation between colour use and quality of websites. Sixty-one different features were tested using three different methods of feature selection. The accuracy achieved in the best model was low. Future work is suggested to improve on this, which should focus on identifying more relevant features and training the model using a better database. colour websites colour harmony saliency machine learning feature selection Human Computer Interaction
155	Sélection de variables à partir de données d'expression : signatures moléculaires pour le pronostic du cancer du sein et inférence de réseaux de régulation génique / Feature selection from gene expression data : molecular signatures for breast cancer prognosis and gene regulation network inference Haury, Anne-Claire 14 December 2012 (has links) De considérables développements dans le domaine des biotechnologies ont modifié notre approche de l'analyse de l'expression génique. En particulier, les puces à ADN permettent de mesurer l'expression des gènes à l'échelle du génome, dont l'analyse est confiée au statisticien.A partir de ces données dites en grande dimension, nous contribuons, dans cette thèse, à l'étude de deux problèmes biologiques. Nous traitons ces questions comme des problèmes d'apprentissage statistique supervisé et, en particulier, de sélection de variables, où il s'agit d'extraire, parmi toutes les variables - gènes - à disposition, celles qui sont nécessaires et suffisantes pour prédire la réponse à une question donnée.D'une part, nous travaillons à repérer des listes de gènes, connues sous le nom de signatures moléculaires et supposées contenir l'information nécessaire à la prédiction de l'issue du cancer du sein. La prédiction des événements métastatiques est en effet cruciale afin d'évaluer, dès l'apparition de la tumeur primaire, la nécessité d'un traitement par chimio-thérapie adjuvante, connue pour son agressivité. Nous présentons dans cette thèse trois contributions à ce problème. Dans la première, nous proposons une comparaison systématique des méthodes de sélection de variables, en termes de performance prédictive, de stabilité et d'interprétabilité biologique de la solution. Les deux autres contributions portent sur l'application de méthodes dites de parcimonie structurée (graph Lasso et k-support norm) au problème de sélection de signatures. Ces trois travaux discutent également l'impact de l'utilisation de méthodes d'ensemble (bootstrap et ré-échantillonnage).D'autre part, nous nous intéressons au problème d'inférence de réseau génique, consistant à déterminer la structure des interactions entre facteurs de transcription et gènes cibles. Les premiers sont des protéines ayant la faculté de réguler la transcription des gènes cibles, c'est-à-dire de l'activer ou de la réprimer. Ces régulations peuvent être représentées sous la forme d'un graphe dirigé, où les noeuds symbolisent les gènes et les arêtes leurs interactions. Nous proposons un nouvel algorithme, TIGRESS, classé troisième lors du challenge d'inférence de réseaux DREAM5 en 2010. Basé sur l'algorithme LARS couplé à une stratégie de ré-échantillonnage, TIGRESS traite chaque gène cible séparément, en sélectionnant ses régulateurs, puis assemble ces sous-problèmes pour prédire l'ensemble du réseau.Enfin, nous consacrons le dernier chapitre à une discussion ayant pour objectif de replacer les travaux de cette thèse dans un contexte bibliographique et épistémologique plus large. / Important developments in biotechnologies have moved the paradigm of gene expression analysis from a hypothesis-driven to a data-driven approach. In particular, DNA microarrays make it possible to measure gene expression on a genome-wide scale, leaving its analysis to statisticians.From these high-dimensional data, we contribute, in this thesis, to two biological problems. Both questions are considered from the supervised learning point of view. In particular, we see them as feature selection problems. Feature selection consists in extracting variables - here, genes - that contain relevant and sufficient information to predict the answer to a given question.First, we are concerned with selecting lists of genes, otherwise known as molecular signatures and assumed to contain the necessary amount of information to predict the outcome of breast cancer. It is indeed crucial to be able to estimate the chances for future metastatic events from the primary tumor, in order to evaluate the relevance of having the patient undergo an aggressive adjuvant chemotherapy. In this thesis, we present three contributions to this problem. First, we propose a systematic comparison of feature selection methods in terms of predictive performance, stability and biological interpretability of the solution they output. The second and third contributions focus on applying so-called structured sparsity methods (here graph Lasso and k-overlap norm) to the signature selection problem. In all three studies, we discuss the impact of using so-called Ensemble methods (bootstrap, resampling).Second, we are interested in the gene regulatory network inference problem that consists in determining patterns of interaction between transcription factors and target genes. The formers are proteins that regulate the transcription of target genes in that they can either activate or repress it. These regulations can be represented as a directed graph, where nodes symbolize genes and edges depict their interactions. We introduce a new algorithm named TIGRESS, that granted us the third place at the DREAM5 network inference challenge in 2010. Based on the LARS algorithm and a resampling procedure, TIGRESS considers each target gene independently by inferring its regulators and finally assembles individual predictions to provide an estimate of the entire network.Finally, in the last chapter, we provide a discussion that attempts to place the contributions of this thesis in a broader bibliographical and epistemological context. Apprentissage statistique Sélection de variables Réseau Prediction Machine learning Feature selection Network Prediction
156	Static Code Features for a Machine Learning based Inspection : An approach for C Tribus, Hannes January 2010 (has links) Delivering fault free code is the clear goal of each devel- oper, however the best method to achieve this aim is still an open question. Despite that several approaches have been proposed in literature there exists no overall best way. One possible solution proposed recently is to combine static source code analysis with the discipline of machine learn- ing. An approach in this direction has been defined within this work, implemented as a prototype and validated subse- quently. It shows a possible translation of a piece of source code into a machine learning algorithm’s input and further- more its suitability for the task of fault detection. In the context of the present work two prototypes have been de- veloped to show the feasibility of the presented idea. The output they generated on open source projects has been collected and used to train and rank various machine learn- ing classifiers in terms of accuracy, false positive and false negative rates. The best among them have subsequently been validated again on an open source project. Out of the first study at least 6 classifiers including “MultiLayerPer- ceptron”, “Ibk” and “ADABoost” on a “BFTree” could convince. All except the latter, which failed completely, could be validated in the second study. Despite that the it is only a prototype, it shows the suitability of some machine learning algorithms for static source code analysis. static source code analysis machine learning feature selection fault detection Software Engineering Programvaruteknik
157	Optimization methods for inventive design / Méthodes d’optimisation pour la conception inventive Lin, Lei 01 April 2016 (has links) La thèse traite des problèmes d'invention où les solutions des méthodes d'optimisation ne satisfont pas aux objectifs des problèmes à résoudre. Les problèmes ainsi définis exploitent, pour leur résolution, un modèle de problème étendant le modèle de la TRIZ classique sous une forme canonique appelée "système de contradictions généralisées". Cette recherche instrumente un processus de résolution basé sur la boucle simulation-optimisation-invention permettant d'utiliser à la fois des méthodes d'optimisation et d'invention. Plus précisément, elle modélise l'extraction des contractions généralisées à partir des données de simulation sous forme de problèmes d'optimisation combinatoire et propose des algorithmes donnant toutes les solutions à ces problèmes. / The thesis deals with problems of invention where solutions optimization methods do not meet the objectives of problems to solve. The problems previuosly defined exploit for their resolution, a problem extending the model of classical TRIZ in a canonical form called "generalized system of contradictions." This research draws up a resolution process based on the loop simulation-optimization-invention using both solving methods of optimization and invention. More precisely, it models the extraction of generalized contractions from simulation data as combinatorial optimization problems and offers algorithms that offer all the solutions to these problems. TRIZ Conception inventive Optimisation Feature sélection TRIZ Inventive design Optimization Feature selection 620 658.575
158	Knowledge discovery method for deriving conditional probabilities from large datasets Elsilä, U. (Ulla) 04 December 2007 (has links) Abstract In today's world, enormous amounts of data are being collected everyday. Thus, the problems of storing, handling, and utilizing the data are faced constantly. As the human mind itself can no longer interpret the vast datasets, methods for extracting useful and novel information from the data are needed and developed. These methods are collectively called knowledge discovery methods. In this thesis, a novel combination of feature selection and data modeling methods is presented in order to help with this task. This combination includes the methods of basic statistical analysis, linear correlation, self-organizing map, parallel coordinates, and k-means clustering. The presented method can be used, first, to select the most relevant features from even hundreds of them and, then, to model the complex inter-correlations within the selected ones. The capability to handle hundreds of features opens up the possibility to study more extensive processes instead of just looking at smaller parts of them. The results of k-nearest-neighbors study show that the presented feature selection procedure is valid and appropriate. A second advantage of the presented method is the possibility to use thousands of samples. Whereas the current rules of selecting appropriate limits for utilizing the methods are theoretically proved only for small sample sizes, especially in the case of linear correlation, this thesis gives the guidelines for feature selection with thousands of samples. A third positive aspect is the nature of the results: given that the outcome of the method is a set of conditional probabilities, the derived model is highly unrestrictive and rather easy to interpret. In order to test the presented method in practice, it was applied to study two different cases of steel manufacturing with hot strip rolling. In the first case, the conditional probabilities for different types of retentions were derived and, in the second case, the rolling conditions for the occurrence of wedge were revealed. The results of both of these studies show that steel manufacturing processes are indeed very complex and highly dependent on the various stages of the manufacturing. This was further confirmed by the fact that with studies of k-nearest-neighbors and C4.5, it was impossible to derive useful models concerning the datasets as a whole. It is believed that the reason for this lies in the nature of these two methods, meaning that they are unable to grasp such manifold inter-correlations in the data. On the contrary, the presented method of conditional probabilities allowed new knowledge to be gained of the studied processes, which will help to better understand these processes and to enhance them. continuous casting data mining feature selection hot strip rolling knowledge discovery process
159	Time Dependent Kernel Density Estimation: A New Parameter Estimation Algorithm, Applications in Time Series Classiﬁcation and Clustering Wang, Xing 23 May 2016 (has links) The Time Dependent Kernel Density Estimation (TDKDE) developed by Harvey & Oryshchenko (2012) is a kernel density estimation adjusted by the Exponentially Weighted Moving Average (EWMA) weighting scheme. The Maximum Likelihood Estimation (MLE) procedure for estimating the parameters proposed by Harvey & Oryshchenko (2012) is easy to apply but has two inherent problems. In this study, we evaluate the performances of the probability density estimation in terms of the uniformity of Probability Integral Transforms (PITs) on various kernel functions combined with diﬀerent preset numbers. Furthermore, we develop a new estimation algorithm which can be conducted using Artiﬁcial Neural Networks to eliminate the inherent problems with the MLE method and to improve the estimation performance as well. Based on the new estimation algorithm, we develop the TDKDE-based Random Forests time series classiﬁcation algorithm which is signiﬁcantly superior to the commonly used statistical feature-based Random Forests method as well as the Ker- nel Density Estimation (KDE)-based Random Forests approach. Furthermore, the proposed TDKDE-based Self-organizing Map (SOM) clustering algorithm is demonstrated to be superior to the widely used Discrete-Wavelet- Transform (DWT)-based SOM method in terms of the Adjusted Rand Index (ARI). Feature Selection Artificial Neural Networks Random Forests Self-organizing Maps Statistics and Probability
160	A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy Ma, Long 08 August 2017 (has links) Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework. Multi-label Text Classification Feature Selection Word2Vec Natural Language Processing Depression Symptoms Social Medias

Search results