Global ETD Search

111	Automatiserad matchning vid rekrytering / Automated matching in recruitment Strand, Henrik January 2018 (has links) För små företag utan rekryteringsansvarig person kan det var svårt att hitta rätt personal. Brist på sådana resurser är en påfrestning som leder till stress och mindre lyckade rekryteringar. Målet med arbetet var att hitta en lösning för att automatisera matchning i en rekryteringsprocess genom att ge förslag på relevanta personer som tidigare sökt jobb hos företag via Cheffle:s tjänst. Det finns flera olika sätt att matcha uppsättningar av data. I det här fallet användes maskininlärning som lösningsmetod. Detta implementerades tillsammans med en prototyp som hämtade in data om jobbet och den arbetssökande. Maskininlärningsmodellerna Supportvektormaskin och Artificiella Neural Nätverk använde sig av denna data för att betygsätta de arbetssökande. Detta utifrån hur väl de matchade jobbannonsen. Arbetets slutsats är att ingen modell hade tillräckligt hög precision i sina klassificeringar för att användas i en verklig implementation, detta då endast små mängder testdata fanns tillgänglig. Resultatet visade att maskininlärningsmodellerna Supportvektormaskin och Artificiella Neurala Nätverk visade potential för att kunna användas vid denna typ av matchning, men för att fastställa detta krävs mer träningsdata / It can be hard for a small company with no dedicated HR-role to find suiting recruits. A lack of resources takes a toll on the existing employees and increase stress that further harms recruiting. The goal of this work was to find a solution to automate matching in a recruitment process by suggesting relevant applicants that have previously used Cheffle. There are multiple ways of matching data. In the case of this study, machine learning was used. A prototype was developed. It collected data about a job and its related applicants. The data was then used by the machine learning models Support vector machine and Artificial Neural Network to classify the applicants by how closely they match the job position. The conclusion made in this work is that no model had a precision high enough in its classification to be used in a final implementation. The low precision in classification is likely a result of the small amount of test data available. The result showed that the machine learning models Support vector machine and Artificial Neural Network had potential in this type of matching. To firmly determine this the models would need to be tested with more test data. Recruitment automation machine learning SVM ANN semantic matching Rekrytering automatisering maskininlärning SVM ANN semantisk matchning Computer Sciences Datavetenskap (datalogi)
112	Klasifikace dokumentů podle tématu / Document Classification Marek, Tomáš January 2013 (has links) This thesis deals with a document classification, especially with a text classification method. Main goal of this thesis is to analyze two arbitrary document classification algorithms to describe them and to create an implementation of those algorithms. Chosen algorithms are Bayes classifier and classifier based on support vector machines (SVM) which were analyzed and implemented in the practical part of this thesis. One of the main goals of this thesis is to create and choose optimal text features, which are describing the input text best and thus lead to the best classification results. At the end of this thesis there is a bunch of tests showing comparison of efficiency of the chosen classifiers under various conditions.
113	Découverte de biomarqueurs prédictifs en cancer du sein par intégration transcriptome-interactome / Biomarkers discovery in breast cancer by Interactome-Transcriptome Integration Garcia, Maxime 20 December 2013 (has links) L’arrivée des technologies à haut-débit pour mesurer l’expression des gènes a permis l’utilisation de signatures génomiques pour prédire des conditions cliniques ou la survie du patient. Cependant de telles signatures ont des limitations, comme la dépendance au jeu de données d’entrainement et le manque de généralisation. Nous proposons un nouvel algorithme, Integration Transcriptome-Interactome (ITI) (Garcia et al.) pour extraire une signature generalisable prédisant la rechute métastatique dans le cancer du sein par superimposition d’un très large jeu de données d’interaction protèine-protèine sur de multiples jeux de données d’expression des gènes. Cette méthode ré-implemente l’algorithme Chuang et al. , avec la capacité supplémentaire d’extraire une signature génomique à partir de plusieurs jeux de donnés d’expression des gènes simultanément. Une analyse non-supervisée et une analyse supervisée ont été réalisés sur un compendium de jeux de donnés issus de puces à ADN en cancer du sein. Les performances des signatures trouvées par ITI ont été comparé aux performances des signatures préalablement publiées (Wang et al. , Van De Vijver et al. , Sotiriou et al. ). Nos résultats montrent que les signatures ITI sont plus stables et plus généralisables, et sont plus performantes pour classifier un jeu de données indépendant. Nous avons trouvés des sous-réseaux formant des complexes précédement relié à des fonctions biologiques impliquées dans la nétastase et le cancer du sein. Plusieurs gènes directeurs ont été détectés, dont CDK1, NCK1 et PDGFB, certains n’étant pas déjà relié à la rechute métastatique dans le cancer du sein. / High-throughput gene-expression profiling technologies yeild genomic signatures to predict clinical condition or patient outcome. However, such signatures have limitations, such as dependency on training set, and lack of generalization. We propose a novel algorithm, Interactome-Transcriptome Integration (ITI) (Garcia et al.) extract a generalizable signature predicting breast cancer relapse by superimposition of a large-scale protein-protein interaction data over several gene-expression data sets. This method re-implements the Chuang et al. algorithm, with the added capability to extract a genomic signature from several gene expression data sets simultaneously. A non-supervised and a supervised analysis were made with a breast cancer compendium of DNA microarray data sets. Performances of signatures found with ITI were compared with previously published signatures (Wang et al. , Van De Vijver et al. , Sotiriou et al. ). Our results show that ITI’s signatures are more stable and more generalizable, and perfom better when classifying an independant dataset. We found that subnetworks formed complexes functionally linked to biological functions related to metastasis and breast cancer. Several drivers genes were detected, including CDK1, NCK1 and PDGFB, some not previously linked to breast cancer relapse. Transcriptome Interactome Intégration de données Signature Biomarqueurs Cancer Cancer du Sein Réseaux de gène Classification SVM Transcriptome Interactome Data Integration Signature Biomarkers Cancer Breast Cancer Gene Networks Classification SVM 570
114	Extraction de relations en domaine de spécialité / Relation extraction in specialized domains Minard, Anne-Lyse 07 December 2012 (has links) La quantité d'information disponible dans le domaine biomédical ne cesse d'augmenter. Pour que cette information soit facilement utilisable par les experts d'un domaine, il est nécessaire de l'extraire et de la structurer. Pour avoir des données structurées, il convient de détecter les relations existantes entre les entités dans les textes. Nos recherches se sont focalisées sur la question de l'extraction de relations complexes représentant des résultats expérimentaux, et sur la détection et la catégorisation de relations binaires entre des entités biomédicales. Nous nous sommes intéressée aux résultats expérimentaux présentés dans les articles scientifiques. Nous appelons résultat expérimental, un résultat quantitatif obtenu suite à une expérience et mis en relation avec les informations permettant de décrire cette expérience. Ces résultats sont importants pour les experts en biologie, par exemple pour faire de la modélisation. Dans le domaine de la physiologie rénale, une base de données a été créée pour centraliser ces résultats d'expérimentation, mais l'alimentation de la base est manuelle et de ce fait longue. Nous proposons une solution pour extraire automatiquement des articles scientifiques les connaissances pertinentes pour la base de données, c'est-à-dire des résultats expérimentaux que nous représentons par une relation n-aire. La méthode procède en deux étapes : extraction automatique des documents et proposition de celles-ci pour validation ou modification par l'expert via une interface. Nous avons également proposé une méthode à base d'apprentissage automatique pour l'extraction et la classification de relations binaires en domaine de spécialité. Nous nous sommes intéressée aux caractéristiques et variétés d'expressions des relations, et à la prise en compte de ces caractéristiques dans un système à base d'apprentissage. Nous avons étudié la prise en compte de la structure syntaxique de la phrase et la simplification de phrases dirigée pour la tâche d'extraction de relations. Nous avons en particulier développé une méthode de simplification à base d'apprentissage automatique, qui utilise en cascade plusieurs classifieurs. / The amount of available scientific literature is constantly growing. If the experts of a domain want to easily access this information, it must be extracted and structured. To obtain structured data, both entities and relations of the texts must be detected. Our research is about the problem of complex relation extraction which represent experimental results, and detection and classification of binary relations between biomedical entities. We are interested in experimental results presented in scientific papers. An experimental result is a quantitative result obtained by an experimentation and linked with information that describes this experimentation. These results are important for biology experts, for example for doing modelization. In the domain of renal physiology, a database was created to centralize these experimental results, but the base is manually populated, therefore the population takes a long time. We propose a solution to automatically extract relevant knowledge for the database from the scientific papers, that is experimental results which are represented by a n-ary relation. The method proceeds in two steps: automatic extraction from documents and proposal of information extracted for approval or modification by the experts via an interface. We also proposed a method based on machine learning for extraction and classification of binary relations in specialized domains. We focused on the variations of the expression of relations, and how to represent them in a machine learning system. We studied the way to take into account syntactic structure of the sentence and the sentence simplification guided by the task of relation extraction. In particular, we developed a simplification method based on machine learning, which uses a series of classifiers. Extraction de relations Relation binaire Relation n-aire Domaine biomédical SVM Information syntaxique Simplification de phrases Relation extraction Binary relation N-ary relation Biomedical domain SVM Syntactic information Sentence simplification
115	Classificação automática de gênero musical baseada em entropia e fractais / Automatic music genre classification based on entropy and fractals Goulart, Antonio José Homsi 16 February 2012 (has links) A classificação automática de gênero musical tem como finalidade o conforto de ouvintes de músicas auxiliando no gerenciamento das coleções de músicas digitais. Existem sistemas que se baseiam em cabeçalhos de metadados (tais como nome de artista, gênero cadastrado, etc.) e também os que extraem parâmetros dos arquivos de música para a realização da tarefa. Enquanto a maioria dos trabalhos do segundo tipo utilizam-se do conteúdo rítmico e tímbrico, este utiliza-se apenas de conceitos da teoria da informação e da geometria de fractais. Entropia, lacunaridade e dimensão do fractal são os parâmetros que treinam os classificadores. Os testes foram realizados com duas coleções criadas para este trabalho e os resultados foram proeminentes / The goal of automatic music genre classification is givingmusic listeners ease and confort when managing digital music databases. Some systems are based on tags of metadata (such as artist name, genre labeled, etc.), while others explore characteristics from the music files to complete the task. While the majority of works of the second type analyse rhytmic, timbric and pitch content, this one explores only information theoretic and fractal geometry concepts. Entropy, fractal dimension and lacunarity are the parameters adopted to train the classifiers. Tests were carried out on two databases assembled by the author. Results were prominent Automatic music genre classification Entropia baseada em wavelet GMM GMM Lacunaridade Lacunarity SVM SVM Wavelet based entropy
116	Approches parcimonieuses pour la sélection de variables et la classification : application à la spectroscopie IR de déchets de bois / Sparse aproaches for variables selection and classification : application to infrared spectroscopy of wood wastes Belmerhnia, Leïla 02 May 2017 (has links) Le présent travail de thèse se propose de développer des techniques innovantes pour l'automatisation de tri de déchets de bois. L'idée est de combiner les techniques de spectrométrie proche-infra-rouge à des méthodes robustes de traitement de données pour la classification. Après avoir exposé le contexte du travail dans le premier chapitre, un état de l'art sur la classification de données spectrales est présenté dans le chapitre 2. Le troisième chapitre traite du problème de sélection de variables par des approches parcimonieuses. En particulier nous proposons d'étendre quelques méthodes gloutonnes pour l'approximation parcimonieuse simultanée. Les simulations réalisées pour l'approximation d'une matrice d'observations montrent l'intérêt des approches proposées. Dans le quatrième chapitre, nous développons des méthodes de sélection de variables basées sur la représentation parcimonieuse simultanée et régularisée, afin d'augmenter les performances du classifieur SVM pour la classification des spectres IR ainsi que des images hyperspectrales de déchets de bois. Enfin, nous présentons dans le dernier chapitre les améliorations apportées aux systèmes de tri de bois existants. Les résultats des tests réalisés avec logiciel de traitement mis en place, montrent qu'un gain considérable peut être atteint en termes de quantités de bois recyclées / In this thesis, innovative techniques for sorting wood wastes are developed. The idea is to combine infrared spectrometry techniques with robust data processing methods for classification task. After exposing the context of the work in the first chapter, a state of the art on the spectral data classification is presented in the chapter 2. The third chapter deals with variable selection problem using sparse approaches. In particular we propose to extend some greedy methods for the simultaneous sparse approximation. The simulations performed for the approximation of an observation matrix validate the advantages of the proposed approaches. In the fourth chapter, we develop variable selection methods based on simultaneous sparse and regularized representation, to increase the performances of SVM classifier for the classification of NIR spectra and hyperspectral images of wood wastes. In the final chapter, we present the improvements made to the existing sorting systems. The results of the conducted tests using the processing software confirm that significant benefits can be achieved in terms of recycled wood quantities Spectres IR Classification SVM Sélection de variables Approches parcimonieuses Images hyperspectrales Infrared spectra Classification SVM Variable selection Sparse approaches Hyperspectral images 621.382 2
117	[en] FUZZY RULES EXTRACTION FROM SUPPORT VECTOR MACHINES (SVM) FOR MULTI-CLASS CLASSIFICATION / [pt] EXTRAÇÃO DE REGRAS FUZZY PARA MÁQUINAS DE VETOR SUPORTE (SVM) PARA CLASSIFICAÇÃO EM MÚLTIPLAS CLASSES ADRIANA DA COSTA FERREIRA CHAVES 25 October 2006 (has links) [pt] Este trabalho apresenta a proposta de um novo método para a extração de regras fuzzy de máquinas de vetor suporte (SVMs) treinadas para problemas de classificação. SVMs são sistemas de aprendizado baseados na teoria estatística do aprendizado e apresentam boa habilidade de generalização em conjuntos de dados reais. Estes sistemas obtiveram sucesso em vários tipos de problemas. Entretanto, as SVMs, da mesma forma que redes neurais (RN), geram um modelo caixa preta, isto é, um modelo que não explica o processo pelo qual sua saída é obtida. Alguns métodos propostos para reduzir ou eliminar essa limitação já foram desenvolvidos para o caso de classificação binária, embora sejam restritos à extração de regras simbólicas, isto é, contêm funções ou intervalos nos antecedentes das regras. No entanto, a interpretabilidade de regras simbólicas ainda é reduzida. Deste modo, propõe-se, neste trabalho, uma técnica para a extração de regras fuzzy de SVMs treinadas, com o objetivo de aumentar a interpretabilidade do conhecimento gerado. Além disso, o modelo proposto foi desenvolvido para classificação em múltiplas classes, o que ainda não havia sido abordado até agora. As regras fuzzy obtidas são do tipo se x1 pertence ao conjunto fuzzy C1, x2 pertence ao conjunto fuzzy C2,..., xn pertence ao conjunto fuzzy Cn, então o ponto x = (x1,...,xn) é da classe A. Para testar o modelo foram realizados estudos de caso detalhados com quatro bancos de dados: Íris, Wine, Bupa Liver Disorders e Winconsin Breast Cancer. A cobertura das regras resultantes da aplicação desse modelo nos testes realizados mostrou-se muito boa, atingindo 100% no caso da Íris. Após a geração das regras, foi feita uma avaliação das mesmas, usando dois critérios, a abrangência e a acurácia fuzzy. Além dos testes acima mencionados foi comparado o desempenho dos métodos de classificação em múltiplas classes usados no trabalho. / [en] This text proposes a new method for fuzzy rule extraction from support vector machines (SVMs) trained to solve classification problems. SVMs are learning systems based on statistical learning theory and present good ability of generalization in real data base sets. These systems have been successfully applied to a wide variety of application. However SVMs, as well as neural networks, generates a black box model, i.e., a model which does not explain the process used in order to obtain its result. Some considered methods to reduce this limitation already has been proposed for the binary classification case, although they are restricted to symbolic rules extraction, and they have, in their antecedents, functions or intervals. However, the interpretability of the symbolic generated rules is small. Hence, to increase the linguistic interpretability of the generating rules, we propose a new technique for extracting fuzzy rules of a trained SVM. Moreover, the proposed model was developed for classification in multiple classes, which was not introduced till now. Fuzzy rules obtained are presented in the format if x1 belongs to the fuzzy set C1, x2 belongs to the fuzzy set C2 , … , xn belongs to the fuzzy set Cn , then the point x=(x1, x2, …xn) belongs to class A. For testing this new model, we performed detailed researches on four data bases: Iris, Wine, Bupa Liver Disorders and Wisconsin Breast Cancer. The rules´ coverage resultant of the application of this method was quite good, reaching 100% in Iris case. After the rules generation, its evaluation was performed using two criteria: coverage and accuracy. Besides the testing above, the performance of the methods for multi-class SVM described in this work was evaluated. [pt] REGRAS FUZZY [en] FUZZY RULES [pt] EXTRACAO DE REGRAS [en] EXTRACTION OF RULES [pt] CLASSIFICACAO EM MULTIPLAS CLASSES [en] MULTI-CLASS CLASSIFICATION [pt] SVM [en] SVM
118	Commande prédictive hybride et apprentissage pour la synthèse de contrôleurs logiques dans un bâtiment. / Hybrid Model Predictive Control and Machine Learning for development of logical controllers in buildings Le, Duc Minh Khang 09 February 2016 (has links) Une utilisation efficace et coordonnée des systèmes installés dans le bâtiment doit permettre d’améliorer le confort des occupants tout en consommant moins d’énergie. Ces objectifs à optimiser sont pourtant antagonistes. Le problème résultant peut être alors vu comme un problème d’optimisation multicritères. Par ailleurs, pour répondre aux enjeux industriels, il devra être résolu non seulement dans une optique d’implémentation simple et peu coûteuse, avec notamment un nombre réduit de capteurs, mais aussi dans un souci de portabilité pour que le contrôleur résultant puisse être implanté dans des bâtiments d’orientation différente et situés dans des lieux géographiques variés.L’approche choisie est de type commande prédictive (MPC, Model Predictive Control) dont l’efficacité pour le contrôle du bâtiment a déjà été illustrée dans de nombreux travaux, elle requiert cependant des efforts de calcul trop important. Cette thèse propose une méthodologie pour la synthèse des contrôleurs, qui doivent apporter une performance satisfaisante en imitant les comportements du MPC, tout en répondant à des contraintes industriels. Elle est divisée deux grandes étapes :1. La première étape consiste à développer un contrôleur MPC. De nombreux défis doivent être relevés tels que la modélisation, le réglage des paramètres et la résolution du problème d’optimisation.2. La deuxième étape applique différents algorithmes d’apprentissage automatique (l’arbre de décision, AdaBoost et SVM) sur une base de données obtenue à partir de simulations utilisant le contrôleur prédictif développé. Les grands points levés sont la construction de la base de données, le choix de l’algorithme de l’apprentissage et le développement du contrôleur logique.La méthodologie est appliquée dans un premier temps à un cas simple pour piloter un volet,puis validée dans un cas plus complexe : le contrôle coordonné du volet, de l’ouvrant et dusystème de ventilation. / An efficient and coordinated control of systems in buildings should improve occupant comfort while consuming less energy. However, these objectives are antagonistic. It can then be formulated as a multi-criteria optimization problem. Moreover, it should be solved not only in a simple and cheap implementation perspective, but also for the sake of adaptability of the controller which can be installed in buildings with different orientations and different geographic locations.The MPC (Model Predictive Control) approach is shown well suited for building control in the state of the art but it requires a big computing effort. This thesis presents a methodology to develop logical controllers for equipments in buildings. It helps to get a satisfactory performance by mimicking the MPC behaviors while dealing with industrial constraints. Two keys steps are required :1. In the first step, an optimal controller is developed with hybrid MPC technique. There are challenges in modeling, parameters tuning and solving the optimization problem.2. In the second step, different Machine Learning algorithms (Decision tree, AdaBoost, SVM) are tested on database which is obtained with the simulation with the MPC controller. The main points are the construction of the database, the choice of learning algorithm and the development of logic controller.First, our methodology is tested on a simple case study to control a blind. Then, it is validatedwith a more complex case : development of a coordinated controller for a blind, natural ventilationand mechanical ventilation. Hybrid MPC Apprentissage automatique Arbre de décision SVM Adaboost Optimisation du confort Pilotage du volet Pilotage de la ventilation Hybrid MPC Machine learning Decision tree SVM Adaboost Comfort optimisation Blind control Ventilation control
119	Monitoramento da cobertura do solo no entorno de hidrelétricas utilizando o classificador SVM (Support Vector Machines). / Land cover monitoring in hydroelectric domain area using Support Vector Machines (SVM) classifier. Albuquerque, Rafael Walter de 07 December 2011 (has links) A classificação de imagens de satélite é muito utilizada para elaborar mapas de cobertura do solo. O objetivo principal deste trabalho consistiu no mapeamento automático da cobertura do solo no entorno da Usina de Lajeado (TO) utilizando-se o classificador SVM. Buscou-se avaliar a dimensão de áreas antropizadas presentes na represa e a acurácia da classificação gerada pelo algoritmo, que foi comparada com a acurácia da classificação obtida pelo tradicional classificador MAXVER. Esta dissertação apresentou sugestões de calibração do algoritmo SVM para a otimização do seu resultado. Verificou-se uma alta acurácia na classificação SVM, que mostrou o entorno da represa hidrelétrica em uma situação ambientalmente favorável. Os resultados obtidos pela classificação SVM foram similares aos obtidos pelo MAXVER, porém este último contextualizou espacialmente as classes de cobertura do solo com uma acurácia considerada um pouco menor. Apesar do bom estado de preservação ambiental apresentado, a represa deve ter seu entorno devidamente monitorado, pois foi diagnosticada uma grande quantidade de incêndios gerados pela população local, sendo que as ferramentas discutidas nesta dissertação auxiliam esta atividade de monitoramento. / Satellite Image Classification are very useful for building land cover maps. The aim of this study consists on an automatic land cover mapping in the domain area of Lajeados dam, at Tocantins state, using the SVM classifier. The aim of this work was to evaluate anthropic dimension areas near the dam and also to verify the algorithms classification accuracy, which was compared to the results of the standard ML (Maximum Likelihood) classifier. This work presents calibration suggestions to the SVM algorithm for optimizing its results. SVM classification presented high accuracy, suggesting a good environmental situation along Lajeados dam region. Classification results comparison between SVM and ML were quite similar, but SVMs spatial contextual mapping areas were slightly better. Although environmental situation of the study area was considered good, monitoring ecosystem is important because a significant quantity of burnt areas was noticed due to local communities activities. This fact emphasized the importance of the tools discussed in this work, which helps environmental monitoring. Classificação Classification Dam Hidrelétrica Hydroeletric Imagem de satélite Remote sensing Represa Satellite images Sensoriamento remoto SVM (Support Vector Machines) SVM (Support Vector Machines)
120	Uma comparação da aplicação de métodos computacionais de classificação de dados aplicados ao consumo de cinema no Brasil / A comparison of the application of data classification computational methods to the consumption of film at theaters in Brazil Nieuwenhoff, Nathalia 13 April 2017 (has links) As técnicas computacionais de aprendizagem de máquina para classificação ou categorização de dados estão sendo cada vez mais utilizadas no contexto de extração de informações ou padrões em bases de dados volumosas em variadas áreas de aplicação. Em paralelo, a aplicação destes métodos computacionais para identificação de padrões, bem como a classificação de dados relacionados ao consumo dos bens de informação é considerada uma tarefa complexa, visto que tais padrões de decisão do consumo estão relacionados com as preferências dos indivíduos e dependem de uma composição de características individuais, variáveis culturais, econômicas e sociais segregadas e agrupadas, além de ser um tópico pouco explorado no mercado brasileiro. Neste contexto, este trabalho realizou o estudo experimental a partir da aplicação do processo de Descoberta do conhecimento (KDD), o que inclui as etapas de seleção e Mineração de Dados, para um problema de classificação binária, indivíduos brasileiros que consomem e não consomem um bem de informação, filmes em salas de cinema, a partir dos dados obtidos na Pesquisa de Orçamento Familiar (POF) 2008-2009, pelo Instituto Brasileiro de Geografia e Estatística (IBGE). O estudo experimental resultou em uma análise comparativa da aplicação de duas técnicas de aprendizagem de máquina para classificação de dados, baseadas em aprendizado supervisionado, sendo estas Naïve Bayes (NB) e Support Vector Machine (SVM). Inicialmente, a revisão sistemática realizada com o objetivo de identificar estudos relacionados a aplicação de técnicas computacionais de aprendizado de máquina para classificação e identificação de padrões de consumo indica que a utilização destas técnicas neste contexto não é um tópico de pesquisa maduro e desenvolvido, visto que não foi abordado em nenhum dos trabalhos estudados. Os resultados obtidos a partir da análise comparativa realizada entre os algoritmos sugerem que a escolha dos algoritmos de aprendizagem de máquina para Classificação de Dados está diretamente relacionada a fatores como: (i) importância das classes para o problema a ser estudado; (ii) balanceamento entre as classes; (iii) universo de atributos a serem considerados em relação a quantidade e grau de importância destes para o classificador. Adicionalmente, os atributos selecionados pelo algoritmo de seleção de variáveis Information Gain sugerem que a decisão de consumo de cultura, mais especificamente do bem de informação, filmes em cinema, está fortemente relacionada a aspectos dos indivíduos relacionados a renda, nível de educação, bem como suas preferências por bens culturais / Machine learning techniques for data classification or categorization are increasingly being used for extracting information or patterns from volumous databases in various application areas. Simultaneously, the application of these computational methods to identify patterns, as well as data classification related to the consumption of information goods is considered a complex task, since such decision consumption paterns are related to the preferences of individuals and depend on a composition of individual characteristics, cultural, economic and social variables segregated and grouped, as well as being not a topic explored in the Brazilian market. In this context, this study performed an experimental study of application of the Knowledge Discovery (KDD) process, which includes data selection and data mining steps, for a binary classification problem, Brazilian individuals who consume and do not consume a information good, film at theaters in Brazil, from the microdata obtained from the Brazilian Household Budget Survey (POF), 2008-2009, performed by the Brazilian Institute of Geography and Statistics (IBGE). The experimental study resulted in a comparative analysis of the application of two machine-learning techniques for data classification, based on supervised learning, such as Naïve Bayes (NB) and Support Vector Machine (SVM). Initially, a systematic review with the objective of identifying studies related to the application of computational techniques of machine learning to classification and identification of consumption patterns indicates that the use of these techniques in this context is not a mature and developed research topic, since was not studied in any of the papers analyzed. The results obtained from the comparative analysis performed between the algorithms suggest that the choice of the machine learning algorithms for data classification is directly related to factors such as: (i) importance of the classes for the problem to be studied; (ii) balancing between classes; (iii) universe of attributes to be considered in relation to the quantity and degree of importance of these to the classifiers. In addition, the attributes selected by the Information Gain variable selection algorithm suggest that the decision to consume culture, more specifically information good, film at theaters, is directly related to aspects of individuals regarding income, educational level, as well as preferences for cultural goods Algoritmos de classificação Bens de informação Classification algorithm Consumo Consumption Information goods Naïve Bayes Naïve Bayes Pattern recognition Reconhecimento de padrões Support Vector Machine Support Vector Machine SVM SVM

Search results