Global ETD Search

1	Construção de redes baseadas em vizinhança para o aprendizado semissupervisionado / Graph construction based on neighborhood for semisupervised Berton, Lilian 25 January 2016 (has links) Com o aumento da capacidade de armazenamento, as bases de dados são cada vez maiores e, em muitas situações, apenas um pequeno subconjunto de itens de dados pode ser rotulado. Isto acontece devido ao processo de rotulagem ser frequentemente caro, demorado e necessitar do envolvimento de especialistas humanos. Com isso, diversos algoritmos semissupervisionados foram propostos, mostrando que é possível obter bons resultados empregando conhecimento prévio, relativo à pequena fração de dados rotulados. Dentre esses algoritmos, os que têm ganhado bastante destaque na área têm sido aqueles baseados em redes. Tal interesse, justifica-se pelas vantagens oferecidas pela representação via redes, tais como, a possibilidade de capturar a estrutura topológica dos dados, representar estruturas hierárquicas, bem como modelar manifolds no espaço multi-dimensional. No entanto, existe uma grande quantidade de dados representados em tabelas atributo-valor, nos quais não se poderia aplicar os algoritmos baseados em redes sem antes construir uma rede a partir desses dados. Como a geração das redes, assim como sua relação com o desempenho dos algoritmos têm sido pouco estudadas, esta tese investigou esses aspectos e propôs novos métodos para construção de redes, considerando características ainda não exploradas na literatura. Foram propostos três métodos para construção de redes com diferentes topologias: 1) S-kNN (Sequential k Nearest Neighbors), que gera redes regulares; 2) GBILI (Graph Based on the Informativeness of Labeled Instances) e RGCLI (Robust Graph that Considers Labeled Instances), que exploram os rótulos disponíveis gerando redes com distribuição de grau lei de potência; 3) GBLP (Graph Based on Link Prediction), que se baseia em medidas de predição de links gerando redes com propriedades mundo-pequeno. As estratégias de construção de redes propostas foram analisadas por meio de medidas de teoria dos grafos e redes complexas e validadas por meio da classificação semissupervisionada. Os métodos foram aplicados em benchmarks da área e também na classificação de gêneros musicais e segmentação de imagens. Os resultados mostram que a topologia da rede influencia diretamente os algoritmos de classificação e as estratégias propostas alcançam boa acurácia. / With the increase capacity of storage, databases are getting larger and, in many situations, only a small subset of data items can be labeled. This happens because the labeling process is often expensive, time consuming and requires the involvement of human experts. Hence, several semi-supervised algorithms have been proposed, showing that it is possible to achieve good results by using prior knowledge. Among these algorithms, those based on graphs have gained prominence in the area. Such interest is justified by the benefits provided by the representation via graphs, such as the ability to capture the topological structure of the data, represent hierarchical structures, as well as model manifold in high dimensional spaces. Nevertheless, most of available data is represented by attribute-value tables, making necessary the study of graph construction techniques in order to convert these tabular data into graphs for applying such algorithms. As the generation of the weight matrix and the sparse graph, and their relation to the performance of the algorithms have been little studied, this thesis investigated these aspects and proposed new methods for graph construction with characteristics litle explored in the literature yet. We have proposed three methods for graph construction with different topologies: 1) S-kNN (Sequential k Nearest Neighbors) that generates regular graphs; 2) GBILI (Graph Based on the informativeness of Labeled Instances) and RGCLI (Robust Graph that Considers Labeled Instances), which exploit the labels available generating power-law graphs; 3) GBLP (Graph Based on Link Prediction), which are based on link prediction measures and generates small-world graphs. The strategies proposed were analyzed by graph theory and complex networks measures and validated in semi-supervised classification tasks. The methods were applied in benchmarks of the area and also in the music genre classification and image segmentation. The results show that the topology of the graph directly affects the classification algorithms and the proposed strategies achieve good accuracy. Aprendizado semissupervisionado Construção de redes Graph construction Graph-based methods for classification Neighborhood graphs Redes baseadas em vizinhança Semi-supervised learning
2	Método para processamento e análise computacinal de imagens histopatológicas visando apoiar o diagnóstico de câncer de colo de útero / A Method for Processing and Computational Analysis of histopathological images to support the diagnosis of Cervical Cancer Miranda, Gisele Helena Barboni 24 November 2011 (has links) A histopatologia é considerada um dos recursos diagnósticos mais importantes na prática médica e caracteriza-se pelo estudo das alterações estruturais e morfológicas das células e dos tecidos causadas por doenças. Atualmente, o principal método utilizado no diagnóstico histopatológico de imagens microscópicas, obtidas por meio de amostras em exames convencionais, é a avaliação visual do patologista, a qual se baseia na experiência do mesmo. O uso de técnicas de processamento computacional de imagens possibilita a identificação de elementos estruturais e a determinação de características inerentes, subsidiando o estudo da organização estrutural das células e de suas variações patológicas. A utilização de métodos computacionais no auxílio ao diagnóstico visa diminuir a subjetividade do processo de avaliação e classificação realizado pelo médico. Diferentes características dos tecidos podem ser mapeadas por meio de métricas específicas que poderão ser utilizadas em sistemas de reconhecimento de padrões. Dentro desta perspectiva, o objetivo geral deste trabalho inclui a proposta, a implementação e a avaliação de um método para a identificação e a análise de estruturas histológicas, a ser utilizado para a análise de lesões neoplásicas do colo do útero (NICs) a partir de amostras histopatológicas. Este trabalho foi desenvolvido em colaboração com uma equipe de patologistas, especialistas do domínio. As imagens microscópicas digitalizadas foram adquiridas a partir de lâminas previamente fixadas, contendo amostras de biópsias. Para segmentação dos núcleos celulares, foi implementado um pipeline de operadores morfológicos. Métodos de segmentação baseados em cor também foram testados e comparados à abordagem morfológica. Foi proposta e implementada uma abordagem baseada em camadas para representação do tecido, adotando-se a Triangulação de Delaunay (TD) como modelo de grafo de vizinhança. A TD apresenta algumas propriedades particulares que permitem a extração de métricas específicas. Foram utilizados algoritmos de agrupamento e morfologia de grafos, adotando-se critérios de semelhança e relações de adjacência entre os triângulos da rede, a fim de se obter a fronteira entre as camadas histológicas do tecido epitelial de forma automática. As seguintes métricas foram extraídas dos agrupamentos resultantes: grau médio, entropia e taxa de ocupação dos triângulos da rede. Finalmente, foi projetado um classificador estatístico levando-se em consideração os diferentes agrupamentos que poderiam ser obtidos a partir das imagens de treinamento. Valores de acurácia, sensitividade e especificidade foram utilizadas para avaliação dos resultados obtidos. Foi implementada validação cruzada em todos os experimentos realizados e foi utilizado um total de 116 imagens. Primeiro, foi avaliado a acurácia da metodologia proposta na determinação correta da presença de anomalia no tecido, para isto, todas as imagens que apresentavam NICs foram agrupadas em uma mesma classe. A maior taxa de acurácia obtida neste experimento foi de 88%. Em uma segunda etapa, foram realizadas avaliações entre as seguintes classes: Normal e NIC-I; NIC-I e NIC-II, e, NIC-II e NIC-III, obtendo-se taxas de acurácia máximas de 73%, 77% e 86%, respectivamente. Além disso, foi verificada também, a acurácia na discriminação entre os três tipos de NICs e regiões normais, obtendo-se acurácia de 64%. As taxas de ocupação relativas aos agrupamentos representativos das camadas basais e superficiais, foram os atributos que levaram às maiores taxas de acurácia. Os resultados obtidos permitem verificar a adequação do método proposto na representação e análise do processo de evolução das NICs no tecido epitelial do colo uterino. / Histopathology is considered one of the most important diagnostic tools in medical routine and is characterized by the study of structural and morphological changes of the cells in biological tissues caused by diseases. Currently, the visual assessment of the pathologist is the main method used in the histopathological diagnosis of microscopic images obtained from biopsy samples. This diagnosis is usually based on the experience of the pathologist. The use of computational techniques in the processing of these images allows the identification of structural elements and the determination of inherent characteristics, supporting the study of the structural organization of tissues and their pathological changes. Also, the use of computational methods to improve diagnosis aims to reduce the subjectivity of the evaluation made by the physician. Besides, different tissue characteristics can be mapped through specific metrics that can be used in pattern recognition systems. Within this perspective, the overall objective of this work includes the proposal, the implementation and the evaluation of a methodology for the identification and analysis of histological structures. This methodology includes the specification of a method for the analysis of cervical intraepithelial neoplasias (CINs) from histopathological samples. This work was developed in collaboration with a team of pathologists. Microscopic images were acquired from blades previously stained, containing samples of biopsy examinations. For the segmentation of cell nuclei, a pipeline of morphological operators were implemented. Segmentation techniques based on color were also tested and compared to the morphological approach. For the representation of the tissue architecture an approach based on the tissue layers was proposed and implemented adopting the Delaunay Triangulation (DT) as neighborhood graph. The DT has some special properties that allow the extraction of specific metrics. Clustering algorithms and graph morphology were used in order to automatically obtain the boundary between the histological layers of the epithelial tissue. For this purpose, similarity criteria and adjacency relations between the triangles of the network were explored. The following metrics were extracted from the resulting clusters: mean degree, entropy and the occupation rate of the clusters. Finally, a statistical classifier was designed taking into account the different combinations of clusters that could be obtained from the training process. Values of accuracy, sensitivity and specificity were used to evaluate the results. All the experiments were taken in a cross-validation process (5-fold) and a total of 116 images were used. First, it was evaluated the accuracy in determining the correct presence of abnormalities in the tissue. For this, all images presenting CINs were grouped in the same class. The highest accuracy rate obtained for this evaluation was 88%. In a second step, the discrimination between the following classes were analyzed: Normal/CIN 1; CIN 1/CIN 2, and, CIN 2/CIN 3, which represents the histological grading of the CINs. In a similar way, the highest accuracy rates obtained were 73%, 77% and 86%, respectively. In addition, it was also calculated the accuracy rate in discriminating between the four classes analyzed in this work: the three types of CINs and the normal region. In this last case, it was obtained a rate of 64%.The occupation rate for the basal and superficial layers were the attributes that led to the highest accuracy rates. The results obtained shows the adequacy of the proposed method in the representation and classification of the CINs evolution in the cervical epithelial tissue. Cervical Intraepithelial Neoplasia (CIN) Computer-Aided Diagnosis Diagnóstico Auxiliado por Computador Grafos de Vizinhança Medical Image Processing Neighborhood Graphs Neoplasia Intraepitelial Cervical (NIC) Processamento de Imagens Médicas
3	Método para processamento e análise computacinal de imagens histopatológicas visando apoiar o diagnóstico de câncer de colo de útero / A Method for Processing and Computational Analysis of histopathological images to support the diagnosis of Cervical Cancer Gisele Helena Barboni Miranda 24 November 2011 (has links) A histopatologia é considerada um dos recursos diagnósticos mais importantes na prática médica e caracteriza-se pelo estudo das alterações estruturais e morfológicas das células e dos tecidos causadas por doenças. Atualmente, o principal método utilizado no diagnóstico histopatológico de imagens microscópicas, obtidas por meio de amostras em exames convencionais, é a avaliação visual do patologista, a qual se baseia na experiência do mesmo. O uso de técnicas de processamento computacional de imagens possibilita a identificação de elementos estruturais e a determinação de características inerentes, subsidiando o estudo da organização estrutural das células e de suas variações patológicas. A utilização de métodos computacionais no auxílio ao diagnóstico visa diminuir a subjetividade do processo de avaliação e classificação realizado pelo médico. Diferentes características dos tecidos podem ser mapeadas por meio de métricas específicas que poderão ser utilizadas em sistemas de reconhecimento de padrões. Dentro desta perspectiva, o objetivo geral deste trabalho inclui a proposta, a implementação e a avaliação de um método para a identificação e a análise de estruturas histológicas, a ser utilizado para a análise de lesões neoplásicas do colo do útero (NICs) a partir de amostras histopatológicas. Este trabalho foi desenvolvido em colaboração com uma equipe de patologistas, especialistas do domínio. As imagens microscópicas digitalizadas foram adquiridas a partir de lâminas previamente fixadas, contendo amostras de biópsias. Para segmentação dos núcleos celulares, foi implementado um pipeline de operadores morfológicos. Métodos de segmentação baseados em cor também foram testados e comparados à abordagem morfológica. Foi proposta e implementada uma abordagem baseada em camadas para representação do tecido, adotando-se a Triangulação de Delaunay (TD) como modelo de grafo de vizinhança. A TD apresenta algumas propriedades particulares que permitem a extração de métricas específicas. Foram utilizados algoritmos de agrupamento e morfologia de grafos, adotando-se critérios de semelhança e relações de adjacência entre os triângulos da rede, a fim de se obter a fronteira entre as camadas histológicas do tecido epitelial de forma automática. As seguintes métricas foram extraídas dos agrupamentos resultantes: grau médio, entropia e taxa de ocupação dos triângulos da rede. Finalmente, foi projetado um classificador estatístico levando-se em consideração os diferentes agrupamentos que poderiam ser obtidos a partir das imagens de treinamento. Valores de acurácia, sensitividade e especificidade foram utilizadas para avaliação dos resultados obtidos. Foi implementada validação cruzada em todos os experimentos realizados e foi utilizado um total de 116 imagens. Primeiro, foi avaliado a acurácia da metodologia proposta na determinação correta da presença de anomalia no tecido, para isto, todas as imagens que apresentavam NICs foram agrupadas em uma mesma classe. A maior taxa de acurácia obtida neste experimento foi de 88%. Em uma segunda etapa, foram realizadas avaliações entre as seguintes classes: Normal e NIC-I; NIC-I e NIC-II, e, NIC-II e NIC-III, obtendo-se taxas de acurácia máximas de 73%, 77% e 86%, respectivamente. Além disso, foi verificada também, a acurácia na discriminação entre os três tipos de NICs e regiões normais, obtendo-se acurácia de 64%. As taxas de ocupação relativas aos agrupamentos representativos das camadas basais e superficiais, foram os atributos que levaram às maiores taxas de acurácia. Os resultados obtidos permitem verificar a adequação do método proposto na representação e análise do processo de evolução das NICs no tecido epitelial do colo uterino. / Histopathology is considered one of the most important diagnostic tools in medical routine and is characterized by the study of structural and morphological changes of the cells in biological tissues caused by diseases. Currently, the visual assessment of the pathologist is the main method used in the histopathological diagnosis of microscopic images obtained from biopsy samples. This diagnosis is usually based on the experience of the pathologist. The use of computational techniques in the processing of these images allows the identification of structural elements and the determination of inherent characteristics, supporting the study of the structural organization of tissues and their pathological changes. Also, the use of computational methods to improve diagnosis aims to reduce the subjectivity of the evaluation made by the physician. Besides, different tissue characteristics can be mapped through specific metrics that can be used in pattern recognition systems. Within this perspective, the overall objective of this work includes the proposal, the implementation and the evaluation of a methodology for the identification and analysis of histological structures. This methodology includes the specification of a method for the analysis of cervical intraepithelial neoplasias (CINs) from histopathological samples. This work was developed in collaboration with a team of pathologists. Microscopic images were acquired from blades previously stained, containing samples of biopsy examinations. For the segmentation of cell nuclei, a pipeline of morphological operators were implemented. Segmentation techniques based on color were also tested and compared to the morphological approach. For the representation of the tissue architecture an approach based on the tissue layers was proposed and implemented adopting the Delaunay Triangulation (DT) as neighborhood graph. The DT has some special properties that allow the extraction of specific metrics. Clustering algorithms and graph morphology were used in order to automatically obtain the boundary between the histological layers of the epithelial tissue. For this purpose, similarity criteria and adjacency relations between the triangles of the network were explored. The following metrics were extracted from the resulting clusters: mean degree, entropy and the occupation rate of the clusters. Finally, a statistical classifier was designed taking into account the different combinations of clusters that could be obtained from the training process. Values of accuracy, sensitivity and specificity were used to evaluate the results. All the experiments were taken in a cross-validation process (5-fold) and a total of 116 images were used. First, it was evaluated the accuracy in determining the correct presence of abnormalities in the tissue. For this, all images presenting CINs were grouped in the same class. The highest accuracy rate obtained for this evaluation was 88%. In a second step, the discrimination between the following classes were analyzed: Normal/CIN 1; CIN 1/CIN 2, and, CIN 2/CIN 3, which represents the histological grading of the CINs. In a similar way, the highest accuracy rates obtained were 73%, 77% and 86%, respectively. In addition, it was also calculated the accuracy rate in discriminating between the four classes analyzed in this work: the three types of CINs and the normal region. In this last case, it was obtained a rate of 64%.The occupation rate for the basal and superficial layers were the attributes that led to the highest accuracy rates. The results obtained shows the adequacy of the proposed method in the representation and classification of the CINs evolution in the cervical epithelial tissue. Diagnóstico Auxiliado por Computador Grafos de Vizinhança Neoplasia Intraepitelial Cervical (NIC) Processamento de Imagens Médicas Cervical Intraepithelial Neoplasia (CIN) Computer-Aided Diagnosis Medical Image Processing Neighborhood Graphs
4	Construção de redes baseadas em vizinhança para o aprendizado semissupervisionado / Graph construction based on neighborhood for semisupervised Lilian Berton 25 January 2016 (has links) Com o aumento da capacidade de armazenamento, as bases de dados são cada vez maiores e, em muitas situações, apenas um pequeno subconjunto de itens de dados pode ser rotulado. Isto acontece devido ao processo de rotulagem ser frequentemente caro, demorado e necessitar do envolvimento de especialistas humanos. Com isso, diversos algoritmos semissupervisionados foram propostos, mostrando que é possível obter bons resultados empregando conhecimento prévio, relativo à pequena fração de dados rotulados. Dentre esses algoritmos, os que têm ganhado bastante destaque na área têm sido aqueles baseados em redes. Tal interesse, justifica-se pelas vantagens oferecidas pela representação via redes, tais como, a possibilidade de capturar a estrutura topológica dos dados, representar estruturas hierárquicas, bem como modelar manifolds no espaço multi-dimensional. No entanto, existe uma grande quantidade de dados representados em tabelas atributo-valor, nos quais não se poderia aplicar os algoritmos baseados em redes sem antes construir uma rede a partir desses dados. Como a geração das redes, assim como sua relação com o desempenho dos algoritmos têm sido pouco estudadas, esta tese investigou esses aspectos e propôs novos métodos para construção de redes, considerando características ainda não exploradas na literatura. Foram propostos três métodos para construção de redes com diferentes topologias: 1) S-kNN (Sequential k Nearest Neighbors), que gera redes regulares; 2) GBILI (Graph Based on the Informativeness of Labeled Instances) e RGCLI (Robust Graph that Considers Labeled Instances), que exploram os rótulos disponíveis gerando redes com distribuição de grau lei de potência; 3) GBLP (Graph Based on Link Prediction), que se baseia em medidas de predição de links gerando redes com propriedades mundo-pequeno. As estratégias de construção de redes propostas foram analisadas por meio de medidas de teoria dos grafos e redes complexas e validadas por meio da classificação semissupervisionada. Os métodos foram aplicados em benchmarks da área e também na classificação de gêneros musicais e segmentação de imagens. Os resultados mostram que a topologia da rede influencia diretamente os algoritmos de classificação e as estratégias propostas alcançam boa acurácia. / With the increase capacity of storage, databases are getting larger and, in many situations, only a small subset of data items can be labeled. This happens because the labeling process is often expensive, time consuming and requires the involvement of human experts. Hence, several semi-supervised algorithms have been proposed, showing that it is possible to achieve good results by using prior knowledge. Among these algorithms, those based on graphs have gained prominence in the area. Such interest is justified by the benefits provided by the representation via graphs, such as the ability to capture the topological structure of the data, represent hierarchical structures, as well as model manifold in high dimensional spaces. Nevertheless, most of available data is represented by attribute-value tables, making necessary the study of graph construction techniques in order to convert these tabular data into graphs for applying such algorithms. As the generation of the weight matrix and the sparse graph, and their relation to the performance of the algorithms have been little studied, this thesis investigated these aspects and proposed new methods for graph construction with characteristics litle explored in the literature yet. We have proposed three methods for graph construction with different topologies: 1) S-kNN (Sequential k Nearest Neighbors) that generates regular graphs; 2) GBILI (Graph Based on the informativeness of Labeled Instances) and RGCLI (Robust Graph that Considers Labeled Instances), which exploit the labels available generating power-law graphs; 3) GBLP (Graph Based on Link Prediction), which are based on link prediction measures and generates small-world graphs. The strategies proposed were analyzed by graph theory and complex networks measures and validated in semi-supervised classification tasks. The methods were applied in benchmarks of the area and also in the music genre classification and image segmentation. The results show that the topology of the graph directly affects the classification algorithms and the proposed strategies achieve good accuracy. Aprendizado semissupervisionado Construção de redes Redes baseadas em vizinhança Graph construction Graph-based methods for classification Neighborhood graphs Semi-supervised learning
5	A contribution to topological learning and its application in Social Networks / Une contribution à l'apprentissage topologique et son application dans les réseaux sociaux Ezzeddine, Diala 01 October 2014 (has links) L'Apprentissage Supervisé est un domaine populaire de l'Apprentissage Automatique en progrès constant depuis plusieurs années. De nombreuses techniques ont été développées pour résoudre le problème de classification, mais, dans la plupart des cas, ces méthodes se basent sur la présence et le nombre de points d'une classe donnée dans des zones de l'espace que doit définir le classifieur. Á cause de cela la construction de ce classifieur est dépendante de la densité du nuage de points des données de départ. Dans cette thèse, nous montrons qu'utiliser la topologie des données peut être une bonne alternative lors de la construction des classifieurs. Pour cela, nous proposons d'utiliser les graphes topologiques comme le Graphe de Gabriel (GG) ou le Graphes des Voisins Relatifs (RNG). Ces dernier représentent la topologie de données car ils sont basées sur la notion de voisinages et ne sont pas dépendant de la densité. Pour appliquer ce concept, nous créons une nouvelle méthode appelée Classification aléatoire par Voisinages (Random Neighborhood Classification (RNC)). Cette méthode utilise des graphes topologiques pour construire des classifieurs. De plus, comme une Méthodes Ensemble (EM), elle utilise plusieurs classifieurs pour extraire toutes les informations pertinentes des données. Les EM sont bien connues dans l'Apprentissage Automatique. Elles génèrent de nombreux classifieurs à partir des données, puis agrègent ces classifieurs en un seul. Le classifieur global obtenu est reconnu pour être très eficace, ce qui a été montré dans de nombreuses études. Cela est possible car il s'appuie sur des informations obtenues auprès de chaque classifieur qui le compose. Nous avons comparé RNC à d'autres méthodes de classification supervisées connues sur des données issues du référentiel UCI Irvine. Nous constatons que RNC fonctionne bien par rapport aux meilleurs d'entre elles, telles que les Forêts Aléatoires (RF) et Support Vector Machines (SVM). La plupart du temps, RNC se classe parmi les trois premières méthodes en terme d'eficacité. Ce résultat nous a encouragé à étudier RNC sur des données réelles comme les tweets. Twitter est un réseau social de micro-blogging. Il est particulièrement utile pour étudier l'opinion à propos de l'actualité et sur tout sujet, en particulier la politique. Cependant, l'extraction de l'opinion politique depuis Twitter pose des défis particuliers. En effet, la taille des messages, le niveau de langage utilisé et ambiguïté des messages rend très diffcile d'utiliser les outils classiques d'analyse de texte basés sur des calculs de fréquence de mots ou des analyses en profondeur de phrases. C'est cela qui a motivé cette étude. Nous proposons d'étudier les couples auteur/sujet pour classer le tweet en fonction de l'opinion de son auteur à propos d'un politicien (un sujet du tweet). Nous proposons une procédure qui porte sur l'identification de ces opinions. Nous pensons que les tweets expriment rarement une opinion objective sur telle ou telle action d'un homme politique mais plus souvent une conviction profonde de son auteur à propos d'un mouvement politique. Détecter l'opinion de quelques auteurs nous permet ensuite d'utiliser la similitude dans les termes employés par les autres pour retrouver ces convictions à plus grande échelle. Cette procédure à 2 étapes, tout d'abord identifier l'opinion de quelques couples de manière semi-automatique afin de constituer un référentiel, puis ensuite d'utiliser l'ensemble des tweets d'un couple (tous les tweets d'un auteur mentionnant un politicien) pour les comparer avec ceux du référentiel. L'Apprentissage Topologique semble être un domaine très intéressant à étudier, en particulier pour résoudre les problèmes de classification...... / Supervised Learning is a popular field of Machine Learning that has made recent progress. In particular, many methods and procedures have been developed to solve the classification problem. Most classical methods in Supervised Learning use the density estimation of data to construct their classifiers.In this dissertation, we show that the topology of data can be a good alternative in constructing classifiers. We propose using topological graphs like Gabriel graphs (GG) and Relative Neighborhood Graphs (RNG) that can build the topology of data based on its neighborhood structure. To apply this concept, we create a new method called Random Neighborhood Classification (RNC).In this method, we use topological graphs to construct classifiers and then apply Ensemble Methods (EM) to get all relevant information from the data. EM is well known in Machine Learning, generates many classifiers from data and then aggregates these classifiers into one. Aggregate classifiers have been shown to be very efficient in many studies, because it leverages relevant and effective information from each generated classifier. We first compare RNC to other known classification methods using data from the UCI Irvine repository. We find that RNC works very well compared to very efficient methods such as Random Forests and Support Vector Machines. Most of the time, it ranks in the top three methods in efficiency. This result has encouraged us to study the efficiency of RNC on real data like tweets. Twitter, a microblogging Social Network, is especially useful to mine opinion on current affairs and topics that span the range of human interest, including politics. Mining political opinion from Twitter poses peculiar challenges such as the versatility of the authors when they express their political view, that motivate this study. We define a new attribute, called couple, that will be very helpful in the process to study the tweets opinion. A couple is an author that talk about a politician. We propose a new procedure that focuses on identifying the opinion on tweet using couples. We think that focusing on the couples's opinion expressed by several tweets can overcome the problems of analysing each single tweet. This approach can be useful to avoid the versatility, language ambiguity and many other artifacts that are easy to understand for a human being but not automatically for a machine.We use classical Machine Learning techniques like KNN, Random Forests (RF) and also our method RNC. We proceed in two steps : First, we build a reference set of classified couples using Naive Bayes. We also apply a second alternative method to Naive method, sampling plan procedure, to compare and evaluate the results of Naive method. Second, we evaluate the performance of this approach using proximity measures in order to use RNC, RF and KNN. The expirements used are based on real data of tweets from the French presidential election in 2012. The results show that this approach works well and that RNC performs very good in order to classify opinion in tweets.Topological Learning seems to be very intersting field to study, in particular to address the classification problem. Many concepts to get informations from topological graphs need to analyse like the ones described by Aupetit, M. in his work (2005). Our work show that Topological Learning can be an effective way to perform classification problem. Apprentissage Automatique Classification Graphes de voisinages Graphe de Gabriel Graphe des voisins relatifs Méthodes d'ensemble Naive Bayes Décision Templates Twitter Opinion Mining Machine Learning Classification Neighborhood Graphs Gabriel Graph Relatif Neighborhood Graph Ensemble Methods Naive Bayes Decision Templates Twitter Opinion Mining
6	Contribution à la sélection de variables par les machines à vecteurs support pour la discrimination multi-classes / Contribution to Variables Selection by Support Vector Machines for Multiclass Discrimination Aazi, Fatima Zahra 20 December 2016 (has links) Les avancées technologiques ont permis le stockage de grandes masses de données en termes de taille (nombre d’observations) et de dimensions (nombre de variables).Ces données nécessitent de nouvelles méthodes, notamment en modélisation prédictive (data science ou science des données), de traitement statistique adaptées à leurs caractéristiques. Dans le cadre de cette thèse, nous nous intéressons plus particulièrement aux données dont le nombre de variables est élevé comparé au nombre d’observations.Pour ces données, une réduction du nombre de variables initiales, donc de dimensions, par la sélection d’un sous-ensemble optimal, s’avère nécessaire, voire indispensable.Elle permet de réduire la complexité, de comprendre la structure des données et d’améliorer l’interprétation des résultats et les performances du modèle de prédiction ou de classement en éliminant les variables bruit et/ou redondantes.Nous nous intéressons plus précisément à la sélection de variables dans le cadre de l’apprentissage supervisé et plus spécifiquement de la discrimination à catégories multiples dite multi-classes. L’objectif est de proposer de nouvelles méthodes de sélection de variables pour les modèles de discrimination multi-classes appelés Machines à Vecteurs Support Multiclasses (MSVM).Deux approches sont proposées dans ce travail. La première, présentée dans un contexte classique, consiste à sélectionner le sous-ensemble optimal de variables en utilisant le critère de "la borne rayon marge" majorante du risque de généralisation des MSVM. Quant à la deuxième approche, elle s’inscrit dans un contexte topologique et utilise la notion de graphes de voisinage et le critère de degré d’équivalence topologique en discrimination pour identifier les variables pertinentes qui constituent le sous-ensemble optimal du modèle MSVM.L’évaluation de ces deux approches sur des données simulées et d’autres réelles montre qu’elles permettent de sélectionner, à partir d’un grand nombre de variables initiales, un nombre réduit de variables explicatives avec des performances similaires ou encore meilleures que celles obtenues par des méthodes concurrentes. / The technological progress has allowed the storage of large amounts of data in terms of size (number of observations) and dimensions (number of variables). These data require new methods, especially for predictive modeling (data science), of statistical processing adapted to their characteristics. In this thesis, we are particularly interested in the data with large numberof variables compared to the number of observations.For these data, reducing the number of initial variables, hence dimensions, by selecting an optimal subset is necessary, even imperative. It reduces the complexity, helps to understand the data structure, improves the interpretation of the results and especially enhances the performance of the forecasting model by eliminating redundant and / or noise variables.More precisely, we are interested in the selection of variables in the context of supervised learning, specifically of multiclass discrimination. The objective is to propose some new methods of variable selection for multiclass discriminant models called Multiclass Support Vector Machines (MSVM).Two approaches are proposed in this work. The first one, presented in a classical context, consist in selecting the optimal subset of variables using the radius margin upper bound of the generalization error of MSVM. The second one, proposed in a topological context, uses the concepts of neighborhood graphs and the degree of topological equivalence in discriminationto identify the relevant variables and to select the optimal subset for an MSVM model.The evaluation of these two approaches on simulated and real data shows that they can select from a large number of initial variables, a reduced number providing equal or better performance than those obtained by competing methods. Apprentissage supervisé Sélection de variables Machines à vecteurs support Discrimination multi-classes Borne rayon marge multi-classes Mesures de proximité Graphes de voisinage Équivalence topologique Supervised learning Variables selection Support vector machines Multiclass discrimination Multiclass radius margin bound Proximity measures Neighborhood graphs Topological equivalence 004

1

Page generated in 0.0403 seconds