Spelling suggestions: "subject:"1atrix factorization"" "subject:"béatrix factorization""
91 |
Paralelização de inferência em redes credais utilizando computação distribuída para fatoração de matrizes esparsas / Parallelization of credal network inference using distributed computing for sparse matrix factorization.Pereira, Ramon Fortes 25 April 2017 (has links)
Este estudo tem como objetivo melhorar o desempenho computacional dos algoritmos de inferência em redes credais, aplicando técnicas de computação paralela e sistemas distribuídos em algoritmos de fatoração de matrizes esparsas. Grosso modo, técnicas de computação paralela são técnicas para transformar um sistema em um sistema com algoritmos que possam ser executados concorrentemente. E a fatoração de matrizes são técnicas da matemática para decompor uma matriz em um produto de duas ou mais matrizes. As matrizes esparsas são matrizes que possuem a maioria de seus valores iguais a zero. E as redes credais são semelhantes as redes bayesianas, que são grafos acíclicos que representam uma probabilidade conjunta através de probabilidades condicionais e suas relações de independência. As redes credais podem ser consideradas como uma extensão das redes bayesianas para lidar com incertezas ou a má qualidade dos dados. Para aplicar a técnica de paralelização de fatoração de matrizes esparsas na inferência de redes credais, a inferência utiliza-se da técnica de eliminação de variáveis onde o grafo acíclico da rede credal é associado a uma matriz esparsa e cada variável eliminada é análoga a eliminação de uma coluna. / This study\'s objective is the computational performance improvement of credal network inference algorithms by applying computational parallel and distributed system techniques of sparse matrix factorization algorithms. Roughly, computational parallel techniques are used to transform systems in systems with algorithms that can be executed concurrently. And the matrix factorization is a group of mathematical techniques to decompose a matrix in a product of two or more matrixes. The sparse matrixes are matrixes which have most of their values equal to zero. And credal networks are similar to Bayesian networks, which are acyclic graphs representing a joint probability through conditional probabilities and their independence relations. Credal networks can be considered as a Bayesian network extension because of their manner of leading to uncertainty and the poor data quality. To apply parallel techniques of sparse matrix factorization in credal network inference the variable elimination method was used, where the credal network acyclic graph is associated to a sparse matrix and every eliminated variable is analogous to an eliminated column.
92 |
Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais / Heuristic biclustering and coclustering based on matrix factorization: a study on textual dataRamos Diaz, Alexandra Katiuska 16 October 2018 (has links)
Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância / Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data
93 |
Apprentissage avec la parcimonie et sur des données incertaines par la programmation DC et DCA / Learning with sparsity and uncertainty by Difference of Convex functions optimizationVo, Xuan Thanh 15 October 2015 (has links)
Dans cette thèse, nous nous concentrons sur le développement des méthodes d'optimisation pour résoudre certaines classes de problèmes d'apprentissage avec la parcimonie et/ou avec l'incertitude des données. Nos méthodes sont basées sur la programmation DC (Difference of Convex functions) et DCA (DC Algorithms) étant reconnues comme des outils puissants d'optimisation. La thèse se compose de deux parties : La première partie concerne la parcimonie tandis que la deuxième partie traite l'incertitude des données. Dans la première partie, une étude approfondie pour la minimisation de la norme zéro a été réalisée tant sur le plan théorique qu'algorithmique. Nous considérons une approximation DC commune de la norme zéro et développons quatre algorithmes basées sur la programmation DC et DCA pour résoudre le problème approché. Nous prouvons que nos algorithmes couvrent tous les algorithmes standards existants dans le domaine. Ensuite, nous étudions le problème de la factorisation en matrices non-négatives (NMF) et fournissons des algorithmes appropriés basés sur la programmation DC et DCA. Nous étudions également le problème de NMF parcimonieuse. Poursuivant cette étude, nous étudions le problème d'apprentissage de dictionnaire où la représentation parcimonieuse joue un rôle crucial. Dans la deuxième partie, nous exploitons la technique d'optimisation robuste pour traiter l'incertitude des données pour les deux problèmes importants dans l'apprentissage : la sélection de variables dans SVM (Support Vector Machines) et le clustering. Différents modèles d'incertitude sont étudiés. Les algorithmes basés sur DCA sont développés pour résoudre ces problèmes. / In this thesis, we focus on developing optimization approaches for solving some classes of optimization problems in sparsity and robust optimization for data uncertainty. Our methods are based on DC (Difference of Convex functions) programming and DCA (DC Algorithms) which are well-known as powerful tools in optimization. This thesis is composed of two parts: the first part concerns with sparsity while the second part deals with uncertainty. In the first part, a unified DC approximation approach to optimization problem involving the zero-norm in objective is thoroughly studied on both theoretical and computational aspects. We consider a common DC approximation of zero-norm that includes all standard sparse inducing penalty functions, and develop general DCA schemes that cover all standard algorithms in the field. Next, the thesis turns to the nonnegative matrix factorization (NMF) problem. We investigate the structure of the considered problem and provide appropriate DCA based algorithms. To enhance the performance of NMF, the sparse NMF formulations are proposed. Continuing this topic, we study the dictionary learning problem where sparse representation plays a crucial role. In the second part, we exploit robust optimization technique to deal with data uncertainty for two important problems in machine learning: feature selection in linear Support Vector Machines and clustering. In this context, individual data point is uncertain but varies in a bounded uncertainty set. Different models (box/spherical/ellipsoidal) related to uncertain data are studied. DCA based algorithms are developed to solve the robust problems
94 |
Décomposition booléenne des tableaux multi-dimensionnels de données binaires : une approche par modèle de mélange post non-linéaire / Boolean decomposition of binary multidimensional arrays using a post nonlinear mixture modelDiop, Mamadou 14 December 2018 (has links)
Cette thèse aborde le problème de la décomposition booléenne des tableaux multidimensionnels de données binaires par modèle de mélange post non-linéaire. Dans la première partie, nous introduisons une nouvelle approche pour la factorisation booléenne en matrices binaires (FBMB) fondée sur un modèle de mélange post non-linéaire. Contrairement aux autres méthodes de factorisation de matrices binaires existantes, fondées sur le produit matriciel classique, le modèle proposé est équivalent au modèle booléen de factorisation matricielle lorsque les entrées des facteurs sont exactement binaires et donne des résultats plus interprétables dans le cas de sources binaires corrélées, et des rangs d'approximation matricielle plus faibles. Une condition nécessaire et suffisante d'unicité pour la FBMB est également fournie. Deux algorithmes s'appuyant sur une mise à jour multiplicative sont proposés et illustrés dans des simulations numériques ainsi que sur un jeu de données réelles. La généralisation de cette approche au cas de tableaux multidimensionnels (tenseurs) binaires conduit à la factorisation booléenne de tenseurs binaires (FBTB). La démonstration de la condition nécessaire et suffisante d’unicité de la décomposition booléenne de tenseurs binaires repose sur la notion d'indépendance booléenne d'une famille de vecteurs. L'algorithme multiplicatif fondé sur le modèle de mélange post non-linéaire est étendu au cas multidimensionnel. Nous proposons également un nouvel algorithme, plus efficace, s'appuyant sur une stratégie de type AO-ADMM (Alternating Optimization -ADMM). Ces algorithmes sont comparés à ceux de l'état de l'art sur des données simulées et sur un jeu de données réelles / This work is dedicated to the study of boolean decompositions of binary multidimensional arrays using a post nonlinear mixture model. In the first part, we introduce a new approach for the boolean factorization of binary matrices (BFBM) based on a post nonlinear mixture model. Unlike the existing binary matrix factorization methods, the proposed method is equivalent to the boolean factorization model when the matrices are strictly binary and give thus more interpretable results in the case of correlated sources and lower rank matrix approximations compared to other state-of-the-art algorithms. A necessary and suffi-cient condition for the uniqueness of the BFBM is also provided. Two algorithms based on multiplicative update rules are proposed and tested in numerical simulations, as well as on a real dataset. The gener-alization of this approach to the case of binary multidimensional arrays (tensors) leads to the boolean factorisation of binary tensors (BFBT). The proof of the necessary and sufficient condition for the boolean decomposition of binary tensors is based on a notion of boolean independence of binary vectors. The multiplicative algorithm based on the post nonlinear mixture model is extended to the multidimensional case. We also propose a new algorithm based on an AO-ADMM (Alternating Optimization-ADMM) strategy. These algorithms are compared to state-of-the-art algorithms on simulated and on real data
95 |
Evaluation de l'adhérence au contact roue-rail par analyse d'images spectrales / Wheel-track adhesion evaluation using spectral imagingNicodeme, Claire 04 July 2018 (has links)
L’avantage du train depuis sa création est sa faible résistance à l’avancement du fait du contact fer-fer de la roue sur le rail conduisant à une adhérence réduite. Cependant cette adhérence faible est aussi un inconvénient majeur : étant dépendante des conditions environnementales, elle est facilement altérée lors d’une pollution du rail (végétaux, corps gras, eau, etc.). Aujourd’hui, les mesures prises face à des situations d'adhérence dégradée impactent directement les performances du système et conduisent notamment à une perte de capacité de transport. L’objectif du projet est d’utiliser les nouvelles technologies d’imagerie spectrale pour identifier sur les rails les zones à adhérence réduite et leur cause afin d’alerter et d’adapter rapidement les comportements. La stratégie d’étude a pris en compte les trois points suivants : • Le système de détection, installé à bord de trains commerciaux, doit être indépendant du train. • La détection et l’identification ne doivent pas interagir avec la pollution pour ne pas rendre la mesure obsolète. Pour ce faire le principe d’un Contrôle Non Destructif est retenu. • La technologie d’imagerie spectrale permet de travailler à la fois dans le domaine spatial (mesure de distance, détection d’objet) et dans le domaine fréquentiel (détection et reconnaissance de matériaux par analyse de signatures spectrales). Dans le temps imparti des trois ans de thèse, nous nous sommes focalisés sur la validation du concept par des études et analyses en laboratoire, réalisables dans les locaux de SNCF Ingénierie & Projets. Les étapes clés ont été la réalisation d’un banc d’évaluation et le choix du système de vision, la création d'une bibliothèque de signatures spectrales de référence et le développement d'algorithmes classification supervisées et non supervisées des pixels. Ces travaux ont été valorisés par le dépôt d'un brevet et la publication d'articles dans des conférences IEEE. / The advantage of the train since its creation is in its low resistance to the motion, due to the contact iron-iron of the wheel on the rail leading to low adherence. However this low adherence is also a major drawback : being dependent on the environmental conditions, it is easily deteriorated when the rail is polluted (vegetation, grease, water, etc). Nowadays, strategies to face a deteriorated adherence impact the performance of the system and lead to a loss of transport capacity. The objective of the project is to use a new spectral imaging technology to identify on the rails areas with reduced adherence and their cause in order to quickly alert and adapt the train's behaviour. The study’s strategy took into account the three following points : -The detection system, installed on board of commercial trains, must be independent of the train. - The detection and identification process should not interact with pollution in order to keep the measurements unbiased. To do so, we chose a Non Destructive Control method. - Spectral imaging technology makes it possible to work with both spatial information (distance’s measurement, target detection) and spectral information (material detection and recognition by analysis of spectral signatures). In the assigned time, we focused on the validation of the concept by studies and analyses in laboratory, workable in the office at SNCF Ingénierie & Projets. The key steps were the creation of the concept's evaluation bench and the choice of a Vision system, the creation of a library containing reference spectral signatures and the development of supervised and unsupervised pixels classification. A patent describing the method and process has been filed and published.
96 |
Non-negative matrix factorization for integrative clustering / Алгоритми интегративног кластеровања података применом ненегативне факторизације матрице / Algoritmi integrativnog klasterovanja podataka primenom nenegativne faktorizacije matriceBrdar Sanja 15 December 2016 (has links)
<p>Integrative approaches are motivated by the desired improvement of<br />robustness, stability and accuracy. Clustering, the prevailing technique for<br />preliminary and exploratory analysis of experimental data, may benefit from<br />integration across multiple partitions. In this thesis we have proposed<br />integration methods based on non-negative matrix factorization that can fuse<br />clusterings stemming from different data sets, different data preprocessing<br />steps or different sub-samples of objects or features. Proposed methods are<br />evaluated from several points of view on typical machine learning data sets,<br />synthetics data, and above all, on data coming form bioinformatics realm,<br />which rise is fuelled by technological revolutions in molecular biology. For a<br />vast amounts of 'omics' data that are nowadays available sophisticated<br />computational methods are necessary. We evaluated methods on problem<br />from cancer genomics, functional genomics and metagenomics.</p> / <p>Предмет истраживања докторске дисертације су алгоритми кластеровања,<br />односно груписања података, и могућности њиховог унапређења<br />интегративним приступом у циљу повећања поузданости, робустности на<br />присуство шума и екстремних вредности у подацима, омогућавања фузије<br />података. У дисертацији су предложене методе засноване на ненегативној<br />факторизацији матрице. Методе су успешно имплементиране и детаљно<br />анализиране на разноврсним подацима са UCI репозиторијума и<br />синтетичким подацима које се типично користе за евалуацију нових<br />алгоритама и поређење са већ постојећим методама. Већи део<br />дисертације посвећен је примени у домену биоинформатике која обилује<br />хетерогеним подацима и бројним изазовним задацима. Евалуација је<br />извршена на подацима из домена функционалне геномике, геномике рака и<br />метагеномике.</p> / <p>Predmet istraživanja doktorske disertacije su algoritmi klasterovanja,<br />odnosno grupisanja podataka, i mogućnosti njihovog unapređenja<br />integrativnim pristupom u cilju povećanja pouzdanosti, robustnosti na<br />prisustvo šuma i ekstremnih vrednosti u podacima, omogućavanja fuzije<br />podataka. U disertaciji su predložene metode zasnovane na nenegativnoj<br />faktorizaciji matrice. Metode su uspešno implementirane i detaljno<br />analizirane na raznovrsnim podacima sa UCI repozitorijuma i<br />sintetičkim podacima koje se tipično koriste za evaluaciju novih<br />algoritama i poređenje sa već postojećim metodama. Veći deo<br />disertacije posvećen je primeni u domenu bioinformatike koja obiluje<br />heterogenim podacima i brojnim izazovnim zadacima. Evaluacija je<br />izvršena na podacima iz domena funkcionalne genomike, genomike raka i<br />metagenomike.</p>
97 |
Méthodes informées de factorisaton matricielle pour l'étalonnage de réseaux de capteurs mobiles et la cartographie de champs de pollution / Informed method of matrix factorization for calibration of mobile sensor networks and pollution fields mappingDorffer, Clément 13 December 2017 (has links)
Le mobile crowdsensing consiste à acquérir des données géolocalisées et datées d'une foule de capteurs mobiles (issus de ou connectés à des smartphones). Dans cette thèse, nous nous intéressons au traitement des données issues du mobile crowdsensing environnemental. En particulier, nous proposons de revisiter le problème d'étalonnage aveugle de capteurs comme un problème informé de factorisation matricielle à données manquantes, où les facteurs contiennent respectivement le modèle d'étalonnage fonction du phénomène physique observé (nous proposons des approches pour des modèles affines et non linéaires) et les paramètres d'étalonnage de chaque capteur. Par ailleurs, dans l'application de surveillance de la qualité de l'air que nous considérons, nous supposons avoir à notre disposition des mesures très précises mais distribuées de manière très parcimonieuse dans le temps et l'espace, que nous couplons aux multiples mesures issues de capteurs mobiles. Nos approches sont dites informées car (i) les facteurs matriciels sont structurés par la nature du problème, (ii) le phénomène observé peut être décomposé sous forme parcimonieuse dans un dictionnaire connu ou approché par un modèle physique/géostatistique, et (iii) nous connaissons la fonction d'étalonnage moyenne des capteurs à étalonner. Les approches proposées sont plus performantes que des méthodes basées sur la complétion de la matrice de données observées ou les techniques multi-sauts de la littérature, basées sur des régressions robustes. Enfin, le formalisme informé de factorisation matricielle nous permet aussi de reconstruire une carte fine du phénomène physique observé. / Mobile crowdsensing aims to acquire geolocated and timestamped data from a crowd of sensors (from or connected to smartphones). In this thesis, we focus on processing data from environmental mobile crowdsensing. In particular, we propose to revisit blind sensor calibration as an informed matrix factorization problem with missing entries, where factor matrices respectively contain the calibration model which is a function of the observed physical phenomenon (we focus on approaches for affine or nonlinear sensor responses) and the calibration parameters of each sensor. Moreover, in the considered air quality monitoring application, we assume to pocee- some precise measurements- which are sparsely distributed in space and time - that we melt with the multiple measurements from the mobile sensors. Our approaches are "informed" because (i) factor matrices are structured by the problem nature, (ii) the physical phenomenon can be decomposed using sparse decomposition with a known dictionary or can be approximated by a physical or a geostatistical model, and (iii) we know the mean calibration function of the sensors to be calibrated. The proposed approaches demonstrate better performances than the one based on the completion of the observed data matrix or the multi-hop calibration method from the literature, based on robust regression. Finally, the informed matrix factorization formalism also provides an accurate reconstruction of the observed physical field.
98 |
Paralelização de inferência em redes credais utilizando computação distribuída para fatoração de matrizes esparsas / Parallelization of credal network inference using distributed computing for sparse matrix factorization.Ramon Fortes Pereira 25 April 2017 (has links)
Este estudo tem como objetivo melhorar o desempenho computacional dos algoritmos de inferência em redes credais, aplicando técnicas de computação paralela e sistemas distribuídos em algoritmos de fatoração de matrizes esparsas. Grosso modo, técnicas de computação paralela são técnicas para transformar um sistema em um sistema com algoritmos que possam ser executados concorrentemente. E a fatoração de matrizes são técnicas da matemática para decompor uma matriz em um produto de duas ou mais matrizes. As matrizes esparsas são matrizes que possuem a maioria de seus valores iguais a zero. E as redes credais são semelhantes as redes bayesianas, que são grafos acíclicos que representam uma probabilidade conjunta através de probabilidades condicionais e suas relações de independência. As redes credais podem ser consideradas como uma extensão das redes bayesianas para lidar com incertezas ou a má qualidade dos dados. Para aplicar a técnica de paralelização de fatoração de matrizes esparsas na inferência de redes credais, a inferência utiliza-se da técnica de eliminação de variáveis onde o grafo acíclico da rede credal é associado a uma matriz esparsa e cada variável eliminada é análoga a eliminação de uma coluna. / This study\'s objective is the computational performance improvement of credal network inference algorithms by applying computational parallel and distributed system techniques of sparse matrix factorization algorithms. Roughly, computational parallel techniques are used to transform systems in systems with algorithms that can be executed concurrently. And the matrix factorization is a group of mathematical techniques to decompose a matrix in a product of two or more matrixes. The sparse matrixes are matrixes which have most of their values equal to zero. And credal networks are similar to Bayesian networks, which are acyclic graphs representing a joint probability through conditional probabilities and their independence relations. Credal networks can be considered as a Bayesian network extension because of their manner of leading to uncertainty and the poor data quality. To apply parallel techniques of sparse matrix factorization in credal network inference the variable elimination method was used, where the credal network acyclic graph is associated to a sparse matrix and every eliminated variable is analogous to an eliminated column.
99 |
Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais / Heuristic biclustering and coclustering based on matrix factorization: a study on textual dataAlexandra Katiuska Ramos Diaz 16 October 2018 (has links)
Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância / Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data
100 |
Méthodes avancées de séparation de sources applicables aux mélanges linéaires-quadratiques / Advanced methods of source separation applicable to linear-quadratic mixturesJarboui, Lina 18 November 2017 (has links)
Dans cette thèse, nous nous sommes intéressés à proposer de nouvelles méthodes de Séparation Aveugle de Sources (SAS) adaptées aux modèles de mélange non-linéaires. La SAS consiste à estimer les signaux sources inconnus à partir de leurs mélanges observés lorsqu'il existe très peu d'informations disponibles sur le modèle de mélange. La contribution méthodologique de cette thèse consiste à prendre en considération les interactions non-linéaires qui peuvent se produire entre les sources en utilisant le modèle linéaire-quadratique (LQ). A cet effet, nous avons développé trois nouvelles méthodes de SAS. La première méthode vise à résoudre le problème du démélange hyperspectral en utilisant un modèle linéaire-quadratique. Celle-ci se repose sur la méthode d'Analyse en Composantes Parcimonieuses (ACPa) et nécessite l'existence des pixels purs dans la scène observée. Dans le même but, nous proposons une deuxième méthode du démélange hyperspectral adaptée au modèle linéaire-quadratique. Elle correspond à une méthode de Factorisation en Matrices Non-négatives (FMN) se basant sur l'estimateur du Maximum A Posteriori (MAP) qui permet de prendre en compte les informations a priori sur les distributions des inconnus du problème afin de mieux les estimer. Enfin, nous proposons une troisième méthode de SAS basée sur l'analyse en composantes indépendantes (ACI) en exploitant les Statistiques de Second Ordre (SSO) pour traiter un cas particulier du mélange linéaire-quadratique qui correspond au mélange bilinéaire. / In this thesis, we were interested to propose new Blind Source Separation (BSS) methods adapted to the nonlinear mixing models. BSS consists in estimating the unknown source signals from their observed mixtures when there is little information available on the mixing model. The methodological contribution of this thesis consists in considering the non-linear interactions that can occur between sources by using the linear-quadratic (LQ) model. To this end, we developed three new BSS methods. The first method aims at solving the hyperspectral unmixing problem by using a linear-quadratic model. It is based on the Sparse Component Analysis (SCA) method and requires the existence of pure pixels in the observed scene. For the same purpose, we propose a second hyperspectral unmixing method adapted to the linear-quadratic model. It corresponds to a Non-negative Matrix Factorization (NMF) method based on the Maximum A Posteriori (MAP) estimate allowing to take into account the available prior information about the unknown parameters for a better estimation of them. Finally, we propose a third BSS method based on the Independent Component Analysis (ICA) method by using the Second Order Statistics (SOS) to process a particular case of the linear-quadratic mixture that corresponds to the bilinear one.
Page generated in 0.1163 seconds