Global ETD Search

1	Utilisation des modèles de co-clustering pour l'analyse exploratoire des données / No English title available Guigourès, Romain 04 December 2013 (has links) Le co-clustering est une technique de classiﬁcation consistant à réaliser une partition simultanée des lignes et des colonnes d’une matrice de données. Parmi les approches existantes, MODL permet de traiter des données volumineuses et de réaliser une partition de plusieurs variables, continues ou nominales. Nous utilisons cette approche comme référence dans l’ensemble des travaux de la thèse et montrons la diversité des problèmes de data mining pouvant être traités, comme le partitionnement de graphes, de graphes temporels ou encore le clustering de courbes. L’approche MODL permet d’obtenir des résultats ﬁns sur des données volumineuses, ce qui les rend diﬃcilement interprétables. Des outils d’analyse exploratoire sont alors nécessaires pour les exploiter. Aﬁn de guider l'utilisateur dans l'interprétation de tels résultats, nous déﬁnissons plusieurs outils consistant à simplifier des résultats ﬁns aﬁn d’en avoir une interprétation globale, à détecter les clusters remarquables, à déterminer les valeurs représentatives de leurs clusters et enﬁn à visualiser les résultats. Les comportements asymptotiques de ces outils d’analyse exploratoire sont étudiés aﬁn de faire le lien avec les approches existantes.Enﬁn une application sur des comptes-rendus d’appels de l’opérateur Orange, collectés en Côte d’Ivoire, montre l’intérêt de l’approche et des outils d’analyse exploratoire dans un contexte industriel. / Co-clustering is a clustering technique aiming at simultaneously partitioning the rows and the columns of a data matrix. Among the existing approaches, MODL is suitable for processing huge data sets with several continuous or categorical variables. We use it as the baseline approach in this thesis. We discuss the reliability of applying such an approach on data mining problems like graphs partitioning, temporal graphs segmentation or curve clustering.MODL tracks very ﬁne patterns in huge data sets, that makes the results diﬃcult to study. That is why, exploratory analysis tools must be defined in order to explore them. In order to help the user in interpreting the results, we deﬁne exploratory analysis tools aiming at simplifying the results in order to make possible an overall interpretation, tracking the most interesting patterns, determining the most representative values of the clusters and visualizing the results. We investigate the asymptotic behavior of these exploratory analysis tools in order to make the connection with the existing approaches.Finally, we highlight the value of MODL and the exploratory analysis tools owing to an application on call detailed records from the telecom operator Orange, collected in Ivory Coast. Co-clustering MODL Graphes Données volumineuses Co-clustering MODL Graphs partitioning 519
2	Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval / Méthodes de regroupement hiérarchique agglomératif et co-clustering, leurs applications aux tests d’hypothèse de cluster et implémentations distribuées Wang, Xinyu 29 November 2017 (has links) Comme une méthode d’apprentissage automatique non supervisé, la classification automatique est largement appliquée dans des tâches diverses. Différentes méthodes de la classification ont leurs caractéristiques uniques. La classification hiérarchique, par exemple, est capable de produire une structure binaire en forme d’arbre, appelée dendrogramme, qui illustre explicitement les interconnexions entre les instances de données. Le co-clustering, d’autre part, génère des co-clusters, contenant chacun un sous-ensemble d’instances de données et un sous-ensemble d’attributs de données. L’application de la classification sur les données textuelles permet d’organiser les documents et de révéler les connexions parmi eux. Cette caractéristique est utile dans de nombreux cas, par exemple, dans les tâches de recherche d’informations basées sur la classification. À mesure que la taille des données disponibles augmente, la demande de puissance du calcul augmente. En réponse à cette demande, de nombreuses plates-formes du calcul distribué sont développées. Ces plates-formes utilisent les puissances du calcul collectives des machines, pour couper les données en morceaux, assigner des tâches du calcul et effectuer des calculs simultanément.Dans cette thèse, nous travaillons sur des données textuelles. Compte tenu d’un corpus de documents, nous adoptons l’hypothèse de «bag-of-words» et applique le modèle vectoriel. Tout d’abord, nous abordons les tâches de la classification en proposant deux méthodes, Sim_AHC et SHCoClust. Ils représentent respectivement un cadre des méthodes de la classification hiérarchique et une méthode du co-clustering hiérarchique, basé sur la proximité. Nous examinons leurs caractéristiques et performances du calcul, grâce de déductions mathématiques, de vérifications expérimentales et d’évaluations. Ensuite, nous appliquons ces méthodes pour tester l’hypothèse du cluster, qui est l’hypothèse fondamentale dans la recherche d’informations basée sur la classification. Dans de tels tests, nous utilisons la recherche du cluster optimale pour évaluer l’efficacité de recherche pour tout les méthodes hiérarchiques unifiées par Sim_AHC et par SHCoClust . Nous aussi examinons l’efficacité du calcul et comparons les résultats. Afin d’effectuer les méthodes proposées sur des ensembles de données plus vastes, nous sélectionnons la plate-forme d’Apache Spark et fournissons implémentations distribuées de Sim_AHC et de SHCoClust. Pour le Sim_AHC distribué, nous présentons la procédure du calcul, illustrons les difficultés rencontrées et fournissons des solutions possibles. Et pour SHCoClust, nous fournissons une implémentation distribuée de son noyau, l’intégration spectrale. Dans cette implémentation, nous utilisons plusieurs ensembles de données qui varient en taille pour examiner l’échelle du calcul sur un groupe de noeuds. / As a major type of unsupervised machine learning method, clustering has been widely applied in various tasks. Different clustering methods have different characteristics. Hierarchical clustering, for example, is capable to output a binary tree-like structure, which explicitly illustrates the interconnections among data instances. Co-clustering, on the other hand, generates co-clusters, each containing a subset of data instances and a subset of data attributes. Applying clustering on textual data enables to organize input documents and reveal connections among documents. This characteristic is helpful in many cases, for example, in cluster-based Information Retrieval tasks. As the size of available data increases, demand of computing power increases. In response to this demand, many distributed computing platforms are developed. These platforms use the collective computing powers of commodity machines to parallelize data, assign computing tasks and perform computation concurrently.In this thesis, we first address text clustering tasks by proposing two clustering methods, Sim_AHC and SHCoClust. They respectively represent a similarity-based hierarchical clustering and a similarity-based hierarchical co-clustering. We examine their properties and performances through mathematical deduction, experimental verification and evaluation. Then we apply these methods in testing the cluster hypothesis, which is the fundamental assumption in cluster-based Information Retrieval. In such tests, we apply the optimal cluster search to evaluation the retrieval effectiveness of different clustering methods. We examine the computing efficiency and compare the results of the proposed tests. In order to perform clustering on larger datasets, we select Apache Spark platform and provide distributed implementation of Sim_AHC and of SHCoClust. For distributed Sim_AHC, we present the designed computing procedure, illustrate confronted difficulties and provide possible solutions. And for SHCoClust, we provide a distributed implementation of its core, spectral embedding. In this implementation, we use several datasets that vary in size to examine scalability. Classification ascendante hiérarchique Co-clustering Recherche d’informations Hypothèse de cluster Calcul distribué Hierarchical clustering Co-clustering Information Retrieval Cluster hypothesis Distributed computing
3	A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism / Une approche de visualisation analytique pour une analyse multi-résolution de corpus textuels : application au journalisme d’investigation Médoc, Nicolas 16 October 2017 (has links) À mesure que la production de textes numériques croît exponentiellement, un besoin grandissant d’analyser des corpus de textes se manifeste dans beaucoup de domaines d’application, tant ces corpus constituent des sources inépuisables d’information et de connaissance partagées. Ainsi proposons-nous dans cette thèse une nouvelle approche de visualisation analytique pour l’analyse de corpus textuels, mise en œuvre pour les besoins spécifiques du journalisme d’investigation. Motivées par les problèmes et les tâches identifiés avec une journaliste d’investigation professionnelle, les visualisations et les interactions ont été conçues suivant une méthodologie centrée utilisateur, impliquant l’utilisateur durant tout le processus de développement. En l’occurrence, les journalistes d’investigation formulent des hypothèses, explorent leur sujet d’investigation sous tous ses angles, à la recherche de sources multiples étayant leurs hypothèses de travail. La réalisation de ces tâches, très fastidieuse lorsque les corpus sont volumineux, requiert l’usage de logiciels de visualisation analytique se confrontant aux problématiques de recherche abordées dans cette thèse. D’abord, la difficulté de donner du sens à un corpus textuel vient de sa nature non structurée. Nous avons donc recours au modèle vectoriel et son lien étroit avec l’hypothèse distributionnelle, ainsi qu’aux algorithmes qui l’exploitent pour révéler la structure sémantique latente du corpus. Les modèles de sujets et les algorithmes de biclustering sont efficaces pour l’extraction de sujets de haut niveau. Ces derniers correspondent à des groupes de documents concernant des sujets similaires, chacun représenté par un ensemble de termes extraits des contenus textuels. Une telle structuration par sujet permet notamment de résumer un corpus et de faciliter son exploration. Nous proposons une nouvelle visualisation, une carte pondérée des sujets, qui dresse une vue d’ensemble des sujets de haut niveau. Elle permet d’une part d’interpréter rapidement les contenus grâce à de multiples nuages de mots, et d’autre part, d’apprécier les propriétés des sujets telles que leur taille relative et leur proximité sémantique. Bien que l’exploration des sujets de haut niveau aide à localiser des sujets d’intérêt ainsi que leur voisinage, l’identification de faits précis, de points de vue ou d’angles d’analyse, en lien avec un événement ou une histoire, nécessite un niveau de structuration plus fin pour représenter des variantes de sujet. Cette structure imbriquée révélée par Bimax, une méthode de biclustering basée sur des motifs avec chevauchement, capture au sein des biclusters les co-occurrences de termes partagés par des sous-ensembles de documents pouvant dévoiler des faits, des points de vue ou des angles associés à des événements ou des histoires communes. Cette thèse aborde les problèmes de visualisation de biclusters avec chevauchement en organisant les biclusters terme-document en une hiérarchie qui limite la redondance des termes et met en exergue les parties communes et distinctives des biclusters. Nous avons évalué l’utilité de notre logiciel d’abord par un scénario d’utilisation doublé d’une évaluation qualitative avec une journaliste d’investigation. En outre, les motifs de co-occurrence des variantes de sujet révélées par Bima. sont déterminés par la structure de sujet englobante fournie par une méthode d’extraction de sujet. Cependant, la communauté a peu de recul quant au choix de la méthode et son impact sur l’exploration et l’interprétation des sujets et de ses variantes. Ainsi nous avons conduit une expérience computationnelle et une expérience utilisateur contrôlée afin de comparer deux méthodes d’extraction de sujet. D’un côté Coclu. est une méthode de biclustering disjointe, et de l’autre, hirarchical Latent Dirichlet Allocation (hLDA) est un modèle de sujet probabiliste dont les distributions de probabilité forment une structure de bicluster avec chevauchement. (...) / As the production of digital texts grows exponentially, a greater need to analyze text corpora arises in various domains of application, insofar as they constitute inexhaustible sources of shared information and knowledge. We therefore propose in this thesis a novel visual analytics approach for the analysis of text corpora, implemented for the real and concrete needs of investigative journalism. Motivated by the problems and tasks identified with a professional investigative journalist, visualizations and interactions are designed through a user-centered methodology involving the user during the whole development process. Specifically, investigative journalists formulate hypotheses and explore exhaustively the field under investigation in order to multiply sources showing pieces of evidence related to their working hypothesis. Carrying out such tasks in a large corpus is however a daunting endeavor and requires visual analytics software addressing several challenging research issues covered in this thesis. First, the difficulty to make sense of a large text corpus lies in its unstructured nature. We resort to the Vector Space Model (VSM) and its strong relationship with the distributional hypothesis, leveraged by multiple text mining algorithms, to discover the latent semantic structure of the corpus. Topic models and biclustering methods are recognized to be well suited to the extraction of coarse-grained topics, i.e. groups of documents concerning similar topics, each one represented by a set of terms extracted from textual contents. We provide a new Weighted Topic Map visualization that conveys a broad overview of coarse-grained topics by allowing quick interpretation of contents through multiple tag clouds while depicting the topical structure such as the relative importance of topics and their semantic similarity. Although the exploration of the coarse-grained topics helps locate topic of interest and its neighborhood, the identification of specific facts, viewpoints or angles related to events or stories requires finer level of structuration to represent topic variants. This nested structure, revealed by Bimax, a pattern-based overlapping biclustering algorithm, captures in biclusters the co-occurrences of terms shared by multiple documents and can disclose facts, viewpoints or angles related to events or stories. This thesis tackles issues related to the visualization of a large amount of overlapping biclusters by organizing term-document biclusters in a hierarchy that limits term redundancy and conveys their commonality and specificities. We evaluated the utility of our software through a usage scenario and a qualitative evaluation with an investigative journalist. In addition, the co-occurrence patterns of topic variants revealed by Bima. are determined by the enclosing topical structure supplied by the coarse-grained topic extraction method which is run beforehand. Nonetheless, little guidance is found regarding the choice of the latter method and its impact on the exploration and comprehension of topics and topic variants. Therefore we conducted both a numerical experiment and a controlled user experiment to compare two topic extraction methods, namely Coclus, a disjoint biclustering method, and hierarchical Latent Dirichlet Allocation (hLDA), an overlapping probabilistic topic model. The theoretical foundation of both methods is systematically analyzed by relating them to the distributional hypothesis. The numerical experiment provides statistical evidence of the difference between the resulting topical structure of both methods. The controlled experiment shows their impact on the comprehension of topic and topic variants, from analyst perspective. (...) Visualisation analytique Fouille de texte Modèles de sujet Co-clustering Étude utilisateur Journalisme d'investigation Visual analytics Text mining Topic models Co-clustering User study Investigative journalism 005.74
4	Simultaneous partitioning and modeling : a framework for learning from complex data Deodhar, Meghana 11 October 2010 (has links) While a single learned model is adequate for simple prediction problems, it may not be sufficient to represent heterogeneous populations that difficult classification or regression problems often involve. In such scenarios, practitioners often adopt a "divide and conquer" strategy that segments the data into relatively homogeneous groups and then builds a model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. We consider prediction problems on bi-modal or dyadic data with covariates, e.g., predicting customer behavior across products, where the independent variables can be naturally partitioned along the modes. A pivoting operation can now result in the target variable showing up as entries in a "customer by product" data matrix. We present a model-based co-clustering framework that interleaves partitioning (clustering) along each mode and construction of prediction models to iteratively improve both cluster assignment and fit of the models. This Simultaneous CO-clustering And Learning (SCOAL) framework generalizes co-clustering and collaborative filtering to model-based co-clustering, and is shown to be better than independently clustering the data first and then building models. Our framework applies to a wide range of bi-modal and multi-modal data, and can be easily specialized to address classification and regression problems in domains like recommender systems, fraud detection and marketing. Further, we note that in several datasets not all the data is useful for the learning problem and ignoring outliers and non-informative values may lead to better models. We explore extensions of SCOAL to automatically identify and discard irrelevant data points and features while modeling, in order to improve prediction accuracy. Next, we leverage the multiple models provided by the SCOAL technique to address two prediction problems on dyadic data, (i) ranking predictions based on their reliability, and (ii) active learning. We also extend SCOAL to predictive modeling of multi-modal data, where one of the modes is implicitly ordered, e.g., time series data. Finally, we illustrate our implementation of a parallel version of SCOAL based on the Google Map-Reduce framework and developed on the open source Hadoop platform. We demonstrate the effectiveness of specific instances of the SCOAL framework on prediction problems through experimentation on real and synthetic data. / text Predictive modeling Dyadic data with covariates Co-clustering Regression Classification Parallelism
5	Using social network information in recommender systems Sudan, Nikita Maple 30 September 2011 (has links) Recommender Systems are used to select online information relevant to a given user. Traditional (memory based) recommenders explore the user-item rating matrix and make recommendations based on users who have rated similarly or items that have been rated similarly. With the growing popularity of social networks, recommender systems can benefit from combining history of user preferences with information from the social/trust network of users. This thesis explores two techniques of combining user-item rating history with trust network information to make better user-item rating predictions. The first approach (SCOAL [5]) simultaneously co-clusters and learns separate models for each co-cluster. The co-clustering is based on the user features as well as the rating history. This captures the intuition that certain groups of users have similar preferences for certain groups of items. The grouping of certain users is affected by the similarity in the rating behavior and the trust network. The second graph-based label propagation approach (MAD [27]) works in a transductive setting and propagates ratings of user-item pairs directly on the user social graph. We evaluate both approaches on two large public data-sets from Epinions.com and Flixster.com. The thesis is amongst the first to explore the role of distrust in rating prediction. Since distrust is not as transitive as trust i.e. an enemy's enemy need not be an enemy or a friend, distrust can't directly replace trust in trust propagation approaches. By using a low dimensional representation of the original trust network in SCOAL, we use distrust as it is and don't propagate it. Using SCOAL, we can pin-point the groups of users and the groups of items that have the same preference model. Both SCOAL and MAD are able to seamlessly integrate side information such as item-subject and item-author information into the trust based rating prediction model. / text Social networks Trust Recommender systems Rating prediction Co-clustering Label propagation
6	Information Retrieval by Identification of Signature Terms in Clusters Muppalla, Sesha Sai Krishna Koundinya 24 May 2022 (has links) No description available. Computer Science Signature Normalized Weighted Sum Term Fraction Entropy Based Signature Clusters Spectral Co-clustering
7	Retrieval and Labeling of Documents Using Ontologies: Aided by a Collaborative Filtering Alshammari, Asma 06 June 2023 (has links) No description available. Computer Science Spectral co-clustering Ontologies Collaborative Filtering Recommender system Document Retrieval Document Labelling
8	Graph Coloring and Clustering Algorithms for Science and Engineering Applications Bozdag, Doruk January 2008 (has links) No description available. Bioinformatics Computer Science Electrical Engineering parallel graph coloring distributed memor y biclustering co-clustering subspace clustering
9	Parallelizing Applications With a Reduction Based Framework on Multi-Core Clusters Ramanathan, Venkatram 01 September 2010 (has links) No description available. Computer Science Parallel wavelet transform parallel co-clustering reduction based framework
10	Mining Complex High-Order Datasets Barnathan, Michael January 2010 (has links) Selection of an appropriate structure for storage and analysis of complex datasets is a vital but often overlooked decision in the design of data mining and machine learning experiments. Most present techniques impose a matrix structure on the dataset, with rows representing observations and columns representing features. While this assumption is reasonable when features are scalar and do not exhibit co-dependence, the matrix data model becomes inappropriate when dependencies between non-target features must be modeled in parallel, or when features naturally take the form of higher-order multilinear structures. Such datasets particularly abound in functional medical imaging modalities, such as fMRI, where accurate integration of both spatial and temporal information is critical. Although necessary to take full advantage of the high-order structure of these datasets and built on well-studied mathematical tools, tensor analysis methodologies have only recently entered widespread use in the data mining community and remain relatively absent from the literature within the biomedical domain. Furthermore, naive tensor approaches suffer from fundamental efficiency problems which limit their practical use in large-scale high-order mining and do not capture local neighborhoods necessary for accurate spatiotemporal analysis. To address these issues, a comprehensive framework based on wavelet analysis, tensor decomposition, and the WaveCluster algorithm is proposed for addressing the problems of preprocessing, classification, clustering, compression, feature extraction, and latent concept discovery on large-scale high-order datasets, with a particular emphasis on applications in computer-assisted diagnosis. Our framework is evaluated on a 9.3 GB fMRI motor task dataset of both high dimensionality and high order, performing favorably against traditional voxelwise and spectral methods of analysis, discovering latent concepts suggestive of subject handedness, and reducing space and time complexities by up to two orders of magnitude. Novel wavelet and tensor tools are derived in the course of this work, including a novel formulation of an r-dimensional wavelet transform in terms of elementary tensor operations and an enhanced WaveCluster algorithm capable of clustering real-valued as well as binary data. Sparseness-exploiting properties are demonstrated and variations of core algorithms for specialized tasks such as image segmentation are presented. / Computer and Information Science Health Sciences, Radiology Co-clustering Data Mining Fmri Svd Tensors Wavelets

Search results