Global ETD Search

21	A Graph Theoretic Clustering Algorithm based on the Regularity Lemma and Strategies to Exploit Clustering for Prediction Trivedi, Shubhendu 30 April 2012 (has links) The fact that clustering is perhaps the most used technique for exploratory data analysis is only a semaphore that underlines its fundamental importance. The general problem statement that broadly describes clustering as the identification and classification of patterns into coherent groups also implicitly indicates it's utility in other tasks such as supervised learning. In the past decade and a half there have been two developments that have altered the landscape of research in clustering: One is improved results by the increased use of graph theoretic techniques such as spectral clustering and the other is the study of clustering with respect to its relevance in semi-supervised learning i.e. using unlabeled data for improving prediction accuracies. In this work an attempt is made to make contributions to both these aspects. Thus our contributions are two-fold: First, we identify some general issues with the spectral clustering framework and while working towards a solution, we introduce a new algorithm which we call "Regularity Clustering" which makes an attempt to harness the power of the Szemeredi Regularity Lemma, a remarkable result from extremal graph theory for the task of clustering. Secondly, we investigate some practical and useful strategies for using clustering unlabeled data in boosting prediction accuracy. For all of these contributions we evaluate our methods against existing ones and also apply these ideas in a number of settings. Machine Learning Graph Mining Unsupervised Learning Ensemble Learning Semi-Supervised Learning Regularity Lemma Graph Partitioning
22	Métodos para o pré-processamento e mineração de grandes volumes de dados multidimensionais e redes complexas / Methods to pre-processing and mining large volumes of multidimensional data and complex networks Appel, Ana Paula 27 May 2010 (has links) A mineração de dados é um processo computacionalmente caro, que se apoia no pré-processamento dos dados para aumentar a sua eficiência. As técnicas de redução de elementos do conjunto de dados, principalmente a amostragem de dados se destacam no pré-processamento. Os dados reais são caracterizados pela não uniformidade da distribuição, grande quantidade de atributos e presença de elementos considerados ruídos. Para esse tipo de dado, a amostragem uniforme, na qual cada elemento tem a mesma probabilidade de ser escolhido, é inefiiente. Os dados nos últimos anos, vem passando por transformações. Assim, não só o seu volume tem aumentado significantemente, mas também a maneira de como eles são representados. Os dados usualmente são divididos apenas em dados tradicionais (número e pequenas cadeias de caracteres) e dados complexos (imagens, cadeias de DNA, vídeos, etc). Entretanto, uma representação mais rica, na qual não só os elementos do conjunto são representados mas também a suas ligações, vem sendo amplamente utilizada. Esse novo tipo de dado, chamado rede complexa, fez surgir uma nova área de pesquisa chamada mineração de redes complexas ou de grafos, já que estes são utilizados na representação das redes complexas. Para esta nova área é necessário o desenvolvimento de técnicas que permitam a mineração de grandes redes complexas, isto é, redes com centenas de milhares de elementos(nós) e ligações(arestas). Esta tese teve como objetivo explorar a redução de elementos em conjuntos de dados chamados desbalanceados, isto é, que possuem agrupamentos ou classes de tamanhos bastantes distintos, e que também possuam alta quantidade de atributos e presença de ruídos. Além disso, esta tese também explora a mineração de redes complexas com a extração de padrões e propriedades e o desenvolvimento de algoritmos eficientes para a classificação das redes em reais e sintéticas. Também é proposto a mineração de redes complexas utilizando gerenciadores de base de dados para a mineração de cliques de tamanho 4 e 5 e a apresentação da extensão do coeficiente de clusterização / Data mining is an expensive computational process speeded up by data preprocessing. Data reduction techniques, as data sampling are useful during the data preprocessing. Real data are known for presenting non-uniform data distribution, a large amount of attributes and noise. For this type of data, uniform sampling, which selects elements with the same probability, is inefficient. Over the past years, the data available to mining have been changed. Not only have their volume increased but also data format. Data are usually divided into traditional (number and small chains of character) and complex (images, DNA, videos, etc). However, a rich representation, in which not only elements but also the connections among the elements have been used, is necessary. This new data type, which is called complex network and is usually modeled as a graph, has created a new research area, called graph mining or complex network mining, which requires the development of new mining techniques to allow mining large networks, that is, networks with hundreds of thousands of nodes and edges. The present thesis aims to explore the data reduction in unbalanced data, that is, data that have clusters with very different sizes, a large amount of attributes and noise. It also explores complex network mining with two basic findings: useful new patterns, which allow distinguishing real from synthetic networks and mining cliques of sizes 4 and 5 using database systems, discovering interesting power laws and presenting a new cluster coefficient formula Amostragem balanceada Banco de dados Biased sampling Database Graph mining Mineração de grafos
23	Analysis of Current Flows in Electrical Networks for Error-Tolerant Graph Matching Gutierrez Munoz, Alejandro 10 November 2008 (has links) Information contained in chemical compounds, fingerprint databases, social networks, and interactions between websites all have one thing in common: they can be represented as graphs. The need to analyze, compare, and classify graph datasets has become more evident over the last decade. The graph isomorphism problem is known to belong to the NP class, and the subgraph isomorphism problem is known to be an NP-complete problem. Several error-tolerant graph matching techniques have been developed during the last two decades in order to overcome the computational complexity associated with these problems. Some of these techniques rely upon similarity measures based on the topology of the graphs. Random walks and edit distance kernels are examples of such methods. In conjunction with learning algorithms like back-propagation neural networks, k-nearest neighbor, and support vector machines (SVM), these methods provide a way of classifying graphs based on a training set of labeled instances. This thesis presents a novel approach to error-tolerant graph matching based on current flow analysis. Analysis of current flow in electrical networks is a technique that uses the voltages and currents obtained through nodal analysis of a graph representing an electrical circuit. Current flow analysis in electrical networks shares some interesting connections with the number of random walks along the graph. We propose an algorithm to calculate a similarity measure between two graphs based on the current flows along geodesics of the same degree. This similarity measure can be applied over large graph datasets, allowing these datasets to be compared in a reasonable amount of time. This thesis investigates the classification potential of several data mining algorithms based on the information extracted from a graph dataset and represented as current flow vectors. We describe our operational prototype and evaluate its effectiveness on the NCI-HIV dataset. Graph mining Compound matching Graph kernel Graph dataset Classifier American Studies Arts and Humanities
24	Fouille de sous-graphes fréquents à base d'arc consistance / Frequent subgraph mining with arc consistency Douar, Brahim 27 November 2012 (has links) Avec la croissance importante du besoin d'analyser une grande masse de données structurées tels que les composés chimiques, les structures de protéines ou même les réseaux sociaux, la fouille de sous-graphes fréquents est devenue un défi réel en matière de fouille de données. Ceci est étroitement lié à leur nombre exponentiel ainsi qu'à la NP-complétude du problème d'isomorphisme d'un sous-graphe général. Face à cette complexité, et pour gérer cette taille importante de l'espace de recherche, les méthodes classiques de fouille de graphes ont exploré des heuristiques de recherche basées sur le support, le langage de description des exemples (limitation aux chemins, aux arbres, etc.) ou des hypothèses (recherche de sous-arborescence communes, de chemins communs, etc.). Dans le cadre de cette thèse, nous nous basons sur une méthode d'appariement de graphes issue du domaine de la programmation par contraintes, nommée AC-projection, qui a le mérite d'avoir une complexité polynomiale. Nous introduisons des approches de fouille de graphes permettant d'améliorer les approches existantes pour ce problème. En particulier, nous proposons deux algorithmes, FGMAC et AC-miner, permettant de rechercher les sous-graphes fréquents à partir d'une base de graphes. Ces deux algorithmes profitent, différemment, des propriétés fortes intéressantes de l'AC-projection. En effet, l'algorithme FGMAC adopte un parcours en largeur de l'espace de recherche et exploite l'approche par niveau introduite dans Apriori, tandis que l'algorithme AC-miner parcourt l'espace en profondeur par augmentation de motifs, assurant ainsi une meilleure mise à l'échelle pour les grands graphes. Ces deux approches permettent l'extraction d'un type particulier de graphes, il s'agit de celui des sous-graphes AC-réduits fréquents. Dans un premier temps, nous prouvons, théoriquement, que l'espace de recherche de ces sous-graphes est moins important que celui des sous-graphes fréquents à un isomorphisme près. Ensuite, nous menons une série d'expérimentations permettant de prouver que les algorithmes FGMAC et AC-miner sont plus efficients que ceux de l'état de l'art. Au même temps, nous prouvons que les sous-graphes AC-réduits fréquents, en dépit de leur nombre sensiblement réduit, ont le même pouvoir discriminant que les sous-graphes fréquents à un isomorphisme près. Cette étude est menée en se basant sur une évaluation expérimentale de la qualité des sous-graphes AC-réduits fréquents dans un processus de classification supervisée de graphes. / With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, social networks, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, frequent subgraph miners are exponential in runtime and/or memory use. In order to alleviate the complexity issue, existing subgraph miners have explored techniques based on the minimal support threshold, the description language of the examples (only supporting paths, trees, etc.) or hypothesis (search for shared trees or common paths, etc.). In this thesis, we are using a new projection operator, named AC-projection, which exhibits nice complexity properties as opposed to the graph isomorphism operator. This operator comes from the constraints programming field and has the advantage of a polynomial complexity. We propose two frequent subgraph mining algorithms based on the latter operator. The first one, named FGMAC, follows a breadth-first order to find frequent subgraphs and takes advantage of the well-known Apriori levelwise strategy. The second is a pattern-growth approach that follows a depth-first search space exploration strategy and uses powerful pruning techniques in order to considerably reduce this search space. These two approaches extract a set of particular subgraphs named AC-reduced frequent subgraphs. As a first step, we have studied the search space for discovering such frequent subgraphs and proved that this one is smaller than the search space of frequent isomorphic subgraphs. Then, we carried out experiments in order to prove that FGMAC and AC-miner are more efficient than the state-of-the-art algorithms. In the same time, we have studied the relevance of frequent AC-reduced subgraphs, which are much fewer than isomorphic ones, on classification and we conclude that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality. Apprentissage automatique Fouille de graphes AC-Projection Classification de graphes Machine learning Graph mining AC-projection Graph classification
25	Automated Inclusive Design Heuristics Generation with Graph Mining Sangelkar, Shraddha Chandrakant 16 December 2013 (has links) Inclusive design is a concept intended to promote the development of products and environments equally usable by all users, irrespective of their age or ability. This research focuses on developing a method to derive heuristics for inclusive design. The research applies the actionfunction diagram to model the interaction between a user and a product, design difference classification to compare a typical product with its inclusive counterpart, graph theory to mathematically represent the comparison relations, and graph data mining to extract the design heuristics. The goal of this research is to formalize and automate the inclusive-design heuristics generation process. The rule generation allows statistical mining of the design guidelines from existing inclusive products. Formalization results show that, the rate of rule generation decreases as more products are added to the dataset. The automated method is particularly helpful in the developmental stages of graph mining applications for product design. The graph mining technique has capability for graph grammar induction, which is extended here to automate the generation of engineering grammars. In general, graph mining can be applied to extract design heuristics from any discrete and relational design data that can be represented as graphs. Concept generation studies are conducted to validate the heuristics derived in this research for inclusive product design. In addition, an inclusivity rating is created and verified to evaluate the inclusiveness of the conceptual ideas. Finally, appreciation and awareness about inclusive design is important in an engineering design course, hence, a module is compiled to teach inclusive design methods in a capstone design course. The results of the exploratory study and validation show that there is problem dependency in the application of the representation scheme. It cannot be stated with certainty at this point if the representation scheme is helpful for designing consumer products, where only the activities related to the upper body are involved. However, self-reported feedback indicates that the teaching module is effective in increasing the awareness and confidence about inclusive design. Inclusive Design Graph Mining Data Mining User Centric Design Universal Design
26	Heavyweight Pattern Mining in Attributed Flow Graphs Simoes Gomes, Carolina Unknown Date No description available. data mining program analysis flow graph pattern mining sub-graph mining program profiling software analysis
27	TiCTak: Target-Specific Centrality Manipulation on Large Networks January 2016 (has links) abstract: Measuring node centrality is a critical common denominator behind many important graph mining tasks. While the existing literature offers a wealth of different node centrality measures, it remains a daunting task on how to intervene the node centrality in a desired way. In this thesis, we study the problem of minimizing the centrality of one or more target nodes by edge operation. The heart of the proposed method is an accurate and efficient algorithm to estimate the impact of edge deletion on the spectrum of the underlying network, based on the observation that the edge deletion is essentially a local, sparse perturbation to the original network. Extensive experiments are conducted on a diverse set of real networks to demonstrate the effectiveness, efficiency and scalability of our approach. In particular, it is average of 260.95%, in terms of minimizing eigen-centrality, better than the standard matrix-perturbation based algorithm, with lower time complexity. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer science graph connectivity optimization graph mining large networks node centrality
28	Métodos para o pré-processamento e mineração de grandes volumes de dados multidimensionais e redes complexas / Methods to pre-processing and mining large volumes of multidimensional data and complex networks Ana Paula Appel 27 May 2010 (has links) A mineração de dados é um processo computacionalmente caro, que se apoia no pré-processamento dos dados para aumentar a sua eficiência. As técnicas de redução de elementos do conjunto de dados, principalmente a amostragem de dados se destacam no pré-processamento. Os dados reais são caracterizados pela não uniformidade da distribuição, grande quantidade de atributos e presença de elementos considerados ruídos. Para esse tipo de dado, a amostragem uniforme, na qual cada elemento tem a mesma probabilidade de ser escolhido, é inefiiente. Os dados nos últimos anos, vem passando por transformações. Assim, não só o seu volume tem aumentado significantemente, mas também a maneira de como eles são representados. Os dados usualmente são divididos apenas em dados tradicionais (número e pequenas cadeias de caracteres) e dados complexos (imagens, cadeias de DNA, vídeos, etc). Entretanto, uma representação mais rica, na qual não só os elementos do conjunto são representados mas também a suas ligações, vem sendo amplamente utilizada. Esse novo tipo de dado, chamado rede complexa, fez surgir uma nova área de pesquisa chamada mineração de redes complexas ou de grafos, já que estes são utilizados na representação das redes complexas. Para esta nova área é necessário o desenvolvimento de técnicas que permitam a mineração de grandes redes complexas, isto é, redes com centenas de milhares de elementos(nós) e ligações(arestas). Esta tese teve como objetivo explorar a redução de elementos em conjuntos de dados chamados desbalanceados, isto é, que possuem agrupamentos ou classes de tamanhos bastantes distintos, e que também possuam alta quantidade de atributos e presença de ruídos. Além disso, esta tese também explora a mineração de redes complexas com a extração de padrões e propriedades e o desenvolvimento de algoritmos eficientes para a classificação das redes em reais e sintéticas. Também é proposto a mineração de redes complexas utilizando gerenciadores de base de dados para a mineração de cliques de tamanho 4 e 5 e a apresentação da extensão do coeficiente de clusterização / Data mining is an expensive computational process speeded up by data preprocessing. Data reduction techniques, as data sampling are useful during the data preprocessing. Real data are known for presenting non-uniform data distribution, a large amount of attributes and noise. For this type of data, uniform sampling, which selects elements with the same probability, is inefficient. Over the past years, the data available to mining have been changed. Not only have their volume increased but also data format. Data are usually divided into traditional (number and small chains of character) and complex (images, DNA, videos, etc). However, a rich representation, in which not only elements but also the connections among the elements have been used, is necessary. This new data type, which is called complex network and is usually modeled as a graph, has created a new research area, called graph mining or complex network mining, which requires the development of new mining techniques to allow mining large networks, that is, networks with hundreds of thousands of nodes and edges. The present thesis aims to explore the data reduction in unbalanced data, that is, data that have clusters with very different sizes, a large amount of attributes and noise. It also explores complex network mining with two basic findings: useful new patterns, which allow distinguishing real from synthetic networks and mining cliques of sizes 4 and 5 using database systems, discovering interesting power laws and presenting a new cluster coefficient formula Amostragem balanceada Banco de dados Mineração de grafos Biased sampling Database Graph mining
29	Novel computational methods to predict drug–target interactions using graph mining and machine learning approaches Olayan, Rawan S. 12 1900 (has links) Computational drug repurposing aims at finding new medical uses for existing drugs. The identification of novel drug-target interactions (DTIs) can be a useful part of such a task. Computational determination of DTIs is a convenient strategy for systematic screening of a large number of drugs in the attempt to identify new DTIs at low cost and with reasonable accuracy. This necessitates development of accurate computational methods that can help focus on the follow-up experimental validation on a smaller number of highly likely targets for a drug. Although many methods have been proposed for computational DTI prediction, they suffer the high false positive prediction rate or they do not predict the effect that drugs exert on targets in DTIs. In this report, first, we present a comprehensive review of the recent progress in the field of DTI prediction from data-centric and algorithm-centric perspectives. The aim is to provide a comprehensive review of computational methods for identifying DTIs, which could help in constructing more reliable methods. Then, we present DDR, an efficient method to predict the existence of DTIs. DDR achieves significantly more accurate results compared to the other state-of-theart methods. As supported by independent evidences, we verified as correct 22 out of the top 25 DDR DTIs predictions. This validation proves the practical utility of DDR, suggesting that DDR can be used as an efficient method to identify 5 correct DTIs. Finally, we present DDR-FE method that predicts the effect types of a drug on its target. On different representative datasets, under various test setups, and using different performance measures, we show that DDR-FE achieves extremely good performance. Using blind test data, we verified as correct 2,300 out of 3,076 DTIs effects predicted by DDR-FE. This suggests that DDR-FE can be used as an efficient method to identify correct effects of a drug on its target. drug–target interaction prediction link prediction Bioinformatics chemoinformatics Machine Learning graph mining
30	Using Machine Learning and Graph Mining Approaches to Improve Software Requirements Quality: An Empirical Investigation Singh, Maninder January 2019 (has links) Software development is prone to software faults due to the involvement of multiple stakeholders especially during the fuzzy phases (requirements and design). Software inspections are commonly used in industry to detect and fix problems in requirements and design artifacts, thereby mitigating the fault propagation to later phases where the same faults are harder to find and fix. The output of an inspection process is list of faults that are present in software requirements specification document (SRS). The artifact author must manually read through the reviews and differentiate between true-faults and false-positives before fixing the faults. The first goal of this research is to automate the detection of useful vs. non-useful reviews. Next, post-inspection, requirements author has to manually extract key problematic topics from useful reviews that can be mapped to individual requirements in an SRS to identify fault-prone requirements. The second goal of this research is to automate this mapping by employing Key phrase extraction (KPE) algorithms and semantic analysis (SA) approaches to identify fault-prone requirements. During fault-fixations, the author has to manually verify the requirements that could have been impacted by a fix. The third goal of my research is to assist the authors post-inspection to handle change impact analysis (CIA) during fault fixation using NL processing with semantic analysis and mining solutions from graph theory. The selection of quality inspectors during inspections is pertinent to be able to carry out post-inspection tasks accurately. The fourth goal of this research is to identify skilled inspectors using various classification and feature selection approaches. The dissertation has led to the development of automated solution that can identify useful reviews, help identify skilled inspectors, extract most prominent topics/keyphrases from fault logs; and help RE author during the fault-fixation post inspection. change impact analysis graph mining key phrase extraction machine learning natural language processing software requirements inspections

Search results