Global ETD Search

11	Multi-layered HITS on Multi-sourced Networks January 2018 (has links) abstract: Network mining has been attracting a lot of research attention because of the prevalence of networks. As the world is becoming increasingly connected and correlated, networks arising from inter-dependent application domains are often collected from different sources, forming the so-called multi-sourced networks. Examples of such multi-sourced networks include critical infrastructure networks, multi-platform social networks, cross-domain collaboration networks, and many more. Compared with single-sourced network, multi-sourced networks bear more complex structures and therefore could potentially contain more valuable information. This thesis proposes a multi-layered HITS (Hyperlink-Induced Topic Search) algorithm to perform the ranking task on multi-sourced networks. Specifically, each node in the network receives an authority score and a hub score for evaluating the value of the node itself and the value of its outgoing links respectively. Based on a recent multi-layered network model, which allows more flexible dependency structure across different sources (i.e., layers), the proposed algorithm leverages both within-layer smoothness and cross-layer consistency. This essentially allows nodes from different layers to be ranked accordingly. The multi-layered HITS is formulated as a regularized optimization problem with non-negative constraint and solved by an iterative update process. Extensive experimental evaluations demonstrate the effectiveness and explainability of the proposed algorithm. / Dissertation/Thesis / Masters Thesis Computer Science 2018 Computer science Graph Mining Multi-layered Networks Ranking
12	Fast and scalable triangle counting in graph streams: the hybrid approach Singh, Paramvir 14 December 2020 (has links) Triangle counting is a major graph problem with several applications in social network analysis, anomaly detection, etc. A considerable amount of work has contributed to approximately computing the global triangle counts using several computational models. One of the most popular streaming models considered is Edge Streaming in which the edges arrive in the form of a graph stream. We categorize the existing literature into two categories: Fixed Memory (FM) approach, and Fixed Probability (FP) approach. As the size of the graphs grows, several challenges arise such as memory space limitations, and prohibitively long running time. Therefore, both FM and FP categories exhibit some limitations. FP algorithms fail to scale for massive graphs. We identified a limitation of FM category $i.e.$ FM algorithms have higher computational time than their FP variants. In this work, we present a new category called the Hybrid approach that overcomes the limitations of both FM and FP approaches. We present two new algorithms that belong to the hybrid category: Neighbourhood Hybrid Multisampling (NHMS) and Triest/ThinkD Hybrid Sampling (THS) for estimating the number of global triangles in graphs. These algorithms are highly scalable and have better running time than FM and FP variants. We experimentally show that both NHMS and THS outperform state-of-the-art algorithms in space-efficient environments. / Graduate Graph mining Triangle counting approximation algorithms Edge Streaming
13	Enumerating k-cliques in a large network using Apache Spark Dheekonda, Raja Sekhar Rao January 2017 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Network analysis is an important research task which explains the relationships among various entities in a given domain. Most of the existing approaches of network analysis compute global properties of a network, such as transitivity, diameter, and all-pair shortest paths. They also study various non-random properties of a network, such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness. Such approaches enable us to understand real-life networks with global properties. However, the discovery of the local topological building blocks within a network is an important task, and examples include clique enumeration, graphlet counting, and motif counting. In this paper, my focus is to fi nd an efficient solution of k-clique enumeration problem. A clique is a small, connected, and complete induced subgraph over a large network. However, enumerating cliques using sequential technologies is very time-consuming. Another promising direction that is being adopted is a solution that runs on distributed clusters of machines using the Hadoop mapreduce framework. However, the solution suffers from a general limitation of the framework, as Hadoop's mapreduce performs substantial amounts of reading and writing to disk. Thus, the running times of Hadoop-based approaches suffer enormously. To avoid these problems, we propose an e cient, scalable, and distributed solution, kc-spark , for enumerating cliques in real-life networks using the Apache Spark in-memory cluster computing framework. Experiment results show that kc-spark can enumerate k-cliques from very large real-life networks, whereas a single commodity machine cannot produce the same desired result in a feasible amount of time. We also compared kc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100 percent faster in terms of running times. On the other hand, we compared with the triangle enumeration with Hadoop mapreduce and results shown that kc-spark is 8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore, the overall performance of kc-spark is improved by using Spark's inbuilt caching and broadcast transformations. k-clique Apache Spark Enumeration Distributed Graph Mining
14	Towards large-scale network analytics Yang, Xintian 27 August 2012 (has links) No description available. Computer Science Data summarization Large-scale graph mining Visualization
15	Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering Maxwell, Evan Kyle 29 July 2010 (has links) Large graph-based datasets are common to many applications because of the additional structure provided to data by graphs. Patterns extracted from graphs must adhere to these structural properties, making them a more complex class of patterns to identify. The role of graph mining is to efficiently extract these patterns and quantify their significance. In this thesis, we focus on two application domains and demonstrate the design of graph mining algorithms in these domains. First, we investigate the use of graph grammar mining as a tool for diagnosing potential memory leaks from Java heap dumps. Memory leaks occur when memory that is no longer in use fails to be reclaimed, resulting in significant slowdowns, exhaustion of available storage, and eventually application crashes. Analyzing the heap dump of a program is a common strategy used in memory leak diagnosis, but our work is the first to employ a graph mining approach to the problem. Memory leaks accumulate in the heap as classes of subgraphs and the allocation paths from which they emanate can be explored to contextualize the leak source. We show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT. Second, we study the problem of multipartite graph clustering as an approach to database summarization on an integrated biological database. Construction of such databases has become a common theme in biological research, where heterogeneous data is consolidated into a single, centralized repository that provides a structured forum for data analysis. We present an efficient approximation algorithm for identifying clusters that form multipartite cliques spanning multiple database tables. We show that our algorithm computes a lossless compression of the database by summarizing it into a reduced set of biologically meaningful clusters. Our algorithm is applied to data from C. elegans, but we note its applicability to general relational databases. / Master of Science graph mining graph clustering multipartite cliques memory leak detection bioinformatics
16	Unstable Communities in Network Ensembles Rahman, Md Ahsanur 07 January 2016 (has links) Ensembles of graphs arise naturally in many applications, for example, the temporal evolution of social contacts or computer communications, tissue-specific protein interaction networks, annual citation or co-authorship networks in a field, or a family of high-likelihood Bayesian networks inferred from systems biology data. Several techniques have been developed to analyze such ensembles. A canonical problem is that of computing communities that are persistent across the ensemble. This problem is usually formulated as one of computing dense subgraphs (communities) that are frequent, i.e., appear in many graphs in the ensemble. In this thesis, we seek to find "unstable communities" which are the antithesis of frequent, dense subgraphs. Informally, an unstable community is a set of nodes that induces highly-varying subgraphs in the ensemble. In other words, the graphs in the ensemble disagree about the precise pairwise connections among these nodes. The primary contribution of this dissertation is to introduce the concept of unstable communities as a novel problem in the field of graph mining. Specifically, it presents three approaches to mathematically formulate the concept of unstable communities, devises algorithms for computing such communities in a given ensemble of networks, and shows the usefulness of this concept in a variety of settings. Our first definition of unstable community relies on two parameters: the first ensures that a node set induces several different subgraphs in the ensemble and the second guarantees that each of these subgraphs occurs in a large number of graphs in the ensemble. We present two algorithms to enumerate unstable communities that match this definition. The first approach, ClustMiner, is a heuristic that transforms the problem into one of computing dense subgraphs in a single graph that summarizes the ensemble. The second approach, UCMiner, is guaranteed to enumerate all maximal unstable communities correctly. We apply both approaches to systems biology datasets to demonstrate that UCMiner is superior to ClustMiner in the sense that ClustMiner's output contains node sets that are not unstable while also missing several communities computed by UCMiner. We find several node sets that capture the uncertain connectivity of genes in relevant protein complexes, suggesting that further experiments may be required to precisely discern their interaction patterns. Our second and third definitions of unstable community rely on a novel concept of (scaled) subgraph divergence, a formulation that uses the concept of relative entropy to measure the instability of a community. We propose another algorithm, SDMiner, that can exactly enumerate all maximal unstable communities with small (scaled) subgraph divergence. We perform extensive experiments on social network datasets to show that we can discover UCs that capture the main structural variations of the given set of networks and also provide us with interesting and relevant insights about these datasets. / Ph. D. Computational Biology Graph Mining Network Theory Unstable Community Hypergraph Hyperedge
17	Identifying Splicing Regulatory Elements with de Bruijn Graphs Badr, Eman 12 May 2015 (has links) Splicing regulatory elements (SREs) are short, degenerate sequences on pre-mRNA molecules that enhance or inhibit the splicing process via the binding of splicing factors, proteins that regulate the functioning of the spliceosome. Existing methods for identifying SREs in a genome are either experimental or computational. This work tackles the limitations in the current approaches for identifying SREs. It addresses two major computational problems, identifying variable length SREs utilizing a graph-based model with de Bruijn graphs and discovering co-occurring sets of SREs (combinatorial SREs) utilizing graph mining techniques. In addition, I studied and analyzed the effect of alternative splicing on tissue specificity in human. First, I have used a formalism based on de Bruijn graphs that combines genomic structure, word count enrichment analysis, and experimental evidence to identify SREs found in exons. In my approach, SREs are not restricted to a fixed length (i.e., k-mers, for a fixed k). Consequently, the predicted SREs are of different lengths. I identified 2001 putative exonic enhancers and 3080 putative exonic silencers for human genes, with lengths varying from 6 to 15 nucleotides. Many of the predicted SREs overlap with experimentally verified binding sites. My model provides a novel method to predict variable length putative regulatory elements computationally for further experimental investigation. Second, I developed CoSREM (Combinatorial SRE Miner), a graph mining algorithm for discovering combinatorial SREs. The goal is to identify sets of exonic splicing regulatory elements whether they are enhancers or silencers. Experimental evidence is incorporated through my graph-based model to increase the accuracy of the results. The identified SREs do not have a predefined length, and the algorithm is not limited to identifying only SRE pairs as are current approaches. I identified 37 SRE sets that include both enhancer and silencer elements in human genes. These results intersect with previous results, including some that are experimental. I also show that the SRE set GGGAGG and GAGGAC identified by CoSREM may play a role in exon skipping events in several tumor samples. Further, I report a genome-wide analysis to study alternative splicing on multiple human tissues, including brain, heart, liver, and muscle. I developed a pipeline to identify tissue-specific exons and hence tissue-specific SREs. Utilizing the publicly available RNA-Seq data set from the Human BodyMap project, I identified 28,100 tissue-specific exons across the four tissues. I identified 1929 exonic splicing enhancers with 99% overlap with previously published experimental and computational databases. A complicated enhancer regulatory network was revealed, where multiple enhancers were found across multiple tissues while some were found only in specific tissues. Putative combinatorial exonic enhancers and silencers were discovered as well, which may be responsible for exon inclusion or exclusion across tissues. Some of the enhancers are found to be co-occurring with multiple silencers and vice versa, which demonstrates a complicated relationship between tissue-specific enhancers and silencers. / Ph. D. Alternative splicing de Bruijn graphs algorithms graph mining splicing regulatory elements
18	Effective Methods of Semantic Analysis in Spatial Contexts Dos Santos, Raimundo Fonseca Jr. 01 August 2014 (has links) With the growing spread of spatial data, exploratory analysis has gained a considerable amount of attention. Particularly in the fields of Information Retrieval and Data Mining, the integration of data points helps uncover interesting patterns not always visible to the naked eye. Social networks often link entities that share places and activities; marketing tools target users based on behavior and preferences; and medical technology combines symptoms to categorize diseases. Many of the current approaches in this field of research depend on semantic analysis, which is good for inferencing and decision making. From a functional point of view, objects can be investigated from a spatial and temporal perspectives. The former attempts to verify how proximity makes the objects related; the latter adds a measure of coherence by enforcing time ordering. This type of spatio-temporal reasoning examines several aspects of semantic analysis and their characteristics: shared relationships among objects, matches versus mismatches of values, distances among parents and children, and bruteforce comparison of attributes. Most of these approaches suffer from the pitfalls of disparate data, often missing true relationships, failing to deal with inexact vocabularies, ignoring missing values, and poorly handling multiple attributes. In addition, the vast majority does not consider the spatio-temporal aspects of the data. This research studies semantic techniques of data analysis in spatial contexts. The proposed solutions represent different methods on how to relate spatial entities or sequences of entities. They are able to identify relationships that are not explicitly written down. Major contributions of this research include (1) a framework that computes a numerical entity similarity, denoted a semantic footprint, composed of spatial, dimensional, and ontological facets; (2) a semantic approach that translates categorical data into a numerical score, which permits ranking and ordering; (3) an extensive study of GML as a representative spatial structure of how semantic analysis methods are influenced by its approaches to storage, querying, and parsing; (4) a method to find spatial regions of high entity density based on a clustering coefficient; (5) a ranking strategy based on connectivity strength which differentiates important relationships from less relevant ones; (6) a distance measure between entity sequences that quantifies the most related streams of information; (7) three distance-based measures (one probabilistic, one based on spatial influence, and one that is spatiological) that quantifies the interactions among entities and events; (8) a spatio-temporal method to compute the coherence of a data sequence. / Ph. D. spatial data social networks graph mining semantic analysis
19	Proximity based association rules for spatial data mining in genomes Saha, Surya 08 August 2009 (has links) Our knowledge discovery algorithm employs a combination of association rule mining and graph mining to identify frequent spatial proximity relationships in genomic data where the data is viewed as a one-dimensional space. We apply mining techniques and metrics from association rule mining to identify frequently co-occurring features in genomes followed by graph mining to extract sets of co-occurring features. Using a case study of ab initio repeat finding, we have shown that our algorithm, ProxMiner, can be successfully applied to identify weakly conserved patterns among features in genomic data. The application of pairwise spatial relationships increases the sensitivity of our algorithm while the use of a confidence threshold based on false discovery rate reduces the noise in our results. Unlike available defragmentation algorithms, ProxMiner discovers associations among ab initio repeat families to identify larger more complete repeat families. ProxMiner will increase the effectiveness of repeat discovery techniques for newly sequenced genomes where ab initio repeat finders are only able to identify partial repeat families. In this dissertation, we provide two detailed examples of ProxMiner-discovered novel repeat families and one example of a known rice repeat family that has been extended by ProxMiner. These examples encompass some of the different types of repeat families that can be discovered by our algorithm. We have also discovered many other potentially interesting novel repeat families that can be further studied by biologists. association rule mining spatial rules repeat defragmentation graph mining association rule mining spatial rules repeat defragmentation graph mining novel repeat regions DNA
20	Mining Metabolic Networks and Biomedical Literature Cakmak, Ali January 2009 (has links) No description available. Bioinformatics Computer Science pathway inference text mining gene ontology metabolic networks metabolomics graph mining taxonomy-superimposed graph mining data mining pathway functionality template

Search results