Spelling suggestions: "subject:"graph mining"" "subject:"raph mining""
61 |
Probabilistic inference in ecological networks : graph discovery, community detection and modelling dynamic socialityPsorakis, Ioannis January 2013 (has links)
This thesis proposes a collection of analytical and computational methods for inferring an underlying social structure of a given population, observed only via timestamped occurrences of its members across a range of locations. It shows that such data streams have a modular and temporally-focused structure, neither fully ordered nor completely random, with individuals appearing in "gathering events". By exploiting such structure, the thesis proposes an appropriate mapping of those spatio-temporal data streams to a social network, based on the co-occurrences of agents across gathering events, while capturing the uncertainty over social ties via the use of probability distributions. Given the extracted graphs mentioned above, an approach is proposed for studying their community organisation. The method considers communities as explanatory variables for the observed interactions, producing overlapping partitions and node membership scores to groups. The aforementioned models are motivated by a large ongoing experiment at Wytham woods, Oxford, where a population of Parus major wild birds is tagged with RFID devices and a grid of feeding locations generates thousands of spatio-temporal records each year. The methods proposed are applied on such data set to demonstrate how they can be used to explore wild bird sociality, reveal its internal organisation across a variety of different scales and provide insights into important biological processes relating to mating pair formation.
|
62 |
Data Mining Meets HCI: Making Sense of Large GraphsChau, Dueng Horng 01 July 2012 (has links)
We have entered the age of big data. Massive datasets are now common in science, government and enterprises. Yet, making sense of these data remains a fundamental challenge. Where do we start our analysis? Where to go next? How to visualize our findings?
We answers these questions by bridging Data Mining and Human- Computer Interaction (HCI) to create tools for making sense of graphs with billions of nodes and edges, focusing on:
(1) Attention Routing: we introduce this idea, based on anomaly detection, that automatically draws people’s attention to interesting areas of the graph to start their analyses. We present three examples: Polonium unearths malware from 37 billion machine-file relationships; NetProbe fingers bad guys who commit auction fraud.
(2) Mixed-Initiative Sensemaking: we present two examples that combine machine inference and visualization to help users locate next areas of interest: Apolo guides users to explore large graphs by learning from few examples of user interest; Graphite finds interesting subgraphs, based on only fuzzy descriptions drawn graphically.
(3) Scaling Up: we show how to enable interactive analytics of large graphs by leveraging Hadoop, staging of operations, and approximate computation.
This thesis contributes to data mining, HCI, and importantly their intersection, including: interactive systems and algorithms that scale; theories that unify graph mining approaches; and paradigms that overcome fundamental challenges in visual analytics.
Our work is making impact to academia and society: Polonium protects 120 million people worldwide from malware; NetProbe made headlines on CNN, WSJ and USA Today; Pegasus won an opensource software award; Apolo helps DARPA detect insider threats and prevent exfiltration.
We hope our Big Data Mantra “Machine for Attention Routing, Human for Interaction” will inspire more innovations at the crossroad of data mining and HCI.
|
63 |
Graph mining for object tracking in videos / Fouille de graphes pour le suivi d’objets dans les vidéosDiot, Fabien 03 June 2014 (has links)
Détecter et suivre les objets principaux d’une vidéo est une étape nécessaire en vue d’en décrire le contenu pour, par exemple, permettre une indexation judicieuse des données multimédia par les moteurs de recherche. Les techniques de suivi d’objets actuelles souffrent de défauts majeurs. En effet, soit elles nécessitent que l’utilisateur désigne la cible a suivre, soit il est nécessaire d’utiliser un classifieur pré-entraîné à reconnaitre une classe spécifique d’objets, comme des humains ou des voitures. Puisque ces méthodes requièrent l’intervention de l’utilisateur ou une connaissance a priori du contenu traité, elles ne sont pas suffisamment génériques pour être appliquées aux vidéos amateurs telles qu’on peut en trouver sur YouTube. Pour résoudre ce problème, nous partons de l’hypothèse que, dans le cas de vidéos dont l’arrière-plan n’est pas fixe, celui-ci apparait moins souvent que les objets intéressants. De plus, dans une vidéo, la topologie des différents éléments visuels composant un objet est supposée consistante d’une image a l’autre. Nous représentons chaque image par un graphe plan modélisant sa topologie. Ensuite, nous recherchons des motifs apparaissant fréquemment dans la base de données de graphes plans ainsi créée pour représenter chaque vidéo. Cette approche nous permet de détecter et suivre les objets principaux d’une vidéo de manière non supervisée en nous basant uniquement sur la fréquence des motifs. Nos contributions sont donc réparties entre les domaines de la fouille de graphes et du suivi d’objets. Dans le premier domaine, notre première contribution est de présenter un algorithme de fouille de graphes plans efficace, appelé PLAGRAM. Cet algorithme exploite la planarité des graphes et une nouvelle stratégie d’extension des motifs. Nous introduisons ensuite des contraintes spatio-temporelles au processus de fouille afin d’exploiter le fait que, dans une vidéo, les objets se déplacent peu d’une image a l’autre. Ainsi, nous contraignons les occurrences d’un même motif a être proches dans l’espace et dans le temps en limitant le nombre d’images et la distance spatiale les séparant. Nous présentons deux nouveaux algorithmes, DYPLAGRAM qui utilise la contrainte temporelle pour limiter le nombre de motifs extraits, et DYPLAGRAM_ST qui extrait efficacement des motifs spatio-temporels fréquents depuis les bases de données représentant les vidéos. Dans le domaine du suivi d’objets, nos contributions consistent en deux approches utilisant les motifs spatio-temporels pour suivre les objets principaux dans les vidéos. La première est basée sur une recherche du chemin de poids minimum dans un graphe connectant les motifs spatio-temporels tandis que l’autre est basée sur une méthode de clustering permettant de regrouper les motifs pour suivre les objets plus longtemps. Nous présentons aussi deux applications industrielles de notre méthode / Detecting and following the main objects of a video is necessary to describe its content in order to, for example, allow for a relevant indexation of the multimedia content by the search engines. Current object tracking approaches either require the user to select the targets to follow, or rely on pre-trained classifiers to detect particular classes of objects such as pedestrians or car for example. Since those methods rely on user intervention or prior knowledge of the content to process, they cannot be applied automatically on amateur videos such as the ones found on YouTube. To solve this problem, we build upon the hypothesis that, in videos with a moving background, the main objects should appear more frequently than the background. Moreover, in a video, the topology of the visual elements composing an object is supposed consistent from one frame to another. We represent each image of the videos with plane graphs modeling their topology. Then, we search for substructures appearing frequently in the database of plane graphs thus created to represent each video. Our contributions cover both fields of graph mining and object tracking. In the first field, our first contribution is to present an efficient plane graph mining algorithm, named PLAGRAM. This algorithm exploits the planarity of the graphs and a new strategy to extend the patterns. The next contributions consist in the introduction of spatio-temporal constraints into the mining process to exploit the fact that, in a video, the motion of objects is small from on frame to another. Thus, we constrain the occurrences of a same pattern to be close in space and time by limiting the number of frames and the spatial distance separating them. We present two new algorithms, DYPLAGRAM which makes use of the temporal constraint to limit the number of extracted patterns, and DYPLAGRAM_ST which efficiently mines frequent spatio-temporal patterns from the datasets representing the videos. In the field of object tracking, our contributions consist in two approaches using the spatio-temporal patterns to track the main objects in videos. The first one is based on a search of the shortest path in a graph connecting the spatio-temporal patterns, while the second one uses a clustering approach to regroup them in order to follow the objects for a longer period of time. We also present two industrial applications of our method
|
64 |
Go with the flow : A study exploring public transit performance using a flow network modelBoman, Axel, Nilsson, Erik January 2020 (has links)
As opposed to public transit agencies' well-developed data generation capabilities, their utilization of their data is often overlooked. This study will tap into the potential of using the GTFS data format from an agency stakeholder perspective to assess transit performance. This format holds data for scheduled transit services, including real-time updates and network organization. The broad adaptation of GTFS by transit agencies (1240 transit networks in 672 locations worldwide) has made it a de-facto standard, making products built on top of it inherently scalable and could potentially be deployed in networks all over the world. The purpose of this thesis is two-fold; firstly, to explore how specific vulnerability features of nodes in a public transit network can be assessed using graph mining algorithms. Secondly, to develop a pipeline for aggregating GTFS data and fit it into a flow network model. The results include a data-driven framework for vulnerability characterization, a method for fitting GTFS data in a flow network model, and lastly, a definition for reduced flow capacity in a public transit context. Additionally, the results are presented in the setting of Uppsala's network (UL) and visualized with a web-based tool.
|
65 |
Efficient and Scalable Subgraph Statistics using Regenerative Markov Chain Monte CarloMayank Kakodkar (12463929) 26 April 2022 (has links)
<p>In recent years there has been a growing interest in data mining and graph machine learning for techniques that can obtain frequencies of <em>k</em>-node Connected Induced Subgraphs (<em>k</em>-CIS) contained in large real-world graphs. While recent work has shown that 5-CISs can be counted exactly, no exact polynomial-time algorithms are known that solve this task for <em>k </em>> 5. In the past, sampling-based algorithms that work well in moderately-sized graphs for <em>k</em> ≤ 8 have been proposed. In this thesis I push this boundary up to <em>k</em> ≤ 16 for graphs containing up to 120M edges, and to <em>k</em> ≤ 25 for smaller graphs containing between a million to 20M edges. I do so by re-imagining two older, but elegant and memory-efficient algorithms -- FANMOD and PSRW -- which have large estimation errors by modern standards. This is because FANMOD produces highly correlated k-CIS samples and the cost of sampling the PSRW Markov chain becomes prohibitively expensive for k-CIS’s larger than <em>k </em>> 8.</p>
<p>In this thesis, I introduce:</p>
<p>(a) <strong>RTS:</strong> a novel regenerative Markov chain Monte Carlo (MCMC) sampling procedure on the tree, generated on-the-fly by the FANMOD algorithm. RTS is able to run on multiple cores and multiple machines (embarrassingly parallel) and compute confidence intervals of estimates, all this while preserving the memory-efficient nature of FANMOD. RTS is thus able to estimate subgraph statistics for <em>k</em> ≤ 16 for larger graphs containing up to 120M edges, and for <em>k</em> ≤ 25 for smaller graphs containing between a million to 20M edges.</p>
<p>(b) <strong>R-PSRW:</strong> which scales the PSRW algorithm to larger CIS-sizes using a rejection sampling procedure to efficiently sample transitions from the PSRW Markov chain. R-PSRW matches RTS in terms of scaling to larger CIS sizes.</p>
<p>(c) <strong>Ripple:</strong> which achieves unprecedented scalability by stratifying the R-PSRW Markov chain state-space into ordered strata via a new technique that I call <em>sequential stratified regeneration</em>. I show that the Ripple estimator is consistent, highly parallelizable, and scales well. Ripple is able to <em>count</em> CISs of size up to <em>k </em>≤ 12 in real world graphs containing up to 120M edges.</p>
<p>My empirical results show that the proposed methods offer a considerable improvement over the state-of-the-art. Moreover my methods are able to run at a scale that has been considered unreachable until now, not only by prior MCMC-based methods but also by other sampling approaches. </p>
<p><strong>Optimization of Restricted Boltzmann Machines. </strong>In addition, I also propose a regenerative transformation of MCMC samplers of Restricted Boltzmann Machines RBMs. My approach, Markov Chain Las Vegas (MCLV) gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has a maximum number of Markov chain step-count <em>K</em> (referred as MCLV-<em>K</em>). I present a MCLV-<em>K</em> gradient estimator (LVS-<em>K</em>) for RBMs and explore the correspondence and differences between LVS-<em>K</em> and Contrastive Divergence (CD-<em>K</em>). LVS-<em>K</em> significantly outperforms CD-<em>K</em> in the task of training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.</p>
|
66 |
Optimizing Bike Sharing System Flows using Graph Mining, Convolutional and Recurrent Neural NetworksLjubenkov, Davor January 2019 (has links)
A Bicycle-sharing system (BSS) is a popular service scheme deployed in cities of different sizes around the world. Although docked bike systems are its most popular model used, it still experiences a number of weaknesses that could be optimized by investigating bike sharing network properties and evolution of obtained patterns.Efficiently keeping bicycle-sharing system as balanced as possible is the main problem and thus, predicting or minimizing the manual transportation of bikes across the city is the prime objective in order to save logistic costs for operating companies.The purpose of this thesis is two-fold; Firstly, it is to visualize bike flow using data exploration methods and statistical analysis to better understand mobility characteristics with respect to distance, duration, time of the day, spatial distribution, weather circumstances, and other attributes. Secondly, by obtaining flow visualizations, it is possible to focus on specific directed sub-graphs containing only those pairs of stations whose mutual flow difference is the most asymmetric. By doing so, we are able to use graph mining and machine learning techniques on these unbalanced stations.Identification of spatial structures and their structural change can be captured using Convolutional neural network (CNN) that takes adjacency matrix snapshots of unbalanced sub-graphs. A generated structure from the previous method is then used in the Long short-term memory artificial recurrent neural network (RNN LSTM) in order to find and predict its dynamic patterns.As a result, we are predicting bike flows for each node in the possible future sub-graph configuration, which in turn informs bicycle-sharing system owners in advance to plan accordingly. This combination of methods notifies them which prospective areas they should focus on more and how many bike relocation phases are to be expected. Methods are evaluated using Cross validation (CV), Root mean square error (RMSE) and Mean average error (MAE) metrics. Benefits are identified both for urban city planning and for bike sharing companies by saving time and minimizing their cost. / Lånecykel avser ett system för uthyrning eller utlåning av cyklar. Systemet används främst i större städer och bekostas huvudsakligen genom tecknande av ett abonnemang.Effektivt hålla cykel andelssystem som balanseras som möjligt huvud problemand därmed förutsäga eller minimera manuell transport av cyklar över staden isthe främsta mål för att spara logistikkostnaderna för drift companies.Syftet med denna avhandling är tvåfaldigt.För det första är det att visualisera cykelflödet med hjälp av datautforskningsmetoder och statistisk analys för att bättre förstå rörlighetskarakteristika med avseende på avstånd, varaktighet, tid på dagen, rumsfördelning, väderförhållanden och andra attribut.För det andra är det vid möjliga flödesvisualiseringar möjligt att fokusera på specifika riktade grafer som endast innehåller de par eller stationer vars ömsesidiga flödesskillnad är den mest asymmetriska.Genom att göra det kan vi anvnda grafmining och maskininlärningsteknik på dessa obalanserade stationer, och använda konjunktionsnurala nätverk (CNN) som tar adjacency matrix snapshots eller obalanserade subgrafer.En genererad struktur från den tidigare metoden används i det långa kortvariga minnet artificiella återkommande neurala nätverket (RNN LSTM) för att hitta och förutsäga dess dynamiska mönster.Som ett resultat förutsäger vi cykelflden för varje nod i den eventuella framtida underkonfigurationen, vilket i sin tur informerar cykeldelningsägare om att planera i enlighet med detta.Denna kombination av metoder meddelar dem vilka framtida områden som bör inriktas på mer och hur många cykelflyttningsfaser som kan förväntas.Metoder utvärderas med hjälp av cross validation (CV), Root mean square error (RMSE) och Mean average error (MAE) metrics.Fördelar identifieras både för stadsplanering och för cykeldelningsföretag genom att spara tid och minimera kostnaderna.
|
67 |
Result Diversification on Spatial, Multidimensional, Opinion, and Bibliographic DataKucuktunc, Onur 01 October 2013 (has links)
No description available.
|
68 |
Malicious Entity Categorization using Graph modelling / Skadlig Entity Kategorisering med användning graf modelleringSrinivaasan, Gayathri January 2016 (has links)
Today, malware authors not only write malicious software but also employ obfuscation, polymorphism, packing and endless such evasive techniques to escape detection by Anti-Virus Products (AVP). Besides the individual behavior of malware, the relations that exist among them play an important role for improving malware detection. This work aims to enable malware analysts at F-Secure Labs to explore various such relationships between malicious URLs and file samples in addition to their individual behavior and activity. The current detection methods at F-Secure Labs analyze unknown URLs and file samples independently without taking into account the correlations that might exist between them. Such traditional classification methods perform well but are not efficient at identifying complex multi-stage malware that hide their activity. The interactions between malware may include any type of network activity, dropping, downloading, etc. For instance, an unknown downloader that connects to a malicious website which in turn drops a malicious payload, should indeed be blacklisted. Such analysis can help block the malware infection at its source and also comprehend the whole infection chain. The outcome of this proof-of-concept study is a system that detects new malware using graph modelling to infer their relationship to known malware as part of the malware classification services at F-Secure. / Idag, skadliga program inte bara skriva skadlig programvara men också använda förvirring, polymorfism, packning och ändlösa sådana undan tekniker för att fly detektering av antivirusprodukter (AVP). Förutom individens beteende av skadlig kod, de relationer som finns mellan dem spelar en viktig roll för att förbättra detektering av skadlig kod. Detta arbete syftar till att ge skadliga analytiker på F-Secure Labs att utforska olika sådana relationer mellan skadliga URL: er och fil prover i Förutom deras individuella beteende och aktivitet. De aktuella detektionsmetoder på F-Secure Labs analysera okända webbadresser och fil prover oberoende utan med beaktande av de korrelationer som kan finnas mellan dem. Sådan traditionella klassificeringsmetoder fungerar bra men är inte effektiva på att identifiera komplexa flerstegs skadlig kod som döljer sin aktivitet. Interaktioner mellan malware kan innefatta någon typ av nätverksaktivitet, släppa, nedladdning, etc. Till exempel, en okänd loader som ansluter till en skadlig webbplats som i sin tur släpper en skadlig nyttolast, bör verkligen vara svartlistad. En sådan analys kan hjälpa till att blockera malware infektion vid källan och även förstå hela infektion kedja. Resultatet av denna proof-of-concept studien är ett system som upptäcker ny skadlig kod med hjälp av diagram modellering för att sluta deras förhållande till kända skadliga program som en del av de skadliga klassificerings tjänster på F-Secure.
|
69 |
Fouille de graphes et classification de graphes : application à l’analyse de plans cadastraux / Graph Mining and Graph Classification : application to cadastral map analysisRaveaux, Romain 25 November 2010 (has links)
Les travaux présentés dans ce mémoire de thèse abordent sous différents angles très intéressants, un sujet vaste et ambitieux : l’interprétation de plans cadastraux couleurs.Dans ce contexte, notre approche se trouve à la confluence de différentes thématiques de recherche telles que le traitement du signal et des images, la reconnaissance de formes, l’intelligence artificielle et l’ingénierie des connaissances. En effet, si ces domaines scientifiques diffèrent dans leurs fondements, ils sont complémentaires et leurs apports respectifs sont indispensables pour la conception d’un système d’interprétation. Le centre du travail est le traitement automatique de documents cadastraux du 19e siècle. La problématique est traitée dans le cadre d'un projet réunissant des historiens, des géomaticiens et des informaticiens. D'une part nous avons considéré le problème sous un angle systémique, s'intéressant à toutes les étapes de la chaîne de traitements mais aussi avec un souci évident de développer des méthodologies applicables dans d'autres contextes. Les documents cadastraux ont été l'objet de nombreuses études mais nous avons su faire preuve d'une originalité certaine, mettant l'accent sur l'interprétation des documents et basant notre étude sur des modèles à base de graphes. Des propositions de traitements appropriés et de méthodologies ont été formulées. Le souci de comblé le gap sémantique entre l’image et l’interprétation a reçu dans le cas des plans cadastraux étudiés une réponse. / This thesis tackles the problem of technical document interpretationapplied to ancient and colored cadastral maps. This subject is on the crossroadof different fields like signal or image processing, pattern recognition, artificial intelligence,man-machine interaction and knowledge engineering. Indeed, each of thesedifferent fields can contribute to build a reliable and efficient document interpretationdevice. This thesis points out the necessities and importance of dedicatedservices oriented to historical documents and a related project named ALPAGE.Subsequently, the main focus of this work: Content-Based Map Retrieval within anancient collection of color cadastral maps is introduced.
|
Page generated in 0.0585 seconds