11 |
Parallelization of backward deleted distance calculation in graph based features using HadoopPillamari, Jayachandran January 1900 (has links)
Master of Science / Department of Computing & Information Sciences / Daniel Andresen / The current project presents an approach to parallelize the calculation of Backward Deleted Distance (BDD) in Graph Based Features (GBF) computation using Hadoop. In this project the issues concerned with the calculation of BDD are identified and parallel computing technologies like Hadoop are applied to solve them. The project introduces a new algorithm to parallelize the APSP problem in BDD calculation using Hadoop Map Reduce feature. The project is implemented in Java and Hadoop technologies. The aim of this project is to parallelize the calculation of BDD thereby reducing GBF computation time. The process of BDD calculation is examined to identify the key places where it could be parallelized. Since the BDD calculation involves calculating the shortest paths between all pairs of given users, it can viewed as All Pairs Shortest Path (APSP) problem. The internal structure and implementation of Hadoop Map-Reduce framework is studied and applied to the process of APSP problem. The GBF features are one of the features set used in the Ontology classifiers. In the current project, GBF features are used to predict the friendship relationship between the users whose direct link is deleted. The computation involves calculating BDD between all pairs of users. The BDD for a user pair represents the shortest path between them when their direct link is deleted. In real terms, it is the shortest distance between them other than the direct path. The project uses train and test data sets consisting of positive instances and negative instances. The positive instances consist of user pairs having a friendship link between them whereas the negative instances do not have any direct link between them. Apache Hadoop is a latest emerging technology in the market introduced for scalable, distributed computing across clusters of computers. It has a Map Reduce framework used for developing applications which process large amounts of data in parallel on large clusters.
The project is developed and implemented successfully and has the best time complexity. The project is tested for its reliability and performance. Different data sets are used in this testing by considering various factors and typical graph representations. The test results were analyzed to predict the behavior of the system. The test results show that the system has best speedup and considerably decreased the processing time from 10 hours to 20 minutes which is rewarding.
|
12 |
Parallelization Of Functional Flow To Predict Protein FunctionsAkkoyun, Emrah 01 February 2011 (has links) (PDF)
Protein-protein interaction networks provide important information about what the biological function of proteins whose roles are unknown might be in a cell. These interaction networks were analyzed by a variety of approaches by running them on a single computer and the roles of the proteins identified were used to predict the function of the proteins unidentified. The functional flow is an approach that takes the network connectivity, distance effect, topology of the network with local and global views into account. With these advantages, that the functional flow produces more accurate results on the prediction of protein functions was presented by the previos conducted researches. However, the application implemented for this approach could not be practically applied on the large and complex network produced for the complex species because of memory limitation. The purpose of this thesis is to provide a new application be implemented on the high computing performance where the application can be scaled on the large data sets. Therefore, Hadoop, one of the open source map/reduce environments, was installed on 18 hosts each of which has eight cores.
Method / the first map/reduce job distributes the protein interaction network as a format which allows parallel distributed computing to all the worker nodes, the other map/reduce job generates flows for each known protein function and the role of the proteins unidentified are predicted by accumulating all of these generated flows. It has been observed in the experiments we performed that the application requiring high performance computing can be decomposed into worker nodes efficiently and the application can provide better performance as the resources increase.
|
13 |
Enriching the Web of Data with topics and linksBöhm, Christoph January 2013 (has links)
This thesis presents novel ideas and research findings for the Web of Data – a global data space spanning many so-called Linked Open Data sources. Linked Open Data adheres to a set of simple principles to allow easy access and reuse for data published on the Web. Linked Open Data is by now an established concept and many (mostly academic) publishers adopted the principles building a powerful web of structured knowledge available to everybody. However, so far, Linked Open Data does not yet play a significant role among common web technologies that currently facilitate a high-standard Web experience.
In this work, we thoroughly discuss the state-of-the-art for Linked Open Data and highlight several shortcomings – some of them we tackle in the main part of this work.
First, we propose a novel type of data source meta-information, namely the topics of a dataset. This information could be published with dataset descriptions and support a variety of use cases, such as data source exploration and selection. For the topic retrieval, we present an approach coined Annotated Pattern Percolation (APP), which we evaluate with respect to topics extracted from Wikipedia portals.
Second, we contribute to entity linking research by presenting an optimization model for joint entity linking, showing its hardness, and proposing three heuristics implemented in the LINked Data Alignment (LINDA) system. Our first solution can exploit multi-core machines, whereas the second and third approach are designed to run in a distributed shared-nothing environment. We discuss and evaluate the properties of our approaches leading to recommendations which algorithm to use in a specific scenario. The distributed algorithms are among the first of their kind, i.e., approaches for joint entity linking in a distributed fashion. Also, we illustrate that we can tackle the entity linking problem on the very large scale with data comprising more than 100 millions of entity representations from very many sources.
Finally, we approach a sub-problem of entity linking, namely the alignment of concepts. We again target a method that looks at the data in its entirety and does not neglect existing relations. Also, this concept alignment method shall execute very fast to serve as a preprocessing for further computations. Our approach, called Holistic Concept Matching (HCM), achieves the required speed through grouping the input by comparing so-called knowledge representations. Within the groups, we perform complex similarity computations, relation conclusions, and detect semantic contradictions. The quality of our result is again evaluated on a large and heterogeneous dataset from the real Web.
In summary, this work contributes a set of techniques for enhancing the current state of the Web of Data. All approaches have been tested on large and heterogeneous real-world input. / Die vorliegende Arbeit stellt neue Ideen sowie Forschungsergebnisse für das Web of Data vor. Hierbei handelt es sich um ein globales Netz aus sogenannten Linked Open Data (LOD) Quellen. Diese Datenquellen genügen gewissen Prinzipien, um Nutzern einen leichten Zugriff über das Internet und deren Verwendung zu ermöglichen. LOD ist bereits weit verbreitet und es existiert eine Vielzahl von Daten-Veröffentlichungen entsprechend der LOD Prinzipien. Trotz dessen ist LOD bisher kein fester Baustein des Webs des 21. Jahrhunderts.
Die folgende Arbeit erläutert den aktuellen Stand der Forschung und Technik für Linked Open Data und identifiziert dessen Schwächen. Einigen Schwachstellen von LOD widmen wir uns in dem darauf folgenden Hauptteil.
Zu Beginn stellen wir neuartige Metadaten für Datenquellen vor – die Themen von Datenquellen (engl. Topics). Solche Themen könnten mit Beschreibungen von Datenquellen veröffentlicht werden und eine Reihe von Anwendungsfällen, wie das Auffinden und Explorieren relevanter Daten, unterstützen. Wir diskutieren unseren Ansatz für die Extraktion dieser Metainformationen – die Annotated Pattern Percolation (APP). Experimentelle Ergebnisse werden mit Themen aus Wikipedia Portalen verglichen.
Des Weiteren ergänzen wir den Stand der Forschung für das Auffinden verschiedener Repräsentationen eines Reale-Welt-Objektes (engl. Entity Linking). Für jenes Auffinden werden nicht nur lokale Entscheidungen getroffen, sondern es wird die Gesamtheit der Objektbeziehungen genutzt. Wir diskutieren unser Optimierungsmodel, beweisen dessen Schwere und präsentieren drei Ansätze zur Berechnung einer Lösung. Alle Ansätze wurden im LINked Data Alignment (LINDA) System implementiert. Die erste Methode arbeitet auf einer Maschine, kann jedoch Mehrkern-Prozessoren ausnutzen. Die weiteren Ansätze wurden für Rechnercluster ohne gemeinsamen Speicher entwickelt. Wir evaluieren unsere Ergebnisse auf mehr als 100 Millionen Entitäten und erläutern Vor- sowie Nachteile der jeweiligen Ansätze.
Im verbleibenden Teil der Arbeit behandeln wir das Linking von Konzepten – ein Teilproblem des Entity Linking. Unser Ansatz, Holistic Concept Matching (HCM), betrachtet abermals die Gesamtheit der Daten. Wir gruppieren die Eingabe um eine geringe Laufzeit bei der Verarbeitung von mehreren Hunderttausenden Konzepten zu erreichen. Innerhalb der Gruppen berechnen wir komplexe Ähnlichkeiten, und spüren semantische Schlussfolgerungen und Widersprüche auf. Die Qualität des Ergebnisses evaluieren wir ebenfalls auf realen Datenmengen.
Zusammenfassend trägt diese Arbeit zum aktuellen Stand der Forschung für das Web of Data bei. Alle diskutierten Techniken wurden mit realen, heterogenen und großen Datenmengen getestet.
|
14 |
Mining frequent highly-correlated item-pairs at very low support levelsSandler, Ian 20 December 2011 (has links)
The ability to extract frequent pairs from a set of transactions is one of the fundamental
building blocks of data mining. When the number of items in a given transaction is
relatively small the problem is trivial. Even when dealing with millions of transactions it
is still trivial if the number of unique items in the transaction set is small. The problem
becomes much more challenging when we deal with millions of transactions, each
containing hundreds of items that are part of a set of millions of potential items.
Especially when we are looking for highly correlated results at extremely low support
levels.
For 25 years the Direct Hashing and Pruning Park Chen Yu (PCY) algorithm has been
the principal technique used when there are billions of potential pairs that need to be
counted. In this paper we propose a new approach that allows us to take full advantage of
both multi-core and multi-CPU availability which works in cases where PCY fails, with
excellent performance scaling that continues even when the number of processors, unique
items and items per transaction are at their highest.
We believe that our approach has much broader applicability in the field of co-occurrence
counting, and can be used to generate much more interesting results when
mining very large data sets. / Graduate
|
15 |
GoldBI: uma solu??o de Business Intelligence como servi?o / GoldBI: a Business Intelligence as a service solutionSilva Neto, Arlindo Rodrigues da 26 August 2016 (has links)
Submitted by Automa??o e Estat?stica (sst@bczm.ufrn.br) on 2017-03-14T23:51:19Z
No. of bitstreams: 1
ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2017-03-16T23:01:46Z (GMT) No. of bitstreams: 1
ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5) / Made available in DSpace on 2017-03-16T23:01:46Z (GMT). No. of bitstreams: 1
ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5)
Previous issue date: 2016-08-26 / Este trabalho consiste em criar uma ferramenta de BI (Business Intelligence) dispon?vel
em nuvem (cloud computing) atrav?s de SaaS (Software as Service) utilizando
t?cnicas de ETL (Extract, Transform, Load) e tecnologias de Big Data, com a inten??o
de facilitar a extra??o descentralizada e o processamento de dados em grande quantidade.
Atualmente, constata-se que ? praticamente invi?vel realizar uma an?lise consistente sem
o aux?lio de um software para gera??o de relat?rios e estat?sticas. Para tais fins, a obten??o
de resultados concretos com a tomada de decis?o exige estrat?gias de an?lise de
dados e vari?veis consolidadas. Partindo dessa vis?o, enfatiza-se neste estudo o Business
Intelligence (BI) com o objetivo de simplificar a an?lise de informa??es gerenciais e estat?sticas
para propiciar indicadores atrav?s de gr?ficos ou listagens din?micas de dados
gerenciais. Assim, ? poss?vel inferir que, com o crescimento exponencial dos dados torna-se
cada vez mais dif?cil a obten??o de resultados de forma r?pida e consistente, tornando
necess?rio atuar com novas t?cnicas e ferramentas para tratamentos de dados em larga
escala. Este trabalho ? de natureza t?cnica de cria??o de um produto de Engenharia de
Software, fundamentado a partir do estudo da arte da ?rea, e de um comparativo com as
principais ferramentas existentes no mercado, evidenciando vantagens e desvantagens da
solu??o criada. / This work is to create a BI tool (Business Intelligence) available in the cloud (cloud
computing) through SaaS (Software as Service) using ETL techniques (extract, transform,
load) and Big Data technologies, with the intention of facilitating decentralized extraction
and data processing in large quantities. Currently, it appears that it is practically
impossible conduct a consistent analysis without the aid of a software for reporting and
statistics. For these purposes, the achievement of concrete results with decision making requires
data analysis strategies and consolidated variable. From this view, it is emphasized
in this study Business Intelligence (BI) in order to simplify the analysis of management
information and statistics to provide indicators through graphs or dynamic lists of data
management. Thus, it is possible to infer that with the exponential growth of data becomes
increasingly difficult to obtain results quickly and consistently, making it necessary to
work with new techniques and tools for large-scale data processing. This work is technical
in nature to create a product of Software Engineering, based from the study of art in the
area, and a comparison with the main existing tools on the market, showing advantages
and disadvantages of the created solution. / 2020-12-31
|
16 |
Extending the Growing Hierarchical Self Organizing Maps for a Large Mixed-Attribute Dataset Using Spark MapReduceMalondkar, Ameya Mohan January 2015 (has links)
In this thesis work, we propose a Map-Reduce variant of the Growing Hierarchical Self Organizing Map (GHSOM) called MR-GHSOM, which is capable of handling mixed attribute datasets of massive size. The Self Organizing Map (SOM) has proved to be a useful unsupervised data analysis algorithm. It projects a high dimensional data onto a lower dimensional grid of neurons. However, the SOM has some limitations owing to its static structure and the incapability to mirror the hierarchical relations in the data. The GHSOM overcomes these shortcomings of the SOM by providing a dynamic structure that adapts its shape according to the input data. It is capable of growing dynamically in terms of the size of the individual neuron layers to represent data at the desired granularity as well as in depth to model the hierarchical relations in the data.
However, the training of the GHSOM requires multiple passes over an input dataset. This makes it difficult to use the GHSOM for massive datasets. In this thesis work, we propose a Map-Reduce variant of the GHSOM called MR-GHSOM, which is capable of processing massive datasets. The MR-GHSOM is implemented using the Apache Spark cluster computing engine and leverages the popular Map-Reduce programming model. This enables us to exploit the usefulness and dynamic capabilities of the GHSOM even for a large dataset.
Moreover, the conventional GHSOM algorithm can handle datasets with numeric attributes only. This is owing to the fact that it relies heavily on the Euclidean space dissimilarity measures of the attribute vectors. The MR-GHSOM further extends the GHSOM to handle mixed attribute - numeric and categorical - datasets. It accomplishes this by adopting the distance hierarchy approach of managing mixed attribute datasets.
The proposed MR-GHSOM is thus capable of handling massive datasets containing mixed attributes. To demonstrate the effectiveness of the MR-GHSOM in terms of clustering of mixed attribute datasets, we present the results produced by the MR-GHSOM on some popular datasets. We further train our MR-GHSOM on a Census dataset containing mixed attributes and provide an analysis of the results.
|
17 |
LIMES M/R: Parallelization of the LInk discovery framework for MEtric Spaces using the Map/Reduce paradigmHillner, Stanley 26 February 2018 (has links)
The World Wide Web is the most important information space in the world. With the change of the web during the last decade, today’sWeb 2.0 offers everybody the possibility to easily publish information on the web. For instance, everyone can have his own blog, write Wikipedia articles, publish photos on Flickr or post status messages via Twitter. All these services on the web offer users all around the world the opportunity to interchange information and interconnect themselves with other users. However, the information, as it is usually published today, does not offer enough semantics to be machine-processable. As an example, Wikipedia articles are created using the lightweight Wiki markup language and then published as HyperText Markup Language (HTML) files whose semantics can easily be captured by humans, but not machines.
|
18 |
Scalable Map-Reduce Algorithms for Mining Formal Concepts and Graph SubstructuresKumar, Lalit January 2018 (has links)
No description available.
|
19 |
Supporting Data-Intensive Scientic Computing on Bandwidth and Space Constrained EnvironmentsBicer, Tekin 18 August 2014 (has links)
No description available.
|
20 |
Apprentissage supervisé de données symboliques et l'adaptation aux données massives et distribuées / Supervised learning of Symbolic Data and adaptation to Big DataHaddad, Raja 23 November 2016 (has links)
Cette thèse a pour but l'enrichissement des méthodes supervisées d'analyse de données symboliques et l'extension de ce domaine aux données volumineuses, dites "Big Data". Nous proposons à cette fin une méthode supervisée nommée HistSyr. HistSyr convertit automatiquement les variables continues en histogrammes les plus discriminants pour les classes d'individus. Nous proposons également une nouvelle méthode d'arbres de décision symbolique, dite SyrTree. SyrTree accepte tous plusieurs types de variables explicatives et à expliquer pour construire l'arbre de décision symbolique. Enfin, nous étendons HistSyr aux Big Data, en définissant une méthode distribuée nommée CloudHistSyr. CloudHistSyr utilise Map/Reduce pour créer les histogrammes les plus discriminants pour des données trop volumineuses pour HistSyr. Nous avons testé CloudHistSyr sur Amazon Web Services (AWS). Nous démontrons la scalabilité et l’efficacité de notre méthode sur des données simulées et sur les données expérimentales. Nous concluons sur l’utilité de CloudHistSyr qui , grâce à ses résultats, permet l'étude de données massives en utilisant les méthodes d'analyse symboliques existantes. / This Thesis proposes new supervised methods for Symbolic Data Analysis (SDA) and extends this domain to Big Data. We start by creating a supervised method called HistSyr that converts automatically continuous variables to the most discriminant histograms for classes of individuals. We also propose a new method of symbolic decision trees that we call SyrTree. SyrTree accepts many types of inputs and target variables and can use all symbolic variables describing the target to construct the decision tree. Finally, we extend HistSyr to Big Data, by creating a distributed method called CloudHistSyr. Using the Map/Reduce framework, CloudHistSyr creates of the most discriminant histograms for data too big for HistSyr. We tested CloudHistSyr on Amazon Web Services. We show the efficiency of our method on simulated data and on actual car traffic data in Nantes. We conclude on overall utility of CloudHistSyr which, through its results, allows the study of massive data using existing symbolic analysis methods.
|
Page generated in 0.0626 seconds