Spelling suggestions: "subject:"distributed data"" "subject:"eistributed data""
31 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
32 |
Temporal streams: programming abstractions for distributed live stream analysis applicationsHilley, David B 20 October 2009 (has links)
Continuous live stream analysis applications are increasingly common. Video-based surveillance, emergency response, disaster recovery, and critical infrastructure monitoring are all examples of such applications. These applications are distributed and typically require significant computing resources (like a cluster of workstations) for analysis. In addition to live data, many such applications also require access to historical data that was streamed in the past and is now archived. While distributed programming support for traditional high-performance computing applications is fairly mature, existing solutions for live stream analysis applications are still in their early stages and, in our view, inadequate.
We explore the system-level value of recognizing temporal properties -- a critical aspect of the application domain. We present "temporal streams", a programming model supporting a higher-level, domain-targeted programming abstraction for such applications. It provides a simple but expressive stream abstraction encompassing transport, manipulation and storage of streaming data. The semantics of the programming model are tailored to the application domain by explicitly recognizing the temporal aspects of continuous streams, providing a common interface for both time-based retrieval of current streaming data and data persistence. The unifying trait of time enables access to both current streaming data and archived historical data using the same interface; the communication and storage abstraction are the same -- a unified stream data abstraction, uniformly modeling stream data interactions.
"Temporal streams" defines how distributed threads of computation interact implicitly via streams, but does not impose a particular model of computation constraining the interactions between distributed actors, targeting loosely coupled distributed systems with no centralized control. In particular, it targets stream analysis scenarios requiring significant signal processing on heavyweight streams such as audio and video. These unstructured streams are data rich but are not directly interpretable until meaningful features are extracted; consequently, feature detection and subsequent analysis are the major computational requirements.
We also use the programming model as a vehicle for exploring systems software design issues, realizing "temporal streams" as a distributed runtime in the tradition of loosely coupled distributed systems with strong communication boundaries. We thoroughly examine the concrete software architecture and elements of implementation. We also describe two generations of system implementations, including the broad development philosophy, specific design principles and salient low-level details. The runtime is designed to be relatively lightweight and suitable as a substrate for higher-level, more domain-specific middleware or application functionality. Even with a relatively simple programming model, a carefully designed system architecture can provide a surprisingly rich and flexibly substrate for upper software layers.
We also evaluate our system implementation in two ways; first, we present a series of quantitative experimental results designed to assess the performance of key primitives in our architecture in isolation. We also use motivating applications to evaluate "temporal streams" in the context of realistic application scenarios. We develop three motivating applications and provide quantitative and qualitative analyses of these applications in the context of "temporal streams." We show that, although it provides needed higher-level functionality to enable live stream analysis applications, our runtime does not add significant overhead to the stream computation at the core of each application.
Finally, we also review the relationship of "temporal streams" (both the programming model and architecture) to other approaches, including database-oriented Stream Data Management Systems (SDMS), various stream processing engines, stream programming languages and parallel batch processing systems, as well as traditional distributed programming systems and communication frameworks.
|
33 |
Self-describing objects with tangible data structuresSinha, Arnab 28 May 2014 (has links) (PDF)
Pervasive computing or ambient computing aims to integrate information systems into the environment, in a manner as transparent as possible to the users. It allows the information systems to be tightly coupled with the physical activities within the environment. Everyday used objects, along with their environment, are made smarter with the use of embedded computing, sensors etc. and also have the ability to communicate among themselves. In pervasive computing, it is necessary to sense the real physical world and to perceive its "context" ; a high level representation of the physical situation. There are various ways to derive the context. Typically, the approach is a multi-step process which begins with sensing. Various sensing technologies are used to capture low level information of the physical activities, which are then aggregated, analyzed and computed elsewhere in the information systems, to become aware of the context. Deployed applications then react, depending on the context situation. Among sensors, RFID is an important emerging technology which allows a direct digital link between information systems and physical objects. Besides storing identification data, RFID also provides a general purpose storage space on objects, enabling new architectures for pervasive computing. In this thesis, we defend an original approach adopting the later use of RFID i.e. a digital memory integrated to real objects. The approach uses the principle where the objects self-support information systems. This way of integration reduces the need of communication for remote processing. The principle is realized in two ways. First, objects are piggybacked with semantic information, related to itself ; as self-describing objects. Hence, relevant information associated with the physical entities are readily available locally for processing. Second, group of related objects are digitally linked using dedicated or ad-hoc data structure, distributed over the objects. Hence, it would allow direct data processing - like validating some property involving the objects in proximity. This property of physical relation among objects can be interpreted digitally from the data structure ; this justifies the appellation "Tangible Data Structures". Unlike the conventional method of using identifiers, our approach has arguments on its benefits in terms of privacy, scalability, autonomy and reduced dependency with respect to infrastructure. But its challenge lies in the expressivity due to limited memory space available in the tags. The principles are validated by prototyping in two different application domains. The first application is developed for waste management domain that helps in efficient sorting and better recycling. And the second, provides added services like assistance while assembling and verification for composite objects, using the distributed data structure across the individual pieces.
|
34 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
35 |
Active Data - Enabling Smart Data Life Cycle Management for Large Distributed Scientific Data Sets / Active Data − Gestion Intelligente du Cycle de Vie des Grands Jeux de Données Scientifiques DistribuésSimonet, Anthony 08 July 2015 (has links)
Dans tous les domaines, le progrès scientifique repose de plus en plus sur la capacité à exploiter des volumes de données toujours plus gigantesques. Alors que leur volume augmente, la gestion de ces données se complexifie. Un point clé est la gestion du cycle de vie des données, c'est à dire les diverses opérations qu'elles subissent entre leur création et leur disparition : transfert, archivage, réplication, suppression, etc. Ces opérations, autrefois simples, deviennent ingérables lorsque le volume des données augmente de manière importante, au vu de l'hétérogénéité des logiciels utilisés d'une part, et de la complexité des infrastructures mises en œuvre d'autre part.Nous présentons Active Data, un méta-modèle, une implémentation et un modèle de programmation qui permet de représenter formellement et graphiquement le cycle de vie de données présentes dans un assemblage de systèmes et d'infrastructures hétérogènes, en exposant naturellement la réplication, la distribution et les différents identifiants des données. Une fois connecté à des applications existantes, Active Data expose aux utilisateurs ou à des programmes l'état d'avancement des données dans leur cycle de vie, en cours d'exécution, tout en gardant leur trace lorsqu'elles passent d'un système à un autre.Le modèle de programmation Active Data permet d'exécuter du code à chaque étape du cycle de vie des données. Les programmes écrits avec Active Data ont à tout moment accès à l'état complet des données, à la fois dans tous les systèmes et dans toutes les infrastructures sur lesquels elles sont distribuées. Nous présentons des évaluations de performance et des exemples d'utilisation qui attestent de l'expressivité du modèle de programmation et de la qualité de l'implémentation. Enfin, nous décrivons l'implémentation d'un outil de Surveillance des données basé sur Active Data pour l'expérience Advanced Photon Source qui permet aux utilisateurs de suivre la progression de leurs données, d'automatiser la plupart des tâches manuelles, d'obtenir des notifications pertinente parmi une masse gigantesque d'événements, ainsi que de détecter et corriger de nombreuses erreurs sans intervention humaine.Ce travail propose des perspectives intéressantes, en particulier dans les domaines de la provenance des données et de l'open data, tout en facilitant la collaboration entre les scientifiques de communautés différentes. / In all domains, scientific progress relies more and more on our ability to exploit ever growing volumes of data. However, as datavolumes increase, their management becomes more difficult. A key point is to deal with the complexity of data life cycle management,i.e. all the operations that happen to data between their creation and there deletion: transfer, archiving, replication, disposal etc.These formerly straightforward operations become intractable when data volume grows dramatically, because of the heterogeneity ofdata management software on the one hand, and the complexity of the infrastructures involved on the other.In this thesis, we introduce Active Data, a meta-model, an implementation and a programming model that allow to represent formally and graphically the life cycle of data distributed in an assemblage of heterogeneous systems and infrastructures, naturally exposing replication, distribution and different data identifiers. Once connected to existing applications, Active Data exposes the progress of data through their life cycle at runtime to users and programs, while keeping their track as it passes from a system to another.The Active Data programming model allows to execute code at each step of the data life cycle. Programs developed with Active Datahave access at any time to the complete state of data in any system and infrastructure it is distributed to.We present micro-benchmarks and usage scenarios that demonstrate the expressivity of the programming model and the implementationquality. Finally, we describe the implementation of a Data Surveillance framework based on Active Data for theAdvanced Photon Source experiment that allows scientists to monitor the progress of their data, automate most manual tasks,get relevant notifications from huge amount of events, and detect and recover from errors without human intervention.This work provides interesting perspectives in data provenance and open data in particular, while facilitating collaboration betweenscientists from different communities.
|
36 |
Approximate Clustering Algorithms for High Dimensional Streaming and Distributed DataCarraher, Lee A. 22 May 2018 (has links)
No description available.
|
37 |
A WAVELET APPROACH FOR DEVELOPMENT AND APPLICATION OF A STOCHASTIC PARAMETER SIMULATION SYSTEMMIRON, ADRIAN 11 October 2001 (has links)
No description available.
|
38 |
Energy Modeling and Management for Data Services in Multi-Tier Mobile Cloud ArchitecturesXu, Zichen 21 November 2016 (has links)
No description available.
|
39 |
Advanced middleware support for distributed data-intensive applicationsDu, Wei 12 September 2005 (has links)
No description available.
|
40 |
\"Armazenamento distribuído de dados e checkpointing de aplicações paralelas em grades oportunistas\" / Distributed data storage and checkpointing of parallel applications in opportunistic gridsCamargo, Raphael Yokoingawa de 04 May 2007 (has links)
Grades computacionais oportunistas utilizam recursos ociosos de máquinas compartilhadas para executar aplicações que necessitam de um alto poder computacional e/ou trabalham com grandes quantidades de dados. Mas a execução de aplicações paralelas computacionalmente intensivas em ambientes dinâmicos e heterogêneos, como grades computacionais oportunistas, é uma tarefa difícil. Máquinas podem falhar, ficar inacessíveis ou passar de ociosas para ocupadas inesperadamente, comprometendo a execução de aplicações. Um mecanismo de tolerância a falhas que dê suporte a arquiteturas heterogêneas é um importante requisito para estes sistemas. Neste trabalho, analisamos, implementamos e avaliamos um mecanismo de tolerância a falhas baseado em checkpointing para aplicações paralelas em grades computacionais oportunistas. Este mecanismo permite o monitoramento de execuções e a migração de aplicações entre nós heterogêneos da grade. Mas além da execução, é preciso gerenciar e armazenar os dados gerados e utilizados por estas aplicações. Desejamos uma infra-estrutura de armazenamento de dados de baixo custo e que utilize o espaço livre em disco de máquinas compartilhadas da grade. Devemos utilizar somente os ciclos ociosos destas máquinas para armazenar e recuperar dados, de modo que um sistema de armazenamento distribuído que as utilize deve ser redundante e tolerante a falhas. Para resolver o problema do armazenamento de dados em grades oportunistas, projetamos, implementamos e avaliamos o middleware OppStore. Este middleware provê armazenamento distribuído e confiável de dados, que podem ser acessados de qualquer máquina da grade. As máquinas são organizadas em aglomerados, que são conectados por uma rede peer-to-peer auto-organizável e tolerante a falhas. Dados são codificados em fragmentos redundantes antes de serem armazenados, de modo que arquivos podem ser reconstruídos utilizando apenas um subconjunto destes fragmentos. Finalmente, para lidar com a heterogeneidade dos recursos, desenvolvemos uma extensão ao protocolo de roteamento em redes peer-to-peer Pastry. Esta extensão adiciona balanceamento de carga e suporte à heterogeneidade de máquinas ao protocolo Pastry. / Opportunistic computational grids use idle resources from shared machines to execute applications that need large amounts of computational power and/or deal with large amounts of data. But executing computationally intensive parallel applications in dynamic and heterogeneous environments, such as opportunistic grids, is a daunting task. Machines may fail, become inaccessible, or change from idle to occupied unexpectedly, compromising the application execution. A fault tolerance mechanism that supports heterogeneous architectures is an important requisite for such systems. In this work, we analyze, implement and evaluate a checkpointing-based fault tolerance mechanism for parallel applications running on opportunistic grids. The mechanism monitors application execution and allows the migration of applications between heterogeneous nodes of the grid. But besides application execution, it is necessary to manage data generated and used by those applications. We want a low cost data storage infrastructure that utilizes the unused disk space of grid shared machines. The system should use the machines to store and recover data only during their idle periods, requiring the system to be redundant and fault-tolerant. To solve the data storage problem in opportunistic grids, we designed, implemented and evaluated the OppStore middleware. This middleware provides reliable distributed storage for application data, which can be accessed from any machine in the grid. The machines are organized in clusters, connected by a self-organizing and fault-tolerant peer-to-peer network. During storage, data is codified into redundant fragments, allowing the reconstruction of the original file using only a subset of those fragments. Finally, to deal with resource heterogeneity, we developed an extension to the Pastry peer-to-peer routing substrate, enabling heterogeneity-aware load-balancing message routing.
|
Page generated in 0.1037 seconds