Global ETD Search

191	Towards effective analysis of big graphs : from scalability to quality Tian, Chao January 2017 (has links) This thesis investigates the central issues underlying graph analysis, namely, scalability and quality. We first study the incremental problems for graph queries, which aim to compute the changes to the old query answer, in response to the updates to the input graph. The incremental problem is called bounded if its cost is decided by the sizes of the query and the changes only. No matter how desirable, however, our first results are negative: for common graph queries such as graph traversal, connectivity, keyword search and pattern matching, their incremental problems are unbounded. In light of the negative results, we propose two new characterizations for the effectiveness of incremental computation, and show that the incremental computations above can still be effectively conducted, by either reducing the computations on big graphs to small data, or incrementalizing batch algorithms by minimizing unnecessary recomputation. We next study the problems with regards to improving the quality of the graphs. To uniquely identify entities represented by vertices in a graph, we propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. As an application, we study the entity matching problem, which is to find all pairs of entities in a graph that are identified by a given set of keys. Although the problem is proved to be intractable, and cannot be parallelized in logarithmic rounds, we provide two parallel scalable algorithms for it. In addition, to catch numeric inconsistencies in real-life graphs, we extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity, since if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. A localizable incremental algorithm is developed to detect errors using NGDs, where the cost is determined by small neighbors of nodes in the updates instead of the entire graph. Finally, a rule-based method to clean graphs is proposed. We extend graph entity dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of ground truth, we fix violations of GEDs in the graph by combining data repairing and object identification. The method finds certain fixes to errors detected by GEDs, i.e., as long as the GEDs and the ground truth are correct, the fixes are assured correct as their logical consequences. Several fundamental results underlying the method are established, and an algorithm is developed to implement the method. We also parallelize the method and guarantee to reduce its running time with the increase of processors.
192	Active Data - Enabling Smart Data Life Cycle Management for Large Distributed Scientific Data Sets / Active Data − Gestion Intelligente du Cycle de Vie des Grands Jeux de Données Scientifiques Distribués Simonet, Anthony 08 July 2015 (has links) Dans tous les domaines, le progrès scientifique repose de plus en plus sur la capacité à exploiter des volumes de données toujours plus gigantesques. Alors que leur volume augmente, la gestion de ces données se complexifie. Un point clé est la gestion du cycle de vie des données, c'est à dire les diverses opérations qu'elles subissent entre leur création et leur disparition : transfert, archivage, réplication, suppression, etc. Ces opérations, autrefois simples, deviennent ingérables lorsque le volume des données augmente de manière importante, au vu de l'hétérogénéité des logiciels utilisés d'une part, et de la complexité des infrastructures mises en œuvre d'autre part.Nous présentons Active Data, un méta-modèle, une implémentation et un modèle de programmation qui permet de représenter formellement et graphiquement le cycle de vie de données présentes dans un assemblage de systèmes et d'infrastructures hétérogènes, en exposant naturellement la réplication, la distribution et les différents identifiants des données. Une fois connecté à des applications existantes, Active Data expose aux utilisateurs ou à des programmes l'état d'avancement des données dans leur cycle de vie, en cours d'exécution, tout en gardant leur trace lorsqu'elles passent d'un système à un autre.Le modèle de programmation Active Data permet d'exécuter du code à chaque étape du cycle de vie des données. Les programmes écrits avec Active Data ont à tout moment accès à l'état complet des données, à la fois dans tous les systèmes et dans toutes les infrastructures sur lesquels elles sont distribuées. Nous présentons des évaluations de performance et des exemples d'utilisation qui attestent de l'expressivité du modèle de programmation et de la qualité de l'implémentation. Enfin, nous décrivons l'implémentation d'un outil de Surveillance des données basé sur Active Data pour l'expérience Advanced Photon Source qui permet aux utilisateurs de suivre la progression de leurs données, d'automatiser la plupart des tâches manuelles, d'obtenir des notifications pertinente parmi une masse gigantesque d'événements, ainsi que de détecter et corriger de nombreuses erreurs sans intervention humaine.Ce travail propose des perspectives intéressantes, en particulier dans les domaines de la provenance des données et de l'open data, tout en facilitant la collaboration entre les scientifiques de communautés différentes. / In all domains, scientific progress relies more and more on our ability to exploit ever growing volumes of data. However, as datavolumes increase, their management becomes more difficult. A key point is to deal with the complexity of data life cycle management,i.e. all the operations that happen to data between their creation and there deletion: transfer, archiving, replication, disposal etc.These formerly straightforward operations become intractable when data volume grows dramatically, because of the heterogeneity ofdata management software on the one hand, and the complexity of the infrastructures involved on the other.In this thesis, we introduce Active Data, a meta-model, an implementation and a programming model that allow to represent formally and graphically the life cycle of data distributed in an assemblage of heterogeneous systems and infrastructures, naturally exposing replication, distribution and different data identifiers. Once connected to existing applications, Active Data exposes the progress of data through their life cycle at runtime to users and programs, while keeping their track as it passes from a system to another.The Active Data programming model allows to execute code at each step of the data life cycle. Programs developed with Active Datahave access at any time to the complete state of data in any system and infrastructure it is distributed to.We present micro-benchmarks and usage scenarios that demonstrate the expressivity of the programming model and the implementationquality. Finally, we describe the implementation of a Data Surveillance framework based on Active Data for theAdvanced Photon Source experiment that allows scientists to monitor the progress of their data, automate most manual tasks,get relevant notifications from huge amount of events, and detect and recover from errors without human intervention.This work provides interesting perspectives in data provenance and open data in particular, while facilitating collaboration betweenscientists from different communities. Gestion de données Calcul distribué Données distribuées Modèles de programmation Systèmes de workflow Data management Distributed computing Distributed data Programming models Workflow systems
193	Abordagem para Qualidade de ServiÃo em Banco de Dados Multi-Inquilinos em Nuvem / Approach for Quality of Service to Multi-Tenant Databases in the Cloud Leonardo Oliveira Moreira 25 July 2014 (has links) FundaÃÃo de Amparo Ã Pesquisa do Estado do CearÃ / A computaÃÃo em nuvens Ã um paradigma bem consolidado de utilizaÃÃo de recursos computacionais, segundo o qual infraestrutura de hardware, software e plataformas para o desenvolvimento de novas aplicaÃÃes sÃo oferecidos como serviÃos disponÃveis remotamente e em escala global. Os usuÃrios de nuvens computacionais abrem mÃo de uma infraestrutura computacional prÃpria para dispÃ-la mediante serviÃos oferecidos por provedores de nuvem, delegando aspectos de Qualidade de ServiÃo (QoS) e assumindo custos proporcionais Ã quantidade de recursos que utilizam modelo de pagamento baseado no uso. Essas garantias de QoS sÃo definidas entre o provedor do serviÃo e o usuÃrio, e expressas por meio de Acordo de NÃvel de ServiÃo (SLA), o qual consiste de contratos que especificam um nÃvel de qualidade a ser atendido, e penalidades em caso de falha. A maioria das aplicaÃÃes em nuvem Ã orientada a dados e, por conta disso, Sistemas Gerenciadores de Banco de Dados (SGBDs) sÃo candidatos potenciais para a implantaÃÃo em nuvem. SGBDs em nuvem devem tratar uma grande quantidade de aplicaÃÃes ou inquilinos. Os modelos de multi-inquilinatos sÃo utilizados para consolidar vÃrios inquilinos dentro de um sÃ SGBD, favorecendo o compartilhamento eficaz de recursos, alÃm de gerenciar uma grande quantidade de inquilinos com padrÃes de carga de trabalho irregulares. Por outro lado, os provedores em nuvem devem reduzir os custos operacionais, garantindo a qualidade. Para muitas aplicaÃÃes, o maior tempo gasto no processamento das requisiÃÃes estÃ relacionado ao tempo de execuÃÃo do SGBD. Portanto, torna-se importante que um modelo de qualidade seja aplicado ao SGBD para seu desempenho. TÃcnicas de provisionamento dinÃmico sÃo voltadas para o tratamento de cargas de trabalho irregulares, para que violaÃÃes de SLA sejam evitadas. Sendo assim, uma estratÃgia para ajustar a nuvem no momento em que se prevÃ um comportamento que pode violar o SLA de um dado inquilino (banco de dados) deve ser considerada. As tÃcnicas de alocaÃÃo sÃo usadas no intuito de aproveitar os recursos do ambiente em detrimento ao provisionamento. Com base nos sistemas de monitoramento e de modelos de otimizaÃÃo, as tÃcnicas de alocaÃÃo decidem onde serÃ o melhor local para receber um dado inquilino. Para realizar a transferÃncia do inquilino de forma eficiente, tÃcnicas de Live Migration sÃo adotadas para ter o mÃnimo de interrupÃÃo do serviÃo. Acredita-se que a combinaÃÃo destas trÃs tÃcnicas podem contribuir para o desenvolvimento de um soluÃÃo robusta de QoS para bancos de dados em nuvem, minimizando violaÃÃes de SLA. Ante tais desafios, esta tese apresenta uma abordagem, denominada PMDB, para melhorar QoS em SGBDs multi-inquilinos em nuvem. A abordagem tem como objetivo reduzir o nÃmero de violaÃÃes de SLA e aproveitar os recursos Ã disposiÃÃo por meio de tÃcnicas que realizam prediÃÃo de carga de trabalho, alocaÃÃo e migraÃÃo de inquilinos quando necessitam de recursos com maior capacidade. Para isso, uma arquitetura foi proposta e um protÃtipo implementado com tais tÃcnicas, alÃm de estratÃgias de monitoramento e QoS voltada para aplicaÃÃes de banco de dados em nuvem. Ademais, alguns experimentos orientados a desempenho foram especificados para mostrar a eficiÃncia da abordagem a fim de alcanÃar o objetivo em foco. / Cloud computing is a well-established paradigm of computing resources usage, whereby hardware infrastructure, software and platforms for the development of new applications are offered as services available remotely and globally. Cloud computing users give up their own infrastructure to dispose of it through the services offered by cloud providers, to which they delegate aspects of Quality of Service (QoS) and assume costs proportional to the amount of resources they use, which is based on a payment model. These QoS guarantees are established between the service provider and the user, and are expressed through Service Level Agreements (SLA). This agreement consists of contracts that specify a level of quality that must be met, and penalties in case of failure. The majority of cloud applications are data-driven, and thus Database Management Systems (DBMSs) are potential candidates for cloud deployment. Cloud DBMS should treat a wide range of applications or tenants. Multi-tenant models have been used to consolidate multiple tenants within a single DBMS, favoring the efficient sharing of resources, and to manage a large number of tenants with irregular workload patterns. On the other hand, cloud providers must be able to reduce operational costs while keeping quality levels as agreed. To many applications, the longer time spent in processing requests is related to the DBMS runtime. Therefore, it becomes important to apply a quality model to obtain DBMS performance. Dynamic provisioning techniques are geared to treat irregular workloads so that SLA violations are avoided. Therefore, it is necessary to adopt a strategy to adjust the cloud at the time a behavior that may violate the SLA of a given tenant (database) is predicted. The allocation techniques are applied in order to utilize the resources of the environment to the dentriment of provisioning. Based on both the monitoring and the optimization models systems, the allocation techniques will decide the best place to assign a given tenant to. In order to efficiently perform the transfer of the tenant, minimal service interruption, Live Migration techniques are adopted. It is believed that the combination of these three techniques may contribute to the development of a robust QoS solution to cloud databases which minimizes SLA violations. Faced with these challenges, this thesis proposes an approach, called PMDB, to improve DBMS QoS in multi-tenant cloud. The approach aims to reduce the number of SLA violations and take advantage the resources that are available using techniques that perform workload prediction, allocation and migration of tenants when greater capacity resources are needed. An architecture was then proposed and a prototype implementing such techniques was developed, besides monitoring strategies and QoS oriented database applications in the cloud. Some performance oriented experiments were then specified to show the effectiveness of our approach. ComputaÃÃo em Nuvem Qualidade de ServiÃo Multi-Inquilino Gerenciamento de Dados Cloud Computing Quality of Service Multi-Tenancy Data Management CIENCIA DA COMPUTACAO
194	Object serialization vs relational data modelling in Apache Cassandra: a performance evaluation Johansen, Valdemar January 2015 (has links) Context. In newer database solutions designed for large-scale, cloud-based services, database performance is of particular concern as these services face scalability challenges due to I/O bottlenecks. These issues can be alleviated through various data model optimizations that reduce I/O loads. Object serialization is one such approach. Objectives. This study investigates the performance of serialization using the Apache Avro library in the Cassandra database. Two different serialized data models are compared with a traditional relational database model. Methods. This study uses an experimental approach that compares read and write latency using Twitter data in JSON format. Results. Avro serialization is found to improve performance. However, the extent of the performance benefit is found to be highly dependent on the serialization granularity defined by the data model. Conclusions. The study concludes that developers seeking to improve database throughput in Cassandra through serialization should prioritize data model optimization as serialization by itself will not outperform relational modelling in all use cases. The study also recommends that further work is done to investigate additional use cases, as there are potential performance issues with serialization that are not covered in this study. information storage technologies Computer Sciences Datavetenskap (datalogi)
195	Distributed data management with access control : social Networks and Data of the Web / Gestion de Données Distribuées avec Contrôle d’Accès : réseaux sociaux et données du Web Galland, Alban 28 September 2011 (has links) La masse d’information disponible sur leWeb s’accroit rapidement, sous l’afflux de données en provenance des utilisateurs et des compagnies. Ces données qu’ils souhaitent partager de façon controllée sur le réseau et quisont réparties sur de nombreuses machines et systèmes différents, ne sont rapidement plus gérables directement par des moyens humains. Nous introduisons WebdamExchange, un nouveau modèle de bases de connaissancesdistribuées, qui comprend des assertions au sujet des données, du contrôle d’accés et de la distribution. Ces assertions peuvent être échangées avec d’autres pairs, répliquées, interrogées et mises à jour, en gardant la trace de leur origine. La base de connaissance permet aussi de guider de façon automatique sa propre gestion. WebdamExchange est basé surWebdamLog, un nouveau langage de règles pour la gestion de données distribuées, qui associe formellement les règles déductives de Datalog avec négation et les règles actives de Datalog::. WebdamLog met l’accent sur la dynamicité et les interactions, caractéristiques du Web 2.0. Ce modèle procure à la fois un langage expressif pour la spécification de systèmes distribués complexes et un cadre formel pour l’étude de propriétés fondamentales de la distribution. Nous présentons aussi une implémentation de notre base de connaissance. Nous pensons que ces contributions formentune fondation solide pour surmonter les problèmes de gestion de données du Web, en particulier dans le cadre du contrôle d’accès. / The amount of information on the Web is spreading very rapidly. Users as well as companies bring data to the network and are willing to share with others. They quickly reach a situation where their information is hosted on many machines they own and on a large number of autonomous systems where they have accounts. Management of all this information is rapidly becoming beyond human expertise. We introduce WebdamExchange, a novel distributed knowledge-base model that includes logical statements for specifying information, access control, secrets, distribution, and knowledge about other peers. These statements can be communicated, replicated, queried, and updated, while keeping track of time and provenance. The resulting knowledge guides distributed data management. WebdamExchange model is based on WebdamLog, a new rule-based language for distributed data management that combines in a formal setting deductiverules as in Datalog with negation, (to specify intensional data) and active rules as in Datalog:: (for updates and communications). The model provides a novel setting with a strong emphasis on dynamicity and interactions(in a Web 2.0 style). Because the model is powerful, it provides a clean basis for the specification of complex distributed applications. Because it is simple, it provides a formal framework for studying many facets of the problem such as distribution, concurrency, and expressivity in the context of distributed autonomous peers. We also discuss an implementation of a proof-of-concept system that handles all the components of the knowledge base and experiments with a lighter system designed for smartphones. We believe that these contributions are a good foundation to overcome theproblems of Web data management, in particular with respect to access control. Distribution Contrôle d’Accès Réseaux Sociaux Gestion de Données du We Datalog Distribué Distribution Access Control Social Network Web Data Management Distributed Datalog
196	Product Information Management / Product Information Management Antonov, Anton January 2012 (has links) Product Information Management (PIM) is a field that deals with the product master data management and combines into one base the experience and the principles of data integration and data quality. Product Information Management merges the specific attributes of products across all channels in the supply chain. By unification, centralization and standardization of product information into one platform, quality and timely information with added value can be achieved. The goal of the theoretical part of the thesis is to construct a picture of the PIM, to place the PIM into a broader context, to define and describe various parts of the PIM solution, to describe the main differences in characteristics between the product data and data about clients and to summarize the available information on the administration and management of knowledge bases of the PIM data quality relevant for solving practical problems. The practical part of the thesis focuses on designing the structure, the content and the method of filling the knowledge base of the Product Information Management solution in the environment of the DataFlux software tools from SAS Institute. The practical part of the thesis further incorporates the analysis of the real product data, the design of definitions and objects of the knowledge base, the creation of a reference database and the testing of the knowledge base with the help of specially designed web services.
197	MDM of Product Data / MDM produktovych dat (MDM of Product Data) Čvančarová, Lenka January 2012 (has links) This thesis is focused on Master Data Management of Product Data. At present, most publications on the topic of MDM take into account customer data, and a very limited number of sources focus solely on product data. Some resources actually do attempt to cover MDM in full-depth. Even those publications are typically are very customer oriented. The lack of Product MDM oriented literature became one of the motivations for this thesis. Another motivation was to outline and analyze specifics of Product MDM in context of its implementation and software requirements for a vendor of MDM application software. For this I chose to create and describe a methodology for implementing MDM of product data. The methodology was derived from personal experience on projects focused on MDM of customer data, which was applied on findings from the theoretical part of this thesis. By analyzing product data characteristics and their impacts on MDM implementation as well as their requirements for application software, this thesis helps vendors of Customer MDM to understand the challenges of Product MDM and therefore to embark onto the product data MDM domain. Moreover this thesis can also serve as an information resource for enterprises considering adopting MDM of product data into their infrastructure.
198	Faculty Attitudes Towards Institutional Repositories Hall, Nathan F. 12 1900 (has links) The purpose of the study was to explore faculty attitudes towards institutional repositories in order to better understand their research habits and preferences. A better understanding of faculty needs and attitudes will enable academic libraries to improve institutional repository services and policies. A phenomenological approach was used to interview fourteen participants and conduct eight observations to determine how tenure-track faculty want to disseminate their research as well as their attitudes towards sharing research data. Interviews were transcribed and coded into emerging themes. Participants reported that they want their research to be read, used, and to have an impact. While almost all faculty see institutional repositories as something that would be useful for increasing the impact and accessibility of their research, they would consider publishers’ rights before depositing work in a repository. Researchers with quantitative data, and researchers in the humanities are more likely to share data than with qualitative or mixed data, which is more open to interpretation and inference. Senior faculty members are more likely than junior faculty members to be concerned about the context of their research data. Junior faculty members’ perception’ of requirements for tenure will inhibit their inclination to publish in open access journals, or share data. The study used a novel approach to provide an understanding of faculty attitudes and the structural functionalism of scholarly communication. Scholarly communication data management institutional repositories College teachers -- Attitudes.
199	Allocation Strategies for Data-Oriented Architectures Kiefer, Tim 09 October 2015 (has links) Data orientation is a common design principle in distributed data management systems. In contrast to process-oriented or transaction-oriented system designs, data-oriented architectures are based on data locality and function shipping. The tight coupling of data and processing thereon is implemented in different systems in a variety of application scenarios such as data analysis, database-as-a-service, and data management on multiprocessor systems. Data-oriented systems, i.e., systems that implement a data-oriented architecture, bundle data and operations together in tasks which are processed locally on the nodes of the distributed system. Allocation strategies, i.e., methods that decide the mapping from tasks to nodes, are core components in data-oriented systems. Good allocation strategies can lead to balanced systems while bad allocation strategies cause skew in the load and therefore suboptimal application performance and infrastructure utilization. Optimal allocation strategies are hard to find given the complexity of the systems, the complicated interactions of tasks, and the huge solution space. To ensure the scalability of data-oriented systems and to keep them manageable with hundreds of thousands of tasks, thousands of nodes, and dynamic workloads, fast and reliable allocation strategies are mandatory. In this thesis, we develop novel allocation strategies for data-oriented systems based on graph partitioning algorithms. Therefore, we show that systems from different application scenarios with different abstraction levels can be generalized to generic infrastructure and workload descriptions. We use weighted graph representations to model infrastructures with bounded and unbounded, i.e., overcommited, resources and possibly non-linear performance characteristics. Based on our generalized infrastructure and workload model, we formalize the allocation problem, which seeks valid and balanced allocations that minimize communication. Our allocation strategies partition the workload graph using solution heuristics that work with single and multiple vertex weights. Novel extensions to these solution heuristics can be used to balance penalized and secondary graph partition weights. These extensions enable the allocation strategies to handle infrastructures with non-linear performance behavior. On top of the basic algorithms, we propose methods to incorporate heterogeneous infrastructures and to react to changing workloads and infrastructures by incrementally updating the partitioning. We evaluate all components of our allocation strategy algorithms and show their applicability and scalability with synthetic workload graphs. In end-to-end--performance experiments in two actual data-oriented systems, a database-as-a-service system and a database management system for multiprocessor systems, we prove that our allocation strategies outperform alternative state-of-the-art methods. info:eu-repo/classification/ddc/004 ddc:004
200	GeoS: A Service for the Management of Geo-Social Information in a Distributed System Anderson, Paul 18 May 2010 (has links) Applications and services that take advantage of social data usually infer social relationships using information produced only within their own context, using a greatly simplified representation of users' social data. We propose to combine social information from multiple sources into a directed and weighted social multigraph in order to enable novel socially-aware applications and services. We present GeoS, a geo-social data management service which implements a representative set of social inferences and can run on a decentralized system. We demonstrate GeoS' potential for social applications on a collection of social data that combines collocation information and Facebook friendship declarations from 100 students. We demonstrate its performance by testing it both on PlanetLab and a LAN with a realistic workload for a 1000 node graph. Data Management Peer-to-Peer Systems Social Graph Socially-Aware Applications Privacy Protection American Studies Arts and Humanities

Search results