71 |
Integração de bancos de dados heterogêneos utilizando grades computacionais. / Heterogeneous databases integration using grid computing.Kakugawa, Fernando Ryoji 18 November 2010 (has links)
Bancos de dados normalmente são projetados para atender a um domínio específico de uma aplicação, tornando o acesso aos dados limitado e uma tarefa árdua em relação à integração de bancos e compartilhamento de dados. Existem várias pesquisas no intuito de integrar dados, como a criação de softwares específicos para uma determinada aplicação e até soluções mais radicais como refazer todos os bancos de dados envolvidos, demonstrando que ainda existem questões em aberto e que a área está longe de atingir soluções definitivas. Este trabalho apresenta conceitos e estratégias para a integração de bancos de dados heterogêneos e a implementa na forma do DIGE, uma ferramenta para desenvolver sistemas de banco de dados integrando diferentes bancos de dados relacionais heterogêneos utilizando grades computacionais. O sistema criado permite o compartilhamento de acesso deixando os dados armazenados em seu local de origem, desta forma, usuários do sistema acessam os dados em outras instituições com a impressão de que os dados estão armazenados localmente. O programador da aplicação final pode acessar e manipular os dados de forma convencional utilizando a linguagem SQL sem se preocupar com a localização e o esquema de cada banco e o administrador do sistema pode adicionar ou remover bancos de forma facilitada sem a necessidade de solicitar alterações na aplicação final. / Databases are usually designed to support a specific application domain, thus making data-access and data-sharing a hard and arduous task when database integration is required. Therefore, research projects have been developed in order to integrate several and heterogeneous databases systems, such as specific-domain application tools or even more extreme solutions, as a complete database redefinition and redesign. Considering these open questions and with no definite answers, this work presents some concepts, strategies and an implementation for heterogeneous databases integration. In this implementation, the DIGE tool was developed to provide access to heterogeneous and geographically distributed databases, using Grid computing, which store locally its data, appearing to the user application, the data are stored locally. By this way, programmers can manipulate data using conventional SQL language with no concern about database location or its schema. Systems Administrators may also add or remove databases on the whole system without need to change the final user application.
|
72 |
Modelo para análise de desempenho do processo de replicação de dados em portais de biodiversidade. / Model for performance analysis of the replication process of biodiversity portal data.Salvanha, Pablo 08 December 2009 (has links)
Atualmente muitas instituições mantêm coleções de espécimes biológicas, e através de ferramentas computacionais digitalizam e disponibilizam seus dados para acesso através de portais de dados de biodiversidade. Um exemplo deste tipo de ferramenta é o portal de espécimes utilizado pelo GBIF (Global Biodiversity Information Facility), que centraliza em suas bases de dados milhões de registros, provenientes de instituições de diferentes localizações. A replicação das bases de dados locais nos portais é realizada através da utilização de protocolos (DiGIR / TAPIR) e esquemas de dados (DarwinCore). Entretanto a execução desta solução demanda uma grande quantidade de tempo, englobando tanto a transferência dos fragmentos de dados como o processamento dos mesmos dentro do portal. Com o crescimento da digitalização de dados dentro das instituições, este cenário tende a ser agravado cada vez mais, dificultando assim a manutenção de dados sempre atualizados dentro dos portais. Esta pesquisa propõe uma análise do processo de replicação de dados com objetivo de avaliar seu desempenho. Para isto é utilizado o portal de biodiversidade de polinizadores da IABIN como estudo de caso, o qual possui, além da replicação de dados convencionais o suporte a dados de interação. Com os resultados desta pesquisa é possível simular situações antes da efetivação das mesmas, prevendo assim qual será o seu desempenho. Adicionalmente estes resultados podem contribuir para melhorias futuras deste processo, visando a diminuição do tempo necessário da disponibilização dos dados dentro de portais de biodiversidade. / Currently many institutions keep collections of biological specimens, and through computational tools they digitalize and provide access to their data through biodiversity data portals. An example of this tool is the specimens portal used by GBIF (Global Biodiversity Information Facility), which focuses on its databases millions of records from different institutions around the world. The replication of databases in those portals is accomplished through the use of protocols (DiGIR / TAPIR) and data schemas (DarwinCore). However the implementation of this solution demands a large amount of time, encompassing both, the transfer of fragments of data as processing data within the portal. With the growth of data digitalization within the institutions, this scenario tends to be increasingly exacerbated, making it hard to maintenance the records up to date within the portals. This research proposes analyze the replication process data to evaluate its performance. To reach this objective is used the IABIN biodiversity portal of pollinators as study case, which support both situations: the conventional data and the interaction data replication. With the results of this research is possible to simulate situations before its execution, thus predicting what will be its performance. Additionally these results may contribute to future improvements of this process; in order to decrease the time required to make the data available in the biodiversity portals.
|
73 |
Integração de bancos de dados heterogêneos utilizando grades computacionais. / Heterogeneous databases integration using grid computing.Fernando Ryoji Kakugawa 18 November 2010 (has links)
Bancos de dados normalmente são projetados para atender a um domínio específico de uma aplicação, tornando o acesso aos dados limitado e uma tarefa árdua em relação à integração de bancos e compartilhamento de dados. Existem várias pesquisas no intuito de integrar dados, como a criação de softwares específicos para uma determinada aplicação e até soluções mais radicais como refazer todos os bancos de dados envolvidos, demonstrando que ainda existem questões em aberto e que a área está longe de atingir soluções definitivas. Este trabalho apresenta conceitos e estratégias para a integração de bancos de dados heterogêneos e a implementa na forma do DIGE, uma ferramenta para desenvolver sistemas de banco de dados integrando diferentes bancos de dados relacionais heterogêneos utilizando grades computacionais. O sistema criado permite o compartilhamento de acesso deixando os dados armazenados em seu local de origem, desta forma, usuários do sistema acessam os dados em outras instituições com a impressão de que os dados estão armazenados localmente. O programador da aplicação final pode acessar e manipular os dados de forma convencional utilizando a linguagem SQL sem se preocupar com a localização e o esquema de cada banco e o administrador do sistema pode adicionar ou remover bancos de forma facilitada sem a necessidade de solicitar alterações na aplicação final. / Databases are usually designed to support a specific application domain, thus making data-access and data-sharing a hard and arduous task when database integration is required. Therefore, research projects have been developed in order to integrate several and heterogeneous databases systems, such as specific-domain application tools or even more extreme solutions, as a complete database redefinition and redesign. Considering these open questions and with no definite answers, this work presents some concepts, strategies and an implementation for heterogeneous databases integration. In this implementation, the DIGE tool was developed to provide access to heterogeneous and geographically distributed databases, using Grid computing, which store locally its data, appearing to the user application, the data are stored locally. By this way, programmers can manipulate data using conventional SQL language with no concern about database location or its schema. Systems Administrators may also add or remove databases on the whole system without need to change the final user application.
|
74 |
Modelo para análise de desempenho do processo de replicação de dados em portais de biodiversidade. / Model for performance analysis of the replication process of biodiversity portal data.Pablo Salvanha 08 December 2009 (has links)
Atualmente muitas instituições mantêm coleções de espécimes biológicas, e através de ferramentas computacionais digitalizam e disponibilizam seus dados para acesso através de portais de dados de biodiversidade. Um exemplo deste tipo de ferramenta é o portal de espécimes utilizado pelo GBIF (Global Biodiversity Information Facility), que centraliza em suas bases de dados milhões de registros, provenientes de instituições de diferentes localizações. A replicação das bases de dados locais nos portais é realizada através da utilização de protocolos (DiGIR / TAPIR) e esquemas de dados (DarwinCore). Entretanto a execução desta solução demanda uma grande quantidade de tempo, englobando tanto a transferência dos fragmentos de dados como o processamento dos mesmos dentro do portal. Com o crescimento da digitalização de dados dentro das instituições, este cenário tende a ser agravado cada vez mais, dificultando assim a manutenção de dados sempre atualizados dentro dos portais. Esta pesquisa propõe uma análise do processo de replicação de dados com objetivo de avaliar seu desempenho. Para isto é utilizado o portal de biodiversidade de polinizadores da IABIN como estudo de caso, o qual possui, além da replicação de dados convencionais o suporte a dados de interação. Com os resultados desta pesquisa é possível simular situações antes da efetivação das mesmas, prevendo assim qual será o seu desempenho. Adicionalmente estes resultados podem contribuir para melhorias futuras deste processo, visando a diminuição do tempo necessário da disponibilização dos dados dentro de portais de biodiversidade. / Currently many institutions keep collections of biological specimens, and through computational tools they digitalize and provide access to their data through biodiversity data portals. An example of this tool is the specimens portal used by GBIF (Global Biodiversity Information Facility), which focuses on its databases millions of records from different institutions around the world. The replication of databases in those portals is accomplished through the use of protocols (DiGIR / TAPIR) and data schemas (DarwinCore). However the implementation of this solution demands a large amount of time, encompassing both, the transfer of fragments of data as processing data within the portal. With the growth of data digitalization within the institutions, this scenario tends to be increasingly exacerbated, making it hard to maintenance the records up to date within the portals. This research proposes analyze the replication process data to evaluate its performance. To reach this objective is used the IABIN biodiversity portal of pollinators as study case, which support both situations: the conventional data and the interaction data replication. With the results of this research is possible to simulate situations before its execution, thus predicting what will be its performance. Additionally these results may contribute to future improvements of this process; in order to decrease the time required to make the data available in the biodiversity portals.
|
75 |
Distribution design for complex value databases : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey UniversityMa, Hui January 2007 (has links)
Distribution design for databases usually addresses the problems of fragmentation, allocation and replication. However, the main purposes of distribution are to improve performance and to increase system reliability. The former aspect is particularly relevant in cases where the desire to distribute data originates from the distributed nature of an organization with many data needs only arising locally, i.e., some data are retrieved and processed at only one or at most very few locations. Therefore, query optimization should be treated as an intrinsic part of distribution design. Due to the interdependencies between fragmentation, allocation and distributed query optimization it is not efficient to study each of the problems in isolation to get overall optimal distribution design. However, the combined problem of fragmentation, allocation and distributed query optimization is NP-hard, and thus requires heuristics to generate efficient solutions. In this thesis the foundations of fragmentation and allocation in databases on query processing are investigated using a query cost model. The considered databases are defined on complex value data models, which capture complex value, object-oriented and XML-based databases. The emphasis on complex value databases enables a large variety of schema fragmentation, while at the same time it imposes restrictions on the way schemata can be fragmented. It is shown that the allocation of locations to the nodes of an optimized query tree is only marginally affected by the allocation of fragments. This implies that optimization of query processing and optimization of fragment allocation are largely orthogonal to each other, leading to several scenarios for fragment allocation. Therefore, it is reasonable to assume that optimized queries are given with subqueries having selection and projection operations applied to leaves. With this assumption some heuristic procedures can be developed to find an “optimal” fragmentation and allocation. In particular, cost-based algorithms for primary horizontal and derived horizontal fragmentation, vertical fragmentation are presented.
|
76 |
Ερωτήματα συνένωσης και βαθμολογημένης συνένωσης σε κατανεμημένα συστήματαΠατλάκας, Ιωάννης 28 February 2013 (has links)
Η ανάπτυξη των peer-to-peer βάσεων δεδομένων και η δυναμική εισαγωγή των συστημάτων αποθήκευσης σε νέφη υπολογιστών (cloudstores) ως τα κυρίαρχα μεγάλης κλίμακας συστήματα διαχείρισης δεδομένων, έχουν οδηγήσει τους ερευνητές να εξετάσουν το πρόβλημα της
υποστήριξης πολύπλοκων ερωτημάτων με ένα πλήρως αποκεντρωμένο τρόπο. Περίπλοκα ερωτήματα επιλογής (select), συνένωσης join, καθώς και βαθμολογημένα ερωτήματα έχουν κεντρίσει το ενδιαφέρον της κοινότητας διαχείρισης δεδομένων.
Ανάμεσα στις τάξεις των ερωτημάτων αυτών είναι το κεντρικής σημασίας top-k join. To κατανεμημένο top-k join, δεν έχει μελετηθεί επαρκώς, αν και συναντάται πολύ συχνά σε πραγματικό φόρτο εργασίας σε πολλά εμπορικά και άλλα συστήματα βάσεων δεδομένων.
Με την εργασία αυτή αντιμετωπίζουμε τέτοιου είδους ερωτήματα πάνω σε δεδομένα
που είναι κατανεμημένα σε ένα μεγάλου κλίμακας δίκτυο.
Οι συνεισφορές μας με αυτήν την εργασία περιλαμβάνουν: (α) ένα νέο κατανεμημένο ευρετήριο,
που επιτρέπει την πρόσβαση σε πλειάδες με τυχαίο και
διατεταγμένο τρόπο, (β) ένα σύνολο αλγόριθμων για βαθμολογημένα ερωτημάτατα συνένωσης join. Οι αλγόριθμοί μας
στηρίζονται στην προσαρμογή γνωστών αλγοριθμών κατωφλίου για βαθμολογημένο join
σε κατανεμημένο περιβάλλον, (γ) μία νέα χρήση των Bloom φίλτρων και ιστογραμμάτων για την περαιτέρω μείωση του εύρους ζώνης
που καταναλώνουν οι παραπάνω αλγόριθμοι, καθώς και απόδειξη για το ότι οι
αλγόριθμοί μας που βασίζονται σε φίλτρα Bloom και ιστογράμματα παράγουν το σωστό top-k αποτέλεσμα, (δ) μια σε βάθος συζήτηση του
σχεδιασμού των αλγορίθμων μας και θεμάτων που συνδέονται με τις επιδόσεις και τα trade-offs.
Επιπλέον διερευνούμε την αποτελεσματικότητα και την ποιότητα των προτεινόμενων λύσεων
μέσα από μία αναλυτική πειραματική αξιολόγηση, δείχνοντας τις περιπτώσεις που ο κάθε αλγόριθμός μας είναι κατάλληλος σε
μαζικώς κατανεμημένα και αποκεντρωμένα περιβάλλοντα, ενώ τονίζουμε τα trade-offs που προκύπτουν. / The advent of peer-to-peer databases and the recent rise of cloudstores as key large-scale data management
paradigms, have led researchers to look into the problem of
supporting complex queries in a fully decentralized manner.
Among the classes of queries considered in related centralized
work, there is one that stands out as largely overlooked in widely
distributed settings, albeit very common in real-world workloads:
top-k joins. With this work we tackle such queries over data
distributed across an internet-scale network.
Our contributions include: (a) a novel distributed indexing
scheme, allowing access to tuples in both a random and an
ordered manner; (b) a set of query processing algorithms based
on a novel adaptation of rank-join and threshold algorithms,
appropriate for use in a distributed environment; (c) a novel use
of Bloom Filters and histograms to further reduce the bandwidth
consumption of the above algorithms; a proof that ensures that
our algorithms based on Bloom filters and histograms produce
the correct top-k results; and (d) an in-depth discussion of the
design space and related performance trade-offs. We further
investigate the efficiency and quality of the proposed solutions
through an elaborate experimental evaluation, showcasing their
appropriateness for widely-distributed and massively decentralized environments and highlighting related trade-offs.
|
77 |
Principles for Distributed Databases in Telecom Environment / Principer för distribuerade databaser inom Telecom MiljöAshraf, Imran, Khokhar, Amir Shahzed January 2010 (has links)
Centralized databases are becoming bottleneck for organizations that are physically distributed and access data remotely. Data management is easy in centralized databases. However, it carries high communication cost and most importantly high response time. The concept of distributing the data over various locations is very attractive for such organizations. In such cases the database is fragmented into fragments and distributed to the locations where it is needed. This kind of distribution provides local control of data and the data access is also very fast in such databases. However, concurrency control, query optimization and data allocations are the factors that affect the response time and must be investigated prior to implementing distributed databases. This thesis makes the use of mixed method approach to meet its objective. In quantitative section, we performed an experiment to compare the response time of two databases; centralized and fragmented/distributed. The experiment was performed at Ericsson. A literature review was also done to find out other important response time related issues like query optimization, concurrency control and data allocation. The literature review revealed that these factors can further improve the response time in distributed environment. Results of the experiment showed a substantial decrease in the response time due to the fragmentation and distribution. / Centraliserade databaser blir flaskhals för organisationer som är fysiskt distribuerade och tillgång till data på distans. Datahantering är lätt i centrala databaser. Men bär den höga kostnaden kommunikation och viktigast av hög svarstid. Konceptet att distribuera data över olika orter är mycket attraktiv för sådana organisationer. I sådana fall databasen är splittrade fragment och distribueras till de platser där det behövs. Denna typ av distribution ger lokal kontroll av uppgifter och dataåtkomst är också mycket snabb i dessa databaser. Men, samtidighet kontroll, frågeoptimering och data anslagen är de faktorer som påverkar svarstiden och måste utredas innan genomförandet distribuerade databaser. Denna avhandling gör användningen av blandade metod strategi för att nå sitt mål. I kvantitativa delen utförde vi ett experiment för att jämföra svarstid på två databaser, centraliserad och fragmenterad / distribueras. Försöket utfördes på Ericsson. En litteraturstudie har gjorts för att ta reda på andra viktiga svarstid liknande frågor som frågeoptimering, samtidighet kontroll och data tilldelning. Litteraturgenomgången visade att dessa faktorer ytterligare kan förbättra svarstiden i distribuerad miljö. Resultaten av försöket visade en betydande minskning av den svarstid på grund av splittring och distribution.
|
78 |
Création d'un environnement de gestion de base de données "en grille" : application à l'échange de données médicales / Creating a "grid" database management environment : application to medical data exchangeDe Vlieger, Paul 12 July 2011 (has links)
La problématique du transport de la donnée médicale, de surcroît nominative, comporte de nombreuses contraintes, qu’elles soient d’ordre technique, légale ou encore relationnelle. Les nouvelles technologies, issues particulièrement des grilles informatiques, permettent d’offrir une nouvelle approche au partage de l’information. En effet, le développement des intergiciels de grilles, notamment ceux issus du projet européen EGEE, ont permis d’ouvrir de nouvelles perspectives pour l’accès distribué aux données. Les principales contraintes d’un système de partage de données médicales, outre les besoins en termes de sécurité, proviennent de la façon de recueillir et d’accéder à l’information. En effet, la collecte, le déplacement, la concentration et la gestion de la donnée, se fait habituellement sur le modèle client-serveur traditionnel et se heurte à de nombreuses problématiques de propriété, de contrôle, de mise à jour, de disponibilité ou encore de dimensionnement des systèmes. La méthodologie proposée dans cette thèse utilise une autre philosophie dans la façon d’accéder à l’information. En utilisant toute la couche de contrôle d’accès et de sécurité des grilles informatiques, couplée aux méthodes d’authentification robuste des utilisateurs, un accès décentralisé aux données médicales est proposé. Ainsi, le principal avantage est de permettre aux fournisseurs de données de garder le contrôle sur leurs informations et ainsi de s’affranchir de la gestion des données médicales, le système étant capable d’aller directement chercher la donnée à la source.L’utilisation de cette approche n’est cependant pas complètement transparente et tous les mécanismes d’identification des patients et de rapprochement d’identités (data linkage) doivent être complètement repensés et réécris afin d’être compatibles avec un système distribué de gestion de bases de données. Le projet RSCA (Réseau Sentinelle Cancer Auvergne – www.e-sentinelle.org) constitue le cadre d’application de ce travail. Il a pour objectif de mutualiser les sources de données auvergnates sur le dépistage organisé des cancers du sein et du côlon. Les objectifs sont multiples : permettre, tout en respectant les lois en vigueur, d’échanger des données cancer entre acteurs médicaux et, dans un second temps, offrir un support à l’analyse statistique et épidémiologique. / Nominative medical data exchange is a growing challenge containing numerous technical, legislative or relationship barriers. New advanced technologies, in the particular field of grid computing, offer a new approach to handle medical data exchange. The development of the gLite grid middleware within the EGEE project opened new perspectives in distributed data access and database federation. The main requirements of a medical data exchange system, except the high level of security, come from the way to collect and provide data. The original client-server model of computing has many drawbacks regarding data ownership, updates, control, availability and scalability. The method described in this dissertation uses another philosophy in accessing medical data. Using the grid security layer and a robust user access authentication and control system, we build up a dedicated grid network able to federate distributed medical databases. In this way, data owners keep control over the data they produce.This approach is therefore not totally straightforward, especially for patient identification and medical data linkage which is an open problem even in centralized medical systems. A new method is then proposed to handle these specific issues in a highly distributed environment. The Sentinelle project (RSCA) constitutes the applicative framework of this project in the field of cancer screening in French Auvergne region. The first objective is to allow anatomic pathology reports exchange between laboratories and screening structures compliant with pathologists’ requirements and legal issues. Then, the second goal is to provide a framework for epidemiologists to access high quality medical data for statistical studies and global epidemiology.
|
79 |
Dynamic First Match : Reducing Resource Consumption of First Match Queries in MySQL NDB ClusterKumar, Hara January 2020 (has links)
Dynamic First Match is a learned heuristic that reduces the resource consumption of first match queries in a multi-threaded, distributed relational database, while having a minimal effect on latency. Traditional first match range scans occur in parallel across all data fragments simultaneously. This could potentially return many redundant results. Dynamic First Match reduced this redundancy by learning to scan only a portion of the data fragments first, before scanning the remaining fragments with a pruned data set. Benchmark tests show that Dynamic First Match could reduce resource consumption of first match queries containing first match range scans by over 40% while having a minimal effect on latency. / Dynamisk Första Match är en lärd heuristik som minskar resursförbrukningen för första match frågor i en flertrådad och distribuerad relationsdatabas, samtidigt som den har en minimal effekt på latens. Första match frågor resulterar i många intervallavsökningar. Traditionellt intervallskanningarna körs parallellt över alla datafragment samtidigt. Detta kan potentiellt ge många överflödiga resultat. Dynamisk Första Match minskade denna redundans genom att lära sig att bara skanna en del av datafragmenten innan återstående datafragmenten skannades med en beskuren datamängd. Jämförelsetester visar att Dynamisk Första Match kan minska resursförbrukningen för första match frågor med intervallavsökningar med över 40% samtidigt som den har en minimal effekt på latens.
|
80 |
Parallel and Distributed Databases, Data Mining and Knowledge DiscoveryValduriez, Patrick, Lehner, Wolfgang, Talia, Domenico, Watson, Paul 17 July 2023 (has links)
Managing and efficiently analysing the vast amounts of data produced by a huge variety of data sources is one of the big challenges in computer science. The development and implementation of algorithms and applications that can extract information diamonds from these ultra-large, and often distributed, databases is a key challenge for the design of future data management infrastructures. Today’s data-intensive applications often suffer from performance problems and an inability to scale to high numbers of distributed data sources. Therefore, distributed and parallel databases have a key part to play in overcoming resource bottlenecks, achieving guaranteed quality of service and providing system scalability. The increased availability of distributed architectures, clusters, Grids and P2P systems, supported by high performance networks and intelligent middleware provides parallel and distributed databases and digital repositories with a great opportunity to cost-effectively support key everyday applications. Further, there is the prospect of data mining and knowledge discovery tools adding value to these vast new data resources by automatically extracting useful information from them.
|
Page generated in 0.1075 seconds