Global ETD Search

1	Enhancing Query Support in HBase via an extended Coprocessor Framework Vashishtha, Himanshu Unknown Date No description available. HBase, Coprocessors, Endpoints, Hadoop
2	Performance Analysis of Cluster Databases Base on YCSB System Huang, Syun 07 August 2012 (has links) Database is the important part of modern application. From SQL to RDBMS,database moved to frequently transmit and operate lots of data. On ACID, it is focus on the consistence, but it does not suit right now. In the proposed article, we use YCSB to try some different workloads and the special of Cassandra, MongoDB, HBase, and MySQL Cluster to find the difference between SQL and NoSQL. In addition, we also analyze the performance of the four operations (insert, update, scan, and read) in Cassandra, MongoDB, and HBase, and simulate some conditions. Those test supplies the reference for user to select the database. YCSB Cassandra HBase MongoDB NoSQL
3	Avaliação do Star Schema Benchmark aplicado a bancos de dados NoSQL distribuídos e orientados a colunas / Evaluation of the Star Schema Benchmark applied to NoSQL column-oriented distributed databases systems Scabora, Lucas de Carvalho 06 May 2016 (has links) Com o crescimento do volume de dados manipulado por aplicações de data warehousing, soluções centralizadas tornam-se muito custosas e enfrentam dificuldades para tratar a escalabilidade do volume de dados. Nesse sentido, existe a necessidade tanto de se armazenar grandes volumes de dados quanto de se realizar consultas analíticas (ou seja, consultas OLAP) sobre esses dados volumosos de forma eficiente. Isso pode ser facilitado por cenários caracterizados pelo uso de bancos de dados NoSQL gerenciados em ambientes paralelos e distribuídos. Dentre os desafios relacionados a esses cenários, destaca-se a necessidade de se promover uma análise de desempenho de aplicações de data warehousing que armazenam os dados do data warehouse (DW) em bancos de dados NoSQL orientados a colunas. A análise experimental e padronizada de diferentes sistemas é realizada por meio de ferramentas denominadas benchmarks. Entretanto, benchmarks para DW foram desenvolvidos majoritariamente para bancos de dados relacionais e ambientes centralizados. Nesta pesquisa de mestrado são investigadas formas de se estender o Star Schema Benchmark (SSB), um benchmark de DW centralizado, para o banco de dados NoSQL distribuído e orientado a colunas HBase. São realizadas propostas e análises principalmente baseadas em testes de desempenho experimentais considerando cada uma das quatro etapas de um benchmark, ou seja, esquema e carga de trabalho, geração de dados, parâmetros e métricas, e validação. Os principais resultados obtidos pelo desenvolvimento do trabalho são: (i) proposta do esquema FactDate, o qual otimiza consultas que acessam poucas dimensões do DW; (ii) investigação da aplicabilidade de diferentes esquemas a cenários empresariais distintos; (iii) proposta de duas consultas adicionais à carga de trabalho do SSB; (iv) análise da distribuição dos dados gerados pelo SSB, verificando se os dados agregados pelas consultas OLAP estão balanceados entre os nós de um cluster; (v) investigação da influência de três importantes parâmetros do framework Hadoop MapReduce no processamento de consultas OLAP; (vi) avaliação da relação entre o desempenho de consultas OLAP e a quantidade de nós que compõem um cluster; e (vii) proposta do uso de visões materializadas hierárquicas, por meio do framework Spark, para otimizar o desempenho no processamento de consultas OLAP consecutivas que requerem a análise de dados em níveis progressivamente mais ou menos detalhados. Os resultados obtidos representam descobertas importantes que visam possibilitar a proposta futura de um benchmark para DWs armazenados em bancos de dados NoSQL dentro de ambientes paralelos e distribuídos. / Due to the explosive increase in data volume, centralized data warehousing applications become very costly and are facing several problems to deal with data scalability. This is related to the fact that these applications need to store huge volumes of data and to perform analytical queries (i.e., OLAP queries) against these voluminous data efficiently. One solution is to employ scenarios characterized by the use of NoSQL databases managed in parallel and distributed environments. Among the challenges related to these scenarios, there is a need to investigate the performance of data warehousing applications that store the data warehouse (DW) in column-oriented NoSQL databases. In this context, benchmarks are widely used to perform standard and experimental analysis of distinct systems. However, most of the benchmarks for DW focus on relational database systems and centralized environments. In this masters research, we investigate how to extend the Star Schema Benchmark (SSB), which was proposed for centralized DWs, to the distributed and column-oriented NoSQL database HBase. We introduce proposals and analysis mainly based on experimental performance tests considering each one of the four steps of a benchmark, i.e. schema and workload, data generation, parameters and metrics, and validation. The main results described in this masters research are described as follows: (i) proposal of the FactDate schema, which optimizes queries that access few dimensions of the DW; (ii) investigation of the applicability of different schemas for different business scenarios; (iii) proposal of two additional queries to the SSB workload; (iv) analysis of the data distribution generated by the SSB, verifying if the data aggregated by OLAP queries are balanced between the nodes of a cluster; (v) investigation of the influence caused by three important parameters of the Hadoop MapReduce framework in the OLAP query processing; (vi) evaluation of the relationship between the OLAP query performance and the number of nodes of a cluster; and (vii) employment of hierarchical materialized views using the Spark framework to optimize the processing performance of consecutive OLAP queries that require progressively more or less aggregated data. These results represent important findings that enable the future proposal of a benchmark for DWs stored in NoSQL databases and managed in parallel and distributed environments. Banco de dados NoSQL Data warehouse Data warehouse Hadoop MapReduce Hadoop MapReduce HBase HBase NoSQL Star Schema Benchmark Star Schema Benchmark
4	Avaliação do Star Schema Benchmark aplicado a bancos de dados NoSQL distribuídos e orientados a colunas / Evaluation of the Star Schema Benchmark applied to NoSQL column-oriented distributed databases systems Lucas de Carvalho Scabora 06 May 2016 (has links) Com o crescimento do volume de dados manipulado por aplicações de data warehousing, soluções centralizadas tornam-se muito custosas e enfrentam dificuldades para tratar a escalabilidade do volume de dados. Nesse sentido, existe a necessidade tanto de se armazenar grandes volumes de dados quanto de se realizar consultas analíticas (ou seja, consultas OLAP) sobre esses dados volumosos de forma eficiente. Isso pode ser facilitado por cenários caracterizados pelo uso de bancos de dados NoSQL gerenciados em ambientes paralelos e distribuídos. Dentre os desafios relacionados a esses cenários, destaca-se a necessidade de se promover uma análise de desempenho de aplicações de data warehousing que armazenam os dados do data warehouse (DW) em bancos de dados NoSQL orientados a colunas. A análise experimental e padronizada de diferentes sistemas é realizada por meio de ferramentas denominadas benchmarks. Entretanto, benchmarks para DW foram desenvolvidos majoritariamente para bancos de dados relacionais e ambientes centralizados. Nesta pesquisa de mestrado são investigadas formas de se estender o Star Schema Benchmark (SSB), um benchmark de DW centralizado, para o banco de dados NoSQL distribuído e orientado a colunas HBase. São realizadas propostas e análises principalmente baseadas em testes de desempenho experimentais considerando cada uma das quatro etapas de um benchmark, ou seja, esquema e carga de trabalho, geração de dados, parâmetros e métricas, e validação. Os principais resultados obtidos pelo desenvolvimento do trabalho são: (i) proposta do esquema FactDate, o qual otimiza consultas que acessam poucas dimensões do DW; (ii) investigação da aplicabilidade de diferentes esquemas a cenários empresariais distintos; (iii) proposta de duas consultas adicionais à carga de trabalho do SSB; (iv) análise da distribuição dos dados gerados pelo SSB, verificando se os dados agregados pelas consultas OLAP estão balanceados entre os nós de um cluster; (v) investigação da influência de três importantes parâmetros do framework Hadoop MapReduce no processamento de consultas OLAP; (vi) avaliação da relação entre o desempenho de consultas OLAP e a quantidade de nós que compõem um cluster; e (vii) proposta do uso de visões materializadas hierárquicas, por meio do framework Spark, para otimizar o desempenho no processamento de consultas OLAP consecutivas que requerem a análise de dados em níveis progressivamente mais ou menos detalhados. Os resultados obtidos representam descobertas importantes que visam possibilitar a proposta futura de um benchmark para DWs armazenados em bancos de dados NoSQL dentro de ambientes paralelos e distribuídos. / Due to the explosive increase in data volume, centralized data warehousing applications become very costly and are facing several problems to deal with data scalability. This is related to the fact that these applications need to store huge volumes of data and to perform analytical queries (i.e., OLAP queries) against these voluminous data efficiently. One solution is to employ scenarios characterized by the use of NoSQL databases managed in parallel and distributed environments. Among the challenges related to these scenarios, there is a need to investigate the performance of data warehousing applications that store the data warehouse (DW) in column-oriented NoSQL databases. In this context, benchmarks are widely used to perform standard and experimental analysis of distinct systems. However, most of the benchmarks for DW focus on relational database systems and centralized environments. In this masters research, we investigate how to extend the Star Schema Benchmark (SSB), which was proposed for centralized DWs, to the distributed and column-oriented NoSQL database HBase. We introduce proposals and analysis mainly based on experimental performance tests considering each one of the four steps of a benchmark, i.e. schema and workload, data generation, parameters and metrics, and validation. The main results described in this masters research are described as follows: (i) proposal of the FactDate schema, which optimizes queries that access few dimensions of the DW; (ii) investigation of the applicability of different schemas for different business scenarios; (iii) proposal of two additional queries to the SSB workload; (iv) analysis of the data distribution generated by the SSB, verifying if the data aggregated by OLAP queries are balanced between the nodes of a cluster; (v) investigation of the influence caused by three important parameters of the Hadoop MapReduce framework in the OLAP query processing; (vi) evaluation of the relationship between the OLAP query performance and the number of nodes of a cluster; and (vii) employment of hierarchical materialized views using the Spark framework to optimize the processing performance of consecutive OLAP queries that require progressively more or less aggregated data. These results represent important findings that enable the future proposal of a benchmark for DWs stored in NoSQL databases and managed in parallel and distributed environments. Banco de dados NoSQL Data warehouse Hadoop MapReduce HBase Star Schema Benchmark Data warehouse Hadoop MapReduce HBase NoSQL Star Schema Benchmark
5	Optimalizace čtení dat z distribuované databáze / Optimization of data reading from a distributed database Kozlovský, Jiří January 2019 (has links) This thesis is focused on optimization of data reading from distributed NoSQL database Apache HBase with regards to the desired data granularity. The assignment was created as a product request from Seznam.cz, a.s. the Reklama division, Sklik.cz cost center to improve user experience by making filtering of aggregated statistical data available to advertiser web application users for the purpose of viewing entity performance history.
6	Modélisation NoSQL des entrepôts de données multidimensionnelles massives / Modeling Multidimensional Data Warehouses into NoSQL El Malki, Mohammed 08 December 2016 (has links) Les systèmes d’aide à la décision occupent une place prépondérante au sein des entreprises et des grandes organisations, pour permettre des analyses dédiées à la prise de décisions. Avec l’avènement du big data, le volume des données d’analyses atteint des tailles critiques, défiant les approches classiques d’entreposage de données, dont les solutions actuelles reposent principalement sur des bases de données R-OLAP. Avec l’apparition des grandes plateformes Web telles que Google, Facebook, Twitter, Amazon… des solutions pour gérer les mégadonnées (Big Data) ont été développées et appelées « Not Only SQL ». Ces nouvelles approches constituent une voie intéressante pour la construction des entrepôts de données multidimensionnelles capables de supporter des grandes masses de données. La remise en cause de l’approche R-OLAP nécessite de revisiter les principes de la modélisation des entrepôts de données multidimensionnelles. Dans ce manuscrit, nous avons proposé des processus d’implantation des entrepôts de données multidimensionnelles avec les modèles NoSQL. Nous avons défini quatre processus pour chacun des deux modèles NoSQL orienté colonnes et orienté documents. De plus, le contexte NoSQL rend également plus complexe le calcul efficace de pré-agrégats qui sont habituellement mis en place dans le contexte ROLAP (treillis). Nous avons élargis nos processus d’implantations pour prendre en compte la construction du treillis dans les deux modèles retenus.Comme il est difficile de choisir une seule implantation NoSQL supportant efficacement tous les traitements applicables, nous avons proposé deux processus de traductions, le premier concerne des processus intra-modèles, c’est-à-dire des règles de passage d’une implantation à une autre implantation du même modèle logique NoSQL, tandis que le second processus définit les règles de transformation d’une implantation d’un modèle logique vers une autre implantation d’un autre modèle logique. / Decision support systems occupy a large space in companies and large organizations in order to enable analyzes dedicated to decision making. With the advent of big data, the volume of analyzed data reaches critical sizes, challenging conventional approaches to data warehousing, for which current solutions are mainly based on R-OLAP databases. With the emergence of major Web platforms such as Google, Facebook, Twitter, Amazon...etc, many solutions to process big data are developed and called "Not Only SQL". These new approaches are an interesting attempt to build multidimensional data warehouse capable of handling large volumes of data. The questioning of the R-OLAP approach requires revisiting the principles of modeling multidimensional data warehouses.In this manuscript, we proposed implementation processes of multidimensional data warehouses with NoSQL models. We defined four processes for each model; an oriented NoSQL column model and an oriented documents model. Each of these processes fosters a specific treatment. Moreover, the NoSQL context adds complexity to the computation of effective pre-aggregates that are typically set up within the ROLAP context (lattice). We have enlarged our implementations processes to take into account the construction of the lattice in both detained models.As it is difficult to choose a single NoSQL implementation that supports effectively all the applicable treatments, we proposed two translation processes. While the first one concerns intra-models processes, i.e., pass rules from an implementation to another of the same NoSQL logic model, the second process defines the transformation rules of a logic model implementation to another implementation on another logic model. NoSQL Système orienté-document Système orienté colonnes Entrepôts de données big data Cuboide OLAP HBase MongoDB NoSQL Document-oriented system Column oriented Big data warehouse OLAP cuboid HBase MongoDB
7	Enhancing Data Processing on Clouds with Hadoop/HBase Zhang, Chen January 2011 (has links) In the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google's data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data processing problems in various application scenarios. However, being originally designed for handling very large data sets that can be divided easily in parts to be processed independently with limited inter-task communication, Hadoop lacks applicability to a wider usage case. As a result, many projects are under way to enhance Hadoop for different application needs, such as data warehouse applications, machine learning and data mining applications, etc. This thesis is one such research effort in this direction. The goal of the thesis research is to design novel tools and techniques to extend and enhance the large-scale data processing capability of Hadoop/HBase on clouds, and to evaluate their effectiveness in performance tests on prototype implementations. Two main research contributions are described. The first contribution is a light-weight computational workflow system called "CloudWF" for Hadoop. The second contribution is a client library called "HBaseSI" supporting transactional snapshot isolation (SI) in HBase, Hadoop's database component. CloudWF addresses the problem of automating the execution of scientific workflows composed of both MapReduce and legacy applications on clouds with Hadoop/HBase. CloudWF is the first computational workflow system built directly using Hadoop/HBase. It uses novel methods in handling workflow directed acyclic graph decomposition, storing and querying dependencies in HBase sparse tables, transparent file staging, and decentralized workflow execution management relying on the MapReduce framework for task scheduling and fault tolerance. HBaseSI addresses the problem of maintaining strong transactional data consistency in HBase tables. This is the first SI mechanism developed for HBase. HBaseSI uses novel methods in handling distributed transactional management autonomously by individual clients. These methods greatly simplify the design of HBaseSI and can be generalized to other column-oriented stores with similar architecture as HBase. As a result of the simplicity in design, HBaseSI adds low overhead to HBase performance and directly inherits many desirable properties of HBase. HBaseSI is non-intrusive to existing HBase installations and user data, and is designed to work with a large cloud in terms of data size and the number of nodes in the cloud. Cloud Hadoop HBase Snapshot Isolation Distributed Transaction Workflow Data Processing Computer Science
8	Enhancing Data Processing on Clouds with Hadoop/HBase Zhang, Chen January 2011 (has links) In the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google's data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data processing problems in various application scenarios. However, being originally designed for handling very large data sets that can be divided easily in parts to be processed independently with limited inter-task communication, Hadoop lacks applicability to a wider usage case. As a result, many projects are under way to enhance Hadoop for different application needs, such as data warehouse applications, machine learning and data mining applications, etc. This thesis is one such research effort in this direction. The goal of the thesis research is to design novel tools and techniques to extend and enhance the large-scale data processing capability of Hadoop/HBase on clouds, and to evaluate their effectiveness in performance tests on prototype implementations. Two main research contributions are described. The first contribution is a light-weight computational workflow system called "CloudWF" for Hadoop. The second contribution is a client library called "HBaseSI" supporting transactional snapshot isolation (SI) in HBase, Hadoop's database component. CloudWF addresses the problem of automating the execution of scientific workflows composed of both MapReduce and legacy applications on clouds with Hadoop/HBase. CloudWF is the first computational workflow system built directly using Hadoop/HBase. It uses novel methods in handling workflow directed acyclic graph decomposition, storing and querying dependencies in HBase sparse tables, transparent file staging, and decentralized workflow execution management relying on the MapReduce framework for task scheduling and fault tolerance. HBaseSI addresses the problem of maintaining strong transactional data consistency in HBase tables. This is the first SI mechanism developed for HBase. HBaseSI uses novel methods in handling distributed transactional management autonomously by individual clients. These methods greatly simplify the design of HBaseSI and can be generalized to other column-oriented stores with similar architecture as HBase. As a result of the simplicity in design, HBaseSI adds low overhead to HBase performance and directly inherits many desirable properties of HBase. HBaseSI is non-intrusive to existing HBase installations and user data, and is designed to work with a large cloud in terms of data size and the number of nodes in the cloud. Cloud Hadoop HBase Snapshot Isolation Distributed Transaction Workflow Data Processing Computer Science
9	Consequences of converting a data warehouse based on a STAR-schema to a column-oriented-NoSQL-database Bodegård Gustafsson, Rebecca January 2018 (has links) Data warehouses based on the relational model has been a popular technology for many years, because they are very reliable due to their ACID-properties (Atomicity, Consistency, Isolation, and Durability). However, the new demands on databases today due to increasing amounts of data and data structures changing do mean that the relational model might not always be the optimal choice. NoSQL is the name of a group of databases that are less bound by schemas and are therefore more scalable and easier to make changes in. They are also adapted for massive parallel processing and are therefore suited for handling large amounts of data. Out of all of the NoSQL databases column-databases are the most like the relational model since it also consists of tables. This study has therefore converted a relational data warehouse based on a STAR-schema to a column-oriented-NoSQL-database and evaluated the implementation by comparing query-times between the relational data warehouse and the column-oriented-NoSQL-database. Scrambled economical data from a business in Sweden has been used to do the conversion and test it by asking a few usual queries. The results show that the mapping works but the query-time in the NoSQL-database is simnifically longer. column-database datawarehouse nosql apache hbase sql Computer Sciences Datavetenskap (datalogi)
10	Performance comparison between PostgreSQL, MongoDB, ArangoDB and HBase / Prestandajämförelse mellan PostgreSQL, MongoDB, ArangoDB och Hbase Dalström, Isak, Ericsson, Philip January 2022 (has links) There is a large amount of data that needs to be stored today. Handling so much data efficiently is important as minor performance differences can have significant effects on large systems. Knowing how a certain database management system performs is important for companies and organizations to decide which database management system to use. There is currently a gap in the research regarding performance differences between different database management systems. We conducted a study that compares the average query response time of PostgreSQL, MongoDB, ArangoDB and HBase. We also compared the performance between using a single thread and using multiple threads. We compared how they perform with a dataset size and operation count of 10 000, 100 000, and 1 000 000 with insert, update and read queries. The results show that PostgreSQL has the lowest average query response when doing read queries and that MongoDB has the lowest average query response when doing insert and update queries. The results also showed a significant performance gain from using multiple threads instead of using a single thread. Database PostgreSQL MongoDB ArangoDB HBase performance Information Systems

Search results