Global ETD Search

1	Compression Selection for Columnar Data using Machine-Learning and Feature Engineering Persson, Douglas, Juelsson Larsen, Ludvig January 2023 (has links) There is a continuously growing demand for improved solutions that provide both efficient storage and efficient retrieval of big data for analytical purposes. This thesis researches the use of machine-learning together with feature engineering to recommend the most cost-effective compression algorithm and encoding combination for columns in a columnar database management system (DBMS). The framework consists of a cost function calculated using compression time, decompression time, and compression ratio. An XGBoost machine-learning model is trained on labels provided by the cost function to recommend the most cost-effective combination for columnar data within a column or vector-oriented DBMS. While the methods are applied on ClickHouse, one of the most popular open-source column-oriented DBMS on the market, the results are broadly applicable to column-oriented data which share data type and characteristics with IoT telemetry data. Using billions of available rows of numeric real business data obtained at Axis Communications in Lund, Sweden, a set of features are engineered to accurately describe the characteristics of a given column. The proposed framework allows for weighting the business interests (compression time, decompression time, and compression ratio) to determine the individually optimal cost-effective solution. The model reaches an accuracy of 99% on the test dataset and an accuracy of 90.1% on unseen data by leveraging data features that are predictive of compression algorithms and encodings performances. Following ClickHouse strategies and the most suitable practices in the field, combinations of general-purpose compression algorithms and data encodings are analysed that together yield the best results in efficiently compressing the data of certain columns. Applying the unweighted recommended combinations on all columns, the framework’s performance impact was measured to increase the average compression speed by 95.46%. Reducing the time to compress the columns from 31.17 seconds to compress the data to 13.17 seconds. Additionally, the decompression speed was increased by 59.87%, reducing the time to decompress the columns from 2.63 seconds to 2.02 seconds, at the cost of decreasing the compression ratio by 66.05%. Increasing the storage requirements by 94.9 MB. In column and vector databases, chunks of data belonging to a certain column are often stored together on a disk. Therefore, choosing the right compression algorithm can lower the storage requirements and boost database throughput. Machine Learning XGBoost Classification Feature Engineering Compression Algorithms Data Encodings Database Management System (DBMS) Column- Oriented DBMS Computer Sciences Datavetenskap (datalogi)
2	Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System Kernert, David 20 September 2016 (has links) (PDF) Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists. Main-Memory Datenbanksysteme Spaltenorientierte DBMS Lineare Algebra Implementierung Matrixdatenstrukturen Ausdrucksoptimierung in-memory database management systems column-oriented DBMS linear algebra implementation matrix data structures expression optimization ddc:004 rvk:ST 270
3	Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System Kernert, David 20 September 2016 (has links) Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists. info:eu-repo/classification/ddc/004 ddc:004

1

Page generated in 0.0417 seconds