1 |
Density-Aware Linear Algebra in a Column-Oriented In-Memory Database SystemKernert, David 20 September 2016 (has links) (PDF)
Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra.
This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes.
We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists.
|
2 |
Density-Aware Linear Algebra in a Column-Oriented In-Memory Database SystemKernert, David 20 September 2016 (has links)
Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra.
This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes.
We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists.
|
3 |
Flavonoid glucodiversification with engineered sucrose-active enzymes / Glucodiversification des flavonoïdes par ingénierie d’enzymes actives sur saccharoseMalbert, Yannick 10 July 2014 (has links)
Les flavonoïdes glycosylés sont des métabolites secondaires d’origine végétale, qui présentent de nombreuses propriétés physico-chimiques et biologiques intéressantes pour des applications industrielles. La glycosylation accroît généralement la solubilité de ces flavonoïdes mais leurs faibles niveaux de production dans les plantes limitent leur disponibilité. Ces travaux de thèse portent donc sur le développement de nouvelles voies de gluco-diversification des flavonoïdes naturels, en mettant à profit l’ingénierie des protéines. Deux transglucosylases recombinantes, structurellement et biochimiquement caractérisées, l'amylosaccharase de Neisseria polysaccharea et la glucane-saccharase de branchement α-(1→2), forme tronquée de la dextran-saccharase de L. Mesenteroides NRRL B-1299, ont été sélectionnées pour la biosynthèse de nouveaux flavonoïdes, possédant des motifs originaux d’α-glycosylation, et potentiellement une solubilité accrue dans l'eau. Dans un premier temps, une librairie de petite taille de mutants de l’amylosaccharase, ciblée sur le site de liaison à l’accepteur, à été criblée en présence de saccharose (donneur d’unité glycosyl) et de lutéoline comme accepteur. Une méthode de screening a donc été développée, et a permis d’isoler des mutants améliorés pour la synthèse de nouveaux glucosides de lutéoline, jusqu’à 17000 fois plus soluble dans l’eau que la lutéoline aglycon. Afin de glucosyler d’autres flavonoïdes, la glucane-saccharase de branchement α-(1→2), a été préférentiellement sélectionnée. Des plans expérimentaux alliés à une méthodologie en surface de réponse ont été réalisés pour optimiser la production de l’enzyme sous forme soluble et éviter la formation de corps d’inclusion. Cinq paramètres ont été ainsi analysés : le temps de culture, la température, et les concentrations en glycérol, lactose (inducteur) et glucose (répresseur). En appliquant les conditions optimales prédites, 5740 U.L-1 de culture d’enzyme soluble ont été produites en microplaques, alors qu’aucune activité n’était retrouvée dans la fraction soluble, lors de l’utilisation de la méthode de production précédemment utilisée. Finalement, Une approche de modélisation moléculaire, structurellement guidés par l’arrimage de flavonoïdes monoglucosylés dans le site actif de l’enzyme, a permis d’identifier des cibles de mutagenèse et de générer des libraries de quelques milliers de variants. Une méthode rapide de criblage sur milieu solide, basée sur la visualisation colorimétrique d’un changement de pH, a été mise au point. Les mutants encore actifs sur saccharose ont été sélectionnés puis analysés sur leur capacités à glucosyler la quercétine et la diosmétine. Une petite série de 23 mutants a ainsi été retenue comme plate-forme d’enzymes améliorées dédiées à la glucosylation de flavonoïdes et a été évalués pour la glycosylation de six flavonoïdes distincts. La promiscuité, remarquablement générée dans cette plateforme, à permis d’isoler quelques mutants beaucoup plus efficaces que l’enzyme sauvage, produisant des motifs de glucosylation différents et fournissant des informations intéressante pour le design et l’amélioration des outils enzymatiques de glucosylation des flavonoïdes. / Flavonoid glycosides are natural plant secondary metabolites exhibiting many physicochemical and biological properties. Glycosylation usually improves flavonoid solubility but access to flavonoid glycosides is limited by their low production levels in plants. In this thesis work, the focus was placed on the development of new glucodiversification routes of natural flavonoids by taking advantage of protein engineering. Two biochemically and structurally characterized recombinant transglucosylases, the amylosucrase from Neisseria polysaccharea and the α-(1→2) branching sucrase, a truncated form of the dextransucrase from L. Mesenteroides NRRL B-1299, were selected to attempt glucosylation of different flavonoids, synthesize new α-glucoside derivatives with original patterns of glucosylation and hopefully improved their water-solubility. First, a small-size library of amylosucrase variants showing mutations in their acceptor binding site was screened in the presence of sucrose (glucosyl donor) and luteolin acceptor. A screening procedure was developed. It allowed isolating several mutants improved for luteolin glucosylation and synthesizing of novel luteolin glucosides, which exhibited up to a 17,000-fold increase of solubility in water. To attempt glucosylation of other types of flavonoids, the α-(1→2) branching sucrase, naturally designed for acceptor reaction, was preferred. Experimental design and Response Surface Methodology were first used to optimize the production of soluble enzyme and avoid inclusion body formation. Five parameters were included in the design: culture duration, temperature and concentrations of glycerol, lactose inducer and glucose repressor. Using the predicted optimal conditions, 5740 U. L-1of culture of soluble enzyme were obtained in microtiter plates, while no activity was obtained in the soluble fraction when using the previously reported method of production. A structurally-guided approach, based on flavonoids monoglucosides docking in the enzyme active site, was then applied to identify mutagenesis targets and generate libraries of several thousand variants. They were screened using a rapid pH-based screening assay, implemented for this purpose. This allowed sorting out mutants still active on sucrose that were subsequently assayed for both quercetin and diosmetin glucosylation. A small set of 23 variants, constituting a platform of enzymes improved for the glucosylation of these two flavonoids was retained and evaluated for the glucosylation of a six distinct flavonoids. Remarkably, the promiscuity generated in this platform allowed isolating several variants much more efficient than the wild-type enzyme. They produced different glucosylation patterns, and provided valuable information to further design and improve flavonoid glucosylation enzymatic tools.
|
Page generated in 0.1374 seconds