Global ETD Search

81	Investigating Persistence Layers for Notifications Ghourchian, Isabel January 2019 (has links) This work was carried out for Cisco at the Tail-f department. Cisco’s main focus is on network and telecommunication. The Tail-f department is the developer of a network service automatization product that allows customers to automate the process of adding, removing and manage their developed devices and services in their network. In the context of this network service, notifications will arise to notify operators when something has happened. The notifications are currently stored in a configurational database that is currently used for all the data in the network service. The customers have wished for more flexibility and functionality in terms of operating the notification data since they are limited in what queries they may perform in order to analyze their data. The purpose of this project is to find an alternative way of storing the notifications with better functionality and efficiency. A number of different databases were investigated with respect to the functionality and performance requirements of the system. A final storage system ElasticSearch was chosen. ElasticSearch provides flexible schema handling and complex queries which makes it a suitable choice that fulfills the customers’ needs. A generator and subscriber program were built in order to perform tests and insertion of notification data into ElasticSearch. The generator creates notifications manually and the subscriber receives them and performs some parsing and insertion into the storage. The queries and performance of ElasticSearch was measured. The query results show that the new system is able to perform much more complex queries than before such as range queries, filtering and full-text searches. The performance results show that the system is able to handle around 1000 notifications every other millisecond before the system will slow down. This is a sufficient number that satisfies the customers’ needs. / Detta arbete utfördes för Cisco vid Tail-f avdelningen. Cisco fokuserar på nätoch telekommunikation. Tail-f är utvecklare av en automatiseringsprodukt för nätverk som gör det möjligt för kunder att automatisera processen att lägga till, ta bort och hantera sina utvecklade enheter och tjänster i deras nätverk. I samband med nätverkstjänsten kommer notifikationer att meddela operatörer när något har hänt.Meddelandena lagras för tillfället i en konfigurationsdatabas som för närvarande används för lagring av all data inom nätverkstjänsten. Kunderna har önskat efter mer flexibilitet och funktionalitet när det gäller att hantera datat eftersom de är begränsade i vilka sökfrågor de kan utföra för att analysera deras notifikationsdata.Syftet med detta projekt är att hitta ett alternativt sätt att lagra notifikationerna med bättre funktionalitet och effektivitet. Ett antal olika databaser undersöktes med avseende på systemets funktionalitet och prestandakrav. Ett slutgiltigt lagringssystem ElasticSearch valdes. ElasticSearch erbjuder flexibel schema-hantering och komplexa sökfrågor som gör det till ett lämpligt val som uppfyller kundernas behov. En generator och subscriber program utvecklades för att utföra test och införande av notifikationsdata i ElasticSearch. Generatorn skapar meddelanden manuellt och subscribern tar emot dem och utför viss parsing och insättning i lagringssystemet.Sökfrågor och prestanda för ElasticSearch mättes. Sökfrågans resultat visar att det nya systemet kan utföra mycket mer komplexa sökfrågor än tidigare, såsom intervall-sökningar, filtrering och full-text sökning. Resultatet visar att systemet kan hantera cirka 1000 notifikationer varannan millisekund innan systemet saktar ner. Detta är ett tillräckligt antal som uppfyller kundernas behov. Databases Notifications Storage Events NoSQL SQL Schema JSON Databaser Notifikationer Lagring Event NoSQL SQL Schema JSON Computer and Information Sciences Data- och informationsvetenskap
82	Datamigration av Content Management Systems (CMS) för Multi-siteapplikationer : En studie på SQL-till-NoSQL migration / Data migration of Content Management Systems (CMS) for Multi-site applications : A study on SQL-to-NoSQL migration Brown, Elin January 2018 (has links) Detta arbete undersöker om existerande Multi-siteapplikationer i CMS-systemet WordPress kan uppnå bättre prestanda genom att övergå från WordPress till det nya CMS-systemet Keystone JS genom en datamigration. Denna migrationsprocess utvärderas med ett vetenskapligt experiment, för att undersöka om migrationsprocessen i sig eventuellt kan medföra prestandaproblem, men också kring när en migration är relevant och i slutändan värd att genomföra. Experimentet mäter svarstider för olika databasoperationer av den originella WordPress-applikationen samt den migrerade Keystone JS-applikationen. Resultatet av mätningen visade att den migrerade applikationen kan uppnå upp till 59% förbättrade svarstider för subdomänrendering, vilket bekräftar att Multi-siteapplikationer kan gynnas av en migration till Keystone JS. Migrationsprocessen ansågs heller inte ha någon individuell negativ prestandapåverkan. Multi-site CMS WordPress Keystone JS SQL-to-NoSQL data migraton Multi-site CMS WordPress Keystone JS SQL-till-NoSQL datamigration Computer Sciences Datavetenskap (datalogi)
83	Avaliação do consumo de energia em sistemas de gerenciamento de banco de dados NoSQL ARAÚJO, Carlos Gomes 08 August 2016 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2017-04-25T12:27:42Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao_CarlosGomes_MPROF_CINUFPE_2016.pdf: 4079444 bytes, checksum: 308622549a641d5ab125dbbdbceb4d2d (MD5) / Made available in DSpace on 2017-04-25T12:27:42Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao_CarlosGomes_MPROF_CINUFPE_2016.pdf: 4079444 bytes, checksum: 308622549a641d5ab125dbbdbceb4d2d (MD5) Previous issue date: 2016-08-08 / NoSQL é uma tecnologia de sistemas de gerenciamento de banco de dados (SGBD) emergente, tendo modelos flexíveis focados em desempenho e escalabilidade, proposta para a manipulação de grandes quantidades de dados. NoSQL não substitui as abordagens de sistemas de gerenciamento de banco de dados relacionais, mas sim atende às restrições relacionadas à manipulação de dados em massa. Tal tecnologia já é aplicada em sistemas bem conhecidos em todo o mundo, tais como serviços de e-commerce e middleware. A importância de tal tecnologia tem motivado muitos trabalhos, principalmente em relação ao desempenho. Poucos trabalhos caracterizam e comparam o consumo de energia no contexto de SGBDs NoSQL, apesar de sua importância. De fato, o consumo de energia não deve ser negligenciado devido ao aumento dos custos financeiros e ambientais. A fim de avaliar essa questão, este trabalho analisa o desempenho e consumo de energia em sistemas de gerenciamento de banco de dados NoSQL, selecionamos o Cassandra (coluna), MongoDB (orientado a documento) e Redis (chave-valor) por serem representativos exemplos desta tecnologia. A metodologia baseia-se em Design of Experiments, de tal forma que as cargas de trabalho são geradas por Yahoo! Cloud Serving Benchmark (YCSB) produzindo leitura, escrita e atualização, por ciclos de 1.000, 10.000 e 100.000 operações. Como resultado são avaliados 27 tratamentos. Para a medição do consumo de energia é aplicado um framework específico chamado Emeter. As métricas são tempo de execução e consumo de energia, assim como a evolução no incremento da carga de trabalho. Os resultados demonstram que o consumo de energia pode variar significativamente entre os SGBDs para comandos distintos e cargas de trabalho. Conclui-se ainda que mesmo havendo uma correlação positiva entre o consumo de energia e o tempo de execução, o SGBD mais rápido não é, necessariamente o que utiliza menos energia. / NoSQL is an emergent database management systems technology (DBMS), having flexible models focused on performance and scalability, proposed for manipulating massive amounts of data. NoSQL is not intending for replacing the relational database management systems approaches, but to overcome constraints related to massive data manipulation. Such a technology already is applied in well-known systems around the world, such as e-commerce and middleware services. The importance of such technology has motivated lots of works, mainly relating to performance. Few works can be enumerated regarding characterization of energy consumption on NoSQL DataBase Management Systems, despite its importance. In fact the energy consumption is a feature that cannot be neglected due its impact on financial cost and environmental questions. In order to deal with such an issue, this work evaluates not only performance but the energy consumption involved on NoSQL DataBase Management Systems, specifically for Cassandra (Column), MongoDB (Document Oriented) and Redis (Key-Value). The methodology is based on Design of Experiments, in such a way the workloads are generated by Yahoo! Cloud Serving Benchmark (YCSB) producing readings, writings and updatings by cycles of 1.000, 10.000 and 100.000. As result, it is evaluated twenty seven treatments. For measuring energy consumption is applied a specific framework named Emeter. The Emeter captures metrics such as execution time and energy consumption related to treatments under analyze. In addition to the individual evaluation, the performance and energy consumption are analyzed among relevant scenarios, as well as the trends due to increases in the workload. The results demonstrate that energy consumption can differs for each DBMS according to command and workload. Additionally, the results make it possible to infer that despite the well-known positive correlation between performance and energy consumption, the fastest DBMS is not necessarily the best on saving energy. NoSQL SGBD Orientado a Coluna Orientado a Documento Chave-valor Consumo de Energia Avaliação de Desempenho NoSQL DBMS Colunm Oriented Document Oriented Key-Value Energy Consumption Performance Evalution
84	Databaser i molnet : En prestanda utvärdering Persson, Peter, Sjölin, Johan, Dahlberg, Thomas January 2012 (has links) Abstract As a developer of database-driven applications you will be faced with difficult choices, when it comes to choosing database, server and programming language. For the result to be satisfactory it requires the different techniques to interact well with each other and also fulfill performance expectations. This is even more important when the application is deployed to the cloud and the response time plays a major role.This paper evaluates different databases and their performance. To test the databases there was an application made in the server script language PHP or Hypertext Preprocessor and it was deployed to Windows Azure cloud platform. The test applications task is to call and load databases by controlled requests which creates, reads, updates and deletes data in an relatively large extent.The results shows that locale databases or databases in the same data center as the server generates the fastest response. The diffrence between NoSQL and SQL is practically nothing when it comes to simple requests. The type of data and the type of usage are major factors in the choice between the two databases.This paper works as a guidance in the choice of database for development of applications in the cloud. Keywords: SQL, NoSQL, databas, cloud, Azure, CouchDB, IrisCouch, database.com, databaseperformance / Abstrakt Som utvecklare av databasdrivna applikationer ställs man inför avgörande val när det kommer till databas, server och programmeringsspråk. För att resultatet ska bli en väl fungerande applikation krävs det att alla tekniker interagerar på ett bra sätt med varandra samt att de uppfyller vissa prestandakrav. Detta blir än viktigare när applikationen lyfts ut i molnet och svarstider spelar en stor roll.I arbetet undersöks olika databasers svarstider med hjälp av en testapplikation som är skriven i scriptspråketet PHP och driftsatt på Windows Azure-plattformen. Applikationens uppgift är att anropa och belasta databaser genom att genomföra kontrollerade operationer som skapar, hämtar, uppdaterar eller tar bort data i relativt stor omfattning.De databaser som praktiskt testats och utvärderats är Azure Table, Azure SQL, CouchDB, IrisCouch samt Database.com.Resultaten visar att lokala databaser, eller databaser inom samma datacenter som servern, generar de snabbaste responstiderna. Skillnaderna mellan NoSQL och SQL är i det närmaste försumbara när det handlar om enklare operationer. I valet mellan de två handlar det i mångt och mycket om vad databaserna ska användas till samt vilken typ av information som ska lagras i dem.Arbetet är tänkt att fungera som en vägledning i valet av databas vid utveckling av molntjänster. Nyckelord: SQL, NoSQL, databas, moln, Azure, CouchDB, IrisCouch, database.com, databasprestanda SQL NoSQL databas cloud Azure CouchDB IrisCouch database.com databaseperformance SQL NoSQL databas moln Azure CouchDB IrisCouch database.com databasprestanda Computer Sciences Datavetenskap (datalogi)
85	A Framework for Property-preserving Encryption in Wide Column Store Databases Waage, Tim 05 May 2017 (has links) No description available. 510 NoSQL Wide Colum Stores Property-preserving Encryption Database Security NoSQL Wide Colum Stores Property-preserving Encryption Database Security Informatik (PPN619939052)
86	Módulo de consultas distribuídas do Infinispan / Module that supports distributed queries in Infinispan Israel Danilo Lacerra 26 November 2012 (has links) Com a grande quantidade de informações existentes nas aplicações computacionais hoje em dia, cada vez mais tornam-se necessários mecanismos que facilitem e aumentem o desempenho da recuperação dessas informações. Nesse contexto vem surgindo os bancos de dados chamados de NOSQL, que são bancos de dados tipicamente não relacionais que, em prol da disponibilidade e do desempenho em ambientes com enormes quantidades de dados, abrem mão de requisitos antes vistos como fundamentais. Neste trabalho iremos lidar com esse cenário ao implementar o módulo de consultas distribuídas do JBoss Infinispan, um sistema de cache distribuído que funciona também como um banco de dados NOSQL em memória. Além de apresentar a implementação desse módulo, iremos falar do surgimento do movimento NOSQL, de como se caracterizam esses bancos e de onde o Infinispan se insere nesse movimento. / With the big amount of data available to computer applications nowadays, there is an increasing need for mechanisms that facilitate the retrieval of such data and improve data access performance. In this context we see the emergence of so-called NOSQL databases, which are databases that are typically non-relational and that give up fulfilling some requirements previously seen as fundamental in order to achieve better availability and performance in big data environments. In this work we deal with the scenario above and implement a module that supports distributed queries in JBoss Infinispan, a distributed cache system that works also as an in-memory NOSQL database. Besides presenting the implementation of that module, we discuss the emergence of the NOSQL movement, the characterization of NOSQL databases, and where Infinispan fits in this context. cache cache distribuído consultas distribuídas data grid Infinispan Lucene. NOSQL sistemas de grade de dados cache data grid distributed cache distributed queries Infinispan NOSQL
87	Einsatz des Intelligent Cluster Index in verteilten, dezentralen NoSQL-Systemen Morgenstern, Johannes 07 February 2019 (has links) Sowohl im Zusammenhang mit der durch den Menschen verursachten Erzeugung von Daten, als auch durch maschinell herbeigeführte Kommunikationsaufwände besteht der Wunsch, aus diesen Daten unter verschiedenen Gesichtspunkten Informationen zu gewinnen. Außerdem wächst die Menge der auszuwertenden Daten stetig. Als technische Grundlage zur Erfassung und Verarbeitung dieser Datenaufkommen werden skalierbare Systemkonzepte genutzt, die Datenwachstum durch inhärente Skalierbarkeit begegnen. Unter analytischen Gesichtspunkten handelt es sich um BigData-Systemkonzepte, deren technische Basis häufig durch nichtrelationale NoSQL-Systeme gebildet wird. In dieser Arbeit werden auf Basis der Growing Neural Gas, einem künstlichen Neuronalen Netz, zwei verteilte Algorithmen zum Erlernen inhaltlicher Merkmale für die Datenorganisation mit einem inhaltsorientierten Index betrachtet. Des Weiteren wird der inhaltsorientierte Index ICIx für Column Family Stores adaptiert, um die Informationsgewinnung in verteilten, dezentralen Systemen auch nach Merkmalen inhaltlicher Ähnlichkeit zu ermöglichen. Die durchgeführten Versuche zeigen, dass die verteilten Varianten des Growing Neural Gas Daten ohne Qualitätsverlust repräsentieren können. Außerdem ergibt die Anwendung der durch dieses künstliche Neuronale Netz organisierten Daten, dass die betrachtete Indexstruktur auch in verteilten, dezentralen Systemen den Datenzugriff gegenüber vergleichbaren Indizes beschleunigt. / Both in the context of man-made data generation and machine-generated communication efforts, there is a desire to extract information from these data from a variety of perspectives. In addition, the amount of data to be evaluated steadily increases. As a technical basis for the collection and processing of this data volume, scalable system concepts are used that counteract data growth through inherent scalability. From an analytical point of view, these are BigData system concepts whose technical basis is often formed by non-relational NoSQL systems. In this work, based on the Growing Neural Gas, an artificial neural network, two distributed algorithms for the acquisition of content characteristics for data organization with a content-oriented index are considered. Furthermore, the content-oriented index ICIx for Column Family Stores will be adapted to enable information gathering in distributed, decentralized systems, even in terms of similarity in content. The experiments show that the distributed variants of Growing Neural Gas can represent data without loss of quality. In addition, the application of the data organized by this artificial neural network results in the fact that the index structure in question also accelerates the data access in comparison to comparable indices in distributed, decentralized systems. info:eu-repo/classification/ddc/004 ddc:004
88	Supporting multiple data stores based applications in cloud environments / Soutenir les applications utilisant des bases de données multiples dans un environnement Cloud Computing Sellami, Rami 05 February 2016 (has links) Avec l’avènement du cloud computing et des big data, de nouveaux systèmes de gestion de bases de données sont apparus, connus en général sous le vocable systèmes NoSQL. Par rapport aux systèmes relationnels, ces systèmes se distinguent par leur absence de schéma, une spécialisation pour des types de données particuliers (documents, graphes, clé/valeur et colonne) et l’absence de langages de requêtes déclaratifs. L’offre est assez pléthorique et il n’y a pas de standard aujourd’hui comme peut l’être SQL pour les systèmes relationnels. De nombreuses applications peuvent avoir besoin de manipuler en même temps des données stockées dans des systèmes relationnels et dans des systèmes NoSQL. Le programmeur doit alors gérer deux (au moins) modèles de données différents et deux (au moins) langages de requêtes différents pour pouvoir écrire son application. De plus, il doit gérer explicitement tout son cycle de vie. En effet, il a à (1) coder son application, (2) découvrir les services de base de données déployés dans chaque environnement Cloud et choisir son environnement de déploiement, (3) déployer son application, (4) exécuter des requêtes multi-sources en les programmant explicitement dans son application, et enfin le cas échéant (5) migrer son application d’un environnement Cloud à un autre. Toutes ces tâches sont lourdes et fastidieuses et le programmeur risque d’être perdu dans ce haut niveau d’hétérogénéité. Afin de pallier ces problèmes et aider le programmeur tout au long du cycle de vie des applications utilisant des bases de données multiples, nous proposons un ensemble cohérent de modèles, d’algorithmes et d’outils. En effet, notre travail dans ce manuscrit de thèse se présente sous forme de quatre contributions. Tout d’abord, nous proposons un modèle de données unifié pour couvrir l’hétérogénéité entre les modèles de données relationnelles et NoSQL. Ce modèle de données est enrichi avec un ensemble de règles de raffinement. En se basant sur ce modèle, nous avons défini notre algèbre de requêtes. Ensuite, nous proposons une interface de programmation appelée ODBAPI basée sur notre modèle de données unifié, qui nous permet de manipuler de manière uniforme n’importe quelle source de données qu’elle soit relationnelle ou NoSQL. ODBAPI permet de programmer des applications indépendamment des bases de données utilisées et d’exprimer des requêtes simples et complexes multi-sources. Puis, nous définissons la notion de bases de données virtuelles qui interviennent comme des médiateurs et interagissent avec les bases de données intégrées via ODBAPI. Ce dernier joue alors le rôle d’adaptateur. Les bases de données virtuelles assurent l’exécution des requêtes d’une façon optimale grâce à un modèle de coût et un algorithme de génération de plan d’exécution optimal que nous définis. Enfin, nous proposons une approche automatique de découverte de bases de données dans des environnements Cloud. En effet, les programmeurs peuvent décrire leurs exigences en termes de bases de données dans des manifestes, et grâce à notre algorithme d’appariement, nous sélectionnons l’environnement le plus adéquat à notre application pour la déployer. Ainsi, nous déployons l’application en utilisant une API générique de déploiement appelée COAPS. Nous avons étendue cette dernière pour pouvoir déployer les applications utilisant plusieurs sources de données. Un prototype de la solution proposée a été développé et mis en œuvre dans des cas d'utilisation du projet OpenPaaS. Nous avons également effectué diverses expériences pour tester l'efficacité et la précision de nos contributions / The production of huge amount of data and the emergence of Cloud computing have introduced new requirements for data management. Many applications need to interact with several heterogeneous data stores depending on the type of data they have to manage: traditional data types, documents, graph data from social networks, simple key-value data, etc. Interacting with heterogeneous data models via different APIs, and multiple data stores based applications imposes challenging tasks to their developers. Indeed, programmers have to be familiar with different APIs. In addition, the execution of complex queries over heterogeneous data models cannot, currently, be achieved in a declarative way as it is used to be with mono-data store application, and therefore requires extra implementation efforts. Moreover, developers need to master and deal with the complex processes of Cloud discovery, and application deployment and execution. In this manuscript, we propose an integrated set of models, algorithms and tools aiming at alleviating developers task for developing, deploying and migrating multiple data stores applications in cloud environments. Our approach focuses mainly on three points. First, we provide a unified data model used by applications developers to interact with heterogeneous relational and NoSQL data stores. This model is enriched by a set of refinement rules. Based on that, we define our query algebra. Developers express queries using OPEN-PaaS-DataBase API (ODBAPI), a unique REST API allowing programmers to write their applications code independently of the target data stores. Second, we propose virtual data stores, which act as a mediator and interact with integrated data stores wrapped by ODBAPI. This run-time component supports the execution of single and complex queries over heterogeneous data stores. It implements a cost model to optimally execute queries and a dynamic programming based algorithm to generate an optimal query execution plan. Finally, we present a declarative approach that enables to lighten the burden of the tedious and non-standard tasks of (1) discovering relevant Cloud environments and (2) deploying applications on them while letting developers to simply focus on specifying their storage and computing requirements. A prototype of the proposed solution has been developed and implemented use cases from the OpenPaaS project. We also performed different experiments to test the efficiency and accuracy of our proposals Cloud computing Données volumineuses Persistence polyglote NoSQL Bases de données relationnelles Requêtes de jointure Cloud computing Big data Polyglot persistence NoSQL Rdbms Join queries
89	Prevention of Privilege Abuse on NoSQL Databases : Analysis on MongoDB access control / Förebyggande av Privilegier Missbruk på NoSQL-databaser : Analys på MongoDB-åtkomstkontroll Ishak, Marwah January 2021 (has links) Database security is vital to retain confidentiality and integrity of data as well as prevent security threats such as privilege abuse. The most common form of privilege abuse is excessive privilege abuse, which entails assigning users with excessive privileges beyond their job function, which can be abused deliberately or inadvertently. The thesis’s objective is to determine how to prevent privilege abuse in the NoSQL database MongoDB. Prior studies have noted the importance of access control to secure databases from privilege abuse. Access control is essential to manage and protect the accessibility of the data stored and restrict unauthorised access. Therefore, the study analyses MongoDB’s embedded access control through experimental testing to test various built-in and advanced privileges roles in preventing privilege abuse. The results indicate that privilege abuse can be prevented if users are granted roles composed of the least privileges. Additionally, the results indicate that assigning users with excessive privileges exposes the system to privilege abuse. The study also underlines that an inaccurate allocation of privileges or permissions to users of databases may have profound consequences for the system and organisation, such as data breach and data manipulation. Hence, organisations that utilise information technology should be obliged to protect their interests and databases from others and their members through access control policies. / Datasäkerhet är avgörande för att bevara datats konfidentialitet och integritet samt för att förhindra säkerhetshot som missbruk av privilegier. Missbruk av överflödig privilegier, är den vanligaste formen av privilegier missbruk. Detta innebär att en användare tilldelas obegränsad behörighet utöver det som behövs för deras arbete, vilket kan missbrukas medvetet eller av misstag. Examensarbetets mål är att avgöra hur man kan förhindra missbruk av privilegier i NoSQL-databasen MongoDB. Tidigare studier har noterat vikten av åtkomstkontroll för att säkra databaser från missbruk av privilegier. Åtkomstkontroll är viktigt för att hantera och skydda åtkomlighet för de lagrade data samt begränsa obegränsad åtkomst. Därför analyserar arbetet MongoDBs inbäddade åtkomstkontroll genom experimentell testning för att testa olika inbyggda och avancerade priviligierade roller för att förhindra missbruk av privilegier. Resultaten indikerar att missbruk av privilegier kan förhindras om användare får roller som har färre privilegier. Dessutom visar resultaten att tilldelning av användare med obegränsade privilegier utsätter systemet för missbruk av privilegier. Studien understryker också att en felaktig tilldelning av privilegier eller behörigheter för databasanvändare kan få allvarliga konsekvenser för systemet och organisationen, såsom dataintrång och datamanipulation. Därför bör organisationer som använder informationsteknologi ha som plikt att skydda sina tillgångar och databaser från obehöriga men även företagets medarbetare som inte är beroende av datat genom policys för åtkomstkontroll. NoSQL databases MongoDB Access control Privilege abuse Role-based access control NoSQL-databaser MongoDB Åtkomstkontroll Missbruk av privilegier Computer Engineering Datorteknik
90	A comparison of Data Stores for the Online Feature Store Component : A comparison between NDB and Aerospike / En jämförelse av datalagringssystem för andvänding som Online Feature Store : En jämförelse mellan NDB och Aerospike Volminger, Alexander January 2021 (has links) This thesis aimed to investigate what Data Stores would fit to be implemented as an Online Feature Store. This is a component in the Machine Learning infrastructure that needs to be able to handle low latency Reads at high throughput with high availability. The thesis evaluated the Data Stores with real feature workloads from Spotify’s Search system. First an investigation was made to find suitable storage systems. NDB and Aerospike were selected because of their state-of-the-art performance together with their suitable functionality. These were then implemented as the Online Feature Store by batch Reading the feature data through a Java program and by using Google Dataflow to input data to the Data Stores. For 1 client NDB achieved about 35% higher batch Read throughput with around 30% lower P99 latency than Aerospike. For 8 clients NDB got 20% higher batch Read throughput, with a varying P99 latency different compared to Aerospike. But in a 8 node setup NDB achieved on average 35% lower latency. Aerospike achieved 50% fasterWrite speeds when writing feature data to the Data Stores. Both Data Stores’ Read performance was found to suffer upon Writing to the data store at the same time as Reading, with the P99 Read latency increasing around 30% for both Data Stores. It was concluded that both Data Stores would work as an Online Feature Store. But NDB achieved better Read performance, which is one of the most important factors for this type of Feature Store. / Den här uppsatsen undersökte vilka datalagringssystem som passar för att implementeras som en Online Feature Store. Detta är en komponent i maskininlärningsinfrastrukturen som måste hantera snabba läsningar med hög genomströmning och hög tillgänglighet. Uppsatsen studerade detta genom att evaluera datalagringssystem med riktig feature data från Spotifys söksystem. En utredning gjordes först för att hitta lovande datalagringssystem för denna uppgift. NDB och Aerospike blev valda på grund av deras topp prestanda och passande funktionalitet. Dessa implementerades sedan som en Online Feature Store genom att batch-läsa feature datan med hjälp av ett Java program samt genom att använda Google Dataflow för att lägga in feature datan i datalagringssystemen. För 1 klient fick NDB runt 35% bättre genomströmning av feature data jämfört med Aerospike för batch läsningar, med ungefär 30% lägre P99 latens. För 8 klienter fick runt 20% högre genomströmning av feature data med en P99 latens som var mer varierande. Men klustren med 8 noder fick NDB i genomsnitt 35% lägre latens. Aerospike var 50% snabbare på att skriva feature datan till datalagringssystemet. Båda systemen led dock av sämre läsprestanda när skrivningar skedde till dem samtidigt. P99 läs-latensen gick då upp runt 30% för båda datalagringssystemen. Sammanfattningsvis funkade båda av de undersökta datalagringssystem som en Online Feature Store. Men NDB hade bättre läsprestanda, vilket är en av de mest viktigaste faktorerna för den här typen av Feature Store. Feature Stores Data Stores NDB Aerospike NoSQL Online Feature Stores Feature Stores Datalagringsystem NDB Aerospike NoSQL Online Feature Stores Computer and Information Sciences Data- och informationsvetenskap

Search results