Global ETD Search

1	Réplication de données dans les systèmes de gestion de données à grande échelle / Data replication in large-scale data management systems Tos, Uras 27 June 2017 (has links) Ces dernières années, la popularité croissante des applications, e.g. les expériences scientifiques, Internet des objets et les réseaux sociaux, a conduit à la génération de gros volumes de données. La gestion de telles données qui de plus, sont hétérogènes et distribuées à grande échelle, constitue un défi important. Dans les systèmes traditionnels tels que les systèmes distribués et parallèles, les systèmes pair-à-pair et les systèmes de grille, répondre à des objectifs tels que l'obtention de performances acceptables tout en garantissant une bonne disponibilité de données constituent des objectifs majeurs pour l'utilisateur, en particulier lorsque ces données sont réparties à travers le monde. Dans ce contexte, la réplication de données, une technique très connue, permet notamment: (i) d'augmenter la disponibilité de données, (ii) de réduire les coûts d'accès aux données et (iii) d'assurer une meilleure tolérance aux pannes. Néanmoins, répliquer les données sur tous les nœuds est une solution non réaliste vu qu'elle génère une consommation importante de la bande passante en plus de l'espace limité de stockage. Définir des stratégies de réplication constitue la solution à apporter à ces problématiques. Les stratégies de réplication de données qui ont été proposées pour les systèmes traditionnels cités précédemment ont pour objectif l'amélioration des performances pour l'utilisateur. Elles sont difficiles à adapter dans les systèmes de cloud. En effet, le fournisseur de cloud a pour but de générer un profit en plus de répondre aux exigences des locataires. Satisfaire les attentes de ces locataire en matière de performances sans sacrifier le profit du fournisseur d'un coté et la gestion élastiques des ressources avec une tarification suivant le modèle 'pay-as-you-go' d'un autre coté, constituent des principes fondamentaux dans les systèmes cloud. Dans cette thèse, nous proposons une stratégie de réplication de données pour satisfaire les exigences du locataire, e.g. les performances, tout en garantissant le profit économique du fournisseur. En se basant sur un modèle de coût, nous estimons le temps de réponse nécessaire pour l'exécution d'une requête distribuée. La réplication de données n'est envisagée que si le temps de réponse estimé dépasse un seuil fixé auparavant dans le contrat établi entre le fournisseur et le client. Ensuite, cette réplication doit être profitable du point de vue économique pour le fournisseur. Dans ce contexte, nous proposons un modèle économique prenant en compte aussi bien les dépenses et les revenus du fournisseur lors de l'exécution de cette requête. Nous proposons une heuristique pour le placement des répliques afin de réduire les temps d'accès à ces nouvelles répliques. De plus, un ajustement du nombre de répliques est adopté afin de permettre une gestion élastique des ressources. Nous validons la stratégie proposée par une évaluation basée sur une simulation. Nous comparons les performances de notre stratégie à celles d'une autre stratégie de réplication proposée dans les clouds. L'analyse des résultats obtenus a montré que les deux stratégies comparées répondent à l'objectif de performances pour le locataire. Néanmoins, une réplique de données n'est crée, avec notre stratégie, que si cette réplication est profitable pour le fournisseur. / In recent years, growing popularity of large-scale applications, e.g. scientific experiments, Internet of things and social networking, led to generation of large volumes of data. The management of this data presents a significant challenge as the data is heterogeneous and distributed on a large scale. In traditional systems including distributed and parallel systems, peer-to-peer systems and grid systems, meeting objectives such as achieving acceptable performance while ensuring good availability of data are major challenges for service providers, especially when the data is distributed around the world. In this context, data replication, as a well-known technique, allows: (i) increased data availability, (ii) reduced data access costs, and (iii) improved fault-tolerance. However, replicating data on all nodes is an unrealistic solution as it generates significant bandwidth consumption in addition to exhausting limited storage space. Defining good replication strategies is a solution to these problems. The data replication strategies that have been proposed for the traditional systems mentioned above are intended to improve performance for the user. They are difficult to adapt to cloud systems. Indeed, cloud providers aim to generate a profit in addition to meeting tenant requirements. Meeting the performance expectations of the tenants without sacrificing the provider's profit, as well as managing resource elasticities with a pay-as-you-go pricing model, are the fundamentals of cloud systems. In this thesis, we propose a data replication strategy that satisfies the requirements of the tenant, such as performance, while guaranteeing the economic profit of the provider. Based on a cost model, we estimate the response time required to execute a distributed database query. Data replication is only considered if, for any query, the estimated response time exceeds a threshold previously set in the contract between the provider and the tenant. Then, the planned replication must also be economically beneficial to the provider. In this context, we propose an economic model that takes into account both the expenditures and the revenues of the provider during the execution of any particular database query. Once the data replication is decided to go through, a heuristic placement approach is used to find the placement for new replicas in order to reduce the access time. In addition, a dynamic adjustment of the number of replicas is adopted to allow elastic management of resources. Proposed strategy is validated in an experimental evaluation carried out in a simulation environment. Compared with another data replication strategy proposed in the cloud systems, the analysis of the obtained results shows that the two compared strategies respond to the performance objective for the tenant. Nevertheless, a replica of data is created, with our strategy, only if this replication is profitable for the provider. Systèmes cloud Requêtes de base de données Réplication de données Evaluation de performances Profit économique Cloud Computing Database Queries Data Replication Performance Evaluation Economic Benefit
2	Semantische Transformation von natürlichsprachigen Anfragen in Datenbankabfragesprachen: Design und Implementierung einer sprachgesteuerten Schnittstelle für die semantische Transformation von natürlichsprachigen Anfragen in Datenbankabfragesprachen am Beispiel von OntoChem´s SciWalker Horstkorte, Garlef 17 December 2024 (has links) Diese Bachelorarbeit beschäftigt sich mit der Entwicklung einer Softwarelösung zur semantischen und syntaktischen Umwandlung natürlicher Sprache in Datenbankabfragesprachen. Ziel ist es, eine benutzerfreundliche Schnittstelle zu schaffen, die auch Nicht-Experten ermöglicht, komplexe Datenbankabfragen durchzuführen. Im Rahmen eines Praktikums bei der OntoChem GmbH wurde zunächst ein regelbasierter Prototyp entwickelt, der natürliche Sprachabfragen in maschinenlesbare Datenbank abfragen transformiert. Anschlieÿend wurde dieser Ansatz mit einem auf Large Language Models (LLMs) basierenden Ansatz, wie beispielsweise ChatGPT, verglichen. Dabei wurden unter anderem die Effizienz, Genauigkeit, Zuverlässigkeit und ökonomischen Kosten beider Ansätze untersucht. Die Arbeit beginnt mit einer Einführung in die Grundlagen der natürlichen Sprachver arbeitung (NLP), regelbasierter Systeme und LLMs. Es folgt eine detaillierte Beschrei bung des Praktikumsprojekts, einschlieÿlich der eingesetzten Technologien und Tools. In den darauf folgenden Kapiteln werden der regelbasierte Ansatz und der LLM-Ansatz zur Umwandlung natürlicher Sprache in Datenbankabfragen vorgestellt, implementiert und getestet. Die Vergleichsanalyse zeigt, dass der regelbasierte Ansatz durch hohe Geschwindigkeit und Datenkontrolle besticht, jedoch in seiner Flexibilität und Genauigkeit limitiert ist. Der LLM-Ansatz bietet hingegen eine höhere Genauigkeit und Flexibilität bei der Interpretation natürlicher Sprache, weist jedoch längere Antwortzeiten und höhere Betriebskosten auf. Abschließend werden Empfehlungen für die Praxis gegeben und zukünftige Forschungsrichtungen aufgezeigt, wie etwa die Kombination beider Ansätze oder das Training eines eigenen Modells. Die Ergebnisse dieser Arbeit tragen dazu bei, die Interaktion zwischen natürlicher Sprache und Datenbanksystemen zu verbessern und bieten praktische Lösungen für die semantische Transformation von Benutzeranfragen.:1 Einleitung 2 1.1 Motivation 2 1.2 Zielsetzung der Arbeit 3 1.3 Aufbau der Arbeit 4 2 Hintergrund und theoretische Grundlagen 6 2.1 Natürliche Sprachverarbeitung (NLP) 6 2.1.1 Grundlagen der NLP 6 2.1.2 Modelle und Algorithmen 7 2.1.3 Anwendungsbereiche 8 2.2 Regelbasierte Systeme 9 2.2.1 Definition und Funktionsweise 9 2.2.2 Beispiele und Anwendungen 10 2.3 Large Language Models (LLMs) 10 2.3.1 Funktionsweise und Architektur 10 2.3.2 Entwicklung und Technologien 14 2.3.3 Training und Datenbasis 15 2.3.4 Anwendungsbereiche 15 2.3.5 Limitationen von GPT-Modellen 16 3 Praktikumsprojekt bei OntoChem GmbH 18 3.1 Unternehmensvorstellung 18 3.1.1 Überblick und Geschichte 18 3.1.2 Produkte und Technologien 19 3.2 Projektbeschreibung 21 3.2.1 Ziel des Projekts 21 3.2.2 Aufgabenstellung 26 3.3 Technologie-Stack und Tools 27 3.3.1 Programmiersprache und Umgebung 27 3.3.2 Bibliotheken 28 4 Regelbasierter Ansatz zur Umwandlung natürlicher Sprache in Datenbankabfragen 29 4.1 API-Design 29 4.1.1 Methodik und Konzeption 29 4.1.2 structFromNaturalSearch 29 4.1.3 queryFromSearchStructure 35 4.2 Implementierung 37 4.2.1 Funktion: SearchStructureFromString 37 4.2.2 Integration OC-Technologien 38 4.2.3 Algorithmen und Regeln 40 4.2.4 Herausforderungen 43 5 LLM-Ansatz zur Umwandlung natürlicher Sprache in Datenbankabfragen 45 5.1 Einführung in den LLM-Ansatz 45 5.1.1 Grundlagen 45 5.1.2 Vergleich mit Regelbasierten Systemen 46 5.2 Prompting in LLMs (z.B. ChatGPT) 46 5.2.1 Prinzipien des Promptings 46 5.2.2 Design effektiver Prompts 47 5.3 Tests und Evaluierung 50 5.3.1 Beschreibung der Tests 50 5.3.2 Ergebnisse und Analyse 52 6 Vergleich der Ansätze 58 6.1 Methodik 58 6.2 Ergebnisse 58 6.3 Diskussion 61 7 Evaluation und Ausblick 62 7.1 Kritische Betrachtung 62 7.2 Limitationen und Fehlerquellen 62 7.3 Fazit und Implikationen 63 7.4 Zukünftige Forschung 63 Literaturverzeichnis I Abbildungsverzeichnis IV Daten- und Codeverzeichnis V
3	En jämförelse mellan databashanterare med prestandatester och stora datamängder / A comparison between database management systems with performance testing and large data sets Brander, Thomas, Dakermandji, Christian January 2016 (has links) Företaget Nordicstation hanterar stora datamängder åt Swedbank där datalagringen sker i relationsdatabasen Microsoft SQL Server 2012 (SQL Server). Då det finns andra databashanterare designade för stora datavolymer är det oklart om SQL Server är den optimala lösningen för situationen. Detta examensarbete har tagit fram en jämförelse med hjälp av prestandatester, beträffande exekveringstiden av databasfrågor, mellan databaserna SQL Server, Cassandra och NuoDB vid hanteringen av stora datamängder. Cassandra är en kolumnbaserad databas designad för hantering av stora datavolymer, NuoDB är en minnesdatabas som använder internminnet som lagringsutrymme och är designad för skalbarhet. Resultaten togs fram i en virtuell servermiljö med Windows Server 2012 R2 på en testplattform skriven i Java. Jämförelsen visar att SQL Server var den databas mest lämpad för gruppering, sortering och beräkningsoperationer. Däremot var Cassandra bäst i skrivoperationer och NuoDB presterade bäst i läsoperationer. Analysen av resultatet visade att mindre access till disken ger kortare exekveringstid men den skalbara lösningen, NuoDB, lider av kraftiga prestandaförluster av att endast konfigureras med en nod. Nordicstation rekommenderas att uppgradera till Microsoft SQL Server 2014, eller senare, där möjlighet finns att spara tabeller i internminnet. / The company Nordicstation handles large amounts of data for Swedbank, where data is stored using the relational database Microsoft SQL Server 2012 (SQL Server). The existence of other databases designed for handling large amounts of data, makes it unclear if SQL Server is the best solution for this situation. This degree project describes a comparison between databases using performance testing, with regard to the execution time of database queries. The chosen databases were SQL Server, Cassandra and NuoDB. Cassandra is a column-oriented database designed for handling large amounts of data, NuoDB is a database that uses the main memory for data storage and is designed for scalability. The performance tests were executed in a virtual server environment with Windows Server 2012 R2 using an application written in Java. SQL Server was the database most suited for grouping, sorting and arithmetic operations. Cassandra had the shortest execution time for write operations while NuoDB performed best in read operations. This degree project concludes that minimizing disk operations leads to shorter execution times but the scalable solution, NuoDB, suffer severe performance losses when configured as a single-node. Nordicstation is recommended to upgrade to Microsoft SQL Server 2014, or later, because of the possibility to save tables in main memory. Database managment system Performance test Execution Time Large data sets Microsoft SQL Server Cassandra NuoDB Database queries Test environment Databashanterare Prestandatest Exekveringstid Stora datavolymer Microsoft SQL Server Cassandra NuoDB Databasfrågor Testmiljö Software Engineering Programvaruteknik

1

Page generated in 0.0662 seconds