Global ETD Search

11	Delivering Business Intelligence Performance by Data Warehouse and ETL Tuning Tashakor, Ghazal January 2013 (has links) Abstract The aim of this thesis is to show how numerous organizations such as CGI Consultant attempt to introduce BI-Solutions through IT and other operational methods in order to deal with large companies, which want to make their competitive market position stronger. This aim is achieved by Gap Analyzing in the BI roadmap and available Data Warehouses based on one of the company projects which were handed over to CGI from Lithuania. The fundamentals in achieving the BI-Solutions through IT, which has built the thesis methodology by research are, data warehousing, content analytics and performance management, data movement (Extract, Transform and Load) and CGI BI methodology, business process management, TeliaSonera Maintenance Management Model (TSM3) and AM model of CGI in the high level. The part of the thesis basically requires some research and practical work on Informatica PowerCenter, Microsoft SQL Server Management Studio and low level details such as database tuning, DBMS tuning implementation and ETL workflows optimization. Keywords: BI, ETL, DW, DBMS, TSM3, AM, Gap Analysing BI ETL DW DBMS TSM3 AM Gap Analysing
12	Second-tier Cache Management to Support DBMS Workloads Li, Xuhui 16 September 2011 (has links) Enterprise Database Management Systems (DBMS) often run on computers with dedicated storage systems. Their data access requests need to go through two tiers of cache, i.e., a database bufferpool and a storage server cache, before reaching the storage media, e.g., disk platters. A tremendous amount of work has been done to improve the performance of the first-tier cache, i.e., the database bufferpool. However, the amount of work focusing on second-tier cache management to support DBMS workloads is comparably small. In this thesis we propose several novel techniques for managing second-tier caches to boost DBMS performance in terms of query throughput and query response time. The main purpose of second-tier cache management is to reduce the I/O latency endured by database query executions. This goal can be achieved by minimizing the number of reads and writes issued from second-tier caches to storage devices. The rst part of our research focuses on reducing the number of read I/Os issued by second-tier caches. We observe that DBMSs issue I/O requests for various reasons. The rationales behind these I/O requests provide useful information to second-tier caches because they can be used to estimate the temporal locality of the data blocks being requested. A second-tier cache can exploit this information when making replacement decisions. In this thesis we propose a technique to pass this information from DBMSs to second-tier caches and to use it in guiding cache replacements. The second part of this thesis focuses on reducing the number of writes issued by second-tier caches. Our work is two fold. First, we observe that although there are second-tier caches within computer systems, today's DBMS cannot take full advantage of them. For example, most commercial DBMSs use forced writes to propagate bufferpool updates to permanent storage for data durability reasons. We notice that enforcing such a practice is more conservative than necessary. Some of the writes can be issued as unforced requests and can be cached in the second-tier cache without immediate synchronization. This will give the second-tier cache opportunities to cache and consolidate multiple writes into one request. However, unfortunately, the current POSIX compliant le system interfaces provided by mainstream operating systems e.g., Unix and Windows) are not flexible enough to support such dynamic synchronization. We propose to extend such interfaces to let DBMSs take advantage of using unforced writes whenever possible. Additionally, we observe that the existing cache replacement algorithms are designed solely to maximize read cache hits (i.e., to minimize read I/Os). The purpose is to minimize the read latency, which is on the critical path of query executions. We argue that minimizing read requests is not the only objective of cache replacement. When I/O bandwidth becomes a bottleneck the objective should be to minimize the total number of I/Os, including both reads and writes, to achieve the best performance. We propose to associate a new type of replacement cost, i.e., the total number of I/Os caused by the replacement, with each cache page; and we also present a partial characterization of an optimal algorithm which minimizes the total number of I/Os generated by caches. Based on this knowledge, we extend several existing replacement algorithms, which are write-oblivious (focus only on reducing reads), to be write-aware and observe promising performance gains in the evaluations. Database Storage Cache DBMS buffer bufferpool Computer Science
13	Linked data performance in different databases : Comparison between SQL and NoSQL databases / Prestanda med länkad data i olika databaser : Jämförelse mellan SQL och NoSQL databaser Chavez Alcarraz, Erick, Moraga, Manuel January 2014 (has links) Meepo AB was investigating the possibility of developing a social rating and recommendation service. In a recommendation service, the user ratings are collected in a database, this data is then used in recommendation algorithms to create individual user recommendations. The purpose of this study was to find out which demands are put on a DBMS, database management system, powering a recommendation service, what impact the NoSQL databases have on the performance of recommendation services compared to traditional relational databases, and which DBMS is most suited for storing the data needed to host a recommendation service. Five distinct NoSQL and Relational DBMS were examined, from these three candidates were chosen for a closer comparison. Following a study of recommendation algorithms and services, a test suite was created to compare DBMS performance in different areas using a data set of 100 million ratings. The results show that MongoDB had the best performance in most use cases, while Neo4j and MySQL struggled with queries spanning the whole data set. This paper however never compared performance for real production code. To get a better comparison, more research is needed. We recommend new performance tests for MongoDB and Neo4j using implementations of recommendation algorithms, a larger data set, and more powerful hardware. / Meepo AB undersökte möjligheten att utveckla en social betygs- och rekommendationstjänst. I en rekommendationstjänst samlas användarbetyg i en databas, för att sedan användas i en rekommendationsalgoritm för att skapa individuella rekommendationer till användarna. Syftet med studien var att ta reda på vilka krav som ställs på ett DBMS, databassystem, som driver en rekommendationstjänst, vilken inverkan NoSQL-databaser har på prestandan för rekommendationstjänster jämfört med traditionella relationsdatabaser och vilket DBMS som är mest lämpat för användning i en rekommendation tjänst. Fem olika NoSQL- och Relationsdatabaser undersöktes, från dessa valdes tre kandidater ut för en närmare jämförelse. Efter en studie i rekommendationsalgoritmer och rekommendationstjänster skapades en testsvit för att jämföra databasernas prestanda i olika områden. Till detta användes ett dataset med 100 miljoner betyg. Resultaten visar att MongoDB hade bäst prestanda i flest användningsfall, medan Neo4j och MySQL hade problem med sökningar som sträcker sig över hela datasetet. I denna uppsats jämförs dock inte prestandan med riktig produktionskod. För en bättre jämförelse behövs mer forskning. Vi rekommenderar nya prestandamätningar för MongoDB och Neo4j med implementationer av rekommendationsalgoritmer, ett större dataset och mer kraftfull hårdvara. Database DBMS Performance NoSQL SQL Data Computer Engineering Datorteknik
14	A quantitative study on the popularity and performance of SQL and NoSQL DBMS. Tatsis, Konstantinos January 2022 (has links) Context: This study compares the popularity and the performance of two DBMS. The two systems are SQL and NoSQL. The objective of the study is to determine which DBMS junior developers should learn first, in order to provide a head-start to their future career. Methods: To determine the most popular DBMS, surveys are collected from the Internet and are meta-analyzed. In order to determine the best performing DBMS, a SLR method that leads to a meta-analysis is conducted and tests the execution time of the read operation. Results: The research findings suggest that SQL is a more popular DBMS than the NoSQL system. This is verified statistically through the Fisher-Freeman-Halton Test, p <.001. As far as performance goes, the SQL DMBS performs a bit better compared to the NoSQL system if descriptive statistics is considered for 100 (M=12.4, SD=19.11),(M=174.4, SD=284.6) and 1000 (M=50.77, SD=113.5), (M=228.8, SD=276.6) records. However, once the t-test is performed it reveals that there is no statistical significance. Thus, the statistical test suggests that both DBMS perform equally well for both 100 and 1000 records t (8) = 1.27, p = .24 and t (8) = 1.11, p = .3 with a small effect size Cohen’sd, (d1=0.27) and (d2=0.28) respectively. Conclusion: Based on our research results and accounting for the importance of the date that this study has been conducted (2021), we recommend that junior developers should focus on learning a SQL DBMS first as their primary backend skillset for the foreseeablefuture. DBMS SQL NoSQL Popularity Performance Computer Sciences Datavetenskap (datalogi)
15	Indexing forecast models for matching and maintenance Fischer, Ulrike, Rosenthal, Frank, Böhm, Matthias, Lehner, Wolfgang 01 September 2022 (has links) Forecasts are important to decision-making and risk assessment in many domains. There has been recent interest in integrating forecast queries inside a DBMS. Answering a forecast query requires the creation of forecast models. Creating a forecast model is an expensive process and may require several scans over the base data as well as expensive operations to estimate model parameters. However, if forecast queries are issued repeatedly, answer times can be reduced significantly if forecast models are reused. Due to the possibly high number of forecast queries, existing models need to be found quickly. Therefore, we propose a model index that efficiently stores forecast models and allows for the efficient reuse of existing ones. Our experiments illustrate that the model index shows a negligible overhead for update transactions, but it yields significant improvements during query execution. info:eu-repo/classification/ddc/004 ddc:004
16	Towards Integrated Data Analytics: Time Series Forecasting in DBMS Fischer, Ulrike, Dannecker, Lars, Siksnys, Laurynas, Rosenthal, Frank, Boehm, Matthias, Lehner, Wolfgang 27 January 2023 (has links) Integrating sophisticated statistical methods into database management systems is gaining more and more attention in research and industry in order to be able to cope with increasing data volume and increasing complexity of the analytical algorithms. One important statistical method is time series forecasting, which is crucial for decision making processes in many domains. The deep integration of time series forecasting offers additional advanced functionalities within a DBMS. More importantly, however, it allows for optimizations that improve the efficiency, consistency, and transparency of the overall forecasting process. To enable efficient integrated forecasting, we propose to enhance the traditional 3-layer ANSI/SPARC architecture of a DBMS with forecasting functionalities. This article gives a general overview of our proposed enhancements and presents how forecast queries can be processed using an example from the energy data management domain. We conclude with open research topics and challenges that arise in this area. info:eu-repo/classification/ddc/004 ddc:004
17	Time Series databaser för sensorsystem : En experimentell studie av prestanda för Time Series databaser för sensorsystem som grundas på: NoSQL eller RDBMS. / Time Series databases for sensor systems Warrén, Linus, Tallkvist, Daniel January 2019 (has links) Purpose – The purpose of this study is to recommend a database and its belonging database model which is optimized for a sensor system. There is a lack of comparisons for databases and data models for bigger sensor systems. The study also brings scientific support for whom wishes to build a sensor system like the one which is included in this paper. Method – This paper starts with a literature study, which purpose is to choose the databases and the database models to be included in the comparison. To achieve the purpose of the study, a quantitative approach has been chosen. The study follows the steps that defines an experimental study within software development according to Shari Lawrence Pfleeger. Four predefined cases are used to compare the databases and the different database models which has been obtained in the literature study. Findings – The literature study shows that Time Series DBMS is the recommended database model to use for implementing sensor systems. The findings of the study also show that TimescaleDB is the preferable database over InfluxDB in four of four predefined cases. The null hypothesis which has been admitted is rejected and the alternative hypothesis is accepted at 1% significance level. Implications – The implications of the paper is to enhance the knowledge about Time Series DBMS, specifically of TimescaleDB and InfluxDB for sensor systems. The result can be implemented and used when resembling sensor systems are created. According to the result of the experiment it is shown that TimescaleDB is better than InfluxDB for sensor systems with similar datastructure. Limitations – Two Time Series DBMS (TimescaleDB and InfluxDB) were used in the experiments in this paper. The experiments was is carried out in Azure and is limited to 10 vCPU:s that a standard account have access to. There were not many beacons available to use for creating testdata. Files with corresponding data that the beacon sends out was created to simulate beacons. Keywords – Time Series DBMS, NoSQL, RDBMS, TimescaleDB, InfluxDB, Sensor systems / Syfte – I problembeskrivningen framgår att det finns brist på vetenskapligt underlag för vilken sorts databas som är optimal att använda för ett sensorsystem. Det saknas jämförelser av prestanda mellan olika databaser och datamodeller i större sensorsystem. Studiens syfte är: ”Att rekommendera en databas och tillhörande databasmodell som är optimerad för ett sensorsystem” Metod – Studien inleds med en litteraturstudie för att genom teorin välja databas och databasmodeller som ska ingå i studien. För att uppnå syftet har en kvantitativ ansats valts. Studien följer de steg som Shari Lawrence Pfleeger definierar som en experimentell studie inom mjukvaruutveckling. Fyra fördefinierade fall används för att jämföra databaserna med olika databasmodeller som erhållits i litteraturstudien. Resultat - Litteraturstudien visar att Time Series DBMS är den databasmodell som rekommenderas att användas i ett sensorsystem. Studiens resultat visar att TimescaleDB presterar bättre än InfluxDB i fyra av fyra fördefinierade fall. Nollhypotesen som har ställts upp förkastas och en mothypotes antas vid 1% signifikansnivå. Implikationer - Studiens implikationer är att öka och fylla vissa kunskapshål kring Time Series DBMS, specifikt TimescaleDB och InfluxDB för sensorsystem. Resultatet kan tillämpas och användas när liknande sensorsystem skall implementeras. Enligt experimentets resultat visar det att TimescaleDB är bättre än InfluxDB för sensorsystem med liknande struktur. Begränsningar – Två Time Series DBMS (TimescaleDB och InfluxDB) ingår i denna studie som experimenten utfördes på. Experimenten utföres i Azure och var begränsade av de 10 vCPU:erna ett standardkonto har tillgång till att använda. Det fanns inte tillgång till ett stort antal beacons för att generera data till experimenten, så filer med motsvarande data skapades för att simulera beacons. Nyckelord - Time Series DBMS, NoSQL, RDBMS, TimescaleDB, InfluxDB, Sensorsystem Time Series DBMS NoSQL RDBMS TimescaleDB InfluxDB Sensor systems Time Series DBMS NoSQL RDBMS TimescaleDB InfluxDB Sensorsystem Computer Sciences Datavetenskap (datalogi)
18	Comparing database optimisation techniques in PostgreSQL : Indexes, query writing and the query optimiser Inersjö, Elizabeth January 2021 (has links) Databases are all around us, and ensuring their efficiency is of great importance. Database optimisation has many parts and many methods, two of these parts are database tuning and database optimisation. These can then further be split into methods such as indexing. These indexing techniques have been studied and compared between Database Management Systems (DBMSs) to see how much they can improve the execution time for queries. And many guides have been written on how to implement query optimisation and indexes. In this thesis, the question "How does indexing and query optimisation affect response time in PostgreSQL?" is posed, and was answered by investigating these previous studies and theory to find different optimisation techniques and compare them to each other. The purpose of this research was to provide more information about how optimisation techniques can be implemented and map out when what method should be used. This was partly done to provide learning material for students, but also people who are starting to learn PostgreSQL. This was done through a literature study, and an experiment performed on a database with different table sizes to see how the optimisation scales to larger systems. What was found was that there are many use cases to optimisation that mainly depend on the query performed and the type of data. From both the literature study and the experiment, the main take-away points are that indexes can vastly improve performance, but if used incorrectly can also slow it. The main use cases for indexes are for short queries and also for queries using spatio-temporal data - although spatio-temporal data should be researched more. Using the DBMS optimiser did not show any difference in execution time for queries, while correctly implemented query tuning techniques also vastly improved execution time. The main use cases for query tuning are for long queries and nested queries. Although, most systems benefit from some sort of query tuning, as it does not have to cost much in terms of memory or CPU cycles, in comparison to how indexes add additional overhead and need some memory. Implementing proper optimisation techniques could improve both costs, and help with environmental sustainability by more effectively utilising resources. / Databaser finns överallt omkring oss, och att ha effektiva databaser är mycket viktigt. Databasoptimering har många olika delar, varav två av dem är databasjustering och SQL optimering. Dessa två delar kan även delas upp i flera metoder, så som indexering. Indexeringsmetoder har studerats tidigare, och även jämförts mellan DBMS (Database Management System), för att se hur mycket ett index kan förbättra prestanda. Det har även skrivits många böcker om hur man kan implementera index och SQL optimering. I denna kandidatuppsats ställs frågan "Hur påverkar indexering och SQL optimering prestanda i PostgreSQL?". Detta besvaras genom att undersöka tidigare experiment och böcker, för att hitta olika optimeringstekniker och jämföra dem med varandra. Syftet med detta arbete var att implementera och kartlägga var och när dessa metoder kan användas, för att hjälpa studenter och folk som vill lära sig om PostgreSQL. Detta gjordes genom att utföra en litteraturstudie och ett experiment på en databas med olika tabell storlekar, för att kunna se hur dessa metoder skalas till större system. Resultatet visar att det finns många olika användingsområden för optimering, som beror på SQL-frågor och datatypen i databasen. Från både litteraturstudien och experimentet visade resultatet att indexering kan förbättra prestanda till olika grader, i vissa fall väldigt mycket. Men om de implementeras fel kan prestandan bli värre. De huvudsakliga användingsområdena för indexering är för korta SQL-frågor och för databaser som använder tid- och rum-data - dock bör tid- och rum-data undersökas mer. Att använda databassystemets optimerare visade ingen förbättring eller försämring, medan en korrekt omskrivning av en SQL fråga kunde förbättra prestandan mycket. The huvudsakliga användingsområdet för omskriving av SQL-frågor är för långa SQL-frågor och för nestlade SQL-frågor. Dock så kan många system ha nytta av att skriva om SQL-frågor för prestanda, eftersom att det kan kosta väldigt lite när det kommer till minne och CPU. Till skillnad från indexering som behöver mer minne och skapar så-kallad överhead". Att implementera optimeringstekniker kan förbättra både driftkostnad och hjälpa med hållbarhetsutveckling, genom att mer effektivt använda resuser. PostgreSQL Query optimisation Query tuning Database indexing Database tuning DBMS PostgreSQL SQL optimering DBMS SQL justering Databasoptimering Indexering Computer and Information Sciences Data- och informationsvetenskap
19	Compression Selection for Columnar Data using Machine-Learning and Feature Engineering Persson, Douglas, Juelsson Larsen, Ludvig January 2023 (has links) There is a continuously growing demand for improved solutions that provide both efficient storage and efficient retrieval of big data for analytical purposes. This thesis researches the use of machine-learning together with feature engineering to recommend the most cost-effective compression algorithm and encoding combination for columns in a columnar database management system (DBMS). The framework consists of a cost function calculated using compression time, decompression time, and compression ratio. An XGBoost machine-learning model is trained on labels provided by the cost function to recommend the most cost-effective combination for columnar data within a column or vector-oriented DBMS. While the methods are applied on ClickHouse, one of the most popular open-source column-oriented DBMS on the market, the results are broadly applicable to column-oriented data which share data type and characteristics with IoT telemetry data. Using billions of available rows of numeric real business data obtained at Axis Communications in Lund, Sweden, a set of features are engineered to accurately describe the characteristics of a given column. The proposed framework allows for weighting the business interests (compression time, decompression time, and compression ratio) to determine the individually optimal cost-effective solution. The model reaches an accuracy of 99% on the test dataset and an accuracy of 90.1% on unseen data by leveraging data features that are predictive of compression algorithms and encodings performances. Following ClickHouse strategies and the most suitable practices in the field, combinations of general-purpose compression algorithms and data encodings are analysed that together yield the best results in efficiently compressing the data of certain columns. Applying the unweighted recommended combinations on all columns, the framework’s performance impact was measured to increase the average compression speed by 95.46%. Reducing the time to compress the columns from 31.17 seconds to compress the data to 13.17 seconds. Additionally, the decompression speed was increased by 59.87%, reducing the time to decompress the columns from 2.63 seconds to 2.02 seconds, at the cost of decreasing the compression ratio by 66.05%. Increasing the storage requirements by 94.9 MB. In column and vector databases, chunks of data belonging to a certain column are often stored together on a disk. Therefore, choosing the right compression algorithm can lower the storage requirements and boost database throughput. Machine Learning XGBoost Classification Feature Engineering Compression Algorithms Data Encodings Database Management System (DBMS) Column- Oriented DBMS Computer Sciences Datavetenskap (datalogi)
20	Model-Driven Development of Complex and Data-Intensive Integration Processes Boehm, Matthias, Habich, Dirk, Lehner, Wolfgang, Wloka, Uwe 12 January 2023 (has links) Due to the changing scope of data management from centrally stored data towards the management of distributed and heterogeneous systems, the integration takes place on different levels. The lack of standards for information integration as well as application integration resulted in a large number of different integration models and proprietary solutions. With the aim of a high degree of portability and the reduction of development efforts, the model-driven development—following the Model-Driven Architecture (MDA)—is advantageous in this context as well. Hence, in the GCIP project (Generation of Complex Integration Processes), we focus on the model-driven generation and optimization of integration tasks using a process-based approach. In this paper, we contribute detailed generation aspects and finally discuss open issues and further challenges. info:eu-repo/classification/ddc/004 ddc:004

Search results