1 |
Replacing batch-based data extraction withevent streaming with Apache Kafka : A comparative studyAxelsson, Richard January 2022 (has links)
For growing organisations that have built their data flow around a monolithic database server, anever-increasing number of applications and an ever-increasing demand for data freshness willeventually push the existing system to its limits, prompting either hardware upgrades or anupdated data architecture. Switching from an approach of full extractions of data at regularintervals to an approach where only changes are extracted, resource consumption couldpotentially be decreased, while simultaneously increasing data freshness. The objective of this thesis is to provide insights into how implementing an event streamingsetup with Apache Kafka connected to SQL Server through the Debezium source connectoraffects resource consumption on the database server. Other studies in related work have oftenbeen focused on steps further downstream in the data pipeline. This thesis can thereforecontribute to an area where more knowledge is needed. Through an empirical study done using two different setups in the same system, traditional dataextraction in batches and extraction through event streaming is measured and compared. The point of measurement is the SQL Server database from which data is extracted. Both memoryutilisation and CPU utilisation is measured, using SQL Server Profiler. Different parameters fortable sizes, volumes of data and intervals between changes are used to simulate differentscenarios. One of the takeaways of the results is that, at the same number of total changes, the size of theindividual transactions has a large impact on the resource consumption caused by eventstreaming. The study shows that an overhead cost is involved with each transaction, and also thatthe regular polling that the source connector performs causes resource consumption even inidleness. The thesis concludes that event streaming can offer reduced resource consumption on thedatabase server. However, when the source table size is small, and the number of changes large,extraction in batches is less resource-intensive.
|
2 |
Построение потокового захвата изменения данных для аналитических хранилищ данных : магистерская диссертация / Building a streaming data change capture system for analytical data warehousesГоликов, А. А., Golikov, A. A. January 2024 (has links)
The object of the thesis is streaming data change capture. The purpose of the work is to analyze the methods for building a streaming data change capture system and implement the best method selected during the analysis. Research methods: theoretical analysis, testing, programming. The result of the work is the successful implementation and testing of a streaming data change capture system based on Kafka Connect in conjunction with the Debezium connector for MySQL and ClickHouse Kafka Connect Sink for ClickHouse, the solved problem of the latter's limitation on processing remote records from the data source, as well as obtaining the current state of data from the source. The scope of the obtained results is data engineering and artificial intelligence. The significance of the work lies in the possibility of its practical implementation at the place of work, as well as in the flexible approach to solving the problem under the conditions of tool limitations. / Цель работы – анализ методов построения потокового захвата изменений данных и реализация лучшего метода, выбранного в ходе анализа. Методы исследования: теоретический анализ, тестирование, программирование. Результатом работы является успешная реализация и тестирование системы потокового захвата изменений данных на базе Kafka Connect в связке с коннектором Debezium для MySQL и ClickHouse Kafka Connect Sink для ClickHouse, решённая проблема ограничения последнего на обработку удалённых записей из источника данных, а также получение актуального состояния данных из источника. Область применения полученных результатов – инженерия данных и искусственного интеллекта. Значимость работы заключается в возможности её практической реализации по месту работы, а также в гибком подходе к решению поставленной задачи в условиях ограничений инструментария.
|
Page generated in 0.0143 seconds