Global ETD Search

Return to search

Parallelization of Dataset Transformation with Processing Order Constraints in Python / Parallelisering av datamängdstransformation med ordningsbegränsningar i Python

Financial data is often represented with rows of values, contained in a dataset. This data needs to be transformed into a common format in order for comparison and matching to be made, which can take a long time for larger datasets. The main goal of this master’s thesis is speeding up these transformations through parallelization using Python multiprocessing. The datasets in question consist of several rows representing trades, and are transformed into a common format using rules known as filters. In order to devise a parallelization strategy, the filters were analyzed in order to find ordering constraints, and the Python profiler cProfile was used to find bottlenecks and potential parallelization points. This analysis resulted in the use of a task-based approach for the implementation, in which the transformation was divided into an initial sequential pre-processing step, a parallel step where chunks of several trade rows were distributed among workers, and a sequential post processing step. The implementation was tested by transforming four datasets of differing sizes using up to 16 workers, and execution time and memory consumption was measured. The results for the tiny, small, medium, and large datasets showed a speedup of 0.5, 2.1, 3.8, and 4.81. They also showed linearly increasing memory consumption for all datasets. The test transformations were also profiled in order to understand the parallel program’s behaviour for the different datasets. The experiments gave way to the conclusion that dataset size heavily influences the speedup, partly because of the fact that the sequential parts become less significant. In addition, the large memory increase for larger amount of workers is noted as a major downside of multiprocessing when using caching mechanisms, as data is duplicated instead of shared. This thesis shows that it is possible to speed up the dataset transformations using chunks of rows as tasks, though the speedup is relatively low. / Finansiell data representeras ofta med rader av värden, samlade i en datamängd. Denna data måste transformeras till ett standardformat för att möjliggöra jämförelser och matchning. Detta kan ta lång tid för stora datamängder. Huvudmålet för detta examensarbete är att snabba upp dessa transformationer genom parallellisering med hjälp av Python-modulen multiprocessing. Datamängderna omvandlas med hjälp av regler, kallade filter. Dessa filter analyserades för att identifiera begränsningar på ordningen i vilken datamängden kan behandlas, och därigenom finna en parallelliseringsstrategi. Python-profileraren cProfile an- vändes även för att hitta potentiella parallelliseringspunkter i koden. Denna analys resulterade i användandet av ett “task”-baserat tillvägagångssätt, där transformationen delades in i ett sekventiellt pre-processingsteg, ett parallelt steg där grupper av rader distribuerades ut bland arbetar-processer, och ett sekventiellt post-processingsteg. Implementationen testades genom transformation av fyra datamängder av olika storlekar, med upp till 16 arbetarprocesser. Resultaten för de fyra datamängderna var en speedup på 0.5, 2.1, 3.8 respektive 4.81. En linjär ökning i minnesanvändning uppvisades även. Experimenten resulterade i slutsatsen att datamängdens storlek var en betydande faktor i hur mycket speedup som uppvisades, delvis på grund av faktumet att de sekventiella delarna tar upp en mindre del av programmet. Den stora minnesåtgången noterades som en nackdel med att använda multiprocessing i kombination med cachning, på grund av duplicerad data. Detta examensarbete visar att det är möjligt att snabba upp datamängdstransformation genom att använda radgrupper som tasks, även om en relativt låg speedup uppvisades.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189574

Datavetenskap (datalogi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-189574
Date	January 2016
Creators	Gramfors, Dexter
Publisher	KTH, Skolan för datavetenskap och kommunikation (CSC)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0027 seconds

Parallelization of Dataset Transformation with Processing Order Constraints in Python / Parallelisering av datamängdstransformation med ordningsbegränsningar i Python

Description

Links & Downloads

Tags

Additional Fields