Global ETD Search

11	Policy-Aware Parallel Execution of Composite Services / 複合サービスのポリシーアウェアな並列実行 Mai, Xuan Trang 23 March 2016 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19855号 / 情博第606号 / 新制\|\|情\|\|105(附属図書館) / 32891 / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授石田亨, 教授吉川正俊, 教授岡部寿男 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Service Composition Parallel Execution Policies Data Parallelism Degree of Parallelism Performance prediction 007
12	Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment / Analys och jämförelse av distribuerade tränings tekniker för djupa neurala nätverk i en dynamisk miljö Gebremeskel, Ermias January 2018 (has links) Deep learning models' prediction accuracy tends to improve with the size of the model. The implications being that the amount of computational power needed to train models is continuously increasing. Distributed deep learning training tries to address this issue by spreading the computational load onto several devices. In theory, distributing computation onto N devices should give a performance improvement of xN. Yet, in reality the performance improvement is rarely xN, due to communication and other overheads. This thesis will study the communication overhead incurred when distributing deep learning training. Hopsworks is a platform designed for data science. The purpose of this work is to explore a feasible way of deploying distributed deep learning training on a shared cluster and analyzing the performance of different distributed deep learning algorithms to be used on this platform. The findings of this study show that bandwidth-optimal communication algorithms like ring all-reduce scales better than many-to-one communication algorithms like parameter server, but were less fault tolerant. Furthermore, system usage statistics collected revealed a network bottleneck when training is distributed on multiple machines. This work also shows that it is possible to run MPI on a hadoop cluster by building a prototype that orchestrates resource allocation, deployment, and monitoring of MPI based training jobs. Even though the experiments did not cover different cluster configurations, the results are still relevant in showing what considerations need to be made when distributing deep learning training. / Träffsäkerheten hos djupinlärningsmodeller tenderar att förbättras i relation med storleken på modellen. Implikationen blir att mängden beräkningskraft som krävs för att träna modeller ökar kontinuerligt.Distribuerad djupinlärning försöker lösa detta problem genom att distribuera beräkningsbelastning på flera enheter. Att distribuera beräkningarna på N enheter skulle i teorin innebär en linjär skalbarhet (xN). I verkligenheten stämmer sällan detta på grund av overhead från nätverkskommunikation eller I/O. Hopsworks är en dataanalys och maskininlärningsplattform. Syftetmed detta arbeta är att utforska ett möjligt sätt att utföra distribueraddjupinlärningträning på ett delat datorkluster, samt analysera prestandan hos olika algoritmer för distribuerad djupinlärning att använda i plattformen. Resultaten i denna studie visar att nätverksoptimala algoritmer såsom ring all-reduce skalar bättre för distribuerad djupinlärning änmånga-till-en kommunikationsalgoritmer såsom parameter server, men är inte lika feltoleranta. Insamlad data från experimenten visade på en flaskhals i nätverket vid träning på flera maskiner. Detta arbete visar även att det är möjligt att exekvera MPI program på ett hadoopkluster genom att bygga en prototyp som orkestrerar resursallokering, distribution och övervakning av exekvering. Trots att experimenten inte täcker olika klusterkonfigurationer så visar resultaten på vilka faktorer som bör tas hänsyn till vid distribuerad träning av djupinlärningsmodeller. deep learning large scale distributed deep learning data parallelism Computer Sciences Datavetenskap (datalogi)
13	Locality Conscious Scheduling Strategies for High Performance Data Analysis Applications Vydyanathan, Nagavijayalakshmi 20 August 2008 (has links) No description available. Computer Science scheduling resource management task-parallelism data-parallelism pipelined-parallelism
14	Data Parallelism For Ray Casting Large Scenes On A Cpu-gpu Cluster Topcu, Tumer 01 June 2008 (has links) (PDF) In the last decade, computational power, memory bandwidth and programmability capabilities of graphics processing units (GPU) have rapidly evolved. Therefore, many researches have been performed to use GPUs in advanced graphics rendering. Because of its high degree of parallelism, ray tracing has been one of the rst algorithms studied on GPUs. However, the rendering of large scenes with ray tracing can easily exceed the GPU&#039 / s memory capacity. The algorithm proposed in this work uses a data parallel approach where the scene is partitioned and assigned to CPU-GPU couples in a cluster to overcome this problem. Our algorithm focuses on ray casting which is a special case of ray tracing mainly used in visualization of volumetric data. CPUs are pretty ecient in ow control and branching while GPUs are very fast performing intense oating point operations. Using these facts, the GPUs in the cluster are assigned the task of performing ray casting while the CPUs are responsible for traversing the rays. In the end, we were able to visualize large scenes successfully by utilizing CPU-GPU couples eectively and observed that the performance is highly dependent on the viewing angle as a result of load imbalance.
15	Evaluation of Machine Learning Primitives on a Digital Signal Processor Engström, Vilhelm January 2020 (has links) Modern handheld devices rely on specialized hardware for evaluating machine learning algorithms. This thesis investigates the feasibility of using the digital signal processor, a part of the modem of the device, as an alternative to this specialized hardware. Memory management techniques and implementations for evaluating the machine learning primitives convolutional, max-pooling and fully connected layers are proposed. The implementations are evaluated based on to what degree they utilize available hardware units. New instructions for packing data and facilitating instruction pipelining are suggested and evaluated. The results show that convolutional and fully connected layers are well-suited to the processor used. The aptness of the convolutional layer is subject to the kernel being applied with a stride of 1 as larger strides cause the hardware usage to plummet. Max-pooling layers, while not ill-suited, are the most limited in terms of hardware usage. The proposed instructions are shown to have positive effects on the throughput of the implementations. digital signal processor DSP SIMD data parallelism machine learning deep learning convolutional neural network Computer Engineering Datorteknik
16	Comparison of Shared memory based parallel programming models Ravela, Srikar Chowdary January 2010 (has links) Parallel programming models are quite challenging and emerging topic in the parallel computing era. These models allow a developer to port a sequential application on to a platform with more number of processors so that the problem or application can be solved easily. Adapting the applications in this manner using the Parallel programming models is often influenced by the type of the application, the type of the platform and many others. There are several parallel programming models developed and two main variants of parallel programming models classified are shared and distributed memory based parallel programming models. The recognition of the computing applications that entail immense computing requirements lead to the confrontation of the obstacle regarding the development of the efficient programming models that bridges the gap between the hardware ability to perform the computations and the software ability to support that performance for those applications [25][9]. And so a better programming model is needed that facilitates easy development and on the other hand porting high performance. To answer this challenge this thesis confines and compares four different shared memory based parallel programming models with respect to the development time of the application under a shared memory based parallel programming model to the performance enacted by that application in the same parallel programming model. The programming models are evaluated in this thesis by considering the data parallel applications and to verify their ability to support data parallelism with respect to the development time of those applications. The data parallel applications are borrowed from the Dense Matrix dwarfs and the dwarfs used are Matrix-Matrix multiplication, Jacobi Iteration and Laplace Heat Distribution. The experimental method consists of the selection of three data parallel bench marks and developed under the four shared memory based parallel programming models considered for the evaluation. Also the performance of those applications under each programming model is noted and at last the results are used to analytically compare the parallel programming models. Results for the study show that by sacrificing the development time a better performance is achieved for the chosen data parallel applications developed in Pthreads. On the other hand sacrificing a little performance data parallel applications are extremely easy to develop in task based parallel programming models. The directive models are moderate from both the perspectives and are rated in between the tasking models and threading models. / From this study it is clear that threading model Pthreads model is identified as a dominant programming model by supporting high speedups for two of the three different dwarfs but on the other hand the tasking models are dominant in the development time and reducing the number of errors by supporting high growth in speedup for the applications without any communication and less growth in self-relative speedup for the applications involving communications. The degrade of the performance by the tasking models for the problems based on communications is because task based models are designed and bounded to execute the tasks in parallel without out any interruptions or preemptions during their computations. Introducing the communications violates the purpose and there by resulting in less performance. The directive model OpenMP is moderate in both aspects and stands in between these models. In general the directive models and tasking models offer better speedup than any other models for the task based problems which are based on the divide and conquer strategy. But for the data parallelism the speedup growth however achieved is low (i.e. they are less scalable for data parallel applications) are equally compatible in execution times with threading models. Also the development times are considerably low for data parallel applications this is because of the ease of development supported by those models by introducing less number of functional routines required to parallelize the applications. This thesis is concerned about the comparison of the shared memory based parallel programming models in terms of the speedup. This type of work acts as a hand in guide that the programmers can consider during the development of the applications under the shared memory based parallel programming models. We suggest that this work can be extended in two different ways: one is from the developer‘s perspective and the other is a cross-referential study about the parallel programming models. The former can be done by using a similar study like this by a different programmer and comparing this study with the new study. The latter can be done by including multiple data points in the same programming model or by using a different set of parallel programming models for the study. / C/O K. Manoj Kumar; LGH 555; Lindbloms Vägan 97; 37233; Ronneby. Phone no: 0738743400 Home country phone no: +91 9948671552 Parallel Programming models Distributed memory Shared memory Dwarfs Development time Speedup Data parallelism Dense Matrix dwarfs threading models Tasking models Directive models. Computer Sciences Datavetenskap (datalogi)
17	Concevoir et partager des workflows d’analyse de données : application aux traitements intensifs en bioinformatique / Design and share data analysis workflows : application to bioinformatics intensive treatments Moreews, François 11 December 2015 (has links) Dans le cadre d'une démarche d'Open science, nous nous intéressons aux systèmes de gestion de workflows (WfMS) scientifiques et à leurs applications pour l'analyse de données intensive en bioinformatique. Nous partons de l'hypothèse que les WfMS peuvent évoluer pour devenir des plates-formes pivots capables d'accélérer la mise au point et la diffusion de méthodes d'analyses innovantes. Elles pourraient capter et fédérer autour d'une thématique disciplinaire non seulement le public actuel des consommateurs de services mais aussi celui des producteurs de services. Pour cela, nous considérons que ces environnements doivent à la fois être adaptés aux pratiques des scientifiques concepteurs de méthodes et fournir un gain de productivité durant la conception et le traitement. Ces contraintes nous amènent à étudier la capture rapide des workflows, la simplification de l'intégration des tâches techniques, comme le parallélisme nécessaire au haut-débit, et la personnalisation du déploiement. Tout d'abord, nous avons défini un langage graphique DataFlow expressif, adapté à la capture rapide des workflows. Celui-ci est interprétable par un moteur de workflows basé sur un nouveau modèle de calcul doté de performances élevées, obtenues par l'exploitation des multiples niveaux de parallélisme. Nous présentons ensuite une approche de conception orientée modèle qui facilite la génération du parallélisme de données et la production d'implémentations adaptées à différents contextes d'exécution. Nous décrivons notamment l'intégration d'un métamodèle des composants et des plates-formes, employé pour automatiser la configuration des dépendances des workflows. Enfin, dans le cas du modèle Container as a Service (CaaS), nous avons élaboré une spécification de workflows intrinsèquement diffusable et ré-exécutable. L'adoption de ce type de modèle pourrait déboucher sur une accélération des échanges et de la mise à disposition des chaînes de traitements d'analyse de données. / As part of an Open Science initiative, we are particularly interested in the scientific Workflow Management Systems (WfMS) and their applications for intensive data analysis in bioinformatics. We start from the assumption that WfMS can evolve to become efficient hubs able to speed up the development and the dissemination of innovative analysis methods. These software platforms could rally and unite not only the current stakeholders, who are service consumers, but also the service producers, around a disciplinary theme. We therefore consider that these environments must be both adapted to the practices of the scientists who are method designers and also enhanced with increased productivity during design and treatment. These constraints lead us to study the rapid capture of workflows, the simplification of technical tasks integration, like parallelisation and the deployment customization. First, we define an expressive graphic worfklow language, adapted to the quick capture of workflows. This is interpreted by a workflow engine based on a new model of computation with high performances obtained by the use of multiple levels of parallelism. Then, we present a Model-Driven design approach that facilitates the data parallelism generation and the production of suitable implementations for different execution contexts. We describe in particular the integration of a components and platforms meta-model used to automate the configuration of workflows’ dependencies. Finally, in the case of the cloud model Container as a Service (CaaS), we develop a workflow specification intrinsically re-executable and readily disseminatable. The adoption of this kind of model could lead to an acceleration of exchanges and a better availability of data analysis workflows. Calcul intensif Modèle de calcul Dataflow Mde Méta-Modèle Bioinformatique Workflow Paralléllisme de données Intensive computation Model of computation Dataflow Mde Meta-Model Bioinformatics Workflow Data parallelism
18	Storage Format Selection and Optimization for Materialized Intermediate Results in Data-Intensive Flows Munir, Rana Faisal 01 February 2021 (has links) Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs. / Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen peratortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HL Ansatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzen. info:eu-repo/classification/ddc/004 ddc:004
19	Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems Awan, Ammar Ahmad 10 September 2020 (has links) No description available. Computer Science Artificial Intelligence Data Parallelism Model Parallelism Hybrid Parallelism Keras Caffe TensorFlow PyTorch, MPI, Eager Execution Deep Learning Scalable DNN Training MVAPICH2-GDR CUDA-Aware MPI
20	Stream Computing on FPGAs Plavec, Franjo 01 September 2010 (has links) Field Programmable Gate Arrays (FPGAs) are programmable logic devices used for the implementation of a wide range of digital systems. In recent years, there has been an increasing interest in design methodologies that allow high-level design descriptions to be automatically implemented in FPGAs. This thesis describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook, and is based on the existing Brook and GPU Brook languages, which target streaming multiprocessors and graphics processing units (GPUs), respectively. A streaming language is suitable for targeting FPGAs because it allows system designers to express applications in a way that exposes parallelism, which can then be exploited through parallel hardware implementation. FPGA Brook supports replication, which allows the system designer to trade-off area for performance, by specifying the parts of an application that should be implemented as multiple hardware units operating in parallel, to achieve desired application throughput. Hardware units are interconnected through FIFO buffers, which effectively utilize the small memory modules available in FPGAs. The FPGA Brook design flow uses a source-to-source compiler, and combines it with a commercial behavioural synthesis tool to generate hardware. The source-to-source compiler was developed as a part of this thesis and includes novel algorithms for implementation of complex reductions in FPGAs. The design flow is fully automated and presents a user-interface similar to traditional software compilers. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results show that applications implemented using our flow achieve much higher throughput than the Nios II soft processor implemented in the same FPGA device. Comparison to the commercial C2H compiler from Altera shows that while simple applications can be effectively implemented using the C2H compiler, complex applications achieve significantly better throughput when implemented by our system. Performance of many applications implemented using our design flow would scale further if a larger FPGA device were used. The thesis demonstrates that using an automated design flow to implement streaming applications in FPGAs is a promising methodology. Field Programmable Gate Array FPGA Streaming Behavioral synthesis Behavioural synthesis Data parallelism Task parallelism Replication Throughput Scalability C2H High-level synthesis Reduction Parallel reduction FIFO Strength reduction 0544 0984

Search results