Global ETD Search

1	Improving Performance and Quality-of-Service through the Task-Parallel Model : Optimizations and Future Directions for OpenMP Podobas, Artur January 2015 (has links) With the failure of Dennard's scaling, which stated that shrinking transistors will be more power-efficient, computer hardware has today become very divergent. Initially the change only concerned the number of processor on a chip (multicores), but has today further escalated into complex heterogeneous system with non-intuitive properties -- properties that can improve performance and power consumption but also strain the programmer expected to develop on them. Answering these challenges is the OpenMP task-parallel model -- a programming model that simplifies writing parallel software. Our focus in the thesis has been to explore performance and quality-of-service directions of the OpenMP task-parallel model, particularly by taking architectural features into account. The first question tackled is: what capabilities does existing state of the art runtime-systems have and how do they perform? We empirically evaluated the performance of several modern task-parallel runtime-systems. Performance and power-consumption was measured through the use of benchmarks and we show that the two primary causes for bottlenecks in modern runtime-systems lies in either the task management overheads or how tasks are being distributed across processors. Next, we consider quality-of-service improvements in task-parallel runtime-systems. Striving to improve execution performance, current state of the art runtime-systems seldom take dynamic architectural features such as temperature into account when deciding how work should be distributed across the processors, which can lead to overheating. We developed and evaluated two strategies for thermal-awareness in task-parallel runtime-systems. The first improves performance when the computer system is constrained by temperature while the second strategy strives to reduce temperature while meeting soft real-time objectives. We end the thesis by focusing on performance. Here we introduce our original contribution called BLYSK -- a prototype OpenMP framework created exclusively for performance research. We found that overheads in current runtime-systems can be expensive, which often lead to performance degradation. We introduce a novel way of preserving task-graphs throughout application runs: task-graphs are recorded, identified and optimized the first time an OpenMP application is executed and are later re-used in following executions, removing unnecessary overheads. Our proposed solution can nearly double the performance compared with other state of the art runtime-systems. Performance can also be improved through heterogeneity. Today, manufacturers are placing processors with different capabilities on the same chip. Because they are different, their power-consuming characteristics and performance differ. Heterogeneity adds another dimension to the multiprocessing problem: how should work be distributed across the heterogeneous processors?We evaluated the performance of existing, homogeneous scheduling algorithms and found them to be an ill-match for heterogeneous systems. We proposed a novel scheduling algorithm that dynamically adjusts itself to the heterogeneous system in order to improve performance. The thesis ends with a high-level synthesis approach to improve performance in task-parallel applications. Rather than limiting ourselves to off-the-shelf processors -- which often contains a large amount of unused logic -- our approach is to automatically generate the processors ourselves. Our method allows us to generate application-specific hardware from the OpenMP task-parallel source code. Evaluated using FPGAs, the performance of our System-on-Chips outperformed other soft-cores such as the NiosII processor and were also comparable in performance with modern state of the art processors such as the Xeon PHI and the AMD Opteron. / <p>QC 20151016</p> Task Parallel OpenMP Scheduling OmpSs multicore manycore
2	Research on Parallel Hierarchical Matrix Construction / 階層型行列生成の並列化に関する研究 Bai, Zhengyang 23 March 2023 (has links) 京都大学 / 新制・課程博士 / 博士(情報学) / 甲第24744号 / 情博第832号 / 新制\|\|情\|\|139(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)准教授深沢圭一郎, 教授田中利幸, 教授石井信 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM hierarchical matrix task parallel language Tascell tree construction 007
3	Parallelization of Graph Mining using Backtrack Search Algorithm / バックトラック探索アルゴリズムを用いるグラフマイニングの並列化 Okuno, Shingo 23 March 2017 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第20518号 / 情博第646号 / 新制\|\|情\|\|112(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授中島浩, 教授永持仁, 教授田中利幸 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Parallelization Backtrack Search Graph Mining Task-Parallel Languages Language Enhancement 007
4	Runtime Systems for Load Balancing and Fault Tolerance on Distributed Systems Arafat, Md Humayun January 2014 (has links) No description available. Computer Science
5	ZipperOTF: Automatic, Precise, and Simple Data Race Detection for Task Parallel Programs with Mutual Exclusion Powell, S. Jacob 31 July 2020 (has links) Data race in parallel programs can be difficult to precisely detect, and doing so manually can often prove unsuccessful. Task parallel programming models can help reduce defects introduced by the programmer by restricting concurrent functionalities to fork-join operations. Typical data race detection algorithms compute the happens-before relation either by tracking the order that shared accesses happen via a vector clock counter, or by grouping events into sets that help classify which heap locations are accessed sequentially or in parallel. Access sets are simple and efficient to compute, and have been shown to have the potential to outperform vector clock approaches in certain use cases. However, they do not support arbitrary thread synchronization, are limited to fork-join or similar structures, and do not support mutual exclusion. Vector clock approaches do not scale as well to many threads with many shared interactions, rendering them inefficient in many cases. This work combines the simplicity of access sets with the generality of vector clocks by grouping heap accesses into access sets, and attaching the vector clock counter to those groupings. By combining these two approaches, access sets can be utilized more generally to support programs that contain mutual exclusion. Additionally, entire blocks can be ordered with each other rather than single accesses, producing a much more efficient algorithm for data race detection. This novel algorithm, ZipperOTF, is compared to the Computation Graph algorithm (an access set algorithm) as well as FastTrack (a vector clock algorithm) to show comparisons in empirical results and in both time and space complexity. data race concurrent detection access sets shadow memory vector clocks parallel task parallel model checking Physical Sciences and Mathematics
6	Master/worker parallel discrete event simulation Park, Alfred John 16 December 2008 (has links) The execution of parallel discrete event simulation across metacomputing infrastructures is examined. A master/worker architecture for parallel discrete event simulation is proposed providing robust executions under a dynamic set of services with system-level support for fault tolerance, semi-automated client-directed load balancing, portability across heterogeneous machines, and the ability to run codes on idle or time-sharing clients without significant interaction by users. Research questions and challenges associated with issues and limitations with the work distribution paradigm, targeted computational domain, performance metrics, and the intended class of applications to be used in this context are analyzed and discussed. A portable web services approach to master/worker parallel discrete event simulation is proposed and evaluated with subsequent optimizations to increase the efficiency of large-scale simulation execution through distributed master service design and intrinsic overhead reduction. New techniques for addressing challenges associated with optimistic parallel discrete event simulation across metacomputing such as rollbacks and message unsending with an inherently different computation paradigm utilizing master services and time windows are proposed and examined. Results indicate that a master/worker approach utilizing loosely coupled resources is a viable means for high throughput parallel discrete event simulation by enhancing existing computational capacity or providing alternate execution capability for less time-critical codes. Master/worker Parallel discrete event simulation Simulation Metacomputing Loosely coupled resources Grid computing Volunteer computing Desktop grids Idle cycle computing Conservative synchronization Optimistic synchronization Time Warp Task parallel Web services Computational grids (Computer systems) Computer simulation
7	Improving message logging protocols towards extreme-scale HPC systems / Amélioration des protocoles de journalisation des messages vers des systèmes HPC extrême-échelle Martsinkevich, Tatiana V. 22 September 2015 (has links) Les machines pétascale qui existent aujourd'hui ont un temps moyen entre pannes de plusieurs heures. Il est prévu que dans les futurs systèmes ce temps diminuera. Pour cette raison, les applications qui fonctionneront sur ces systèmes doivent être capables de tolérer des défaillances fréquentes. Aujourd'hui, le moyen le plus commun de le faire est d'utiliser le mécanisme de retour arrière global où l'application fait des sauvegardes périodiques à partir d’un point de reprise. Si un processus s'arrête à cause d'une défaillance, tous les processus reviennent en arrière et se relancent à partir du dernier point de reprise. Cependant, cette solution deviendra infaisable à grande échelle en raison des coûts de l'énergie et de l'utilisation inefficace des ressources. Dans le contexte des applications MPI, les protocoles de journalisation des messages offrent un meilleur confinement des défaillances car ils ne demandent que le redémarrage du processus qui a échoué, ou parfois d’un groupe de processus limité. Par contre, les protocoles existants ont souvent un surcoût important en l’absence de défaillances qui empêchent leur utilisation à grande échelle. Ce surcoût provient de la nécessité de sauvegarder de façon fiable tous les événements non-déterministes afin de pouvoir correctement restaurer l'état du processus en cas de défaillance. Ensuite, comme les journaux de messages sont généralement stockés dans la mémoire volatile, la journalisation risque de nécessiter une large utilisation de la mémoire. Une autre tendance importante dans le domaine des HPC est le passage des applications MPI simples aux nouveaux modèles de programmation hybrides tels que MPI + threads ou MPI + tâches en réponse au nombre croissant de cœurs par noeud. Cela offre l’opportunité de gérer les défaillances au niveau du thread / de la tâche contrairement à l'approche conventionnelle qui traite les défaillances au niveau du processus. Par conséquent, le travail de cette thèse se compose de trois parties. Tout d'abord, nous présentons un protocole de journalisation hiérarchique pour atténuer une défaillance de processus. Le protocole s'appelle Scalable Pattern-Based Checkpointing et il exploite un nouveau modèle déterministe appelé channel-determinism ainsi qu’une nouvelle relation always-happens-before utilisée pour mettre partiellement en ordre les événements de l'application. Le protocole est évolutif, son surcoût pendant l'exécution sans défaillance est limité, il n'exige l'enregistrement d'aucun évènement et, enfin, il a une reprise entièrement distribuée. Deuxièmement, afin de résoudre le problème de la limitation de la mémoire sur les nœuds de calcul, nous proposons d'utiliser des ressources dédiées supplémentaires, appelées logger nodes. Tous les messages qui ne rentrent pas dans la mémoire du nœud de calcul sont envoyés aux logger nodes et sauvegardés dans leur mémoire. À travers de nos expériences nous montrons que cette approche est réalisable et, associée avec un protocole de journalisation hiérarchique comme le SPBC, les logger nodes peuvent être une solution ultime au problème de mémoire limitée sur les nœuds de calcul. Troisièmement, nous présentons un protocole de tolérance aux défaillances pour des applications hybrides qui adoptent le modèle de programmation MPI + tâches. Ce protocole s'utilise pour tolérer des erreurs détectées non corrigées qui se produisent lors de l'exécution d'une tâche. Normalement, une telle erreur provoque une exception du système ce qui provoque un arrêt brutal de l'application. Dans ce cas, l'application doit redémarrer à partir du dernier point de reprise. Nous combinons la sauvegarde des données de la tâche avec une journalisation des messages afin d’aider à la reprise de la tâche qui a subi une défaillance. Ainsi, nous évitons le redémarrage au niveau du processus, plus coûteux. Nous démontrons les avantages de ce protocole avec l'exemple des applications hybrides MPI + OmpSs. / Existing petascale machines have a Mean Time Between Failures (MTBF) in the order of several hours. It is predicted that in the future systems the MTBF will decrease. Therefore, applications that will run on these systems need to be able to tolerate frequent failures. Currently, the most common way to do this is to use global application checkpoint/restart scheme: if some process fails the whole application rolls back the its last checkpointed state and re-executes from that point. This solution will become infeasible at large scale, due to its energy costs and inefficient resource usage. Therefore fine-grained failure containment is a strongly required feature for the fault tolerance techniques that target large-scale executions. In the context of message passing MPI applications, message logging fault tolerance protocols provide good failure containment as they require restart of only one process or, in some cases, a bounded number of processes. However, existing logging protocols experience a number of issues which prevent their usage at large scale. In particular, they tend to have high failure-free overhead because they usually need to store reliably any nondeterministic events happening during the execution of a process in order to correctly restore its state in recovery. Next, as message logs are usually stored in the volatile memory, logging may incur large memory footprint, especially in communication-intensive applications. This is particularly important because the future exascale systems expect to have less memory available per core. Another important trend in HPC is switching from MPI-only applications to hybrid programming models like MPI+threads and MPI+tasks in response to the increasing number of cores per node. This gives opportunities for employing fault tolerance solutions that handle faults on the level of threads/tasks. Such approach has even better failure containment compared to message logging protocols which handle failures on the level of processes. Thus, the work in these dissertation consists of three parts. First, we present a hierarchical log-based fault tolerance solution, called Scalable Pattern-Based Checkpointing (SPBC) for mitigating process fail-stop failures. The protocol leverages a new deterministic model called channel-determinism and a new always-happens-before relation for partial ordering of events in the application. The protocol is scalable, has low overhead in failure-free execution and does not require logging any events, provides perfect failure containment and has a fully distributed recovery. Second, to address the memory limitation problem on compute nodes, we propose to use additional dedicated resources, or logger nodes. All the logs that do not fit in the memory of compute nodes are sent to the logger nodes and kept in their memory. In a series of experiments we show that not only this approach is feasible, but, combined with a hierarchical logging scheme like the SPBC, logger nodes can be an ultimate solution to the problem of memory limitation for logging protocols. Third, we present a log-based fault tolerance protocol for hybrid applications adopting MPI+tasks programming model. The protocol is used to tolerate detected uncorrected errors (DUEs) that happen during execution of a task. Normally, a DUE caused the system to raise an exception which lead to an application crash. Then, the application has to restart from a checkpoint. In the proposed solution, we combine task checkpointing with message logging in order to support task re-execution. Such task-level failure containment can be beneficial in large-scale executions because it avoids the more expensive process-level restart. We demonstrate the advantages of this protocol on the example of hybrid MPI+OmpSs applications. Tolérance aux fautes Calcul haute performance Enregistrement de messages Checkpointing Ressources dédiées Erreurs transitoires Protocoles hiérarchiques Applications MPI Fault tolerance High performance computing Message logging Checkpointing Dedicated resources Transient errors Task parallel programming model Hierarchical protocols MPI applications

1

Page generated in 0.3081 seconds