101 |
Proceedings of the 4th Many-core Applications Research Community (MARC) SymposiumJanuary 2012 (has links)
In continuation of a successful series of events, the 4th Many-core Applications Research Community (MARC) symposium took place at the HPI in Potsdam on December 8th and 9th 2011. Over 60 researchers from different fields presented their work on many-core hardware architectures, their programming models, and the resulting research questions for the upcoming generation of heterogeneous parallel systems.
|
102 |
Secure and high-performance big-data systems in the cloudTang, Yuzhe 21 September 2015 (has links)
Cloud computing and big data technology continue to revolutionize how computing and data analysis are delivered today and in the future. To store and process the fast-changing big data, various scalable systems (e.g. key-value stores and MapReduce) have recently emerged in industry. However, there is a huge gap between what these open-source software systems can offer and what the real-world applications demand. First, scalable key-value stores are designed for simple data access methods, which limit their use in advanced database applications. Second, existing systems in the cloud need automatic performance optimization for better resource management with minimized operational overhead. Third, the demand continues to grow for privacy-preserving search and information sharing between autonomous data providers, as exemplified by the Healthcare information networks.
My Ph.D. research aims at bridging these gaps.
First, I proposed HINDEX, for secondary index support on top of write-optimized key-value stores (e.g. HBase and Cassandra). To update the index structure efficiently in the face of an intensive write stream, HINDEX synchronously executes append-only operations and defers the so-called index-repair operations which are expensive. The core contribution of HINDEX is a scheduling framework for deferred and lightweight execution of index repairs. HINDEX has been implemented and is currently being transferred to an IBM big data product.
Second, I proposed Auto-pipelining for automatic performance optimization of streaming applications on multi-core machines. The goal is to prevent the bottleneck scenario in which the streaming system is blocked by a single core while all other cores are idling, which wastes resources. To partition the streaming workload evenly to all the cores and to search for the best partitioning among many possibilities, I proposed a heuristic based search strategy that achieves locally optimal partitioning with lightweight search overhead. The key idea is to use a white-box approach to search for the theoretically best partitioning and then use a black-box approach to verify the effectiveness of such partitioning. The proposed technique, called Auto-pipelining, is implemented on IBM Stream S.
Third, I proposed ǫ-PPI, a suite of privacy preserving index algorithms that allow data sharing among unknown parties and yet maintaining a desired level of data privacy. To differentiate privacy concerns of different persons, I proposed a personalized privacy definition and substantiated this new privacy requirement by the injection of false positives in the published ǫ-PPI data. To construct the ǫ-PPI securely and efficiently, I proposed to optimize the performance of multi-party computations which are otherwise expensive; the key idea is to use addition-homomorphic secret sharing mechanism which is inexpensive and to do the distributed computation in a scalable P2P overlay.
|
103 |
Electromagnetic Modeling of Multi-Dimensional Scale Problems: Nanoscale Solar Materials, RF Electronics, Wearable AntennasYoo, Sungjong January 2014 (has links)
The use of full wave electromagnetic modeling and simulation tools allows for accurate performance predictions of unique RF structures that exhibit multi-dimensional scales. Full wave simulation tools need to cover the broad range of frequency including RF and terahertz bands that is focused as RF technology is developed. In this dissertation, three structures with multi-dimensional scales and different operating frequency ranges are modeled and simulated. The first structure involves nanostructured solar cells. The silicon solar cell design is interesting research to cover terahertz frequency range in terms of the economic and environmental aspects. Two unique solar cell surfaces, nanowire and branched nanowire are modeled and simulated. The surface of nanowire is modeled with two full wave simulators and the results are well-matched to the reference results. This dissertation compares and contrasts the simulators and their suitability for extensive simulation studies. Nanostructured Si cells have large and small dimensional scales and the material characteristics of Si change rapidly over the solar spectrum. The second structure is a reconfigurable four element antenna array antenna operating at 60 GHz for wireless communications between computing cores in high performance computing systems. The array is reconfigurable, provides improved transmission gain between cores, and can be used to create a more failure resilient computing system. The on-chip antenna array involves modeling the design of a specially designed ground plane that acts as an artificial magnetic conductor. The work involves modeling antennas in a complex computing environment. The third structure is a unique collar integrated zig-zag antenna that operates at 154.5 MHz for use as a ground link in a GPS based location system for wildlife tracking. In this problem, an intricate antenna is modeled in the proximity of an animal. Besides placing a low frequency antenna in a constricted area (the collar), the antenna performance near the large animal body must also be considered. Each of these applications requires special modeling details to take into account the various dimensional scales of the structures and interaction with complex media. An analysis of the challenges and limits of each specific problem will be presented.
|
104 |
Extension of the SkePU Skeleton ProgrammingFramework for Multi-core CPU and Multi-GPU Systems for MPI-based ClustersMangaraj, Swadhin K January 2013 (has links)
SkePU (Skeleton Programming Framework for Multi-core CPU and Multi-GPU Systems) is a parallel computing framework developed by Johan Enmyren and Christoph Kessler at Linköpings Universitet. This C++ template library provides a simple and unified interface for specifying data-parallel computations with the help of skeletons and is targeted to multiple backends e.g. for a sequential CPU, parallel CPUs using MPI and OpenMP or GPUs using CUDA and OpenCL. SkePU is comprised of seven data-parallel skeletons and one task-parallel skeleton and these skeletons use two types of containers: vector and matrix to model real-life parallel applications. In this thesis, we address the extension of the SkePU framework by extending the matrix container (which stores 2-D data values) that can efficiently use the existing skeletons to develop parallel scientific applications on large-scale clusters using MPI. This piece of work focuses on the distribution of the matrix among the participating processes which after receiving their share of data can execute the application in parallel. This work covers all of the seven data-parallel skeletons. Each skeleton has been tested with a small application program. In addition to measurement of performance improvement from the application program’s execution time, we have also done a communication cost analysis for all skeletons with MPI using the LogGP model. In order to evaluate and test the operational efficiency of the extension, we have considered a PDE solver application. Through this application, we have demonstrated the performance gain and scalability of the extended framework. The performance improvement was more when computational load dominates the memory I/O operations. The results show that using the extension can serve as a viable approach while implementing real-life parallel applications on large-scale clusters.
|
105 |
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word ArchitecturesSalinger, Alejandro January 2013 (has links)
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming.
|
106 |
Support matériel pour la communication inter-processus dans un système multi-coeur / Hardware support for inter-process communication in multiprocessor systemFrance pillois, Maxime 27 September 2018 (has links)
La forte parallélisation des applications MPSoC accroît le besoin d'optimisation des mécanismes de synchronisation, primordiaux pour l'échange sûr d'informations entre processus. En effet, les délais qu'ils introduisent impactent les performances globales des MPSoC. L'objet de cette thèse est d'étudier puis d'optimiser les performances temporelles de ces mécanismes de synchronisation.La complexité croissante des MPSoC impose l'étude précise des mécanismes ciblés dans un environnement réaliste mettant en exergue les spécificités logicielles et matérielles.Les outils de mesures disponibles ne répondant pas à nos exigences de précision conjuguée à la vitesse d'analyse, nous avons conçu notre propre chaîne de mesure non intrusive reposant sur une plateforme d'émulation.Appliquée à l'étude de l'implémentation GNU du mécanisme de barrière de synchronisation offert par la bibliothèque d'aide à la parallélisation de code OpenMP, notre chaîne de mesure a mis en évidence deux faiblesses d'implémentation, aboutissant à la mise en place d'optimisations logicielles et matérielles réduisant de manière significative les délais de ce mécanisme.La chaîne de mesure développée nous a également permis de vérifier une hypothèse structurante pour l'optimisation : un verrou, bien qu'utilisé par plusieurs cœurs de différentes grappes au cours de l'application, est très souvent repris par le dernier cœur l'ayant libéré. Sur la base de ce constat, nous proposons une solution innovante assurant, de manière totalement décentralisée, la relocalisation dynamique des verrous dans la mémoire proche du cœur ayant obtenu l'accès. Cela permet de réduire la latence d'accès et le trafic réseau lors de la réutilisation d'un verrou par une même grappe. / High parallelism of MPSoC applications increase the need of optimization for the synchronization mechanisms, essential to ensure consistent data exchanges between threads. Delays inserted by them impact the whole performances of the system. This thesis work aims to analyze and reduce delays of synchronization mechanisms for MPSoC architectures.The growing complexity of MPSoCs requires assessment of proposed optimizations against hardware and software specifics in real-life environment. Since usual tools to perform measurements do not fulfill required accuracy with sufficient evaluation speed, we have designed a custom non-intrusive tool-chain based on an emulation platform.The study of the textit{GNU} OpenMP library implementation of the synchronization barriers, carried out with our tool-chain, has revealed two weaknesses. Our proposed hardware and software optimizations achieve significant reduction of the delays introduced by the synchronization barrier.The designed tool-chain has also allowed us to confirm a fundamental hypothesis for the optimization of the lock mechanism : although during the run time a lock may be used by various cores belonging to different clusters, it is often reused by the last core which has released it. Based on this observation, we propose an innovative decentralized solution to manage dynamic re-homing of locks in memory close to the last access-granted core, thus reducing access latency and network traffic in case of reuse of the lock by the same cluster.
|
107 |
Design and Implementation of Multi-core Support for an Embedded Real-time Operating System for Space ApplicationsZhang, Wei January 2015 (has links)
Nowadays, multi-core processors are widely used in embedded applications due to the advantages of higher performance and lower power consumption. However, the complexity of multi-core architectures makes it a considerably challenging task to extend a single-core version of a real-time operating system to support multi-core platform. This thesis documents the process of design and implementation of a multi-core version of RODOS - an embedded real-time operating system developed by German Aerospace Center and the University of Würzburg - on a dual-core platform. Two possible models are proposed: Symmetric Multiprocessing and Asymmetric Multiprocessing. In order to prevent the collision of the global components initialization, a new multi-core boot loader is created to allow that each core boots up in a proper manner. A working version of multi-core RODOS is implemented that has an ability to run tasks on a multi-core platform. Several test cases are applied and verified that the performance on the multi-core version of RODOS achieves around 180% improved than the same tasks running on the original RODOS. Deadlock free communication and synchronization APIs are provided to let parallel applications share data and messages in a safe manner.
|
108 |
Fast Viterbi Decoder Algorithms for Multi-Core SystemJu, Zilong January 2012 (has links)
In this thesis, fast Viterbi Decoder algorithms for a multi-core system are studied. New parallel Viterbi algorithms for decoding convolutional codes are proposed based on tail biting trellises. The performances of the new algorithms are first evaluated by MATLAB and then Eagle (E-UTRA algorithms for LTE) link level simulations where the optimal parameter settings are obtained based on various simulations. One of the algorithms is proposed for implementation in the product due to its good BLER performance and low implementation complexity. The new parallel algorithm is then implemented on target DSPs for Ericsson internal multi-core system to decode the PUSCH (Physical Uplink Shared Channel) CQI (Channel Quality Indicator) in LTE (Long Term Evolution). And the performance of the new algorithm in the real multi-core system is compared against the current implementation regarding both cycle and memory consumption. As a fast decoder, the proposed parallel Viterbi decoder is computationally efficient which reduces significantly the decoding latency and solves memory limitation problems on DSP.
|
109 |
Multicore Optimized Real-Time Protocol for Power Control NetworksNaveed, Muhammad January 2012 (has links)
The Technology today is changing at a fast pace. The growth of computers and telecommunications over the past three decades has been extraordinary. We today are at the point where all technologies related to communication and data transfer are submerging to a common platform. A number of different methods are available for data communication or data transfer. The important factor in all communication setups is to satisfy user demands with low cost and reliability. The area of interest for this thesis is future energy substations and wind mills. In order to make things more straight forward and see its different options and capabilities the focus is on designing and implementing a new energy protocol called Energy Real Time Protocol (eRTP) based on Iyad Real Time Protocol (iRTP) [2]. The protocol is designed to meet the requirements of power and energy networks in terms of sending the energy parameters with VoIP data (optional) among power stations at different locations. Keeping in mind the importance transferring energy parameters in real-time, the presented protocol has built upon small individual algorithms/modules designed for multi-core architecture. Each module is supposed to be processed by an individual core/processor in parallel.
|
110 |
Maîtrise de la couche hyperviseur sur les architectures multi-coeurs COTS dans un contexte avionique / Hypervisor control of COTS multi-cores processors in order to enforce determinism for future avionics equipmentJean, Xavier 18 June 2015 (has links)
Nous nous intéressons dans cette thèse à la maîtrise de processeurs multi-cœurs COTS dans le but de les rendre utilisables dans des équipements avioniques, qui ont des exigences temps réelles dures. L’objectif est de permettre l'application de méthodes connues d’évaluation de pire temps d’exécution (WCET) sur un ensemble de tâches représentatif d’applications avioniques. Au cours de leur exécution, les tâches exécutées sur différents cœurs vont accéder simultanément à des ressources matérielles qui sont partagées entre les cœurs, en particulier la mémoire principale. Cela pourra entraîner des mises en attente de certains accès que l'on qualifie d'interférences. Ces interférences peuvent avoir un impact élevé sur le temps d'exécution du logiciel embarqué. Sur un processeur COTS, qui est acheté dans le commerce et vise un marché plus large que l'avionque, cet impact n'est pas borné. Nous cherchons à garantir l'absence d'interférences grâce à des moyens logiciels, dans la mesure où les processeurs COTS ne proposent pas de mécanismes adéquats au niveau matériel. Nous cherchons à étendre des concepts de logiciel déterministe de telle sorte à les rendre compatibles avec un objectif de réutilisation de logiciel existant. A cet effet, nous introduisons la notion de logiciel de contrôle, qui est un élément fonctionnellement neutre, répliqué sur tous les cœurs, et qui contrôle les dates des accès des cœurs aux ressources communes de telle sorte à offrir une isolation temporelle entre ces accès. Nous étudions dans cette thèse le problème de faisabilité d'un logiciel de contrôle sur un processeur COTS, et de son efficacité vis à vis d'applications avioniques. / We focus in this thesis on issues related to COTS multi-core processors mastering, especially regarding hard real-time constraints, in order to enable their usage in future avionics equipment. We aim at applying existing Worst Case Execution Time (WCET) evaluation methods on a set of tasks similar to those we can find in avionics software. At runtime, tasks executed among different cores are likely to access hardware resources at the same time, e.g. the main memory. It may lead to additional delays due to hardware contention, called “interferences”. Interferences slow down embedded software within ranges that may be important. Additionnally, no bound has been established for their impact on WCET when using COTS processors, that target larger markets than avionics. We try to provide guarantees that all interferences are eliminated through software, as COTS processors do not provide adequate mechanisms at hardware level. We extend deterministic software concepts that have been developed in the state of the art, in order to make them compliant with the use of legacy software. We introduce the concept of "control software", which is functionnaly neutral, is replicated among all cores, and performs active control of core's accesses to shared resources, so that concurrent accesses are temporally isolated. We formalize and study in this thesis the problem of control software feasibility on COTS processors, and questions of efficiency with regard to legacy avionics software.
|
Page generated in 0.0455 seconds