91 |
Transforming and Optimizing Irregular Applications for Parallel ArchitecturesZhang, Jing 12 February 2018 (has links)
Parallel architectures, including multi-core processors, many-core processors, and multi-node systems, have become commonplace, as it is no longer feasible to improve single-core performance through increasing its operating clock frequency. Furthermore, to keep up with the exponentially growing desire for more and more computational power, the number of cores/nodes in parallel architectures has continued to dramatically increase. On the other hand, many applications in well-established and emerging fields, such as bioinformatics, social network analysis, and graph processing, exhibit increasing irregularities in memory access, control flow, and communication patterns. While multiple techniques have been introduced into modern parallel architectures to tolerate these irregularities, many irregular applications still execute poorly on current parallel architectures, as their irregularities exceed the capabilities of these techniques. Therefore, it is critical to resolve irregularities in applications for parallel architectures. However, this is a very challenging task, as the irregularities are dynamic, and hence, unknown until runtime.
To optimize irregular applications, many approaches have been proposed to improve data locality and reduce irregularities through computational and data transformations. However, there are two major drawbacks in these existing approaches that prevent them from achieving optimal performance. First, these approaches use local optimizations that exploit data locality and regularity locally within a loop or kernel. However, in many applications, there is hidden locality across loops or kernels. Second, these approaches use "one-size-fits-all'' methods that treat all irregular patterns equally and resolve them with a single method. However, many irregular applications have complex irregularities, which are mixtures of different types of irregularities and need differentiated optimizations. To overcome these two drawbacks, we propose a general methodology that includes a taxonomy of irregularities to help us analyze the irregular patterns in an application, and a set of adaptive transformations to reorder data and computation based on the characteristics of the application and architecture.
By extending our adaptive data-reordering transformation on a single node, we propose a data-partitioning framework to resolve the load imbalance problem of irregular applications on multi-node systems. Unlike existing frameworks, which use "one-size-fits-all" methods to partition the input data by a single property, our framework provides a set of operations to transform the input data by multiple properties and generates the desired data-partitioning codes by composing these operations into a workflow. / Ph. D. / Irregular applications, which present unpredictable and irregular patterns of data accesses and computation, are increasingly important in well-established and emerging fields, such as biological data analysis, social network analysis, and machine learning, to deal with large datasets. On the other hand, current parallel processors, such as multi-core CPUs (central processing units), GPUs (graphics processing units), and computer clusters (i.e., groups of connected computers), are designed for regular applications and execute irregular applications poorly. Therefore, it is critical to optimize irregular applications for parallel processors. However, it is a very challenging task, as the irregular patterns are dynamic, and hence, unknown until application execution. To overcome this challenge, we propose a general methodology that includes a taxonomy of irregularities to help us analyze the irregular patterns in an application, and a set of adaptive transformations to reorder data and computation for exploring hidden regularities based on the characteristics of the application and processor. We apply our methodology on couples of important and complex irregular applications as case studies to demonstrate that it is effective and efficient.
|
92 |
Multi-Core Fiber and Optical Supersymmetry: Theory and ApplicationsMacho Ortiz, Andrés 02 September 2019 (has links)
[ES] A día de hoy, las redes de comunicaciones de fibra óptica están alcanzando su capacidad límite debido al rápido crecimiento de la demanda de datos en la última década, generado por el auge de los teléfonos inteligentes, las tabletas, las redes sociales, la provisión de servicios en la nube, las transmisiones en streaming y las comunicaciones máquina-a-máquina. Con el fin de solventar dicho problema, se ha propuesto incrementar la capacidad límite de las redes ópticas mediante el reemplazo de la fibra óptica clásica por la fibra óptica multinúcleo (MCF, acrónimo en inglés de multi-core fiber), la cual es capaz de integrar la capacidad de varias fibras ópticas clásicas en su estructura ocupando prácticamente la misma sección transversal que éstas.
Sin embargo, explotar todo el potencial de una fibra MCF requiere entender en profundidad los fenómenos electromagnéticos que aparecen en este tipo de fibras cuando guiamos luz a travésde ellas. Así pues, en la primera parte de la tesis se analizan teóricamente estos fenómenos electromagnéticos y, posteriormente, se estudia la viabilidad de la tecnología MCF en distintos tipos de redes ópticas de transporte, específicamente, en aquellas que hacen uso de transmisiones radio-sobre-fibra. Estos resultados pueden ser de gran utilidad para las futuras generaciones móviles 5G y Beyond-5G en las próximas décadas.
Adicionalmente, con el fin de expandir las funcionalidades básicas de las fibras MCF, esta tesis explora nuevas estrategias de diseño de las mismas utilizando la analogía existente entre las ecuaciones que rigen la mecánica cuántica y el electromagnetismo. Con esta idea en mente, en la segunda parte de la tesis se propone diseñar una nueva clase de fibras MCF usando las matemáticas de la supersimetría, surgida en el seno de la teoría de cuerdas y de la teoría cuántica de campos como un marco teórico de trabajo que permite unificar las interacciones fundamentales de la naturaleza (la nuclear fuerte, la nuclear débil, el electromagnetismo y la gravedad). Girando en torno a esta idea surgen las fibras MCF supersimétricas, las cuales nos permiten procesar la información de los usuarios durante la propia propagación de la luz a través de ellas, reduciendo así la complejidad del procesado de datos del usuario en recepción.
Finalmente, esta tesis se completa introduciendo un cambio de paradigma que permite diseñar dispositivos fotónicos disruptivos. Demostramos que la supersimetría de mecánica cuántica no relativista, propuesta como una serie de transformaciones matemáticas restringidas al dominio espacial, se puede extender también al dominio del tiempo, al menos dentro del marco de trabajo de la fotónica. Como resultado de nuestras investigaciones, demostramos que la supersimetría temporal puede convertirse en una plataforma prometedora para la fotónica integrada ya que nos permite diseñar nuevos dispositivos ópticos versátiles y ultra-compactos que pueden jugar un papel clave en los procesadores del futuro.
Asimismo, con el fin de hacer los resultados principales de esta tesis doctoral lo más generales posibles, se detalla cómo poder extrapolarlos a otros campos de la física como acústica y mecánica cuántica. / [CA] Avui en dia, les xarxes de comunicacions de fibra òptica estan aconseguint la seua capacitat límit a causa del ràpid creixement de la demanda de dades duante l'última dècada, generat per l'auge dels telèfons intel·ligents, les tablets, les xarxes socials, la provisió de servicis en la núvol, les transmissions en streaming i les comunicacions màquina-a-màquina. Per a resoldre el dit problema, s'ha proposat incrementar la capacitat límit de les xarxes òptiques per mitjà del reemplaçament de la fibra òptica clàssica per la fibra òptica multinúcleo (MCF, acrònim en anglés de multi-core fiber), la qual és capaç d'integrar la capacitat de diverses fibres òptiques clàssiques en la seua estructura ocupant pràcticament la mateixa secció transversal que estes.
Tanmateix, explotar tot el potencial d'una fibra MCF requereix entendre en profunditat els fenòmens electromagnètics que apareixen en aquestes fibres quan guiem llum a través d'elles. Així, doncs, en la primera part de la tesi analitzem teòricament aquests fenòmens electromagnètics i, posteriorment, estudiem la viabilitat de la tecnologia MCF en distints tipus de xarxes òptiques de transport, específicament, en aquelles que fan ús de transmissions ràdio-sobre-fibra. Estos resultats poden ser de gran utilitat per a les futures generacions mòbils 5G i Beyond-5G en les pròximes dècades.
Addicionalment, a fi d'expandir les funcionalitats bàsiques de les fibres MCF, esta tesi explora noves estratègies de disseny de les mateixes utilitzant l'analogia existent entre les equacions que regixen la mecànica quàntica i l'electromagnetisme. Amb aquesta idea en ment, en la segona part de la tesi proposem dissenyar una nova classe de fibres MCF usant les matemàtiques de la supersimetria, sorgida en el si de la teoria de cordes i de la teoria quàntica de camps com un marc teòric de treball que permet unificar les interaccions fonamentals de la natura (la nuclear forta, la nuclear feble, l'electromagnetisme i la gravetat). Al voltant d'aquesta idea sorgeixen les fibres MCF supersimètriques, les quals ens permeten processar la informació dels usuaris durant la pròpia propagació de la llum a través d'elles, reduint així la complexitat del processament de dades de l'usuari a recepció.
Finalment, esta tesi es completa introduint un canvi de paradigma que permet dissenyar dispositius fotónicos disruptius. Demostrem que la supersimetria de mecànica quàntica no relativista, proposta com una sèrie de transformacions matemàtiques restringides al domini espacial, es pot estendre també al domini del temps, almenys dins del marc de treball de la fotónica. Com resultat de les nostres investigacions, demostrem que la supersimetria temporal pot convertir-se en una plataforma prometedora per a la fotònica integrada ja que ens permet dissenyar nous dispositius òptics versàtils i ultracompactes que poden jugar un paper clau en els processadors del futur.
Per tal de fer els resultats principals d'aquesta tesi doctoral el més generals possibles, es detalla com poder extrapolar-los a altres camps de la física com ara la acústica i la mecànica quàntica. / [EN] To date, communication networks based on optical fibers are rapidly approaching their capacity limit as a direct consequence of the increment of the data traffic demand in the last decade due to the ubiquity of smartphones, tablets, social networks, cloud computing applications, streaming services including video and gaming, and machine-to-machine communications. In such a scenario, a new class of optical fiber which is able to integrate the capacity of several classical optical fibers approximately in the same transverse section as that of the original one, the multi-core fiber (MCF), has been recently proposed to overcome the capacity limits of current optical networks.
However, the possibility of exploiting the full potential of an MCF requires to deeply understand the electromagnetic phenomena that can be observed when guiding light in this optical medium. In this vein, in the first part of this thesis, we analyze theoretically these phenomena and, next, we study the suitability of the MCF technology in optical transport networks using radio-over-fiber transmissions. These findings could be of great utility for 5G and Beyond-5G cellular technology in the next decades.
In addition, the close connection between the mathematical framework of quantum mechanics and electromagnetism becomes a great opportunity to explore ground-breaking design strategies of these new fibers that allow us to expand their basic functionalities. Revolving around this idea, in the second part of this thesis we propose to design a new class of MCFs using the mathematics of supersymmetry (SUSY), emerged within the context of string and quantum field theory as a means to unify the basic interactions of nature (strong, electroweak, and gravitational interactions). Interestingly, a supersymmetric MCF will allow us, not only to propagate the light, but also to process the information of users during propagation.
Finally, we conclude this thesis by introducing a paradigm shift that allows us to design disruptive optical devices. We demonstrate that the basic ideas of SUSY in non-relativistic quantum mechanics, restricted to the space domain to clarify unsolved questions about SUSY in string and quantum field theory, can also be extended to the time domain, at least within the framework of photonics. In this way, it is shown that temporal supersymmetry may serve as a key tool to judiciously design versatile and ultra-compact optical devices enabling a promising new platform for integrated photonics.
For the sake of completeness, we indicate how to extrapolate the main results of this thesis to other fields of physics, such as acoustics and quantum mechanics. / Macho Ortiz, A. (2019). Multi-Core Fiber and Optical Supersymmetry: Theory and Applications [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/124964
|
93 |
Multi-Core Memory System Design : Developing and using Analytical Models for Performance Evaluation and EnhancementsDwarakanath, Nagendra Gulur January 2015 (has links) (PDF)
Memory system design is increasingly influencing modern multi-core architectures from both performance and power perspectives. Both main memory latency and bandwidth have im-proved at a rate that is slower than the increase in processor core count and speed. Off-chip memory, primarily built from DRAM, has received significant attention in terms of architecture and design for higher performance. These performance improvement techniques include sophisticated memory access scheduling, use of multiple memory controllers, mitigating the impact of DRAM refresh cycles, and so on. At the same time, new non-volatile memory technologies have become increasingly viable in terms of performance and energy. These alternative technologies offer different performance characteristics as compared to traditional DRAM.
With the advent of 3D stacking, on-chip memory in the form of 3D stacked DRAM has opened up avenues for addressing the bandwidth and latency limitations of off-chip memory. Stacked DRAM is expected to offer abundant capacity — 100s of MBs to a few GBs — at higher bandwidth and lower latency. Researchers have proposed to use this capacity as an extension to main memory, or as a large last-level DRAM cache. When leveraged as a cache, stacked DRAM provides opportunities and challenges for improving cache hit rate, access latency, and off-chip bandwidth.
Thus, designing off-chip and on-chip memory systems for multi-core architectures is complex, compounded by the myriad architectural, design and technological choices, combined with the characteristics of application workloads. Applications have inherent spatial local-ity and access parallelism that influence the memory system response in terms of latency and bandwidth.
In this thesis, we construct an analytical model of the off-chip main memory system to comprehend this diverse space and to study the impact of memory system parameters and work-load characteristics from latency and bandwidth perspectives. Our model, called ANATOMY, uses a queuing network formulation of the memory system parameterized with workload characteristics to obtain a closed form solution for the average miss penalty experienced by the last-level cache. We validate the model across a wide variety of memory configurations on four-core, eight-core and sixteen-core architectures. ANATOMY is able to predict memory latency with average errors of 8.1%, 4.1%and 9.7%over quad-core, eight-core and sixteen-core configurations respectively. Further, ANATOMY identifie better performing design points accurately thereby allowing architects and designers to explore the more promising design points in greater detail. We demonstrate the extensibility and applicability of our model by exploring a variety of memory design choices such as the impact of clock speed, benefit of multiple memory controllers, the role of banks and channel width, and so on. We also demonstrate ANATOMY’s ability to capture architectural elements such as memory scheduling mechanisms and impact of DRAM refresh cycles. In all of these studies, ANATOMY provides insight into sources of memory performance bottlenecks and is able to quantitatively predict the benefit of redressing them.
An insight from the model suggests that the provisioning of multiple small row-buffers in each DRAM bank achieves better performance than the traditional one (large) row-buffer per bank design. Multiple row-buffers also enable newer performance improvement opportunities such as intra-bank parallelism between data transfers and row activations, and smart row-buffer allocation schemes based on workload demand. Our evaluation (both using the analytical model and detailed cycle-accurate simulation) shows that the proposed DRAM re-organization achieves significant speed-up as well as energy reduction.
Next we examine the role of on-chip stacked DRAM caches at improving performance by reducing the load on off-chip main memory. We extend ANATOMY to cover DRAM caches. ANATOMY-Cache takes into account all the key parameters/design issues governing DRAM cache organization namely, where the cache metadata is stored and accessed, the role of cache block size and set associativity and the impact of block size on row-buffer hit rate and off-chip bandwidth. Yet the model is kept simple and provides a closed form solution for the aver-age miss penalty experienced by the last-level SRAM cache. ANATOMY-Cache is validated against detailed architecture simulations and shown to have latency estimation errors of 10.7% and 8.8%on average in quad-core and eight-core configurations respectively. An interesting in-sight from the model suggests that under high load, it is better to bypass the congested DRAM cache and leverage the available idle main memory bandwidth. We use this insight to propose a refresh reduction mechanism that virtually eliminates refresh overhead in DRAM caches. We implement a low-overhead hardware mechanism to record accesses to recent DRAM cache pages and refresh only these pages. Older cache pages are considered invalid and serviced from the (idle) main memory. This technique achieves average refresh reduction of 90% with resulting memory energy savings of 9%and overall performance improvement of 3.7%.
Finally, we propose a new DRAM cache organization that achieves higher cache hit rate, lower latency and lower off-chip bandwidth demand. Called the Bi-Modal Cache, our cache organization brings three independent improvements together: (i) it enables parallel tag and data accesses, (ii) it eliminates a large fraction of tag accesses entirely by use of a novel way locator and (iii) it improves cache space utilization by organizing the cache sets as a combination of some big blocks (512B) and some small blocks (64B). The Bi-Modal Cache reduces hit latency by use of the way locator and parallel tag and data accesses. It improves hit rate by leveraging the cache capacity efficiently – blocks with low spatial reuse are allocated in the cache at 64B granularity thereby reducing both wasted off-chip bandwidth as well as cache internal fragmentation. Increased cache hit rate leads to reduction in off-chip bandwidth demand. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement of 10.8%, 13.8% and 14.0% in quad-core, eight-core and sixteen-core workloads respectively over an aggressive baseline.
|
94 |
Evaluation of EDF scheduling for Ericsson LTE system : A comparison between EDF, FIFO and RRNyberg, Angelica, Hartman, Jonas January 2016 (has links)
Scheduling is extremely important for modern real-time systems. It enables several programs to run in parallel and succeed with their tasks. Many systems today are real-time systems, which means that good scheduling is highly needed. This thesis aims to evaluate the real-time scheduling algorithm earliest deadline first, newly introduced into the Linux kernel, and compare it to the already existing real-time scheduling algorithms first in, first out and round robin in the context of firm tasks. By creating a test program that can create pthreads and set their scheduling characteristics, the performance of earliest deadline first can be evaluated and compared to the others. / Schemaläggning är extremt viktigt för dagens realtidssystem. Det tillåter att flera program körs parallellt samtidigt som deras processer inte misslyckas med sina uppgifter. Idag är många system realtidssystem, vilket innebär att det finns ett ytterst stort behov för en bra schemaläggningsalgoritm. Målet med det här examensarbetet är att utvärdera schema-läggningsalgoritmen earliest deadline first som nyligen introducerats i operativsystemet Linux. Målet är även att jämföra algoritmen med två andra schemaläggningsalgoritmer (first in, first out och round robin), vilka redan är väletablerade i Linux kärnan. Det här görs med avseende på processer klassificerade som firm. Genom att skapa ett program som kan skapa pthreads med önskvärda egenskaper kan prestandan av earliest deadline first algoritmen utvärderas, samt jämföras med de andra algoritmerna.
|
95 |
Performance improvements using dynamic performance stubsTrapp, Peter January 2011 (has links)
This thesis proposes a new methodology to extend the software performance engineering process. Common performance measurement and tuning principles mainly target to improve the software function itself. Hereby, the application source code is studied and improved independently of the overall system performance behavior. Moreover, the optimization of the software function has to be done without an estimation of the expected optimization gain. This often leads to an under- or overoptimization, and hence, does not utilize the system sufficiently. The proposed performance improvement methodology and framework, called dynamic performance stubs, improves the before mentioned insufficiencies by evaluating the overall system performance improvement. This is achieved by simulating the performance behavior of the original software functionality depending on an adjustable optimization level prior to the real optimization. So, it enables the software performance analyst to determine the systems’ overall performance behavior considering possible outcomes of different improvement approaches. Moreover, by using the dynamic performance stubs methodology, a cost-benefit analysis of different optimizations regarding the performance behavior can be done. The approach of the dynamic performance stubs is to replace the software bottleneck by a stub. This stub combines the simulation of the software functionality with the possibility to adjust the performance behavior depending on one or more different performance aspects of the replaced software function. A general methodology for using dynamic performance stubs as well as several methodologies for simulating different performance aspects is discussed. Finally, several case studies to show the application and usability of the dynamic performance stubs approach are presented.
|
96 |
Capteur de courants innovant pour des systèmes polyphasés : application aux câbles multiconducteurs / Innovative current sensor for multiphase systems : application to multi-core cablesBourkeb, Menad 17 April 2014 (has links)
Cette thèse porte sur l'étude et la réalisation d’un prototype de capteur de courants innovant pour câbles multiconducteurs. Outre le caractère non-Intrusif de ce capteur (i.e. mesure sans contact), il permet de réaliser une mesure sur un système polyphasé dont la position des conducteurs est inconnue. L’approche adoptée est basée sur la résolution d’un problème inverse. En effet, à partir d’une mesure de la signature des champs magnétiques autour du câble, des algorithmes de reconstruction appropriés permettent de remonter aux courants circulant dans le câble. En plus des résultats de simulation, un banc de tests a été conçu et une validation expérimentale de ce concept est présentée pour répondre à un cahier des charges, notamment pour une structure comportant un blindage en matériau ferromagnétique pour atténuer les perturbations extérieures / This thesis presents the study and realization of an innovative currents sensor prototype for multi-Core cables. The two main advantages of this sensor compared to existing devices on the electrical equipment market are: firstly, it is no longer necessary to interrupt the system's electrical power supply to install the sensor. This is due to contactless measure (non-Intrusive sensor). Another feature of our device is its capability to measure the currents in a multi-Core system with unknown positions of conductors. This currents sensor operates in a way to find firstly the conductor positions, and then reconstructing the currents using the retrieved positions. In order to meet specifications, simulation results, test bench measurements and experimental results are presented with a ferromagnetic shielding
|
97 |
Un algorithme de fouille de données générique et parallèle pour architecture multi-coeurs / A generic and parallel pattern mining algorithm for multi-core architectures.Negrevergne, Benjamin 29 November 2011 (has links)
Dans le domaine de l'extraction de motifs, il existe un grand nombre d'algorithmes pour résoudre une large variété de sous problèmes sensiblement identiques. Cette variété d'algorithmes freine l'adoption des techniques d'extraction de motifs pour l'analyse de données. Dans cette thèse, nous proposons un formalisme qui permet de capturer une large gamme de problèmes d'extraction de motifs. Pour démontrer la généralité de ce formalisme, nous l'utilisons pour décrire trois problèmes d'extraction de motifs : le problème d'extraction d'itemsets fréquents fermés, le problème d'extraction de graphes relationnels fermés ou le problème d'extraction d'itemsets graduels fermés. Ce formalisme nous permet de construire ParaMiner qui est un algorithme générique et parallèle pour les problèmes d'extraction de motifs. ParaMiner est capable de résoudre tous les problèmes d'extraction de motifs qui peuvent ˆtre décrit dans notre formalisme. Pour obtenir de bonne performances, nous avons généralisé plusieurs optimisations proposées par la communauté dans le cadre de problèmes spécifique d'extraction de motifs. Nous avons également exploité la puissance de calcul parallèle disponible dans les archi- tectures parallèles. Nos expériences démontrent qu'en dépit de la généricité de ParaMiner ses performances sont comparables avec celles obtenues par les algorithmes les plus rapides de l'état de l'art. Ces algorithmes bénéficient pourtant d'un avantage important, puisqu'ils incorporent de nombreuses optimisations spécifiques au sous problème d'extraction de motifs qu'ils résolvent. / In the pattern mining field, there exist a large number of algorithms that can solve a large variety of distinct but similar pattern mining problems. This variety prevent broad adoption of data analysis with pattern mining algorithms. In this thesis we propose a formal framework that is able to capture a broad range of pattern mining problems. We illustrate the generality of our framework by formalizing three different pattern mining problems: the problem of closed frequent itemset mining, the problem of closed relational graph mining and the problem of closed gradual itemset mining. Building on this framework, we have designed ParaMiner, a generic and parallel algorithm for pattern mining. ParaMiner is able to solve any pattern mining problem that can be formalized within our framework. In order to achieve practical efficiency we have generalized important optimizations from state of the art algorithms and we have made ParaMiner able to exploit parallel computing platforms. We have conducted thorough experiments that demonstrate that despite being a generic algorithm, ParaMiner can compete with the fastest ad-hoc algorithms.
|
98 |
Dynamic program analysis algorithms to assist parallelizationKim, Minjang 24 August 2012 (has links)
All market-leading processor vendors have started to pursue multicore processors as an alternative to high-frequency single-core processors for better energy and power efficiency. This transition to multicore processors no longer provides the free performance gain enabled by increased clock frequency for programmers. Parallelization of existing serial programs has become the most powerful approach to improving application performance. Not surprisingly, parallel programming is still extremely difficult for many programmers mainly because thinking in parallel is simply beyond the human perception. However, we believe that software tools based on advanced analyses can significantly reduce this parallelization burden.
Much active research and many tools exist for already parallelized programs such as finding concurrency bugs. Instead we focus on program analysis algorithms that assist the actual parallelization steps: (1) finding parallelization candidates, (2) understanding the parallelizability and profits of the candidates, and (3) writing parallel code. A few commercial tools are introduced for these steps. A number of researchers have proposed various methodologies and techniques to assist parallelization. However, many weaknesses and limitations still exist.
In order to assist the parallelization steps more effectively and efficiently, this dissertation proposes Prospector, which consists of several new and enhanced program analysis algorithms.
First, an efficient loop profiling algorithm is implemented. Frequently executed loop can be candidates for profitable parallelization targets. The detailed execution profiling for loops provides a guide for selecting initial parallelization targets.
Second, an efficient and rich data-dependence profiling algorithm is presented. Data dependence is the most essential factor that determines parallelizability. Prospector exploits dynamic data-dependence profiling, which is an alternative and complementary approach to traditional static-only analyses. However, even state-of-the-art dynamic dependence analysis algorithms can only successfully profile a program with a small memory footprint. Prospector introduces an efficient data-dependence profiling algorithm to support large programs and inputs as well as provides highly detailed profiling information.
Third, a new speedup prediction algorithm is proposed. Although the loop profiling can give a qualitative estimate of the expected profit, obtaining accurate speedup estimates needs more sophisticated analysis. Prospector introduces a new dynamic emulation method to predict parallel speedups from annotated serial code. Prospector also provides a memory performance model to predict speedup saturation due to increased memory traffic. Compared to the latest related work, Prospector significantly improves both prediction accuracy and coverage.
Finally, Prospector provides algorithms that extract hidden parallelism and advice on writing parallel code. We present a number of case studies how Prospector assists manual parallelization in particular cases including privatization, reduction, mutex, and pipelining.
|
99 |
Speeding up matrix computation kernels by sharing vector coprocessor among multiple cores on chipDahlberg, Christopher January 2012 (has links)
Today’s computer systems develop towards less energy consumption while keeping high performance. These are contradictory requirement and pose a great challenge. A good example of an application were this is used is the smartphone. The constraints are on long battery time while getting high performance required by future 2D/3D applications. A solution to this is heterogeneous systems that have components that are specialized in different tasks and can execute them fast with low energy consumption. These could be specialized i.e. encoding/decoding, encryption/decryption, image processing or communication. At the apartment of Computer Architecture and Parallel Processing Laboratory (CAPPL) at New Jersey Institute of Technology (NJIT) a vector co-processor has been developed. The Vector co-processor has the unusual feature of being able to receive instructions from multiple hosts (scalar cores). In addition to this a test system with a couple of scalar processors using the vector processor has been developed. This thesis describes this processor and its test system. It also shows the development of math applications involving matrix operations. This results in the conclusions of the vector co-processing saving substantial amount of energy while speeding up the execution of the applications. In addition to this the thesis will describe an extension of the vector co-processor design that makes it possible to monitor the throughput of instructions and data in the processor.
|
100 |
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word ArchitecturesSalinger, Alejandro January 2013 (has links)
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming.
|
Page generated in 0.0347 seconds