Global ETD Search

91	A Scalable Approach to Multi-core Prototyping Newcomb, Jamie David 22 April 2008 (has links) In recent years, multi-core processors and multi-processor networks have grown in popularity as a solution to the limits on increasing clock speed, rising power consumption, and the nanometer manufacturing processes. Multi-core processors and multi-processor networks are seen as the next step in the advancement of computational capabilities by way of concurrent processing. However, parallel software design is difficult due to the immaturity of scalable architectures and software development environments for multi-core hardware. How should processors effectively and quickly pass information, with as little overhead as possible? What kind of communication architecture is best suited for parallelism? How can large-scale architectures be quickly produced, verified and properly utilized by software? Using commercially available FPGA development boards, Xilinx tools and components, this thesis offers a light-weight solution to these questions for effective, low-overhead, low-latency multi-core communication and fast prototyping of multi-processor networks for scalable processor arrays. / Master of Science multi-gigabit Aurora Xilinx Field programmable gate arrays multi-processor array multi-core
92	Accelerating Hardware Simulation on Multi-cores Nanjundappa, Mahesh 04 June 2010 (has links) Electronic design automation (EDA) tools play a central role in bridging the productivity gap for designing complex hardware systems. However, with an increase in the size and complexity of today's design requirements, current methodologies and EDA tools are unable to effectively mitigate the further widening of productivity gap. It is estimated that testing and verification takes 2/3rd of the total development time of complex hardware systems. Functional simulation forms the main stay of testing and verification process and is the most widely used technique for testing and verification. Most of the simulation algorithms and their implementations are designed for uniprocessor systems that cannot easily leverage the parallelism in multi-core and GPU platforms. For example, logic simulation often uses levelized sequential algorithms, whereas the discrete-event simulation frameworks for Verilog, VHDL and SystemC employ concurrency in the form of multi-threading to given an illusion of the inherent parallelism present in circuits. However, the discrete-event model of computation requires a global notion of an event-queue, which makes improving its simulation performance via parallelization even more challenging. This work investigates automatic parallelization of simulation algorithms used to simulate hardware models. In particular, we focus on parallelizing the simulation of hardware designs described at the RTL using SystemC/HDL with examples to clearly describe the parallelization. Even though multi-cores and GPUs other parallelism, efficiently exploiting this parallelism with their programming models is not straightforward. To overcome this, we also focus our research on building intelligent translators to map simulation applications onto multi-cores and GPUs such that the complexity of the low-level programming models is hidden from the designers. / Master of Science POSIX threads Discrete Event Simulation (DES) CUDA SystemC Simulation GPGPU Multi-core simulation Threading Building Blocks
93	Multi-Core Memory System Design : Developing and using Analytical Models for Performance Evaluation and Enhancements Dwarakanath, Nagendra Gulur January 2015 (has links) (PDF) Memory system design is increasingly inﬂuencing modern multi-core architectures from both performance and power perspectives. Both main memory latency and bandwidth have im-proved at a rate that is slower than the increase in processor core count and speed. Off-chip memory, primarily built from DRAM, has received signiﬁcant attention in terms of architecture and design for higher performance. These performance improvement techniques include sophisticated memory access scheduling, use of multiple memory controllers, mitigating the impact of DRAM refresh cycles, and so on. At the same time, new non-volatile memory technologies have become increasingly viable in terms of performance and energy. These alternative technologies offer different performance characteristics as compared to traditional DRAM. With the advent of 3D stacking, on-chip memory in the form of 3D stacked DRAM has opened up avenues for addressing the bandwidth and latency limitations of off-chip memory. Stacked DRAM is expected to offer abundant capacity — 100s of MBs to a few GBs — at higher bandwidth and lower latency. Researchers have proposed to use this capacity as an extension to main memory, or as a large last-level DRAM cache. When leveraged as a cache, stacked DRAM provides opportunities and challenges for improving cache hit rate, access latency, and off-chip bandwidth. Thus, designing off-chip and on-chip memory systems for multi-core architectures is complex, compounded by the myriad architectural, design and technological choices, combined with the characteristics of application workloads. Applications have inherent spatial local-ity and access parallelism that inﬂuence the memory system response in terms of latency and bandwidth. In this thesis, we construct an analytical model of the off-chip main memory system to comprehend this diverse space and to study the impact of memory system parameters and work-load characteristics from latency and bandwidth perspectives. Our model, called ANATOMY, uses a queuing network formulation of the memory system parameterized with workload characteristics to obtain a closed form solution for the average miss penalty experienced by the last-level cache. We validate the model across a wide variety of memory conﬁgurations on four-core, eight-core and sixteen-core architectures. ANATOMY is able to predict memory latency with average errors of 8.1%, 4.1%and 9.7%over quad-core, eight-core and sixteen-core conﬁgurations respectively. Further, ANATOMY identiﬁe better performing design points accurately thereby allowing architects and designers to explore the more promising design points in greater detail. We demonstrate the extensibility and applicability of our model by exploring a variety of memory design choices such as the impact of clock speed, beneﬁt of multiple memory controllers, the role of banks and channel width, and so on. We also demonstrate ANATOMY’s ability to capture architectural elements such as memory scheduling mechanisms and impact of DRAM refresh cycles. In all of these studies, ANATOMY provides insight into sources of memory performance bottlenecks and is able to quantitatively predict the beneﬁt of redressing them. An insight from the model suggests that the provisioning of multiple small row-buffers in each DRAM bank achieves better performance than the traditional one (large) row-buffer per bank design. Multiple row-buffers also enable newer performance improvement opportunities such as intra-bank parallelism between data transfers and row activations, and smart row-buffer allocation schemes based on workload demand. Our evaluation (both using the analytical model and detailed cycle-accurate simulation) shows that the proposed DRAM re-organization achieves signiﬁcant speed-up as well as energy reduction. Next we examine the role of on-chip stacked DRAM caches at improving performance by reducing the load on off-chip main memory. We extend ANATOMY to cover DRAM caches. ANATOMY-Cache takes into account all the key parameters/design issues governing DRAM cache organization namely, where the cache metadata is stored and accessed, the role of cache block size and set associativity and the impact of block size on row-buffer hit rate and off-chip bandwidth. Yet the model is kept simple and provides a closed form solution for the aver-age miss penalty experienced by the last-level SRAM cache. ANATOMY-Cache is validated against detailed architecture simulations and shown to have latency estimation errors of 10.7% and 8.8%on average in quad-core and eight-core conﬁgurations respectively. An interesting in-sight from the model suggests that under high load, it is better to bypass the congested DRAM cache and leverage the available idle main memory bandwidth. We use this insight to propose a refresh reduction mechanism that virtually eliminates refresh overhead in DRAM caches. We implement a low-overhead hardware mechanism to record accesses to recent DRAM cache pages and refresh only these pages. Older cache pages are considered invalid and serviced from the (idle) main memory. This technique achieves average refresh reduction of 90% with resulting memory energy savings of 9%and overall performance improvement of 3.7%. Finally, we propose a new DRAM cache organization that achieves higher cache hit rate, lower latency and lower off-chip bandwidth demand. Called the Bi-Modal Cache, our cache organization brings three independent improvements together: (i) it enables parallel tag and data accesses, (ii) it eliminates a large fraction of tag accesses entirely by use of a novel way locator and (iii) it improves cache space utilization by organizing the cache sets as a combination of some big blocks (512B) and some small blocks (64B). The Bi-Modal Cache reduces hit latency by use of the way locator and parallel tag and data accesses. It improves hit rate by leveraging the cache capacity efficiently – blocks with low spatial reuse are allocated in the cache at 64B granularity thereby reducing both wasted off-chip bandwidth as well as cache internal fragmentation. Increased cache hit rate leads to reduction in off-chip bandwidth demand. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement of 10.8%, 13.8% and 14.0% in quad-core, eight-core and sixteen-core workloads respectively over an aggressive baseline. Multi Core Architecture ANATOMY-Cache DRAM Off-chip Memory Off-chip Bandwidth On-chip Memory Systems ANATOMY Multi-Core Memory System DRAM Cache Computer System-performance Evaluation Memory System Design Computer Science
94	Evaluation of EDF scheduling for Ericsson LTE system : A comparison between EDF, FIFO and RR Nyberg, Angelica, Hartman, Jonas January 2016 (has links) Scheduling is extremely important for modern real-time systems. It enables several programs to run in parallel and succeed with their tasks. Many systems today are real-time systems, which means that good scheduling is highly needed. This thesis aims to evaluate the real-time scheduling algorithm earliest deadline first, newly introduced into the Linux kernel, and compare it to the already existing real-time scheduling algorithms first in, first out and round robin in the context of firm tasks. By creating a test program that can create pthreads and set their scheduling characteristics, the performance of earliest deadline first can be evaluated and compared to the others. / Schemaläggning är extremt viktigt för dagens realtidssystem. Det tillåter att flera program körs parallellt samtidigt som deras processer inte misslyckas med sina uppgifter. Idag är många system realtidssystem, vilket innebär att det finns ett ytterst stort behov för en bra schemaläggningsalgoritm. Målet med det här examensarbetet är att utvärdera schema-läggningsalgoritmen earliest deadline first som nyligen introducerats i operativsystemet Linux. Målet är även att jämföra algoritmen med två andra schemaläggningsalgoritmer (first in, first out och round robin), vilka redan är väletablerade i Linux kärnan. Det här görs med avseende på processer klassificerade som firm. Genom att skapa ett program som kan skapa pthreads med önskvärda egenskaper kan prestandan av earliest deadline first algoritmen utvärderas, samt jämföras med de andra algoritmerna. EDF Earliest Deadline First Scheduling Linux Linux Scheduling FIFO RR RMS First in First out Round Robin Rate-monotonic Scheduling Scheduling LTE applications LTE Multi-core Multi core Firm Deadlines SCHED_DEADLINE SCHED_FIFO SCHED_RR Deadline Computer Engineering Datorteknik
95	Performance improvements using dynamic performance stubs Trapp, Peter January 2011 (has links) This thesis proposes a new methodology to extend the software performance engineering process. Common performance measurement and tuning principles mainly target to improve the software function itself. Hereby, the application source code is studied and improved independently of the overall system performance behavior. Moreover, the optimization of the software function has to be done without an estimation of the expected optimization gain. This often leads to an under- or overoptimization, and hence, does not utilize the system sufficiently. The proposed performance improvement methodology and framework, called dynamic performance stubs, improves the before mentioned insufficiencies by evaluating the overall system performance improvement. This is achieved by simulating the performance behavior of the original software functionality depending on an adjustable optimization level prior to the real optimization. So, it enables the software performance analyst to determine the systems’ overall performance behavior considering possible outcomes of different improvement approaches. Moreover, by using the dynamic performance stubs methodology, a cost-benefit analysis of different optimizations regarding the performance behavior can be done. The approach of the dynamic performance stubs is to replace the software bottleneck by a stub. This stub combines the simulation of the software functionality with the possibility to adjust the performance behavior depending on one or more different performance aspects of the replaced software function. A general methodology for using dynamic performance stubs as well as several methodologies for simulating different performance aspects is discussed. Finally, several case studies to show the application and usability of the dynamic performance stubs approach are presented. 600
96	Capteur de courants innovant pour des systèmes polyphasés : application aux câbles multiconducteurs / Innovative current sensor for multiphase systems : application to multi-core cables Bourkeb, Menad 17 April 2014 (has links) Cette thèse porte sur l'étude et la réalisation d’un prototype de capteur de courants innovant pour câbles multiconducteurs. Outre le caractère non-Intrusif de ce capteur (i.e. mesure sans contact), il permet de réaliser une mesure sur un système polyphasé dont la position des conducteurs est inconnue. L’approche adoptée est basée sur la résolution d’un problème inverse. En effet, à partir d’une mesure de la signature des champs magnétiques autour du câble, des algorithmes de reconstruction appropriés permettent de remonter aux courants circulant dans le câble. En plus des résultats de simulation, un banc de tests a été conçu et une validation expérimentale de ce concept est présentée pour répondre à un cahier des charges, notamment pour une structure comportant un blindage en matériau ferromagnétique pour atténuer les perturbations extérieures / This thesis presents the study and realization of an innovative currents sensor prototype for multi-Core cables. The two main advantages of this sensor compared to existing devices on the electrical equipment market are: firstly, it is no longer necessary to interrupt the system's electrical power supply to install the sensor. This is due to contactless measure (non-Intrusive sensor). Another feature of our device is its capability to measure the currents in a multi-Core system with unknown positions of conductors. This currents sensor operates in a way to find firstly the conductor positions, and then reconstructing the currents using the retrieved positions. In order to meet specifications, simulation results, test bench measurements and experimental results are presented with a ferromagnetic shielding Capteur de courants Blindage Magnétique Système polyphasé Câbles multiconducteurs Currents sensor Shielding Magnetic Polyphase set Multi-core cables 621.31
97	Un algorithme de fouille de données générique et parallèle pour architecture multi-coeurs / A generic and parallel pattern mining algorithm for multi-core architectures. Negrevergne, Benjamin 29 November 2011 (has links) Dans le domaine de l'extraction de motifs, il existe un grand nombre d'algorithmes pour résoudre une large variété de sous problèmes sensiblement identiques. Cette variété d'algorithmes freine l'adoption des techniques d'extraction de motifs pour l'analyse de données. Dans cette thèse, nous proposons un formalisme qui permet de capturer une large gamme de problèmes d'extraction de motifs. Pour démontrer la généralité de ce formalisme, nous l'utilisons pour décrire trois problèmes d'extraction de motifs : le problème d'extraction d'itemsets fréquents fermés, le problème d'extraction de graphes relationnels fermés ou le problème d'extraction d'itemsets graduels fermés. Ce formalisme nous permet de construire ParaMiner qui est un algorithme générique et parallèle pour les problèmes d'extraction de motifs. ParaMiner est capable de résoudre tous les problèmes d'extraction de motifs qui peuvent ˆtre décrit dans notre formalisme. Pour obtenir de bonne performances, nous avons généralisé plusieurs optimisations proposées par la communauté dans le cadre de problèmes spéciﬁque d'extraction de motifs. Nous avons également exploité la puissance de calcul parallèle disponible dans les archi- tectures parallèles. Nos expériences démontrent qu'en dépit de la généricité de ParaMiner ses performances sont comparables avec celles obtenues par les algorithmes les plus rapides de l'état de l'art. Ces algorithmes bénéﬁcient pourtant d'un avantage important, puisqu'ils incorporent de nombreuses optimisations spéciﬁques au sous problème d'extraction de motifs qu'ils résolvent. / In the pattern mining ﬁeld, there exist a large number of algorithms that can solve a large variety of distinct but similar pattern mining problems. This variety prevent broad adoption of data analysis with pattern mining algorithms. In this thesis we propose a formal framework that is able to capture a broad range of pattern mining problems. We illustrate the generality of our framework by formalizing three diﬀerent pattern mining problems: the problem of closed frequent itemset mining, the problem of closed relational graph mining and the problem of closed gradual itemset mining. Building on this framework, we have designed ParaMiner, a generic and parallel algorithm for pattern mining. ParaMiner is able to solve any pattern mining problem that can be formalized within our framework. In order to achieve practical eﬃciency we have generalized important optimizations from state of the art algorithms and we have made ParaMiner able to exploit parallel computing platforms. We have conducted thorough experiments that demonstrate that despite being a generic algorithm, ParaMiner can compete with the fastest ad-hoc algorithms. Fouille de données Parallélisme Architectures multi-coeur Extraction de motifs Data mining Parallel Multi-core architectures Pattern mining 004
98	Dynamic program analysis algorithms to assist parallelization Kim, Minjang 24 August 2012 (has links) All market-leading processor vendors have started to pursue multicore processors as an alternative to high-frequency single-core processors for better energy and power efficiency. This transition to multicore processors no longer provides the free performance gain enabled by increased clock frequency for programmers. Parallelization of existing serial programs has become the most powerful approach to improving application performance. Not surprisingly, parallel programming is still extremely difficult for many programmers mainly because thinking in parallel is simply beyond the human perception. However, we believe that software tools based on advanced analyses can significantly reduce this parallelization burden. Much active research and many tools exist for already parallelized programs such as finding concurrency bugs. Instead we focus on program analysis algorithms that assist the actual parallelization steps: (1) finding parallelization candidates, (2) understanding the parallelizability and profits of the candidates, and (3) writing parallel code. A few commercial tools are introduced for these steps. A number of researchers have proposed various methodologies and techniques to assist parallelization. However, many weaknesses and limitations still exist. In order to assist the parallelization steps more effectively and efficiently, this dissertation proposes Prospector, which consists of several new and enhanced program analysis algorithms. First, an efficient loop profiling algorithm is implemented. Frequently executed loop can be candidates for profitable parallelization targets. The detailed execution profiling for loops provides a guide for selecting initial parallelization targets. Second, an efficient and rich data-dependence profiling algorithm is presented. Data dependence is the most essential factor that determines parallelizability. Prospector exploits dynamic data-dependence profiling, which is an alternative and complementary approach to traditional static-only analyses. However, even state-of-the-art dynamic dependence analysis algorithms can only successfully profile a program with a small memory footprint. Prospector introduces an efficient data-dependence profiling algorithm to support large programs and inputs as well as provides highly detailed profiling information. Third, a new speedup prediction algorithm is proposed. Although the loop profiling can give a qualitative estimate of the expected profit, obtaining accurate speedup estimates needs more sophisticated analysis. Prospector introduces a new dynamic emulation method to predict parallel speedups from annotated serial code. Prospector also provides a memory performance model to predict speedup saturation due to increased memory traffic. Compared to the latest related work, Prospector significantly improves both prediction accuracy and coverage. Finally, Prospector provides algorithms that extract hidden parallelism and advice on writing parallel code. We present a number of case studies how Prospector assists manual parallelization in particular cases including privatization, reduction, mutex, and pipelining. Parallel programming Pallelization Multi-core Profiling Program analysis Compilers Parallel programs (Computer programs) Computer programs Software engineering Computer science Algorithms
99	Speeding up matrix computation kernels by sharing vector coprocessor among multiple cores on chip Dahlberg, Christopher January 2012 (has links) Today’s computer systems develop towards less energy consumption while keeping high performance. These are contradictory requirement and pose a great challenge. A good example of an application were this is used is the smartphone. The constraints are on long battery time while getting high performance required by future 2D/3D applications. A solution to this is heterogeneous systems that have components that are specialized in different tasks and can execute them fast with low energy consumption. These could be specialized i.e. encoding/decoding, encryption/decryption, image processing or communication. At the apartment of Computer Architecture and Parallel Processing Laboratory (CAPPL) at New Jersey Institute of Technology (NJIT) a vector co-processor has been developed. The Vector co-processor has the unusual feature of being able to receive instructions from multiple hosts (scalar cores). In addition to this a test system with a couple of scalar processors using the vector processor has been developed. This thesis describes this processor and its test system. It also shows the development of math applications involving matrix operations. This results in the conclusions of the vector co-processing saving substantial amount of energy while speeding up the execution of the applications. In addition to this the thesis will describe an extension of the vector co-processor design that makes it possible to monitor the throughput of instructions and data in the processor. Coprocessor Speeding-up Accelerator Power efficiency Shared resources Vector processor Multi-core on chip Matrix algorithm Matrix computation kernels
100	Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures Salinger, Alejandro January 2013 (has links) Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming. computer science parallel computing algorithms multi-core models of computation low degree parallelism paging cache GPU LoPRAM online algorithms Computer Science

Search results