Global ETD Search

1	Scaling RDMA RPCs with FLOCK Monga, Sumit Kumar 30 November 2021 (has links) RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present FLOCK, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, FLOCK departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, FLOCK uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. / M.S. / Internet is one of the great discoveries of our time. It provides access to enormous knowledge sources, makes it easier to communicate across the globe seamlessly with other countless advantages. Accessing the internet over the years, it is noticeable that the latency of services like web searches and downloading files has gone down sharply. A download that used to take minutes during the 2000s can complete within seconds in present times. Network speeds have been improving, facilitating a faster and smoother user experience. Another factor contributing to the improved internet experience is the service providers like Google, Amazon, and others that can process user requests in a fraction of time what used to take before. Web services such as search, e-commerce are implemented using a multi-layer architecture with layer containing hundreds to thousands of servers. Each server runs one or more components of the web service application. In this architecture, user requests are received in the upper layer and processed by the lower layers. Servers in different layers communicate over an ultrafast network like Remote Direct Memory Access (RDMA). The implication of the multi-layer architecture is that a server has to communicate with multiple other servers in the upper and lower layers. Unfortunately, due to its inherent limitations, RDMA does not perform well when network communication takes place with a large number of servers. In this thesis, a new communication framework for RDMA networks, FLOCK is proposed to overcome the scalability limitations of RDMA hardware. FLOCK maintains scalability when communicating with many servers and it consistently provides better performance compared to the state-of-the-art. Additionally, FLOCK utilizes the network bandwidth efficiently and reduces the CPU overheads incurred due to network communication. Datacenter networking Remote Direct Memory Access (RDMA) Scalability
2	Support for Accessible Bitsliced Software Conroy, Thomas Joseph 05 March 2021 (has links) The expectations on embedded systems have grown incredibly in recent years. Not only are there more applications for them than ever, the applications are increasingly complex, and their security is essential. To meet such demanding goals, designers and programmers are always looking for more efficient methods of computation. One technique that has gained attention over the past couple of decades is bitsliced software. In addition to high efficiency in certain situations, including block ciphers computation, it has been used in designs to resist hardware attacks. However, this technique requires both program and data to be in a specific format. This requirement makes writing bitsliced software by hand laborious and adds computational overhead to transpose the data before and after computation. This work describes a code generation tool that produces it from a higher-level description in Verilog. By supporting the synthesis of sequential circuits, this tool extends bitsliced software to parallel synchronous software. This tool is then used to implement a method for accelerating software neural network processing with reduced-precision computation on highly constrained devices. To address the data transposition overhead and to support a hardware attack-resistant architecture, a custom DMA controller is introduced that efficiently transposes the data as it transfers along with dedicated hardware for masking and redundancy generation. In combination, these tools make bitsliced software and its benefits more accessible to system designers and programmers. / Master of Science / Small computers embedded in devices, such as cars, smart devices, and other electronics, face many challenges. Often, they are pushed to their limits by designers and programmers to reach acceptable levels of performance. The increasing complexity of the applications they run compounds with the need for these applications to be secure. The programmers are always looking for better, more efficient methods of doing computations. Over the past two decades bitsliced software has gained attention as a technique that can, in certain situations, be more efficient than standard software. It also has properties that make it useful for designs implementing secure software. However, writing bitsliced software by hand is a laborious task, and the data input to the software needs to be in a specific format. To make writing the software easier, a tool that generates it from the well-known Verilog hardware description language is discussed in this work. This tool is then used to implement a method to accelerate artificial intelligence calculations on highly constrained computers. A custom hardware module is also introduced to speed up the formatting of data for bitsliced processing. In combination, these tools make bitsliced software and its benefits more accessible. Bitsliced Software Code Generation Direct Memory Access Neural Network Acceleration
3	Design and Implementation of a DMA Controller for Digital Signal Processor Jiang, Guoyou January 2010 (has links) <p>The thesis work is conducted in the division of computer engineering at thedepartment of electrical engineering in Linköping University. During the thesiswork, a configurable Direct Memory Access (DMA) controller was designed andimplemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595.</p><p>The DMA controller has two address generators and can provide two clocksources. It can thus handle data read and write simultaneously. There are 16channels built in the DMA controller, the data width can be 16-bit, 32-bit and64-bit. The DMA controller supports 2D data access by configuring its intelligentlinking table. The DMA is designed for advanced DSP applications and it is notdedicated for cache which has a fixed priority.</p> DMA direct memory access digital signal processing DSP linking table processor peripherals scalability Computer engineering Datorteknik
4	Design and Implementation of a DMA Controller for Digital Signal Processor Jiang, Guoyou January 2010 (has links) The thesis work is conducted in the division of computer engineering at thedepartment of electrical engineering in Linköping University. During the thesiswork, a configurable Direct Memory Access (DMA) controller was designed andimplemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clocksources. It can thus handle data read and write simultaneously. There are 16channels built in the DMA controller, the data width can be 16-bit, 32-bit and64-bit. The DMA controller supports 2D data access by configuring its intelligentlinking table. The DMA is designed for advanced DSP applications and it is notdedicated for cache which has a fixed priority. DMA direct memory access digital signal processing DSP linking table processor peripherals scalability Computer Engineering Datorteknik
5	Podpora DMA pro rodinu mikrokontrolerů HCS08 / DMA Support for HCS08 Microcontrollers Family Novosád, Adrián January 2013 (has links) Embedded systems are dedicated to perform specific tasks, so design engineers can optimize them to reduce the size and cost of the product and increase the reliability and performance. However, result of these optimizations is that some architectures may lack commonly used technologies such as direct memory access (DMA). We may encounter with this situation in family of microcontrollers HCS08. The main theme of this work is to describe a design of DMA controller that can be added into the family of microcontrollers HCS08.
6	Optimisation des transferts de données sur systèmes multiprocesseurs sur puce / Optimizing Data Transfers for Multiprocessor Systems on Chips Saidi, Selma 24 October 2012 (has links) Les systèmes multiprocesseurs sur puce, tel que le processeur CELL ou plus récemment Platform 2012, sont des architectures multicœurs hétérogènes constitués d'un processeur host et d'une fabric de calcul qui consiste en plusieurs petits cœurs dont le rôle est d'agir comme un accélérateur programmable. Les parties parallélisable d'une application, qui initialement est supposé etre executé par le host, et dont le calcul est intensif sont envoyés a la fabric multicœurs pour être exécutés. Ces applications sont en général des applications qui manipulent des tableaux trés larges de données, ces données sont stockées dans une memoire distante hors puce (off-chip memory) dont l 'accès est 100 fois plus lent que l 'accès par un cœur a une mémoire locale. Accéder ces données dans la mémoire off-chip devient donc un problème majeur pour les performances. une characteristiques principale de ces plateformes est une mémoire local géré par le software, au lieu d un mechanisme de cache, tel que les mouvements de données dans la hiérarchie mémoire sont explicitement gérés par le software. Dans cette thèse, l 'objectif est d'optimiser ces transfert de données dans le but de reduire/cacher la latence de la mémoire off-chip . / Multiprocessor system on chip (MPSoC) such as the CELL processor or the more recent Platform2012 are heterogeneous multi-core architectures, with a powerful host processor and a computation fabric, consisting of several smaller cores, whose intended role is to act as a general purpose programmable accelerator. Therefore computation-intensive (and parallelizable) parts of the application initially intended to be executed by the host processor are offloaded to the multi-cores for execution. These parts of the application are often data intensive, operating on large arrays of data initially stored in a remote off-chip memory whose access time is about 100 times slower than that of the cores local memory. Accessing data in the off-chip memory becomes then a main bottleneck for performance. A major characteristic of these platforms is a software controlled local memory storage rather than a hidden cache mechanism where data movement in the memory hierarchy, typically performed using a DMA (Direct Memory Access) engine, are explicitely managed by the software. In this thesis, we attempt to optimize such data transfers in order to reduce/hide the off-chip memory latency. Application data parallèles DMA Systemes multiprocesseurs sur puce Data parallel applications Direct Memory Access(DMA) Multiprocessor architecture
7	Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors / Effektiv LU-faktorisering för Texas Instruments digitala signalprocessorer med Keystone-arkitektur Netzer, Gilbert January 2015 (has links) The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. / Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också. LU factorization digital signal processors Texas Instruments Keystone architecture high-performance LINPACK benchmark performance energy efficiency software-pipelined loops direct memory access optimization Computer Sciences Datavetenskap (datalogi)
8	Software-defined Buffer Management and Robust Congestion Control for Modern Datacenter Networks Danushka N Menikkumbura (12208121) 20 April 2022 (has links) <p> Modern datacenter network applications continue to demand ultra low latencies and very high throughputs. At the same time, network infrastructure keeps achieving higher speeds and larger bandwidths. We still need better network management solutions to keep these two demand and supply fronts go hand-in-hand. There are key metrics that define network performance such as flow completion time (the lower the better), throughput (the higher the better), and end-to-end latency (the lower the better) that are mainly governed by how effectively network application get their fair share of network resources. We observe that buffer utilization on network switches gives a very accurate indication of network performance. Therefore, network buffer management is important in modern datacenter networks, and other network management solutions can be efficiently built around buffer utilization. This dissertation presents three solutions based on buffer use on network switches.</p> <p> This dissertation consists of three main sections. The first section is on a specification language for buffer management in modern programmable switches. The second section is on a congestion control solution for Remote Direct Memory Access (RDMA) networks. The third section is on a solution to head-of-the-line blocking in modern datacenter networks.</p> Computer System Architecture Networking and Communications Switch Buffering Architectures Network Programmability Datacenter Networks Congestion Control Remote Direct Memory Access (RDMA) Head-of-line Blocking Routing Deadlocks
9	DMA řadič a ovladač síťové karty pro platformu COMBO2 / DMA Controller and Network Interface Card Driver for COMBO2 Platform Kaštovský, Petr January 2009 (has links) There is a family of COMBO cards used for netork monitoring acceleration being developed on the Liberouter project, which is the CESNET's research activity. These cards are equipped with Xilinx's programmable field array. To enable usage of classic tools for network monitoring and management, not only application specific tools, it is necessary to implement network interface card on the platform, that realizes packet reception and transmission through the standard Linux kernel interface. This thesis describes the design and implementation of network interface card's key components. Those are DMA controller and Linux device driver.
10	A Study of Disk Performance Optimization. Gray, Richard Scott 01 May 2000 (has links) (PDF) Response time is one of the most important performance measures associated with a typical multi-user system. Response time, in turn, is bounded by the performance of the input/output (I/O) subsystem. Other than the end user and some external peripherals, the slowest component of the I/O subsystem is the disk drive. One standard strategy for improving I/O subsystem performance uses high-performance hardware like Small Computer Systems Interface (SCSI) drives to improve overall response time. SCSI hardware, unfortunately, is often too expensive to use in low-end multi-user systems. The low-end multi-user systems commonly use inexpensive Integrated Drive Electronics (IDE) disk drives to keep overall costs low. On such IDE based multi-user systems, reducing the Central Processing Unit (CPU) overhead associated with disk I/O is critical to system responsiveness. This thesis explores the impact of PCI bus mastering Direct Memory Access (DMA) on the performance of systems with IDE drives. DMA is a data transfer protocol that allows data to be sent directly from an attached device to a computer system’s main memory, thereby reducing CPU overhead. PCI bus mastering allows modern IDE disk controllers to manipulate main memory without utilizing motherboard-resident DMA controllers. Using a series of experiments, this thesis examines the impact of PCI bus mastering DMA on IDE performance for synchronous I/O, relative to Programmed Input/Output (PIO) and SCSI performance. Experiment results show that PCI bus mastering DMA, when used properly, improves the responsiveness and throughput of IDE drives by as much as a factor of seven. The magnitude of this improvement shows the importance of operating system support for DMA in low-end multi-user systems. Additionally, experimental results demonstrate that performance gains associated with SCSI are dependent on system usage and operating system support for advanced SCSI capabilities. Therefore, under many circumstances, high-performance SCSI drives are not cost effective when compared with IDE bus mastering DMA capable drives. direct memory access adaptive disk rearrangement disk-head scheduling disk drive performance I/O subsystem performance integrated drive electronics small computer systems interface Computer Sciences Physical Sciences and Mathematics

Search results