Global ETD Search

1	Scaling RDMA RPCs with FLOCK Monga, Sumit Kumar 30 November 2021 (has links) RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present FLOCK, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, FLOCK departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, FLOCK uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. / M.S. / Internet is one of the great discoveries of our time. It provides access to enormous knowledge sources, makes it easier to communicate across the globe seamlessly with other countless advantages. Accessing the internet over the years, it is noticeable that the latency of services like web searches and downloading files has gone down sharply. A download that used to take minutes during the 2000s can complete within seconds in present times. Network speeds have been improving, facilitating a faster and smoother user experience. Another factor contributing to the improved internet experience is the service providers like Google, Amazon, and others that can process user requests in a fraction of time what used to take before. Web services such as search, e-commerce are implemented using a multi-layer architecture with layer containing hundreds to thousands of servers. Each server runs one or more components of the web service application. In this architecture, user requests are received in the upper layer and processed by the lower layers. Servers in different layers communicate over an ultrafast network like Remote Direct Memory Access (RDMA). The implication of the multi-layer architecture is that a server has to communicate with multiple other servers in the upper and lower layers. Unfortunately, due to its inherent limitations, RDMA does not perform well when network communication takes place with a large number of servers. In this thesis, a new communication framework for RDMA networks, FLOCK is proposed to overcome the scalability limitations of RDMA hardware. FLOCK maintains scalability when communicating with many servers and it consistently provides better performance compared to the state-of-the-art. Additionally, FLOCK utilizes the network bandwidth efficiently and reduces the CPU overheads incurred due to network communication. Datacenter networking Remote Direct Memory Access (RDMA) Scalability
2	Support for Accessible Bitsliced Software Conroy, Thomas Joseph 05 March 2021 (has links) The expectations on embedded systems have grown incredibly in recent years. Not only are there more applications for them than ever, the applications are increasingly complex, and their security is essential. To meet such demanding goals, designers and programmers are always looking for more efficient methods of computation. One technique that has gained attention over the past couple of decades is bitsliced software. In addition to high efficiency in certain situations, including block ciphers computation, it has been used in designs to resist hardware attacks. However, this technique requires both program and data to be in a specific format. This requirement makes writing bitsliced software by hand laborious and adds computational overhead to transpose the data before and after computation. This work describes a code generation tool that produces it from a higher-level description in Verilog. By supporting the synthesis of sequential circuits, this tool extends bitsliced software to parallel synchronous software. This tool is then used to implement a method for accelerating software neural network processing with reduced-precision computation on highly constrained devices. To address the data transposition overhead and to support a hardware attack-resistant architecture, a custom DMA controller is introduced that efficiently transposes the data as it transfers along with dedicated hardware for masking and redundancy generation. In combination, these tools make bitsliced software and its benefits more accessible to system designers and programmers. / Master of Science / Small computers embedded in devices, such as cars, smart devices, and other electronics, face many challenges. Often, they are pushed to their limits by designers and programmers to reach acceptable levels of performance. The increasing complexity of the applications they run compounds with the need for these applications to be secure. The programmers are always looking for better, more efficient methods of doing computations. Over the past two decades bitsliced software has gained attention as a technique that can, in certain situations, be more efficient than standard software. It also has properties that make it useful for designs implementing secure software. However, writing bitsliced software by hand is a laborious task, and the data input to the software needs to be in a specific format. To make writing the software easier, a tool that generates it from the well-known Verilog hardware description language is discussed in this work. This tool is then used to implement a method to accelerate artificial intelligence calculations on highly constrained computers. A custom hardware module is also introduced to speed up the formatting of data for bitsliced processing. In combination, these tools make bitsliced software and its benefits more accessible. Bitsliced Software Code Generation Direct Memory Access Neural Network Acceleration
3	Direct memory access interface of MC6800 with the TDC1010J LSI multiplier and the application as a digital filter Hsueh, Hsiao-Chen January 1983 (has links) No description available. TDC1010J LSI Multiplier Digital Filter Memory Access Interface
4	Minimising Memory Access Conflicts for FFT on a DSP Jonsson, Sofia January 2019 (has links) The FFT support in an Ericsson's proprietary DSP is to be improved in order to achieve high performance without disrupting the current DSP architecture too much. The FFT:s and inverse FFT:s in question should support FFT sizes ranging from 12-2048, where the size is a multiple of prime factors 2, 3 and 5. Especially memory access conflicts could cause low performance in terms of speed compared with existing hardware accelerator. The problem addressed in this thesis is how to minimise these memory access conflicts. The studied FFT is a mixed-radix DIT FFT where the butterfly results are written back to addresses of a certain order. Furthermore, different buffer structures and sizes are studied, as well as different order in which to perform the operations within each FFT butterfly stage, and different orders in which to shuffle the samples in the initial stage. The study shows that for both studied buffer structures there are buffer sizes giving good performance for the majority of the FFT sizes, without largely changing the current architecture. By using certain orders for performing the operations and shuffling within the FFT stages for remaining FFT sizes, it is possible to reach good performance also for these cases. FFT DSP memory access conflict Elektroteknik och elektronik
5	Design and Implementation of a DMA Controller for Digital Signal Processor Jiang, Guoyou January 2010 (has links) <p>The thesis work is conducted in the division of computer engineering at thedepartment of electrical engineering in Linköping University. During the thesiswork, a configurable Direct Memory Access (DMA) controller was designed andimplemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595.</p><p>The DMA controller has two address generators and can provide two clocksources. It can thus handle data read and write simultaneously. There are 16channels built in the DMA controller, the data width can be 16-bit, 32-bit and64-bit. The DMA controller supports 2D data access by configuring its intelligentlinking table. The DMA is designed for advanced DSP applications and it is notdedicated for cache which has a fixed priority.</p> DMA direct memory access digital signal processing DSP linking table processor peripherals scalability Computer engineering Datorteknik
6	Exposure of Patterns in Parallel Memory Acces Lundgren, Björn, Ödlund, Anders January 2007 (has links) <p>The concept and advantages of a Parallel Memory Architecture (PMA) in computer systems have been known for long but it’s only in recent time it has become interesting to implement modular parallel memories even in handheld embedded systems. This thesis presents a method to analyse source code to expose possible parallel memory accesses. Memory access Patterns may be found, categorized and the corresponding code marked for optimization. As a result a PMA compatible with found pattern(s) and code optimization may be specified.</p> ASIP Parallel Memory Memory Access Pattern Static Code Analysis Computer engineering Datorteknik
7	A Study on the Generation of Local Memory Access Sequences and Communication Sets for Data-Parallel Programs Shiu, Liang-Cheng 13 February 2003 (has links) Distributed-memory multiprocessors offer very high levels of performance that are required to solve scientific applications. A traditional programming language cannot be expected to yield good performance when used to program such machines. Data-parallel languages provide programmers with a global memory and relieve them from the burden of inserting time-consuming, error-prone inter-processor communication. The compilers of these languages perform this task. Data-parallel languages also enable the programmers to establish alignment and distribution directives which specify the type of data parallelism and data mapping to the underlying parallel architecture. Parallelizing compilers distribute data and generate code according to the owner-computes rule when compiling an array statement. The array elements in a processor it owns are only a fraction of all the array elements. Not all of the array elements in the processor are active elements, so determining local memory access sequence is important. However, generating local memory access sequences becomes rather complicated when the array references involve complex subscripts. This study considers two types of complex subscript ― coupled subscripts and multiple induction variables. A processor may refer to the rhs (right-hand side) array elements owned by other processors, and the movement of data is inevitable. The overhead to access non-local data by inter-processor communication may be around 10 to 100 times more than the cost of accessing local data. Efficiently generating communication sets is important. This thesis introduces the concept of block compression/decompression, using smaller iteration tables, course distance and local block distance to solve problems of local memory access sequences, coupled scripts, MIV subscripts and communication set generation. Related work on these problems is reviewed and experimental results to demonstrate the benefit of the proposed methods. coupled subscript local memory access sequence communication set multiple induction variable
8	Programmation efficace et sécurisé d'applications à mémoire partagée / Towards efficient and secure shared memory applications Sifakis, Emmanuel 06 May 2013 (has links) L'utilisation massive des plateformes multi-cœurs et multi-processeurs a pour effet de favoriser la programmation parallèle à mémoire partagée. Néanmoins, exploiter efficacement et de manière correcte le parallélisme sur ces plateformes reste un problème de recherche ouvert. De plus, leur modèle d'exécution sous-jacent, et notamment les modèles de mémoire "relâchés", posent de nouveaux défis pour les outils d'analyse statiques et dynamiques. Dans cette thèse nous abordons deux aspects importants dans le cadre de la programmation sur plateformes multi-cœurs et multi-processeurs: l'optimisation de sections critiques implémentées selon l'approche pessimiste, et l'analyse dynamique de flots d'informations. Les sections critiques définissent un ensemble d'accès mémoire qui doivent être exécutées de façon atomique. Leur implémentation pessimiste repose sur l'acquisition et le relâchement de mécanismes de synchronisation, tels que les verrous, en début et en fin de sections critiques. Nous présentons un algorithme générique pour l'acquisition/relâchement des mécanismes de synchronisation, et nous définissons sur cet algorithme un ensemble de politiques particulier ayant pour objectif d'augmenter le parallélisme en réduisant le temps de possession des verrous par les différentes threads. Nous montrons alors la correction de ces politiques (respect de l'atomicité et absence de blocages), et nous validons expérimentalement leur intérêt. Le deuxième point abordé est l'analyse dynamique de flot d'information pour des exécutions parallèles. Dans ce type d'analyse, l'enjeu est de définir précisément l'ordre dans lequel les accès à des mémoires partagées peuvent avoir lieu à l'exécution. La plupart des travaux existant sur ce thème se basent sur une exécution sérialisée du programme cible. Ceci permet d'obtenir une sérialisation explicite des accès mémoire mais entraîne un surcoût en temps d'exécution et ignore l'effet des modèles mémoire relâchées. A contrario, la technique que nous proposons permet de prédire l'ensemble des sérialisations possibles vis-a-vis de ce modèle mémoire à partir d'une seule exécution parallèle ("runtime prediction"). Nous avons développé cette approche dans le cadre de l'analyse de teinte, qui est largement utilisée en détection de vulnérabilités. Pour améliorer la précision de cette analyse nous prenons également en compte la sémantique des primitives de synchronisation qui réduisent le nombre de sérialisations valides. Les travaux proposé ont été implémentés dans des outils prototype qui ont permit leur évaluation sur des exemples représentatifs. / The invasion of multi-core and multi-processor platforms on all aspects of computing makes shared memory parallel programming mainstream. Yet, the fundamental problems of exploiting parallelism efficiently and correctly have not been fully addressed. Moreover, the execution model of these platforms (notably the relaxed memory models they implement) introduces new challenges to static and dynamic program analysis. In this work we address 1) the optimization of pessimistic implementations of critical sections and 2) the dynamic information flow analysis for parallel executions of multi-threaded programs. Critical sections are excerpts of code that must appear as executed atomically. Their pessimistic implementation reposes on synchronization mechanisms, such as mutexes, and consists into obtaining and releasing them at the beginning and end of the critical section respectively. We present a general algorithm for the acquisition/release of synchronization mechanisms and define on top of it several policies aiming to reduce contention by minimizing the possession time of synchronization mechanisms. We demonstrate the correctness of these policies (i.e. they preserve atomicity and guarantee deadlock freedom) and evaluate them experimentally. The second issue tackled is dynamic information flow analysis of parallel executions. Precisely tracking information flow of a parallel execution is infeasible due to non-deterministic accesses to shared memory. Most existing solutions that address this problem enforce a serial execution of the target application. This allows to obtain an explicit serialization of memory accesses but incurs both an execution-time overhead and eliminates the effects of relaxed memory models. In contrast, the technique we propose allows to predict the plausible serializations of a parallel execution with respect to the memory model. We applied this approach in the context of taint analysis , a dynamic information flow analysis widely used in vulnerability detection. To improve precision of taint analysis we further take into account the semantics of synchronization mechanisms such as mutexes, which restricts the predicted serializations accordingly. The solutions proposed have been implemented in proof of concept tools which allowed their evaluation on some hand-crafted examples. Concurrence Atomicité Vulnerabilité Concurrency Weak atomicity Memory access Static analysis Runtime analysis
9	Exposure of Patterns in Parallel Memory Acces Lundgren, Björn, Ödlund, Anders January 2007 (has links) The concept and advantages of a Parallel Memory Architecture (PMA) in computer systems have been known for long but it’s only in recent time it has become interesting to implement modular parallel memories even in handheld embedded systems. This thesis presents a method to analyse source code to expose possible parallel memory accesses. Memory access Patterns may be found, categorized and the corresponding code marked for optimization. As a result a PMA compatible with found pattern(s) and code optimization may be specified. ASIP Parallel Memory Memory Access Pattern Static Code Analysis Computer Engineering Datorteknik
10	Design and Implementation of a DMA Controller for Digital Signal Processor Jiang, Guoyou January 2010 (has links) The thesis work is conducted in the division of computer engineering at thedepartment of electrical engineering in Linköping University. During the thesiswork, a configurable Direct Memory Access (DMA) controller was designed andimplemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clocksources. It can thus handle data read and write simultaneously. There are 16channels built in the DMA controller, the data width can be 16-bit, 32-bit and64-bit. The DMA controller supports 2D data access by configuring its intelligentlinking table. The DMA is designed for advanced DSP applications and it is notdedicated for cache which has a fixed priority. DMA direct memory access digital signal processing DSP linking table processor peripherals scalability Computer Engineering Datorteknik

Search results