• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 5
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 36
  • 36
  • 16
  • 15
  • 11
  • 6
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Securing Multiprocessor Systems-on-Chip

Biswas, Arnab Kumar 16 August 2016 (has links) (PDF)
MHRD PhD scholarship / With Multiprocessor Systems-on-Chips (MPSoCs) pervading our lives, security issues are emerging as a serious problem and attacks against these systems are becoming more critical and sophisticated. We have designed and implemented different hardware based solutions to ensure security of an MPSoC. Security assisting modules can be implemented at different abstraction levels of an MPSoC design. We propose solutions both at circuit level and system level of abstractions. At the VLSI circuit level abstraction, we consider the problem of presence of noise voltage in input signal coming from outside world. This noise voltage disturbs the normal circuit operation inside a chip causing false logic reception. If the disturbance is caused intentionally the security of a chip may be compromised causing glitch/transient attack. We propose an input receiver with hysteresis characteristic that can work at voltage levels between 0.9V and 5V. The circuit can protect the MPSoC from glitch/transient attack. At the system level, we propose solutions targeting Network-on-Chip (NoC) as the on-chip communication medium. We survey the possible attack scenarios on present-day MPSoCs and investigate a new attack scenario, i.e., router attack targeted toward NoC enabled MPSoC. We propose different monitoring-based countermeasures against routing table-based router attack in an MPSoC having multiple Trusted Execution Environments (TEEs). Software attacks, the most common type of attacks, mainly exploit vulnerabilities like buffer overflow. This is possible if proper access control to memory is absent in the system. We propose four hardware based mechanisms to implement Role Based Access Control (RBAC) model in NoC based MPSoC.
22

Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors / Effektiv LU-faktorisering för Texas Instruments digitala signalprocessorer med Keystone-arkitektur

Netzer, Gilbert January 2015 (has links)
The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. / Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också.
23

Big Data causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU

Karnagel, Tomas, Ben-Nun, Tal, Werner, Matthias, Habich, Dirk, Lehner, Wolfgang 13 June 2022 (has links)
GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.
24

Software-defined Buffer Management and Robust Congestion Control for Modern Datacenter Networks

Danushka N Menikkumbura (12208121) 20 April 2022 (has links)
<p>  Modern datacenter network applications continue to demand ultra low latencies and very high throughputs. At the same time, network infrastructure keeps achieving higher speeds and larger bandwidths. We still need better network management solutions to keep these two demand and supply fronts go hand-in-hand. There are key metrics that define network performance such as flow completion time (the lower the better), throughput (the higher the better), and end-to-end latency (the lower the better) that are mainly governed by how effectively network application get their fair share of network resources. We observe that buffer utilization on network switches gives a very accurate indication of network performance. Therefore, network buffer management is important in modern datacenter networks, and other network management solutions can be efficiently built around buffer utilization. This dissertation presents three solutions based on buffer use on network switches.</p> <p>  This dissertation consists of three main sections. The first section is on a specification language for buffer management in modern programmable switches. The second section is on a congestion control solution for Remote Direct Memory Access (RDMA) networks. The third section is on a solution to head-of-the-line blocking in modern datacenter networks.</p>
25

DMA řadič a ovladač síťové karty pro platformu COMBO2 / DMA Controller and Network Interface Card Driver for COMBO2 Platform

Kaštovský, Petr January 2009 (has links)
There is a family of COMBO cards used for netork monitoring acceleration being developed on the Liberouter project, which is the CESNET's research activity. These cards are equipped with Xilinx's programmable field array. To enable usage of classic tools for network monitoring and management, not only application specific tools, it is necessary to implement network interface card on the platform, that realizes packet reception and transmission through the standard Linux kernel interface. This thesis describes the design and implementation of network interface card's key components. Those are DMA controller and Linux device driver.
26

A Study of Disk Performance Optimization.

Gray, Richard Scott 01 May 2000 (has links) (PDF)
Response time is one of the most important performance measures associated with a typical multi-user system. Response time, in turn, is bounded by the performance of the input/output (I/O) subsystem. Other than the end user and some external peripherals, the slowest component of the I/O subsystem is the disk drive. One standard strategy for improving I/O subsystem performance uses high-performance hardware like Small Computer Systems Interface (SCSI) drives to improve overall response time. SCSI hardware, unfortunately, is often too expensive to use in low-end multi-user systems. The low-end multi-user systems commonly use inexpensive Integrated Drive Electronics (IDE) disk drives to keep overall costs low. On such IDE based multi-user systems, reducing the Central Processing Unit (CPU) overhead associated with disk I/O is critical to system responsiveness. This thesis explores the impact of PCI bus mastering Direct Memory Access (DMA) on the performance of systems with IDE drives. DMA is a data transfer protocol that allows data to be sent directly from an attached device to a computer system’s main memory, thereby reducing CPU overhead. PCI bus mastering allows modern IDE disk controllers to manipulate main memory without utilizing motherboard-resident DMA controllers. Using a series of experiments, this thesis examines the impact of PCI bus mastering DMA on IDE performance for synchronous I/O, relative to Programmed Input/Output (PIO) and SCSI performance. Experiment results show that PCI bus mastering DMA, when used properly, improves the responsiveness and throughput of IDE drives by as much as a factor of seven. The magnitude of this improvement shows the importance of operating system support for DMA in low-end multi-user systems. Additionally, experimental results demonstrate that performance gains associated with SCSI are dependent on system usage and operating system support for advanced SCSI capabilities. Therefore, under many circumstances, high-performance SCSI drives are not cost effective when compared with IDE bus mastering DMA capable drives.
27

High Data Rate Signal Processing Architectures and Compilation Strategies for Scalable, Multi-Gigabit Digital Systems

Nybo, Daniel Alexander 12 April 2024 (has links) (PDF)
In this study we present a high-performance computing architecture and hardware acceleration strategy for a heterogeneous multi-gigabit computing system. The system architecture integrates a BeeGFS distributed file system, capable of achieving 80 Gbps of sustained write throughput across five nodes, essential for managing the high data volumes generated by a 25 high performance computer (HPC) compute cluster. To ensure operational efficiency and scalability, the tasks performed on the Linux compute cluster consisting of 30 nodes are automated using Ansible, facilitating seamless deployment, management, and updates. We present compilation strategies for a hardware accelerated Polyphase Filter Bank (PFB) channelization routine optimized for Xilinx Ultrascale+ FPGAs, capable of simultaneously processing 2048 channels per 12 input streams. This setup shows the efficiency of High Level Sysnthesis of FPGA-based signal processing in handling demanding data analysis tasks. We also present the implementation and verification of a 1.6 Gsps Direct Memory Access (DMA) transfer from DDR4 memory to a modern Radio Frequency System on Chip (RFSoC) digital to analog converter. The combination of a high-throughput file system, streamlined automation, and advanced signal processing capabilities shows these system's ability to meet the needs of complex, real-time data analysis and processing applications, advancing the field of computational research.
28

Partitioning Strategy Selection for In-Memory Graph Pattern Matching on Multiprocessor Systems

Krause, Alexander, Kissinger, Thomas, Habich, Dirk, Voigt, Hannes, Lehner, Wolfgang 19 July 2023 (has links)
Pattern matching on large graphs is the foundation for a variety of application domains. The continuously increasing size of the underlying graphs requires highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, a fine-grained graph partitioning becomes increasingly important. Hence, we present a classification of graph partitioning strategies and evaluate representative algorithms on medium and large-scale NUMA systems in this paper. As a scalable pattern matching processing infrastructure, we leverage a data-oriented architecture that preserves data locality and minimizes concurrency-related bottlenecks on NUMA systems. Our in-depth evaluation reveals that the optimal partitioning strategy depends on a variety of factors and consequently, we derive a set of indicators for selecting the optimal partitioning strategy suitable for a given graph and workload.
29

Selektion beim Zugriff auf mentale Objekte im Arbeitsgedächtnis / die Aufgabe entscheidet über die Details

Schwager, Sabine 08 May 2006 (has links)
Die vorliegende Dissertation behandelt die Frage nach Prozesskomponenten eines flexiblen Zugriffs auf im verbalen Arbeitsgedächtnis gehaltene "mentale Objekte" sowie nach deren Eigenschaften. Ein gängiges Arbeitsgedächtnismodell geht davon aus, dass das aktuell bearbeitete Objekt im Fokus der Aufmerksamkeit steht und beliebigen mentalen Operationen zur Verfügung steht, während die übrigen in einer "Region des direkten Zugriffs" aufrechterhalten werden. Ein Wechsel des mentalen Objekts führt zu zeitlichen Kosten, weil unter den Kandidaten eine erneute Objektselektion stattfinden muss (Oberauer, 2002). Mit Hilfe von vier Experimenten konnte gezeigt werden, dass diese Sichtweise erweitert werden muss: Mentale Objekte werden nicht für beliebige sondern spezifisch für die aktuelle Aufgabe bereitgestellt, während nicht mehr ausgewählte Objekte, deren Merkmalsinformation zerfällt, durch subvokales Rehearsal verfügbar gehalten werden und einer Neuselektion zunächst wahrscheinlich als phonologische Codes zugrunde liegen. Der Zugriff auf ein mentales Objekt erfordert somit neben der Objektselektion auch (anforderungsabhängig) Prozesse des Merkmalsabrufs und der Merkmalsselektion innerhalb des mentalen Objekts, welche die aktuell relevante Objektinformation bereitstellen. Sequenzen von Vergleichen mit wechselndem Zugriff auf Elemente einer Gedächtnismenge aus einstelligen Zahlen oder einsilbigen Wörtern ergaben höhere Objektwechselkosten, wenn die mentalen Objekte phonologisch ähnlich waren (Objektselektion), und wenn der Vergleich stärker semantische Objektinformation erforderte (Merkmalsabruf), sowie Kosten für einen Wechsel der relevanten Merkmale innerhalb eines Objekts (Merkmalsselektion), nicht aber bei Objektwechsel, der in jedem Fall die Selektion neuer Merkmale einschließt. Die Resultate sprechen für die postulierte Anforderungsabhängigkeit der Selektion im Arbeitsgedächtnis. / The dissertation aims at identifying component processes of access to "mental objects" from verbal working memory and characterizing the involved memory codes. In one of the current working memory models it is assumed that the object actually selected for processing is in the focus of attention and can be subjected to any upcoming mental operation while the remaining candidates are maintained within the "region of direct access". When the focus is moved to a new object this results in time costs since it requires the selection of a new object from the set (Oberauer, 2002). This task-independent view of working memory access has to be extended: The mental object in focus is usually selected for a certain (not any) operation while feature information of objects outside the focus of attention is subject to decay. Maintenance of objects currently not selected objects is probably realized by subvocal rehearsal that provides phonological codes of the objects - being the basis of a new object selection. Consequently, when switching mental objects there is not only object selection necessary but also feature retrieval and feature selection processes within the object that provide the task-relevant object information. Four Experiments were conducted. They consisted of sequences of comparisons using randomly changing elements from a memory set of one-digit numbers or monosyllabic german nouns. Object switching costs are higher when the memory set contains phonologically similar elements (object selection) and when the task requires semantic rather than superficial information (feature retrieval). There are costs for changing the relevant features within an object (feature selection) but not with an object switch that always includes the selection of new object features. The results strongly support the view of task-dependent selection processes in working memory.
30

Toward Highly-efficient GPU-centric Networking / Mot Högeffektiva GPU-centrerade Nätverk

Girondi, Massimo January 2024 (has links)
Graphics Processing Units (GPUs) are emerging as the most popular accelerator for many applications, powering the core of Machine Learning applications and many computing-intensive workloads. GPUs have typically been consideredas accelerators, with Central Processing Units (CPUs) in charge of the mainapplication logic, data movement, and network connectivity. In these architectures,input and output data of network-based GPU-accelerated application typically traverse the CPU, and the Operating System network stack multiple times, getting copied across the system main memory. These increase application latency and require expensive CPU cycles, reducing the power efficiency of systems, and increasing the overall response times. These inefficiencies become of higher importance in latency-bounded deployments, or with high throughput, where copy times could easily inflate the response time of modern GPUs. The main contribution of this dissertation is towards a GPU-centric network architecture, allowing GPUs to initiate network transfers without the intervention of CPUs. We focus on commodity hardware, using NVIDIA GPUs and Remote Direct Memory Access over Converged Ethernet (RoCE) to realize this architecture, removing the need of highly homogeneous clusters and ad-hoc designed network architecture, as it is required by many other similar approaches. By porting some rdma-core posting routines to GPU runtime, we can saturate a 100-Gbps link without any CPU cycle, reducing the overall system response time, while increasing the power efficiency and improving the application throughput.The second contribution concerns the analysis of Clockwork, a State-of-The-Art inference serving system, showing the limitations imposed by controller-centric, CPU-mediated architectures. We then propose an alternative architecture to this system based on an RDMA transport, and we study some performance gains that such a system would introduce. An integral component of an inference system is to account and track user flows,and distribute them across multiple worker nodes. Our third contribution aims to understand the challenges of Connection Tracking applications running at 100Gbps, in the context of a Stateful Load Balancer running on commodity hardware. / <p>QC 20240315</p>

Page generated in 0.0498 seconds