Global ETD Search

1	Open-Source Parameterized Low-Latency Aggressive Hardware Compressor and Decompressor for Memory Compression Jearls, James Chandler 16 June 2021 (has links) In recent years, memory has shown to be a constraining factor in many workloads. Memory is an expensive necessity in many situations, from embedded devices with a few kilobytes of SRAM to warehouse-scale computers with thousands of terabytes of DRAM. Memory compression has existed in all major operating systems for many years. However, while faster than swapping to a disk, memory decompression adds latency to data read operations. Companies and research groups have investigated hardware compression to mitigate these problems. Still, open-source low-latency hardware compressors and decompressors do not exist; as such, every group that studies hardware compression must re-implement. Importantly, because the devices that can benefit from memory compression vary so widely, there is no single solution to address all devices' area, latency, power, and bandwidth requirements. This work intends to address the many issues with hardware compressors and decompressors. This work implements hardware accelerators for three popular compression algorithms; LZ77, LZW, and Huffman encoding. Each implementation includes a compressor and decompressor, and all designs are entirely parameterized. There are a total of 22 parameters between the designs in this work. All of the designs are open-source under a permissive license. Finally, configurations of the work can achieve decompression latencies under 500 nanoseconds, much closer than existing works to the 255 nanoseconds required to read an uncompressed 4 KB page. The configurations of this work accomplish this while still achieving compression ratios comparable to software compression algorithms. / Master of Science / Computer memory, the fast, temporary storage where programs and data are held, is expensive and limited. Compression allows for data and programs to be held in memory in a smaller format so they take up less space. This work implements a hardware design for compression and decompression accelerators to make it faster for the programs using the compressed data to access it. This work includes three hardware compressor and decompressor designs that can be easily modified and are free for anyone to use however they would like. The included designs are orders of magnitude smaller and less expensive than the existing state of the art, and they reduce the decompression time by up to 6x. These smaller areas and latencies result in a relatively small reduction in compression ratios: only 13% on average across the tested benchmarks. compression decompression accelerator hardware parameterized low-latency
2	IMPROVING REAL-TIME LATENCY PERFORMANCE ON COTS ARCHITECTURES Bono, John, Hauck, Preston 10 1900 (has links) International Telemetering Conference Proceedings / October 20-23, 2003 / Riviera Hotel and Convention Center, Las Vegas, Nevada / Telemetry systems designed to support the current needs of mission-critical applications often have stringent real-time requirements. These systems must guarantee a maximum worst-case processing and response time when incoming data is received. These real-time tolerances continue to tighten as data rates increase. At the same time, end user requirements for COTS pricing efficiencies have forced many telemetry systems to now run on desktop operating systems like Windows or Unix. While these desktop operating systems offer advanced user interface capabilities, they cannot meet the realtime requirements of the many mission-critical telemetry applications. Furthermore, attempts to enhance desktop operating systems to support real-time constraints have met with only limited success. This paper presents a telemetry system architecture that offers real-time guarantees while at the same time extensively leveraging inexpensive COTS hardware and software components. This is accomplished by partitioning the telemetry system onto two processors. The first processor is a NetAcquire subsystem running a real-time operating system (RTOS). The second processor runs a desktop operating system running the user interface. The two processors are connected together with a high-speed Ethernet IP internetwork. This architecture affords an improvement of two orders of magnitude over the real-time performance of a standalone desktop operating system. Telemetry COTS Real Time Low Latency Deterministic Distributed Systems
3	La quête de latence faible sur les deux bords du réseau : conception, analyse, simulation et expériences / The quest for low-latency at both network edges : design, analysis, simulation and experiments Gong, Yixi 04 March 2016 (has links) Au cours de ces dernières années, les services Internet croissent considérablement ce qui crée beaucoup de nouveaux défis dans des scénarios variés. La performance globale du service dépend à son tour de la performance des multiples segments de réseau. Nous étudions deux défis représentatifs de conception dans différents segments : les deux les plus importants se trouvent sur les bords opposés la connectivité de bout en bout des chemins d’Internet, notamment, le réseau d’accès pour l’ utilisateur et le réseau de centre de données du fournisseur de services. / In the recent years, the innovation of new services over Internet is considerably growing at a fast speed, which brings forward lots of new challenges under varied scenarios. The overall service performance depends in turn on the performance of multiple network segments. We investigated two representative design challenges in different segments : the two most important sit at the opposite edges of the end-to-end Internet path, namely, the end-user access network vs. the service provider data center network. Internet Quête de faible latence Internet Quest for low-latency
4	LEVERAGING INTERNET PROTOCOL (IP) NETWORKS TO TRANSPORT MULTI-RATE SERIAL DATA STREAMS Heath, Doug, Polluconi, Marty, Samad, Flora 10 1900 (has links) ITC/USA 2006 Conference Proceedings / The Forty-Second Annual International Telemetering Conference and Technical Exhibition / October 23-26, 2006 / Town and Country Resort & Convention Center, San Diego, California / As the rates and numbers of serial telemetry data streams increase, the cost of timely, efficient and robust distribution of those streams increases faster. Without alternatives to traditional pointto- point serial distribution, the complexity of the infrastructure will soon overwhelm potential benefits of the increased stream counts and rates. Utilization of multiplexing algorithms in Field- Programmable Gate Arrays (FPGA), coupled with universally available Internet Protocol (IP) switching technology, provides a low-latency, time-data correlated multi-stream distribution solution. This implementation has yielded zero error IP transport and regeneration of multiple serial streams, maintaining stream to stream skew of less than 10 nsec, with end-to-end latency contribution of less than 15 msec. Adoption of this technique as a drop-in solution can greatly reduce the costs and complexities of maintaining pace with the changing serial telemetry community. Multiplexing Telemetry Field-Programmable Gate Array Internet Protocol low-latency timedata correlation
5	On the Coordinated Use of a Sleep Mode in Wireless Sensor Networks: Ripple Rendezvous van Coppenhagen, Robert Lindenberg, robert.vancoppenhagen@dsto.defence.gov.au January 2006 (has links) It is widely accepted that low energy consumption is the most important requirement when designing components and systems for a wireless sensor network (WSN). The greatest energy consumer of each node within a WSN is the radio transceiver and as such, it is important that this component be used in an extremely energy e±cient manner. One method of reducing the amount of energy consumed by the radio transceiver is to turn it off and allow nodes to enter a sleep mode. The algorithms that directly control the radio transceiver are traditionally grouped into the Medium Access Control (MAC) layer of a communication protocol stack. This thesis introduces the emerging field of wireless sensor networks and outlines the requirements of a MAC protocol for such a network. Current MAC protocols are reviewed in detail with a focus on how they utilize this energy saving sleep mode as well as performance problems that they suffer from. A proposed new method of coordinating the use of this sleep mode between nodes in the network is specifed and described. The proposed new protocol is analytically compared with existing protocols as well as with some fundamental performance limits. The thesis concludes with an analysis of the results as well as some recommendations for future work. wireless sensor networks low energy low latency medium access controls (MAC) protocols sleep mode
6	Efficient high-speed on-chip global interconnects Caputa, Peter January 2006 (has links) <p>The continuous miniaturization of integrated circuits has opened the path towards System-on-Chip realizations. Process shrinking into the nanometer regime improves transistor performancewhile the delay of global interconnects, connecting circuit blocks separated by a long distance, significantly increases. In fact, global interconnects extending across a full chip can have a delay corresponding to multiple clock cycles. At the same time, global clock skew constraints, not only between blocks but also along the pipelined interconnects, become even tighter. On-chip interconnects have always been considered <em>RC</em>-like, that is exhibiting long <em>RC</em>-delays. This has motivated large efforts on alternatives such as on-chip optical interconnects, which have not yet been demonstrated, or complex schemes utilizing on-chip F-transmission or pulsed current-mode signaling.</p><p>In this thesis, we show that well-designed electrical global interconnects, behaving as transmission lines, have the capacity of very high data rates (higher than can be delivered by the actual process) and support near velocity-of-light delay for single-ended voltage-mode signaling, thus mitigating the <em>RC</em>-problem. We critically explore key interconnect performance measures such as data delay, maximum data rate, crosstalk, edge rates and power dissipation. To experimentally demonstrate the feasibility and superior properties of on-chip transmission line interconnects, we have designed and fabricated a test chip carrying a 5 mm long global communication link. Measurements show that we can achieve 3 Gb/s/wire over the 5 mm long, repeaterless on-chip bus implemented in a standard 0.18 μm CMOS process, achieving a signal velocity of 1/3 of the velocity of light in vacuum.</p><p>To manage the problems due to global wire delays, we describe and implement a Synchronous Latency Insensitive Design (SLID) scheme, based on source-synchronous data transfer between blocks and data re-timing at the receiving block. The SLIDtechnique not onlymitigates unknown globalwire delays, but also removes the need for controlling global clock skew. The high-performance and high robustness capability of the SLID-method is practically demonstrated through a successful implementation of a SLID-based, 5.4 mm long, on-chip global bus, supporting 3 Gb/s/wire and dynamically accepting ± 2 clock cycles of data-clock skew, in a standard 0.18 μm CMOS porcess.</p><p>In the context of technology scaling, there is a tendency for interconnects to dominate chip power dissipation due to their large total capacitance. In this thesis we address the problem of interconnect power dissipation by proposing and analyzing a transition-energy cost model aimed for efficient power estimation of performancecritical buses. The model, which includes properties that closely capture effects present in high-performance VLSI buses, can be used to more accurately determine the energy benefits of e.g. transition coding of bus topologies. We further show a power optimization scheme based on appropriate choice of reduced voltage swing of the interconnect and scaling of receiver amplifier. Finally, the power saving impact of swing reduction in combination with a sense-amplifying flip-flop receiver is shown on a microprocessor cache bus architecture used in industry.</p> Microelectronics Global Interconnects On-Chip Interconnects Velocity-of-Light Delay On-Chip Communication Low-Latency Transmission Lines Electronics Elektronik
7	Efficient high-speed on-chip global interconnects Caputa, Peter January 2006 (has links) The continuous miniaturization of integrated circuits has opened the path towards System-on-Chip realizations. Process shrinking into the nanometer regime improves transistor performancewhile the delay of global interconnects, connecting circuit blocks separated by a long distance, significantly increases. In fact, global interconnects extending across a full chip can have a delay corresponding to multiple clock cycles. At the same time, global clock skew constraints, not only between blocks but also along the pipelined interconnects, become even tighter. On-chip interconnects have always been considered RC-like, that is exhibiting long RC-delays. This has motivated large efforts on alternatives such as on-chip optical interconnects, which have not yet been demonstrated, or complex schemes utilizing on-chip F-transmission or pulsed current-mode signaling. In this thesis, we show that well-designed electrical global interconnects, behaving as transmission lines, have the capacity of very high data rates (higher than can be delivered by the actual process) and support near velocity-of-light delay for single-ended voltage-mode signaling, thus mitigating the RC-problem. We critically explore key interconnect performance measures such as data delay, maximum data rate, crosstalk, edge rates and power dissipation. To experimentally demonstrate the feasibility and superior properties of on-chip transmission line interconnects, we have designed and fabricated a test chip carrying a 5 mm long global communication link. Measurements show that we can achieve 3 Gb/s/wire over the 5 mm long, repeaterless on-chip bus implemented in a standard 0.18 μm CMOS process, achieving a signal velocity of 1/3 of the velocity of light in vacuum. To manage the problems due to global wire delays, we describe and implement a Synchronous Latency Insensitive Design (SLID) scheme, based on source-synchronous data transfer between blocks and data re-timing at the receiving block. The SLIDtechnique not onlymitigates unknown globalwire delays, but also removes the need for controlling global clock skew. The high-performance and high robustness capability of the SLID-method is practically demonstrated through a successful implementation of a SLID-based, 5.4 mm long, on-chip global bus, supporting 3 Gb/s/wire and dynamically accepting ± 2 clock cycles of data-clock skew, in a standard 0.18 μm CMOS porcess. In the context of technology scaling, there is a tendency for interconnects to dominate chip power dissipation due to their large total capacitance. In this thesis we address the problem of interconnect power dissipation by proposing and analyzing a transition-energy cost model aimed for efficient power estimation of performancecritical buses. The model, which includes properties that closely capture effects present in high-performance VLSI buses, can be used to more accurately determine the energy benefits of e.g. transition coding of bus topologies. We further show a power optimization scheme based on appropriate choice of reduced voltage swing of the interconnect and scaling of receiver amplifier. Finally, the power saving impact of swing reduction in combination with a sense-amplifying flip-flop receiver is shown on a microprocessor cache bus architecture used in industry. Microelectronics Global Interconnects On-Chip Interconnects Velocity-of-Light Delay On-Chip Communication Low-Latency Transmission Lines Electronics Elektronik
8	Scalability and robustness of artificial neural networks Stromatias, Evangelos January 2016 (has links) Artificial Neural Networks (ANNs) appear increasingly and routinely to gain popularity today, as they are being used in several diverse research fields and many different contexts, which may range from biological simulations and experiments on artificial neuronal models to machine learning models intended for industrial and engineering applications. One example is the recent success of Deep Learning architectures (e.g., Deep Belief Networks [DBN]), which appear in the spotlight of machine learning research, as they are capable of delivering state-of-the-art results in many domains. While the performance of such ANN architectures is greatly affected by their scale, their capacity for scalability both for training and during execution is limited by the increased power consumption and communication overheads, implicitly posing a limiting factor on their real-time performance. The on-going work on the design and construction of spike-based neuromorphic platforms offers an alternative for running large-scale neural networks, such as DBNs, with significantly lower power consumption and lower latencies, but has to overcome the hardware limitations and model specialisations imposed by these type of circuits. SpiNNaker is a novel massively parallel fully programmable and scalable architecture designed to enable real-time spiking neural network (SNN) simulations. These properties render SpiNNaker quite an attractive neuromorphic exploration platform for running large-scale ANNs, however, it is necessary to investigate thoroughly both its power requirements as well as its communication latencies. This research focusses on around two main aspects. First, it aims at characterising the power requirements and communication latencies of the SpiNNaker platform while running large-scale SNN simulations. The results of this investigation lead to the derivation of a power estimation model for the SpiNNaker system, a reduction of the overall power requirements and the characterisation of the intra- and inter-chip spike latencies. Then it focuses on a full characterisation of spiking DBNs, by developing a set of case studies in order to determine the impact of (a) the hardware bit precision; (b) the input noise; (c) weight variation; and (d) combinations of these on the classification performance of spiking DBNs for the problem of handwritten digit recognition. The results demonstrate that spiking DBNs can be realised on limited precision hardware platforms without drastic performance loss, and thus offer an excellent compromise between accuracy and low-power, low-latency execution. These studies intend to provide important guidelines for informing current and future efforts around developing custom large-scale digital and mixed-signal spiking neural network platforms. 006.3
9	Behavioral Modeling and FPGA Synthesis of IEEE 802.11n Orthogonal Frequency Division Multiplexing (OFDM) Scheme Sharma, Ragahv 04 November 2016 (has links) In the field of communications, a high data rate and low multi-path fading is required for efficient information exchange. Orthogonal Frequency Division Multiplexing (OFDM) is a widely accepted IEEE 802.11n (and many others) standard for usage in communication systems operating in fading dispersive channels. In this thesis, we modeled the OFDM algorithm at the behavioral level in VHDL/Verilog that was successfully synthesized/verified on an FPGA. Due to rapid technology scaling, FPGAs have become popular and are low-cost and high performance alternatives to (semi-) custom ASICs. Further, due to reprogramming flexibility, FPGAs are useful in rapid prototyping. As per the IEEE standard, we implemented both transmitter and receiver with four modulation schemes (BPSK, QPSK, QAM16, and QAM64). We extensively verified the design in simulation as well as on Altera Stratix IV EP4SGX230KF40C2 FPGA (Terasic DE4 Development Board). The synthesized design ran at 100 MHz clock frequency incurring 54 µ sec. end-to-end latency and 8% logic utilization. Wireless Image Transfer VHDL Implementation OFDM Verification Low Latency Data Transfer High Bit Rate Computer Engineering
10	Attelage de systèmes de transcription automatique de la parole / Attelage de systèmes de transcription automatique de la parole Bougares, Fethi 23 November 2012 (has links) Nous abordons, dans cette thèse, les méthodes de combinaison de systèmesde transcription de la parole à Large Vocabulaire. Notre étude se concentre surl’attelage de systèmes de transcription hétérogènes dans l’objectif d’améliorerla qualité de la transcription à latence contrainte. Les systèmes statistiquessont affectés par les nombreuses variabilités qui caractérisent le signal dela parole. Un seul système n’est généralement pas capable de modéliserl’ensemble de ces variabilités. La combinaison de différents systèmes detranscription repose sur l’idée d’exploiter les points forts de chacun pourobtenir une transcription finale améliorée. Les méthodes de combinaisonproposées dans la littérature sont majoritairement appliquées a posteriori,dans une architecture de transcription multi-passes. Cela nécessite un tempsde latence considérable induit par le temps d’attente requis avant l’applicationde la combinaison.Récemment, une méthode de combinaison intégrée a été proposée. Cetteméthode est basée sur le paradigme de décodage guidé (DDA :Driven DecodingAlgorithm) qui permet de combiner différents systèmes durant le décodage. Laméthode consiste à intégrer des informations en provenance de plusieurs systèmes dits auxiliaires dans le processus de décodage d’un système dit primaire.Notre contribution dans le cadre de cette thèse porte sur un double aspect : d’une part, nous proposons une étude sur la robustesse de la combinaison par décodage guidé. Nous proposons ensuite, une amélioration efficacement généralisable basée sur le décodage guidé par sac de n-grammes,appelé BONG. D’autre part, nous proposons un cadre permettant l’attelagede plusieurs systèmes mono-passe pour la construction collaborative, à latenceréduite, de la sortie de l’hypothèse de reconnaissance finale. Nous présentonsdifférents modèles théoriques de l’architecture d’attelage et nous exposons unexemple d’implémentation en utilisant une architecture client/serveur distribuée. Après la définition de l’architecture de collaboration, nous nous focalisons sur les méthodes de combinaison adaptées à la transcription automatiqueà latence réduite. Nous proposons une adaptation de la combinaison BONGpermettant la collaboration, à latence réduite, de plusieurs systèmes mono-passe fonctionnant en parallèle. Nous présentons également, une adaptationde la combinaison ROVER applicable durant le processus de décodage via unprocessus d’alignement local suivi par un processus de vote basé sur la fréquence d’apparition des mots. Les deux méthodes de combinaison proposéespermettent la réduction de la latence de la combinaison de plusieurs systèmesmono-passe avec un gain significatif du WER. / This thesis presents work in the area of Large Vocabulary ContinuousSpeech Recognition (LVCSR) system combination. The thesis focuses onmethods for harnessing heterogeneous systems in order to increase theefficiency of speech recognizer with reduced latency.Automatic Speech Recognition (ASR) is affected by many variabilitiespresent in the speech signal, therefore single ASR systems are usually unableto deal with all these variabilities. Considering these limitations, combinationmethods are proposed as alternative strategies to improve recognitionaccuracy using multiple recognizers developed at different research siteswith different recognition strategies. System combination techniques areusually used within multi-passes ASR architecture. Outputs of two or moreASR systems are combined to estimate the most likely hypothesis amongconflicting word pairs or differing hypotheses for the same part of utterance.The contribution of this thesis is twofold. First, we study and analyze theintegrated driven decoding combination method which consists in guidingthe search algorithm of a primary ASR system by the one-best hypothesesof auxiliary systems. Thus we propose some improvements in order to makethe driven decoding more efficient and generalizable. The proposed methodis called BONG and consists in using Bag Of N-Gram auxiliary hypothesisfor the driven decoding.Second, we propose a new framework for low latency paralyzed single-passspeech recognizer harnessing. We study various theoretical harnessingmodels and we present an example of harnessing implementation basedon client/server distributed architecture. Afterwards, we suggest differentcombination methods adapted to the presented harnessing architecture:first we extend the BONG combination method for low latency paralyzedsingle-pass speech recognizer systems collaboration. Then we propose, anadaptation of the ROVER combination method to be performed during thedecoding process using a local vote procedure followed by voting based onword frequencies. Reconnaissance automatique de la parole Combinaison de système Latence réduite Décodage guidé Automatic Speech Recognition Combination method Low latency

Search results