Global ETD Search

1	Hardware-Aware Distributed Pipelined Neural Network Models Inference Alshams, Mojtaba 07 1900 (has links) Neural Network models got the attention of the scientific community for their increasing accuracy in predictions and good emulation of some human tasks. This led to extensive enhancements in their architecture, resulting in models with fast-growing memory and computation requirements. Due to hardware constraints such as memory and computing capabilities, the inference of a large neural network model can be distributed across multiple devices by a partitioning algorithm. The proposed framework finds the optimal model splits and chooses which device shall compute a corresponding split to minimize inference time and energy. The framework is based on PipeEdge algorithm and extends it by not only increasing inference throughput but also simultaneously minimizing inference energy consumption. Another thesis contribution is the augmentation of the emerging technology Compute-in-memory (CIM) devices to the system. To the best of my knowledge, no one studied the effect of including CIM, specifically DNN+NeuroSim simulator, devices in a distributed inference. My proposed framework could partition VGG8 and ResNet152 on ImageNet and achieve a comparable trade-off between inference slowest stage increase and energy reduction when it tried to decrease inference energy (e.g. 19% energy reduction with 34% time increase) and when CIM devices were augmenting the system (e.g. 34% energy reduction with 45% time increase). Distributed Inference Neural Networks Machined Learning Compute-in-memory throughput energy trade-off
2	Compute-in-Memory Primitives for Energy-Efficient Machine Learning Amogh Agrawal (10506350) 26 July 2021 (has links) <div>Machine Learning (ML) workloads, being memory and compute-intensive, consume large amounts of power running on conventional computing systems, restricting their implementations to large-scale data centers. Thus, there is a need for building domain-specific hardware primitives for energy-efficient ML processing at the edge. One such approach is in-memory computing, which eliminates frequent and unnecessary data-transfers between the memory and the compute units, by directly computing the data where it is stored. Most of the chip area is consumed by on-chip SRAMs in both conventional von-Neumann systems (e.g. CPU/GPU) as well as application-specific ICs (e.g. TPU). Thus, we propose various circuit techniques to enable a range of computations such as bitwise Boolean and arithmetic computations, binary convolution operations, non-Boolean dot-product operations, lookup-table based computations, and spiking neural network implementation - all within standard SRAM memory arrays.</div><div><br></div><div>First, we propose X-SRAM, where, by using skewed sense amplifiers, bitwise Boolean operations such as NAND/NOR/XOR/IMP etc. can be enabled within 6T and 8T SRAM arrays. Moreover, exploiting the decoupled read/write ports in 8T SRAMs, we propose read-compute-store scheme where the computed data can directly be written back in the array simultaneously. </div><div><br></div><div>Second, we propose Xcel-RAM, where we show how binary convolutions can be enabled in 10T SRAM arrays for accelerating binary neural networks. We present charge sharing approach for performing XNOR operations followed by a population count (popcount) using both analog and digital techniques, highlighting the accuracy-energy tradeoff. </div><div><br></div><div>Third, we take this concept further and propose CASH-RAM, to accelerate non-Boolean operations, such as dot-products within standard 8T-SRAM arrays by utilizing the parasitic capacitances of bitlines and sourcelines. We analyze the non-idealities that arise due to analog computations and propose a self-compensation technique which reduces the effects of non-idealities, thereby reducing the errors. </div><div><br></div><div>Fourth, we propose ROM-embedded caches, RECache, using standard 8T SRAMs, useful for lookup-table (LUT) based computations. We show that just by adding an extra word-line (WL) or a source-line (SL), the same bit-cell can store a ROM bit, as well as the usual RAM bit, while maintaining the performance and area-efficiency, thereby doubling the memory density. Further we propose SPARE, an in-memory, distributed processing architecture built on RECache, for accelerating spiking neural networks (SNNs), which often require high-order polynomials and transcendental functions for solving complex neuro-synaptic models. </div><div><br></div><div>Finally, we propose IMPULSE, a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. The inherent dynamics of the neuron membrane potential in SNNs allows processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly-sparse spike-based computations in such spatio-temporal data can be leveraged for energy-efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. IMPULSE triew to tackle the above challenges. It consists of a fused weight (WMEM) and membrane potential (VMEM) memory and inherently exploits sparsity in input spikes. We propose staggered data mapping and re-configurable peripherals for handling different bit-precision requirements of WMEM and VMEM, while supporting multiple neuron functionalities. The proposed macro was fabricated in 65nm CMOS technology. We demonstrate a sentiment classification task from the IMDB dataset of movie reviews and show that the SNN achieves competitive accuracy with only a fraction of trainable parameters and effective operations compared to an LSTM network.</div><div><br></div><div>These circuit explorations to embed computations in standard memory structures shows that on-chip SRAMs can do much more than just store data and can be re-purposed as on-demand accelerators for a variety of applications. </div> Computer Engineering SRAM in-memory computing compute-in-memory Neuromorphic Computing Spiking Neural Network
3	A 2TnC ferroelectric memory gain cell suitable for compute-in-memory and neuromorphic application Slesazeck, Stefan, Ravsher, Taras, Havel, Viktor, Breyer, Evelyn T., Mulaosmanovic, Halid, Mikolajick, Thomas 20 June 2022 (has links) A 2TnC ferroelectric memory gain cell consisting of two transistors and two or more ferroelectric capacitors (FeCAP) is proposed. While a pre-charge transistor allows to access the single cell in an array, the read transistor amplifies the small read signals from small-scaled FeCAPs that can be operated either in FeRAM mode by sensing the polarization reversal current, or in ferroelectric tunnel junction (FTJ) mode by sensing the polarization dependent leakage current. The simultaneous read or write operation of multiple FeCAPs is used to realize compute-in-memory (CiM) algorithms that enable processing of data being represented by both, non-volatilely internally stored data and externally applied data. The internal gain of the cell mitigates the need for 3D integration of the FeCAPs, thus making the concept very attractive especially for embedded memories. Here we discuss design constraints of the 2TnC cell and present the proof-of-concept for realizing versatile (CiM) approaches by means of electrical characterization results. info:eu-repo/classification/ddc/621.3 ddc:621.3
4	Toward Energy-Efficient Machine Learning: Algorithms and Analog Compute-In-Memory Hardware Indranil Chakraborty (11180610) 26 July 2021 (has links) <div>The ‘Internet of Things’ has increased the demand for artificial intelligence (AI)-based edge computing in applications ranging from healthcare monitoring systems to autonomous vehicles. However, the growing complexity of machine learning workloads requires rethinking to make AI amenable to resource constrained environments such as edge devices. To that effect, the entire stack of machine learning, from algorithms to hardware primitives, have been explored to enable energy-efficient intelligence at the edge. </div><div><br></div><div>From the algorithmic aspect, model compression techniques such as quantization are powerful tools to address the growing computational cost of ML workloads. However, quantization, particularly, can result in substantial loss of performance for complex image classification tasks. To address this, a principal component analysis (PCA)-driven methodology to identify the important layers of a binary network, and design mixed-precision networks. The proposed Hybrid-Net achieves a significant improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets, while still achieving up remarkable energy-efficiency. </div><div><br></div><div>Having explored compressed neural networks, there is a need to investigate suitable computing systems to further the energy efficiency. Memristive crossbars have been extensively explored as an alternative to traditional CMOS based systems for deep learning accelerators due to their high on-chip storage density and efficient Matrix Vector Multiplication (MVM) compared to digital CMOS. However, the analog nature of computing poses significant issues due to various non-idealities such as: parasitic resistances, non-linear I-V characteristics of the memristor device etc. To address this, a simplified equation-based modelling of the non-ideal behavior of crossbars is performed and correspondingly, a modified technology aware training algorithm is proposed. Building on the drawbacks of equation-based modeling, a Generalized Approach to Emulating Non-Ideality in Memristive Crossbars using Neural Networks (GENIEx) is proposed where a neural network is trained on HSPICE simulation data to learn the transfer characteristics of the non-ideal crossbar. Next, a functional simulator was developed which includes key architectural facets such as tiling, and bit-slicing to analyze the impact of non-idealities on the classification accuracy of large-scale neural networks.</div><div><br></div><div>To truly realize the benefits of hardware primitives and the algorithms on top of the stack, it is necessary to build efficient devices that mimic the behavior of the fundamental units of a neural network, namely, neurons and synapses. However, efforts have largely been invested in implementations in the electrical domain with potential limitations of switching speed, functional errors due to analog computing, etc. As an alternative, a purely photonic operation of an Integrate-and-Fire Spiking neuron is proposed, based on the phase change dynamics of Ge2Sb2Te5 (GST) embedded on top of a microring resonator, which alleviates the energy constraints of PCMs in electrical domain. Further, the inherent parallelism of wavelength-division multiplexing (WDM) was leveraged to propose a photonic dot-product engine. The proposed computing platform was used to emulate a SNN inferencing engine for image-classification tasks. These explorations at different levels of the stack can enable energy-efficient machine learning for edge intelligence. </div><div><br></div><div>Having explored various domains to design efficient DNN models and studying various hardware primitives based on emerging technologies, we focus on Silicon implementation of compute-in-memory (CIM) primitives for machine learning acceleration based on the more available CMOS technology. CIM primitives enable efficient matrix-vector multiplications (MVM) through parallelized multiply-and-accumulate operations inside the memory array itself. As CIM primitives deploy bit-serial computing, the computations are exposed bit-level sparsity of inputs and weights in a ML model. To that effect, we present an energy-efficient sparsity-aware reconfigurable-precision compute-in-memory (CIM) 8T-SRAM macro for machine learning (ML) applications. Standard 8T-SRAM arrays are re-purposed to enable MAC operations using selective current flow through the read-port transistors. The proposed macro dynamically leverages workload sparsity by reconfiguring the output precision in the peripheral circuitry without degrading application accuracy. Specifically, we propose a new energy-efficient reconfigurable-precision SAR ADC design with the ability to form (n+m)-bit precision using n-bit and m-bit ADCs. Additionally, the transimpedance amplifier (TIA) –required to convert the summed current into voltage before conversion—is reconfigured based on sparsity to improve sense margin at lower output precision. The proposed macro, fabricated in 65 nm technology, provides 35.5-127.2 TOPS/W as the ADC precision varies from 6-bit to 2-bit, respectively. Building on top of the fabricated macro, we next design a hierarchical CIM core micro-architecture that addresses the existing CIM scaling challenges. The proposed CIM core micro-architecture consists of 32 proposed sparsity-aware CIM macros. The 32 macros are divided into 4 matrix-vector multiplication units (MVMUs) consisting of 8 macros each. The core has three unique features: i) it can adaptively reconfigure ADC precision to achieve energy-efficiency and lower latency based on input and weight sparsity, determined by a sparsity controller, ii) it deploys row-gating feature to maintain SNR requirements for accurate DNN computations, and iii) hardware support for load balancing to balance latency mismatches occurring due to different ADC precisions in different compute units. Besides the CIM macros, the core micro-architecture consists of input, weight, and output memories, along with instruction memory and control circuits. The instruction set architecture allows for flexible dataflows and mapping in the proposed core micro-architecture. The sparsity-aware processing core is scheduled to be taped out next month. The proposed CIM demonstrations complemented by our previous analysis on analog CIM systems progressed our understanding of this emerging paradigm in pertinence to ML acceleration.</div> Computer Engineering Machine Learning Hardware Artificial Intelligence Compute-in-Memory Accelerator Machine Learning Accelerator Photonic Neural Networks Neural Networks Phase Change Materials
5	Adoption of 2T2C ferroelectric memory cells for logic operation Ravsher, Taras, Mulaosmanovic, Halid, Breyer, Evelyn T., Havel, Viktor, Mikolajick, Thomas, Slesazeck, Stefan 17 December 2021 (has links) A 2T2C ferroelectric memory cell consisting of a select transistor, a read transistor and two ferroelectric capacitors that can be operated either in FeRAM mode or in memristive ferroelectric tunnel junction mode is proposed. The two memory devices can be programmed individually. By performing a combined readout operation, the two stored bits of the memory cells can be combined to perform in-memory logic operation. Moreover, additional input logic signals that are applied as external readout voltage pulses can be used to perform logic operation together with the stored logic states of the ferroelectric capacitors. Electrical characterization results of the logic-in-memory (LiM) functionality is presented. info:eu-repo/classification/ddc/621.3 ddc:621.3

1

Page generated in 0.0429 seconds