• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 5
  • 2
  • 2
  • Tagged with
  • 9
  • 9
  • 4
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Memristive Probabilistic Computing

Alahmadi, Hamzah 10 1900 (has links)
In the era of Internet of Things and Big Data, unconventional techniques are rising to accommodate the large size of data and the resource constraints. New computing structures are advancing based on non-volatile memory technologies and different processing paradigms. Additionally, the intrinsic resiliency of current applications leads to the development of creative techniques in computations. In those applications, approximate computing provides a perfect fit to optimize the energy efficiency while compromising on the accuracy. In this work, we build probabilistic adders based on stochastic memristor. Probabilistic adders are analyzed with respect of the stochastic behavior of the underlying memristors. Multiple adder implementations are investigated and compared. The memristive probabilistic adder provides a different approach from the typical approximate CMOS adders. Furthermore, it allows for a high area saving and design exibility between the performance and power saving. To reach a similar performance level as approximate CMOS adders, the memristive adder achieves 60% of power saving. An image-compression application is investigated using the memristive probabilistic adders with the performance and the energy trade-off.
2

Design by transformation : from domain knowledge to optimized program generation

Marker, Bryan Andrew 20 June 2014 (has links)
Expert design knowledge is essential to develop a library of high-performance software. This includes how to implement and parallelize domain operations, how to optimize implementations, and estimates of which implementation choices are best. An expert repeatedly applies his knowledge, often in a rote and tedious way, to develop all of the related functionality expected from a domain-specific library. Expert knowledge is hard to gain and is easily lost over time when an expert forgets or when a new engineer starts developing code. The domain of dense linear algebra (DLA) is a prime example with software that is so well designed that much of experts' important work has become tediously rote in many ways. In this dissertation, we demonstrate how one can encode design knowledge for DLA so it can be automatically applied to generate code as an expert would or to generate better code. Further, the knowledge is encoded for perpetuity, so it can be reused to make implementing functionality on new hardware easier or it can be used to teach how software is designed to a non-expert. We call this approach to software engineering (encoding expert knowledge and automatically applying it) Design by Transformation (DxT). We present our vision, the methodology, a prototype code generation system, and possibilities when applying DxT to the domain of dense linear algebra. / text
3

Energy-Efficient Circuit and Architecture Designs for Intelligent Systems

January 2020 (has links)
abstract: In the era of artificial intelligent (AI), deep neural networks (DNN) have achieved accuracy on par with humans on a variety of recognition tasks. However, the high computation and storage requirement of DNN training and inference have posed challenges to deploying or locally training the DNNs on mobile and wearable devices. Energy-efficient hardware innovation from circuit to architecture level is required.In this dissertation, a smart electrocardiogram (ECG) processor is first presented for ECG-based authentication as well as cardiac monitoring. The 65nm testchip consumes 1.06 μW at 0.55 V for real-time ECG authentication achieving equal error rate of 1.7% for authentication on an in-house 645-subject database. Next, a couple of SRAM-based in-memory computing (IMC) accelerators for deep learning algorithms are presented. Two single-array macros titled XNOR-SRAM and C3SRAM are based on resistive and capacitive networks for XNOR-ACcumulation (XAC) operations, respectively. XNOR-SRAM and C3SRAM macros in 65nm CMOS achieve energy efficiency of 403 TOPS/W and 672 TOPS/W, respectively. Built on top of these two single-array macro designs, two multi-array architectures are presented. The XNOR-SRAM based architecture titled “Vesti” is designed to support configurable multibit activations and large-scale DNNs seamlessly. Vesti employs double-buffering with two groups of in-memory computing SRAMs, effectively hiding the write latency of IMC SRAMs. The Vesti accelerator in 65nm CMOS achieves energy consumption of <20 nJ for MNIST classification and <40μJ for CIFAR-10 classification at 1.0 V supply. More recently, a programmable IMC accelerator (PIMCA) integrating 108 C3SRAM macros of a total size of 3.4 Mb is proposed. The28nm prototype chip achieves system-level energy efficiency of 437/62 TOPS/W at 40 MHz, 1 V supply for DNNs with 1b/2b precision. In addition to the IMC works, this dissertation also presents a convolutional neural network (CNN) learning processor, which accelerates the stochastic gradient descent (SGD) with momentum based training algorithm in 16-bit fixed-point precision. The65nm CNN learning processor achieves peak energy efficiency of 2.6 TOPS/W for16-bit fixed-point operations, consuming 10.45 mW at 0.55 V. In summary, in this dissertation, several hardware innovations from circuit to architecture level are presented, exploiting the reduced algorithm complexity with pruning and low-precision quantization techniques. In particular, macro-level and system-level SRAM based IMC works presented in this dissertation show that SRAM based IMC is one of the promising solutions for energy-efficient intelligent systems. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2020
4

IN-MEMORY COMPUTING WITH CMOS AND EMERGING MEMORY TECHNOLOGIES

Shubham Jain (7464389) 17 October 2019 (has links)
Modern computing workloads such as machine learning and data analytics perform simple computations on large amounts of data. Traditional von Neumann computing systems, which consist of separate processor and memory subsystems, are inefficient in realizing modern computing workloads due to frequent data transfers between these subsystems that incur significant time and energy costs. In-memory computing embeds computational capabilities within the memory subsystem to alleviate the fundamental processor-memory bottleneck, thereby achieving substantial system-level performance and energy benefits. In this dissertation, we explore a new generation of in-memory computing architectures that are enabled by emerging memory technologies and new CMOS-based memory cells. The proposed designs realize Boolean and non-Boolean computations natively within memory arrays.<br><div><br></div><div>For Boolean computing, we leverage the unique characteristics of emerging memories that allow multiple word lines within an array to be simultaneously enabled, opening up the possibility of directly sensing functions of the values stored in multiple rows using single access. We propose Spin-Transfer Torque Compute-in-Memory (STT-CiM), a design for in-memory computing with modifications to peripheral circuits that leverage this principle to perform logic, arithmetic, and complex vector operations. We address the challenge of reliable in-memory computing under process variations utilizing error detecting and correcting codes to control errors during CiM operations. We demonstrate how STT-CiM can be integrated within a general-purpose computing system and propose architectural enhancements to processor instruction sets and on-chip buses for in-memory computing. <br></div><div><br></div><div>For non-Boolean computing, we explore crossbar arrays of resistive memory elements, which are known to compactly and efficiently realize a key primitive operation involved in machine learning algorithms, i.e., vector-matrix multiplication. We highlight a key challenge involved in this approach - the actual function computed by a resistive crossbar can deviate substantially from the desired vector-matrix multiplication operation due to a range of device and circuit level non-idealities. It is essential to evaluate the impact of the errors introduced by these non-idealities at the application level. There has been no study of the impact of non-idealities on the accuracy of large-scale workloads (e.g., Deep Neural Networks [DNNs] with millions of neurons and billions of synaptic connections), in part because existing device and circuit models are too slow to use in application-level evaluation. We propose a Fast Crossbar Model (FCM) to accurately capture the errors arising due to crossbar non-idealities while being four-to-five orders of magnitude faster than circuit simulation. We also develop RxNN, a software framework to evaluate DNN inference on resistive crossbar systems. Using RxNN, we evaluate a suite of large-scale DNNs developed for the ImageNet Challenge (ILSVRC). Our evaluations reveal that the errors due to resistive crossbar non-idealities can degrade the overall accuracy of DNNs considerably, motivating the need for compensation techniques. Subsequently, we propose CxDNN, a hardware-software methodology that enables the realization of large-scale DNNs on crossbar systems with minimal degradation in accuracy by compensating for errors due to non-idealities. CxDNN comprises of (i) an optimized mapping technique to convert floating-point weights and activations to crossbar conductances and input voltages, (ii) a fast re-training method to recover accuracy loss due to this conversion, and (iii) low-overhead compensation hardware to mitigate dynamic and hardware-instance-specific errors. Unlike previous efforts that are limited to small networks and require the training and deployment of hardware-instance-specific models, CxDNN presents a scalable compensation methodology that can address large DNNs (e.g., ResNet-50 on ImageNet), and enables a common model to be trained and deployed on many devices. <br></div><div><br></div><div>For non-Boolean computing, we also propose TiM-DNN, a programmable hardware accelerator that is specifically designed to execute ternary DNNs. TiM-DNN supports various ternary representations including unweighted (-1,0,1), symmetric weighted (-a,0,a), and asymmetric weighted (-a,0,b) ternary systems. TiM-DNN is an in-memory accelerator designed using TiM tiles --- specialized memory arrays that perform massively parallel signed vector-matrix multiplications on ternary values per access. TiM tiles are in turn composed of Ternary Processing Cells (TPCs), new CMOS-based memory cells that function as both ternary storage units and signed scalar multiplication units. We evaluate an implementation of TiM-DNN in 32nm technology using an architectural simulator calibrated with SPICE simulation and RTL synthesis. TiM-DNN achieves a peak performance of 114 TOPs/s, consumes 0.9W power, and occupies 1.96mm2 chip area, representing a 300X improvement in TOPS/W compared to a state-of-the-art NVIDIA Tesla V100 GPU. In comparison to popular quantized DNN accelerators, TiM-DNN achieves 55.2X-240X and 160X-291X improvement in TOPS/W and TOPS/mm2, respectively.<br></div><div><br></div><div>In summary, the dissertation proposes new in-memory computing architectures as well as addresses the need for scalable modeling frameworks and compensation techniques for resistive crossbar based in-memory computing fabrics. Our evaluations show that in-memory computing architectures are promising for realizing modern machine learning and data analytics workloads, and can attain orders-of-magnitude improvement in system-level energy and performance over traditional von Neumann computing systems. <br></div>
5

Compute-in-Memory Primitives for Energy-Efficient Machine Learning

Amogh Agrawal (10506350) 26 July 2021 (has links)
<div>Machine Learning (ML) workloads, being memory and compute-intensive, consume large amounts of power running on conventional computing systems, restricting their implementations to large-scale data centers. Thus, there is a need for building domain-specific hardware primitives for energy-efficient ML processing at the edge. One such approach is in-memory computing, which eliminates frequent and unnecessary data-transfers between the memory and the compute units, by directly computing the data where it is stored. Most of the chip area is consumed by on-chip SRAMs in both conventional von-Neumann systems (e.g. CPU/GPU) as well as application-specific ICs (e.g. TPU). Thus, we propose various circuit techniques to enable a range of computations such as bitwise Boolean and arithmetic computations, binary convolution operations, non-Boolean dot-product operations, lookup-table based computations, and spiking neural network implementation - all within standard SRAM memory arrays.</div><div><br></div><div>First, we propose X-SRAM, where, by using skewed sense amplifiers, bitwise Boolean operations such as NAND/NOR/XOR/IMP etc. can be enabled within 6T and 8T SRAM arrays. Moreover, exploiting the decoupled read/write ports in 8T SRAMs, we propose read-compute-store scheme where the computed data can directly be written back in the array simultaneously. </div><div><br></div><div>Second, we propose Xcel-RAM, where we show how binary convolutions can be enabled in 10T SRAM arrays for accelerating binary neural networks. We present charge sharing approach for performing XNOR operations followed by a population count (popcount) using both analog and digital techniques, highlighting the accuracy-energy tradeoff. </div><div><br></div><div>Third, we take this concept further and propose CASH-RAM, to accelerate non-Boolean operations, such as dot-products within standard 8T-SRAM arrays by utilizing the parasitic capacitances of bitlines and sourcelines. We analyze the non-idealities that arise due to analog computations and propose a self-compensation technique which reduces the effects of non-idealities, thereby reducing the errors. </div><div><br></div><div>Fourth, we propose ROM-embedded caches, RECache, using standard 8T SRAMs, useful for lookup-table (LUT) based computations. We show that just by adding an extra word-line (WL) or a source-line (SL), the same bit-cell can store a ROM bit, as well as the usual RAM bit, while maintaining the performance and area-efficiency, thereby doubling the memory density. Further we propose SPARE, an in-memory, distributed processing architecture built on RECache, for accelerating spiking neural networks (SNNs), which often require high-order polynomials and transcendental functions for solving complex neuro-synaptic models. </div><div><br></div><div>Finally, we propose IMPULSE, a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. The inherent dynamics of the neuron membrane potential in SNNs allows processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly-sparse spike-based computations in such spatio-temporal data can be leveraged for energy-efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. IMPULSE triew to tackle the above challenges. It consists of a fused weight (WMEM) and membrane potential (VMEM) memory and inherently exploits sparsity in input spikes. We propose staggered data mapping and re-configurable peripherals for handling different bit-precision requirements of WMEM and VMEM, while supporting multiple neuron functionalities. The proposed macro was fabricated in 65nm CMOS technology. We demonstrate a sentiment classification task from the IMDB dataset of movie reviews and show that the SNN achieves competitive accuracy with only a fraction of trainable parameters and effective operations compared to an LSTM network.</div><div><br></div><div>These circuit explorations to embed computations in standard memory structures shows that on-chip SRAMs can do much more than just store data and can be re-purposed as on-demand accelerators for a variety of applications. </div>
6

應用記憶體內運算於多維度多顆粒度資料探勘之研究―以醫療服務創新為例 / A Research Into In-memory Computing In Multidimensional, Multi-granularity Data Mining ― With Healthcare Services Innovation

朱家棋, Chu, Chia Chi Unknown Date (has links)
全球面臨人口老化與人口不斷成長的壓力下,對於醫療服務的需求不斷提升。醫療服務領域中常以資料探勘「關聯規則」分析,挖掘隱藏在龐大的醫學資料庫中的知識(knowledge),以支援臨床決策或創新醫療服務。隨著醫療服務與應用推陳出新(如,電子健康紀錄或行動醫療等),與醫療機構因應政府政策需長期保存大量病患資料,讓醫療領域面臨如何有效的處理巨量資料。 然而傳統的關聯規則演算法,其效能上受到相當大的限制。因此,許多研究提出將關聯規則演算法,在分散式環境中,以Hadoop MapReduce框架實現平行化處理巨量資料運算。其相較於單節點 (single-node) 的運算速度確實有大幅提升。但實際上,MapReduce並不適用於需要密集迭帶運算的關聯規則演算法。 本研究藉由Spark記憶體內運算框架,在分散式叢集上實現平行化挖掘多維度多顆粒度挖掘關聯規則,實驗結果可以歸納出下列三點。第一點,當資料規模小時,由於平行化將資料流程分為Map與Reduce處理,因此在小規模資料處理上沒有太大的效益。第二點,當資料規模大時,平行化策略模式與單機版有明顯大幅度差異,整體運行時間相差100倍之多;然而當項目個數大於1萬個時,單機版因記憶體不足而無法運行,但平行化策略依舊可以運行。第三點,整體而言Spark雖然在小規模處理上略慢於單機版的速度,但其運行時間仍小於Hadoop的4倍。大規模處理速度上Spark依舊優於Hadoop版本。因此,在處理大規模資料時,就運算效能與擴充彈性而言,Spark都為最佳化解決方案。 / Under the population aging and population growth and rising demand for Healthcare. Healthcare is facing a big issue how to effectively deal with huge amounts of data. Cased by new healthcare services or applications (such as electronic health records or health care, etc), and also medical institutions in accordance with government policy for long-term preservation of a large number of patient data. But the traditional algorithms for mining association rules, subject to considerable restrictions on their effectiveness. Therefore, many studies suggest that the association rules algorithm in a distributed computing, such as Hadoop MapReduce framework implements parallel to process huge amounts of data operations. But in fact, MapReduce does not apply to require intensive iterative computation algorithm of association rules. Studied in this Spark in-memory computing framework, implemented on a distributed cluster parallel mining association rules mining multidimensional granularity, the experimental results can be summed up in the following three points. 1th, when data is small, due to the parallel data flow consists of Map and Reduce, so not much in the small-scale processing of benefits. 2nd, when the data size is large, parallel strategy models and stand-alone obviously significant differences overall running time is 100 times as much when the item number is greater than 10,000, however, stand-alone version cannot run due to insufficient memory, but parallel strategies can still run. 3rd, overall Spark though somewhat slower than the single version in small scale processing speed, but the running time is less than 4 times times the Hadoop. Massive processing speed Spark is still superior to the Hadoop version. Therefore, when working with large data, operational efficiency and expansion elasticity, Spark for optimum solutions.
7

A high-throughput in-memory index, durable on flash-based SSD

Kissinger, Thomas, Schlegel, Benjamin, Böhm, Matthias, Habich, Dirk, Lehner, Wolfgang 14 February 2013 (has links) (PDF)
Growing memory capacities and the increasing number of cores on modern hardware enforces the design of new in-memory indexing structures that reduce the number of memory transfers and minimizes the need for locking to allow massive parallel access. However, most applications depend on hard durability constraints requiring a persistent medium like SSDs, which shorten the latency and throughput gap between main memory and hard disks. In this paper, we present our winning solution of the SIGMOD Programming Contest 2011. It consists of an in-memory indexing structure that provides a balanced read/write performance as well as non-blocking reads and single-lock writes. Complementary to this index, we describe an SSD-optimized logging approach to fit hard durability requirements at a high throughput rate.
8

運用記憶體內運算於智慧型健保院所異常查核之研究 / A Research into In-Memory Computing Techniques for Intelligent Check of Health-Insurance Fraud

湯家哲, Tang, Jia Jhe Unknown Date (has links)
我國全民健保近年財務不佳,民國98年收支短絀達582億元。根據中央健康保險署資料,截至目前為止,特約醫事服務機構違規次數累積達13722次。在所有重大違規事件中,大部分是詐欺行為。 健保審查機制主要以電腦隨機抽樣,再由人工進行調查。然而,這樣的審查方式無法有效抽取到違規醫事機構之樣本,造成審查效果不彰。 Benford’s Law又稱第一位數法則,其概念為第一位數的值越小則該數字出現的頻率越大,反之相反。該方法被應用於會計、金融、審計及經濟領域中。楊喻翔(2012)將Benford’s Law相關指標應用於我國全民健保上,並結合機器學習演算法來進行健保異常偵測。 Zaharia et al. (2012)提出了一種具容錯的群集記憶內運算模式 Apache Spark,在相同的運算節點及資源下,其資料運算效率及速度可勝出Hadoop MapReduce 20倍以上。 為解決健保異常查核效果不彰問題,本研究將採用Benford’s Law,使用國家衛生研究院發行之健保資料計算成為Benford’s Law指標和實務指標,接著並使用支援向量機和邏輯斯迴歸來建構出異常查核模型。然而健保資料量龐大,為加快運算時間,本研究使用Apache Spark做為運算環境,並以Hadoop MapReduce作為標竿,比較運算效率。 研究結果顯示,本研究撰寫的Spark程式運算時間能較MapReduce快2倍;在分類模型上,支援向量機和邏輯斯迴歸所進行的住院資料測試,敏感度皆有80%以上;而所進行的門診資料測試,兩個模型的準確率沒有住院資料高,但邏輯斯迴歸測試結果仍保有一定的準確性,在敏感度仍有75%,整體正確率有73%。 本研究使用Apache Spark節省處理大量健保資料的運算時間。其次本研究建立的智慧型異常查核模型,確實能查核出違約的醫事機構,而模型所查核出可能有詐欺及濫用健保之醫事機構,可進行下階段人工調查,最終得改善健保查核效力。 / Financial condition of National Health Insurance (NHI) has been wretched in recent years. The income statement in 2009 indicated that National Health Insurance Administration (NHIA) was in debt for NTD $58.2 billion. According to NHIA data, certain medical institutions in Taiwan violated the NHI laws for 13722 times. Among all illegal cases, fraud is the most serious. In order to find illegal medical institutions, NHIA conducted random sampling by computer. Once the data was collected, NHIA investigators got involved in the review process. However, the way to get the samples mentioned above cannot reveal the reality. Benford's law is called the First-Digit Law. The concept of Benford’s Law is that the smaller digits would appear more frequently, while larger digits would occur less frequently. Benford’s Law is applied to accounting, finance, auditing and economics. Yang(2012) used Benford’s Law in NHI data and he also used machine learning algorithms to do fraud detection. Zaharia et al. (2012) proposed a fault-tolerant in-memory cluster computing -Apache Spark. Under the same computing nodes and resources, Apache Spark’s computing is faster than Hadoop MapReduce 20 times. In order to solve the problem of medical claims review, Benford’s Law was applied to this study. This study used NHI data which was published by National Health Research Institutes. Then, we computed NHI data to generate Benford’s Law variables and technical variables. Finally, we used support vector machine and logistics regression to construct the illegal check model. During system development, we found that the data size was big. With the purpose of reducing the computing time, we used Apache Spark to build computing environment. Furthermore, we adopted Hadoop MapReduce as benchmark to compare the performance of computing time. This study indicated that Apache Spark is faster twice than Hadoop MapReduce. In illegal check model, with support vector machine and logistics regression, we had 80% sensitivity in inpatient data. In outpatient data, the accuracy of support vector machine and logistics regression were lower than inpatient data. In this case, logistics regression still had 75% sensitivity and 73% accuracy. This study used Apache Spark to compute NHI data with lower computing time. Second, we constructed the intelligent illegal check model which can find the illegal medical institutions for manual check. With the use of illegal check model, the procedure of medical claims review will be improved.
9

A high-throughput in-memory index, durable on flash-based SSD: Insights into the winning solution of the SIGMOD programming contest 2011

Kissinger, Thomas, Schlegel, Benjamin, Böhm, Matthias, Habich, Dirk, Lehner, Wolfgang January 2012 (has links)
Growing memory capacities and the increasing number of cores on modern hardware enforces the design of new in-memory indexing structures that reduce the number of memory transfers and minimizes the need for locking to allow massive parallel access. However, most applications depend on hard durability constraints requiring a persistent medium like SSDs, which shorten the latency and throughput gap between main memory and hard disks. In this paper, we present our winning solution of the SIGMOD Programming Contest 2011. It consists of an in-memory indexing structure that provides a balanced read/write performance as well as non-blocking reads and single-lock writes. Complementary to this index, we describe an SSD-optimized logging approach to fit hard durability requirements at a high throughput rate.

Page generated in 0.0957 seconds