Global ETD Search

11	Enabling Full-Fledged Parallelism on Intermittently Powered Computing Akhunov, Khakim 24 June 2024 (has links) Energy-harvesting batteryless devices exploit power from various sources, such as radio waves, sunlight, and vibration. However, the sporadic availability of ambient energy causes frequent power failures, forcing the systems to operate intermittently. The computation interruptions violate forward progress and memory consistency. State-of-the-art solutions have proposed multiple mature approaches for intermittent computing to provide both application termination guarantees and consistent and idempotent results. Some solutions propose so-called just-in-time (JIT) checkpoints, where dedicated hardware is used to constantly monitor available energy and warn the system when the energy level in the energy buffer reaches critical points. These points indicate potential power failures before which the system must back up its architectural state. Other solutions propose placing checkpoints in the program code at compile time based on the energy consumption of code execution between checkpoints. A power failure can occur at any time during execution, but the computation recovers from the recent checkpoint. Instead of explicitly placing checkpoints, another set of solutions assumes the software developers split the application into failure-atomic tasks directly manipulating non-volatile memory. The common condition in task-based intermittent programming is to keep the energy consumption of each task within the capacity of the energy buffer. While efficient, the proposed solutions target off-the-shelf single-core ultra-low-power microcontrollers (MCUs) with limited flexibility and performance capability. These MCUs are energy-efficient and ideal for performing low-cost tasks. On the other hand, contemporary compute- and data-intensive, parallelizable applications demand the execution of high-cost tasks on edge devices. The reason is that sending large amounts of raw sensor data wirelessly to offload the intensive tasks to the cloud is too energy-inefficient, especially for energy-harvesting devices. Four critical limitations prevent the use of advanced multicore devices and emerging technologies for the efficient execution of modern applications on ultra-low-power batteryless edges. First, in existing systems, programmers need to exploit underlying parallelism manually by interacting directly with low-power accelerators, which is cumbersome. Programmable general-purpose multicore platforms provide the highest degree of flexibility, but the intermittent computing community has overlooked them so far. Existing intermittent computing runtimes do not support parallelism or provide language constructs to express parallelizable code blocks. Second, the availability of energy and the strength of incoming power affect an intermittent system's charging and discharging cyclical nature. When incoming power is strong enough, the device charges rapidly and spends more time on computation. Similarly, low input power forces the system to spend more time collecting energy than computing. To respond to ambient power dynamics and increase throughput, existing works have proposed workload, accuracy, voltage, frequency, and computational unit scaling techniques. However, the solutions work on a fixed hardware configuration, and target systems are limited by the performance of a single-core processor without employing available degrees of application parallelism. Third, existing low-power multicore platforms are not designed for intermittent computing. Their internal non-volatile flash memories are not suitable for intermittent computing because they have high energy requirements, low speed, and limited write endurance. The only way to exploit current low-power multicore platforms for intermittent computing is to introduce an external non-volatile memory, such as FRAM. However, this architectural configuration is very inefficient as compared to embedded FRAM due to its significant energy overhead, making backup and recovery operations energy-expensive. Finally, using emerging memories, e.g., MRAM, as an external non-volatile memory allows for in-memory processing (PIM) of data-intensive computations, eliminating unnecessary data movement and enabling data-level parallelism. While inherently idempotent, such in-memory computation is hard to integrate into traditional MCU-based intermittent systems. Successful integration lacks the effective maintenance of data flow and computation in a power failure-resilient manner. In this thesis, we tackle the limitations. In Chapter 3, we introduce AdaMICA, an intermittent computing runtime that supports parallel intermittent multicore computing and provides the highest degree of flexibility of programmable general-purpose multiple cores. AdaMICA adaptively switches to the best multicore configuration considering the dynamic input power. Therefore, it allows an intermittent system to benefit from workload parallelization, thereby increasing systems throughput and decreasing end-to-end delay while considering the energy availability. Chapter 4 presents PEARL, a power- and energy-aware multicore intermittent computing that enables, for the first time, the efficient adaptation of the common off-the-shelf low-power multicore microcontroller platforms to the intermittent computing paradigm. PEARL features a novel backup policy that significantly reduces the number of accesses to non-volatile memory on multicore platforms. PEARL benefits from multicore power-aware adaptation to adjust the underlying hardware architecture and exploits energy awareness to transition an intermittent system to ultra-low-power mode, retaining memory content. In Chapter 6, we address emerging non-volatile memory, CRAM (Computational RAM), presenting PiMCo and LUTIC, novel programmable CRAM-based in-memory coprocessors that facilitate the power-failure resilient execution of parallelizable computational loads. The coprocessors are pluggable into and controlled by a general-purpose MCU via a standard communication protocol. In Chapter 7, we propose Viadotto, a novel adaptive intermittent computing system that bridges the gap between existing MCU-based intermittent systems and the emerging compute-in-memory paradigm. Viadotto introduces a high-level programming model supported by its compiler, software library, and power failure-resilient memory controller, hiding detailed low-level logic operations and data flow management in CRAM from programmers. Viadotto exploits adaptation by controlling data-level parallelism with respect to the ambient power level. In essence, this thesis addresses several pivotal challenges to enabling full-fledged parallelism on ultra-low-power batteryless devices. Hence, we have made a significant step towards the efficient deployment of modern complex applications on energy-harvesting systems. Intermittent computing Batteryless system Multicore system Parallelism Processing in-memory Computing in-memory
12	<b>PROCESSING IN MEMORY DESIGN AND OPTIMIZATIONS FOR MACHINE LEARNING INFERENCE</b> Mingxuan He (19759866) 22 October 2024 (has links) <p dir="ltr">Advances in machine learning (ML) have ignited hardware innovations for efficient execution of the ML models many of which are memory-bound (e.g., long short-term memories, multi-level perceptrons, and recurrent neural networks). Specifically, inference using these ML models with small batches, as would be the case at the Cloud edge, has little reuse of the large filters and is deeply memory-bound. Simultaneously, processing-in or -near memory (PIM or PNM) is promising unprecedented highbandwidth connection between compute and memory. Fortunately, the memory-bound ML models are a good fit for PIM. We focus on digital PIM which provides higher bandwidth than PNM and does not incur the reliability issues of analog PIM. Previous PIM and PNM approaches advocate full processor cores which do not conform to PIM’s severe area and power constraints. This thesis is composed of three major projects: Newton, activation folding (AcF) and ESPIM. Newton is Sk Hynix’s first accelerator-inmemory (AiMX) product for machine learning, AcF improves the performance of Newton by achieving more compute-row access overlap and ESPIM incorporate sparse neural network models to PIM</p> computer architecture research Processing in Memory machine learning accelerates
13	Stavové zpracování síťových toků / Stateful Processing of Network Flows Košek, Martin Unknown Date (has links) Modern network traffic processing became a challenging task as there are increasing demands on network security devices. Packet-level processing is not sufficient for advanced network traffic analysis and it is necessary to design processing over entire network flows. Stateful processing in software does not offer enough performance for high-speed networks over 10 Gbps and therefore acceleration in hardware should be utilized. Currently there exists no universal platform for stateful processing in hardware and this task has to be implemented individually. Utilization of such platform significantly speed-up development of stateful network applications. This master thesis analyzes all aspects of stateful network processing platform design. Component based architecture increases platform flexibility and ability to optimize for chosen network applications.
14	IN-MEMORY COMPUTING WITH CMOS AND EMERGING MEMORY TECHNOLOGIES Shubham Jain (7464389) 17 October 2019 (has links) Modern computing workloads such as machine learning and data analytics perform simple computations on large amounts of data. Traditional von Neumann computing systems, which consist of separate processor and memory subsystems, are inefficient in realizing modern computing workloads due to frequent data transfers between these subsystems that incur significant time and energy costs. In-memory computing embeds computational capabilities within the memory subsystem to alleviate the fundamental processor-memory bottleneck, thereby achieving substantial system-level performance and energy benefits. In this dissertation, we explore a new generation of in-memory computing architectures that are enabled by emerging memory technologies and new CMOS-based memory cells. The proposed designs realize Boolean and non-Boolean computations natively within memory arrays.<br><div><br></div><div>For Boolean computing, we leverage the unique characteristics of emerging memories that allow multiple word lines within an array to be simultaneously enabled, opening up the possibility of directly sensing functions of the values stored in multiple rows using single access. We propose Spin-Transfer Torque Compute-in-Memory (STT-CiM), a design for in-memory computing with modifications to peripheral circuits that leverage this principle to perform logic, arithmetic, and complex vector operations. We address the challenge of reliable in-memory computing under process variations utilizing error detecting and correcting codes to control errors during CiM operations. We demonstrate how STT-CiM can be integrated within a general-purpose computing system and propose architectural enhancements to processor instruction sets and on-chip buses for in-memory computing. <br></div><div><br></div><div>For non-Boolean computing, we explore crossbar arrays of resistive memory elements, which are known to compactly and efficiently realize a key primitive operation involved in machine learning algorithms, i.e., vector-matrix multiplication. We highlight a key challenge involved in this approach - the actual function computed by a resistive crossbar can deviate substantially from the desired vector-matrix multiplication operation due to a range of device and circuit level non-idealities. It is essential to evaluate the impact of the errors introduced by these non-idealities at the application level. There has been no study of the impact of non-idealities on the accuracy of large-scale workloads (e.g., Deep Neural Networks [DNNs] with millions of neurons and billions of synaptic connections), in part because existing device and circuit models are too slow to use in application-level evaluation. We propose a Fast Crossbar Model (FCM) to accurately capture the errors arising due to crossbar non-idealities while being four-to-five orders of magnitude faster than circuit simulation. We also develop RxNN, a software framework to evaluate DNN inference on resistive crossbar systems. Using RxNN, we evaluate a suite of large-scale DNNs developed for the ImageNet Challenge (ILSVRC). Our evaluations reveal that the errors due to resistive crossbar non-idealities can degrade the overall accuracy of DNNs considerably, motivating the need for compensation techniques. Subsequently, we propose CxDNN, a hardware-software methodology that enables the realization of large-scale DNNs on crossbar systems with minimal degradation in accuracy by compensating for errors due to non-idealities. CxDNN comprises of (i) an optimized mapping technique to convert floating-point weights and activations to crossbar conductances and input voltages, (ii) a fast re-training method to recover accuracy loss due to this conversion, and (iii) low-overhead compensation hardware to mitigate dynamic and hardware-instance-specific errors. Unlike previous efforts that are limited to small networks and require the training and deployment of hardware-instance-specific models, CxDNN presents a scalable compensation methodology that can address large DNNs (e.g., ResNet-50 on ImageNet), and enables a common model to be trained and deployed on many devices. <br></div><div><br></div><div>For non-Boolean computing, we also propose TiM-DNN, a programmable hardware accelerator that is specifically designed to execute ternary DNNs. TiM-DNN supports various ternary representations including unweighted (-1,0,1), symmetric weighted (-a,0,a), and asymmetric weighted (-a,0,b) ternary systems. TiM-DNN is an in-memory accelerator designed using TiM tiles --- specialized memory arrays that perform massively parallel signed vector-matrix multiplications on ternary values per access. TiM tiles are in turn composed of Ternary Processing Cells (TPCs), new CMOS-based memory cells that function as both ternary storage units and signed scalar multiplication units. We evaluate an implementation of TiM-DNN in 32nm technology using an architectural simulator calibrated with SPICE simulation and RTL synthesis. TiM-DNN achieves a peak performance of 114 TOPs/s, consumes 0.9W power, and occupies 1.96mm2 chip area, representing a 300X improvement in TOPS/W compared to a state-of-the-art NVIDIA Tesla V100 GPU. In comparison to popular quantized DNN accelerators, TiM-DNN achieves 55.2X-240X and 160X-291X improvement in TOPS/W and TOPS/mm2, respectively.<br></div><div><br></div><div>In summary, the dissertation proposes new in-memory computing architectures as well as addresses the need for scalable modeling frameworks and compensation techniques for resistive crossbar based in-memory computing fabrics. Our evaluations show that in-memory computing architectures are promising for realizing modern machine learning and data analytics workloads, and can attain orders-of-magnitude improvement in system-level energy and performance over traditional von Neumann computing systems. <br></div> Computer Engineering In-memory Computing Processing-in-Memory Resistive Crossbar Deep Neural Network Emerging memories STT-MRAM Spintronics
15	Dataflow Processing in Memory Achieves Significant Energy Efficiency Shelor, Charles F. 08 1900 (has links) The large difference between processor CPU cycle time and memory access time, often referred to as the memory wall, severely limits the performance of streaming applications. Some data centers have shown servers being idle three out of four clocks. High performance instruction sequenced systems are not energy efficient. The execute stage of even simple pipeline processors only use 9% of the pipeline's total energy. A hybrid dataflow system within a memory module is shown to have 7.2 times the performance with 368 times better energy efficiency than an Intel Xeon server processor on the analyzed benchmarks. The dataflow implementation exploits the inherent parallelism and pipelining of the application to improve performance without the overhead functions of caching, instruction fetch, instruction decode, instruction scheduling, reorder buffers, and speculative execution used by high performance out-of-order processors. Coarse grain reconfigurable logic in an energy efficient silicon process provides flexibility to implement multiple algorithms in a low energy solution. Integrating the logic within a 3D stacked memory module provides lower latency and higher bandwidth access to memory while operating independently from the host system processor. Computer architecture dataflow processing in memory coarse grain reconfigurable logic energy efficient 3D memory Computer Science Data flow computing. Memory management (Computer science) Computer systems -- Energy consumption.
16	AL: Unified Analytics in Domain Specific Terms Luong, Johannes, Habich, Dirk, Lehner, Wolfgang 13 June 2022 (has links) Data driven organizations gather information on various aspects of their endeavours and analyze that information to gain valuable insights or to increase automatization. Today, these organizations can choose from a wealth of specialized analytical libraries and platforms to meet their functional and non-functional requirements. Indeed, many common application scenarios involve the combination of multiple such libraries and platforms in order to provide a holistic perspective. Due to the scattered landscape of specialized analytical tools, this integration can result in complex and hard to evolve applications. In addition, the necessary movement of data between tools and formats can introduce a serious performance penalty. In this article we present a unified programming environment for analytical applications. The environment includes AL, a programming language that combines concepts of various common analytical domains. Further, the environment also includes a flexible compilation system that uses a language-, domain-, and platform independent program intermediate representation to separate high level application logic and physical organisation. We provide a detailed introduction of AL, establish our program intermediate representation as a generally useful abstraction, and give a detailed explanation of the translation of AL programs into workloads for our experimental shared-memory processing engine. info:eu-repo/classification/ddc/004 ddc:004
17	Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad 10 April 2019 (has links) Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain these results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flow (or set of flows) of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation); however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems. In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of updates that is based on relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow the automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g., timestamp attributes) along-with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints, while non-temporal comparisons can either be equality or inequality constraints, hence these systems mostly process inequality joins. As starting point, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kind of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub) results incur high update latency and systems that materialize (sub) results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems, and present a main-memory data representation that allows to enumerate query (sub) results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLR). We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries); 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only); 3) they take space linear in the size of the database; 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm. Then, we present the generalization of thiw algorithm to the class of acyclic queries featuring multi-way theta-joins with projections. We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of dynamic algorithms over DCLRs is based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present the algorithms to test a conjunctive query featuring theta-joins for acyclicity and to generate GJTs for such queries. To do this, we extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to test a conjunctive query featuring multi-way theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic. We implemented our algorithms in a query compiler that takes as input the SQL queries and generates Scala executable code – a trigger program to process queries and maintain under updates. We tested our approach against state of the art main-memory BI and CEP systems. Our evaluation results have shown that our DCLRs based approach is over an order of magnitude efficient than existing systems for both memory footprint and update processing cost. We have also shown that the enumeration of query results without materialization in DCLRs is comparable (and in some cases efficient) as compared to enumerating from materialized query results. info:eu-repo/classification/ddc/004 ddc:004

Page generated in 0.0588 seconds