Spelling suggestions: "subject:"computer atemsystem 1rchitecture"" "subject:"computer atemsystem 1architecture""
1 |
Modelling of communication protocolsWei, K-K. January 1986 (has links)
No description available.
|
2 |
Techniques for Managing Irregular Control Flow on GPUsJad Hbeika (5929730) 25 June 2020 (has links)
<p>GPGPU is a highly multithreaded throughput architecture that can deliver high speed-up for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular applications and/or the GPU architecture to achieve similar benefits for irregular applications as well as efforts to extract data parallelism from task parallel applications. In this work we tackle both problems.</p><p>The first part of this work tackles the problem of Control divergence in GPUs. GPGPUs’ SIMT execution model is ineffective for workloads with irregular control-flow because GPGPUs serialize the execution of divergent paths which lead to thread-level parallelism (TLP) loss. Previous works focused on creating new warps based on the control path threads follow, or created different warps for the different paths, or ran multiple narrower warps in parallel. While all previous solutions showed speedup for irregular workloads, they imposed some performance loss on regular workloads. In this work we propose a more fine-grained approach to exploit <i>intra-warp</i>convergence: rather than threads executing the same code path, <i>opcode-convergent threads</i>execute the same instruction, but with potentially different operands. Based on this new definition we find that divergent control blocks within a warp exhibit substantial opcode convergence. We build a compiler that analyzes divergent blocks and identifies the common streams of opcodes. We modify the GPU architecture so that these common instructions are executed as convergent instructions. Using software simulation, we achieve a 17% speedup over baseline GPGPU for irregular workloads and do not incur any performance loss on regular workloads.</p><p>In the second part we suggest techniques for extracting data parallelism from irregular, task parallel applications in order to take advantage of the massive parallelism provided by the GPU. Our technique involves dividing each task into multiple sub-tasks each performing less work and touching a smaller memory footprint. Our framework performs a locality-aware scheduling that works on minimizing the memory footprint of each warp (a set of threads performing in lock-step). We evaluate our framework with 3 task-parallel benchmarks and show that we can achieve significant speedups over optimized GPU code.</p>
|
3 |
Inference Engine: A high efficiency accelerator for Deep Neural NetworksAliasger Tayeb Zaidy (7043234) 12 October 2021 (has links)
Deep Neural Networks are state-of the art algorithms for various image and natural language processing tasks. These networks are composed of billions of operations working on an input to produce the desired result. Along with this computational complexity, these workloads are also massively parallel in nature. These inherent properties make deep neural networks an excellent target for custom acceleration. The main challenge faced by such accelerators is achieving a compromise between power consumption, software programmability, and resource utilization for the varied compute and data access patterns presented by DNN workloads. In this work, I present Inference Engine, a scalable and efficient DNN accelerator designed to be agnostic to the type of DNN workload. Inference Engine was designed to provide near peak hardware resource utilization, minimize data transfer, and offer a programmer friendly instruction set. Inference engine scales at the level of individually programmable clusters, each of which contains several hundred compute resources. It provides an instruction set designed to exploit parallelism within the workload while also allowing freedom for compiler based exploration of data access patterns.
|
4 |
TREE-BASED UNIDIRECTIONAL NEURAL NETWORKS FOR LOW-POWER COMPUTER VISION ON EMBEDDED DEVICESAbhinav Goel (12468279) 27 April 2022 (has links)
<p>Deep Neural Networks (DNNs) are a class of machine learning algorithms that are widelysuccessful in various computer vision tasks. DNNs filter input images and videos with manyconvolution operations in each layer to extract high-quality features and achieve high ac-curacy. Although highly accurate, the state-of-the-art DNNs usually require server-gradeGPUs, and are too energy, computation and memory-intensive to be deployed on most de-vices. This is a significant problem because billions of mobile and embedded devices that donot contain GPUs are now equipped with high definition cameras. Running DNNs locallyon these devices enables applications such as emergency response and safety monitoring,because data cannot always be offloaded to the Cloud due to latency, privacy, or networkbandwidth constraints.</p>
<p>Prior research has shown that a considerable number of a DNN’s memory accesses andcomputation are redundant when performing computer vision tasks. Eliminating these re-dundancies will enable faster and more efficient DNN inference on low-power embedded de-vices. To reduce these redundancies and thereby reduce the energy consumption of DNNs,this thesis proposes a novel Tree-based Unidirectional Neural Network (TRUNK) architec-ture. Instead of a single large DNN, multiple small DNNs in the form of a tree work togetherto perform computer vision tasks. The TRUNK architecture first finds thesimilaritybe-tween different object categories. Similar object categories are grouped intoclusters. Similarclusters are then grouped into a hierarchy, creating a tree. The small DNNs at every nodeof TRUNK classify between different clusters. During inference, for an input image, oncea DNN selects a cluster, another DNN further classifies among the children of the cluster(sub-clusters). The DNNs associated with other clusters are not used during the inferenceof that image. By doing so, only a small subset of the DNNs are used during inference,thus reducing redundant operations, memory accesses, and energy consumption. Since eachintermediate classification reduces the search space of possible object categories in the image,the small efficient DNNs still achieve high accuracy.</p>
<p>In this thesis, we identify the computer vision applications and scenarios that are wellsuited for the TRUNK architecture. We develop methods to use TRUNK to improve the efficiency of the image classification, object counting, and object re-identification problems.We also present methods to adapt the TRUNK structure for different embedded/edge ap-plication contexts with different system architectures, accuracy requirements, and hardware constraints.</p>
<p>Experiments with TRUNK using several image datasets reveal the effectiveness of theproposed solution to reduce memory requirement by∼50%, inference time by∼65%, energyconsumption by∼65%, and the number of operations by∼45% when compared with existingDNN architectures. These experiments are conducted on consumer-grade embedded systems:NVIDIA Jetson Nano, Raspberry Pi 3, and Raspberry Pi Zero. The TRUNK architecturehas only marginal losses in accuracy when compared with the state-of-the-art DNNs.</p>
|
5 |
MULTILINGUAL CYBERBULLYING DETECTION SYSTEMRohit Sidram Pawar (6613247) 11 June 2019 (has links)
Since the use of social media has evolved, the ability of its users to bully others has increased. One of the prevalent forms of bullying is Cyberbullying, which occurs on the social media sites such as Facebook©, WhatsApp©, and Twitter©. The past decade has witnessed a growth in cyberbullying – is a form of bullying that occurs virtually by the use of electronic devices, such as messaging, e-mail, online gaming, social media, or through images or mails sent to a mobile. This bullying is not only limited to English language and occurs in other languages. Hence, it is of the utmost importance to detect cyberbullying in multiple languages. Since current approaches to identify cyberbullying are mostly focused on English language texts, this thesis proposes a new approach (called Multilingual Cyberbullying Detection System) for the detection of cyberbullying in multiple languages (English, Hindi, and Marathi). It uses two techniques, namely, Machine Learning-based and Lexicon-based, to classify the input data as bullying or non-bullying. The aim of this research is to not only detect cyberbullying but also provide a distributed infrastructure to detect bullying. We have developed multiple prototypes (standalone, collaborative, and cloud-based) and carried out experiments with them to detect cyberbullying on different datasets from multiple languages. The outcomes of our experiments show that the machine-learning model outperforms the lexicon-based model in all the languages. In addition, the results of our experiments show that collaboration techniques can help to improve the accuracy of a poor-performing node in the system. Finally, we show that the cloud-based configurations performed better than the local configurations.
|
6 |
Efficient and Robust Deep Learning through Approximate ComputingSanchari Sen (9178400) 28 July 2020 (has links)
<p>Deep
Neural Networks (DNNs) have greatly advanced the state-of-the-art in a wide range
of machine learning tasks involving image, video, speech and text analytics,
and are deployed in numerous widely-used products and services. Improvements in
the capabilities of hardware platforms such as Graphics Processing Units (GPUs)
and specialized accelerators have been instrumental in enabling these advances
as they have allowed more complex and accurate networks to be trained and
deployed. However, the enormous computational and memory demands of DNNs
continue to increase with growing data size and network complexity, posing a
continuing challenge to computing system designers. For instance,
state-of-the-art image recognition DNNs require hundreds of millions of
parameters and hundreds of billions of multiply-accumulate operations while
state-of-the-art language models require hundreds of billions of parameters and
several trillion operations to process a single input instance. Another major
obstacle in the adoption of DNNs, despite their impressive accuracies on a range
of datasets, has been their lack of robustness. Specifically, recent efforts
have demonstrated that small, carefully-introduced input perturbations can
force a DNN to behave in unexpected and erroneous ways, which can have to
severe consequences in several safety-critical DNN applications like healthcare
and autonomous vehicles. In this dissertation, we explore approximate computing
as an avenue to improve the speed and energy efficiency of DNNs, as well as
their robustness to input perturbations.</p>
<p> </p>
<p>Approximate
computing involves executing selected computations of an application in an
approximate manner, while generating favorable trade-offs between computational
efficiency and output quality. The intrinsic error resilience of machine learning
applications makes them excellent candidates for approximate computing, allowing
us to achieve execution time and energy reductions with minimal effect on the
quality of outputs. This dissertation performs a comprehensive analysis of
different approximate computing techniques for improving the execution efficiency
of DNNs. Complementary to generic approximation techniques like quantization,
it identifies approximation opportunities based on the specific characteristics
of three popular classes of networks - Feed-forward Neural Networks (FFNNs),
Recurrent Neural Networks (RNNs) and Spiking Neural Networks (SNNs), which vary
considerably in their network structure and computational patterns.</p>
<p> </p>
<p>First, in
the context of feed-forward neural networks, we identify sparsity, or the presence
of zero values in the data structures (activations, weights, gradients and errors),
to be a major source of redundancy and therefore, an easy target for
approximations. We develop lightweight micro-architectural and instruction set
extensions to a general-purpose processor core that enable it to dynamically
detect zero values when they are loaded and skip future instructions that are
rendered redundant by them. Next, we explore LSTMs (the most widely used class
of RNNs), which map sequences from an input space to an output space. We
propose hardware-agnostic approximations that dynamically skip redundant
symbols in the input sequence and discard redundant elements in the state
vector to achieve execution time benefits. Following that, we consider SNNs,
which are an emerging class of neural networks that represent and process
information in the form of sequences of binary spikes. Observing that spike-triggered
updates along synaptic connections are the dominant operation in SNNs, we
propose hardware and software techniques to identify connections that can be
minimally impact the output quality and deactivate them dynamically, skipping any
associated updates.</p>
<p> </p>
<p>The
dissertation also delves into the efficacy of combining multiple approximate computing
techniques to improve the execution efficiency of DNNs. In particular, we focus
on the combination of quantization, which reduces the precision of DNN data-structures,
and pruning, which introduces sparsity in them. We observe that the ability of
pruning to reduce the memory demands of quantized DNNs decreases with precision
as the overhead of storing non-zero locations alongside the values starts to
dominate in different sparse encoding schemes. We analyze this overhead and the
overall compression of three different sparse formats across a range of
sparsity and precision values and propose a hybrid compression scheme that
identifies that optimal sparse format for a pruned low-precision DNN.</p>
<p> </p>
<p>Along with
improved execution efficiency of DNNs, the dissertation explores an additional
advantage of approximate computing in the form of improved robustness. We
propose ensembles of quantized DNN models with different numerical precisions as
a new approach to increase robustness against adversarial attacks. It is based on
the observation that quantized neural networks often demonstrate much higher robustness
to adversarial attacks than full precision networks, but at the cost of a substantial
loss in accuracy on the original (unperturbed) inputs. We overcome this limitation
to achieve the best of both worlds, i.e., the higher unperturbed accuracies of
the full precision models combined with the higher robustness of the low
precision models, by composing them in an ensemble.</p>
<p> </p>
<p><br></p><p>In
summary, this dissertation establishes approximate computing as a promising direction
to improve the performance, energy efficiency and robustness of neural networks.</p>
|
7 |
Software-defined Buffer Management and Robust Congestion Control for Modern Datacenter NetworksDanushka N Menikkumbura (12208121) 20 April 2022 (has links)
<p> Modern datacenter network applications continue to demand ultra low latencies and very high throughputs. At the same time, network infrastructure keeps achieving higher speeds and larger bandwidths. We still need better network management solutions to keep these two demand and supply fronts go hand-in-hand. There are key metrics that define network performance such as flow completion time (the lower the better), throughput (the higher the better), and end-to-end latency (the lower the better) that are mainly governed by how effectively network application get their fair share of network resources. We observe that buffer utilization on network switches gives a very accurate indication of network performance. Therefore, network buffer management is important in modern datacenter networks, and other network management solutions can be efficiently built around buffer utilization. This dissertation presents three solutions based on buffer use on network switches.</p>
<p> This dissertation consists of three main sections. The first section is on a specification language for buffer management in modern programmable switches. The second section is on a congestion control solution for Remote Direct Memory Access (RDMA) networks. The third section is on a solution to head-of-the-line blocking in modern datacenter networks.</p>
|
8 |
High-performant, Replicated, Queue-oriented Transaction Processing Systems on Modern Computing InfrastructuresThamir Qadah (11132985) 27 July 2021 (has links)
With the shifting landscape of computing hardware architectures and the emergence of new computing environments (e.g., large main-memory systems, hundreds of CPUs, distributed and virtualized cloud-based resources), state-of-the-art designs of transaction processing systems that rely on conventional wisdom suffer from lost performance optimization opportunities. This dissertation challenges conventional wisdom to rethink the design and implementation of transaction processing systems for modern computing environments.<div><br></div><div>We start by tackling the vertical hardware scaling challenge, and propose a deterministic approach to transaction processing on emerging multi-sockets, many-core, shared memory architecture to harness its unprecedented available parallelism. Our proposed priority-based queue-oriented transaction processing architecture eliminates the transaction contention footprint and uses speculative execution to improve the throughput of centralized deterministic transaction processing systems. We build QueCC and demonstrate up to two orders of magnitude better performance over the state-of-the-art.<br></div><div><br></div><div>We further tackle the horizontal scaling challenge and propose a distributed queue-oriented transaction processing engine that relies on queue-oriented communication to eliminate the traditional overhead of commitment protocols for multi-partition transactions. We build Q-Store, and demonstrate up to 22x improvement in system throughput over the state-of-the-art deterministic transaction processing systems.<br></div><div><br></div><div>Finally, we propose a generalized framework for designing distributed and replicated deterministic transaction processing systems. We introduce the concept of speculative replication to hide the latency overhead of replication. We prototype the speculative replication protocol in QR-Store and perform an extensive experimental evaluation using standard benchmarks. We show that QR-Store can achieve a throughput of 1.9 million replicated transactions per second in under 200 milliseconds and a replication overhead of 8%-25%compared to non-replicated configurations.<br></div>
|
Page generated in 0.0688 seconds