Global ETD Search

81	Método automático para descoberta de funções de ordenação utilizando programação genética paralela em GPU / Automatic raking function discovery method using parallel genetic programming on GPU Coimbra, Andre Rodrigues 28 March 2014 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:33:06Z No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:37:45Z (GMT) No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2015-05-15T13:37:45Z (GMT). No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2014-03-28 / Ranking functions have a vital role in the performance of information retrieval systems ensuring that documents more related to the user’s search need – represented as a query – are shown in the top results, preventing the user from having to examine a range of documents that are not really relevant. Therefore, this work uses Genetic Programming (GP), an Evolutionary Computation technique, to find ranking functions automaticaly and systematicaly. Moreover, in this project the technique of GP was developed following a strategy that exploits parallelism through graphics processing units. Other known methods in the context of information retrieval as classification committees and the Lazy strategy were combined with the proposed approach – called Finch. These combinations were only feasible due to the GP nature and the use of parallelism. The experimental results with the Finch, regarding the ranking functions quality, surpassed the results of several strategies known in the literature. Considering the time performance, significant gains were also achieved. The solution developed exploiting the parallelism spends around twenty times less time than the solution using only the central processing unit. / Funções de ordenação têm um papel vital no desempenho de sistemas de recuperação de informação garantindo que os documentos mais relacionados com o desejo do usuário – representado através de uma consulta – sejam trazidos no topo dos resultados, evitando que o usuário tenha que analisar uma série de documentos que não sejam realmente relevantes. Assim, utiliza-se a Programação Genética (PG), uma técnica da Computação Evolucionária, para descobrir de forma automática e sistemática funções de ordenação. Além disso, neste trabalho a técnica de PG foi desenvolvida seguindo uma estratégia que explora o paralelismo através de unidades gráficas de processamento. Foram agregados ainda na abordagem proposta – denominada Finch – outros métodos conhecidos no contexto de recuperação de informação como os comitês de classificação e a estratégia Lazy. Sendo que essa complementação só foi viável devido a natureza da PG e em virtude da utilização do paralelismo. Os resultados experimentais encontrados com a Finch, em relação à qualidade das funções de ordenação descobertas, superaram os resultados de diversas estratégias conhecidas na literatura. Considerando o desempenho da abordagem em função do tempo, também foram alcançados ganhos significativos. A solução desenvolvida explorando o paralelismo gasta, em média, vinte vezes menos tempo que a solução utilizando somente a unidade central de processamento. Programação genética Computação paralela Sistema de recuperação de informação Ordenação de documentos Computação evolucionária CUDA Unidade gráfica de processamento Inteligência computacional Genetic programming Parallel computing Information retrieval system Document ranking Evolutionary computation CUDA Graphics processing unit Machine learning
82	A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems Rengasamy, Vasudevan January 2014 (has links) (PDF) The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications. Graphics Processing Unit (GPU) Parallel Programming (Computer Science) Parallel Programming Models Parallel Programming Frameworks Charm++ (Computer Program Language) HybridAPI-GPU Management Framework G-Charm Framework Accelerator Based Computing Cholesky Factorization Computer Science
83	應用機器學習預測利差交易的收益 / Application of machine learning to predicting the returns of carry trade 吳佳真 Unknown Date (has links) 本研究提出了一個類神經網路機制，可以及時有效的預測利差交易(carry trade)的收益。為了實現及時性，我們將通過Tensorflow和圖形處理單元(GPU)來實作這個機制。此外，類神經網路機制需要處理具有概念飄移和異常值的時間序列數據。而我們將透過設計的實驗來驗證這個機制的及時性與有效性。在實驗過程中，我們發現在演算法設置不同的參數將影響類神經網路的性能。本研究將討論不同參數下所產生的不同結果。實驗結果表明，我們所提出的類神經網路機制可以預測出利差交易的收益的動向。希望這個研究將對機器學習和金融領域皆有所貢獻。 / This research derives an artificial neural networks (ANN) mechanism for timely and effectively predicting the return of carry trade. To achieve the timeliness, the ANN mechanism is implemented via the infrastructure of TensorFlow and graphic processing unit (GPU). Furthermore, the ANN mechanism needs to cope with the time series data that may have concept-drifting phenomenon and outliers. An experiment is also designed to verify the timeliness and effectiveness of the proposed mechanism. During the experiment, we find that different parameters we set in the algorithm will affect the performance of the neural network. And this research will discuss the different results in different parameters. Our experiment result represents that the proposed ANN mechanism can predict movement of the returns of carry trade well. Hope this research would contribute for both machine learning and finance field. 機器學習利差交易類神經網路 TensorFlow 圖形處理單元 Machine learning Carry trade Artificial neural networks (ANN) TensorFlow Graphic processing unit (GPU)
84	Performance Analysis of Non Local Means Algorithm using Hardware Accelerators Antony, Daniel Sanju January 2016 (has links) (PDF) Image De-noising forms an integral part of image processing. It is used as a standalone algorithm for improving the quality of the image obtained through camera as well as a starting stage for image processing applications like face recognition, super resolution etc. Non Local Means (NL-Means) and Bilateral Filter are two computationally complex de-noising algorithms which could provide good de-noising results. Due to its computational complexity, the real time applications associated with these letters are limited. In this thesis, we propose the use of hardware accelerators such as GPU (Graphics Processing Units) and FPGA (Field Programmable Gate Arrays) to speed up the filter execution and efficiently implement using them. GPU based implementation of these letters is carried out using Open Computing Language (Open CL). The basic objective of this research is to perform high speed de-noising without compromising on the quality. Here we implement a basic NL-Means filter, a Fast NL-Means filter, and Bilateral filter using Gauss Polynomial decomposition on GPU. We also propose a modification to the existing NL-Means algorithm and Gauss Polynomial Bilateral filter. Instead of Gaussian Spatial Kernel used in standard algorithm, Box Spatial kernel is introduced to improve the speed of execution of the algorithm. This research work is a step forward towards making the real time implementation of these algorithms possible. It has been found from results that the NL-Means implementation on GPU using Open CL is about 25x faster than regular CPU based implementation for larger images (1024x1024). For Fast NL-Means, GPU based implementation is about 90x faster than CPU implementation. Even with the improved execution time, the embedded system application of the NL-Means is limited due to the power and thermal restrictions of the GPU device. In order to create a low power and faster implementation, we have implemented the algorithm on FPGA. FPGAs are reconfigurable devices and enable us to create a custom architecture for the parallel execution of the algorithm. It was found that the execution time for smaller images (256x256) is about 200x faster than CPU implementation and about 25x faster than GPU execution. Moreover the power requirements of the FPGA design of the algorithm (0.53W) is much less compared to CPU(30W) and GPU(200W). Algorithm using Open Computing Language Image Denoising Non Local Means Algorithm OpenCL Field-Programmable Gate Array (FPGA) Open Computing Language GPU Graphics Processing Unit Additive White Gaussian Noise (AWGN) NL-Means Algorithm Electrcal Engineering
85	Benchmark pro zařízení s podporou OpenGL ES 3.0 / Benchmark for OpenGL ES 3.0 Devices Kimer, Tomáš January 2014 (has links) This thesis deals with the development of benchmark application for the OpenGL ES 3.0 devices using the realistic real-time rendering of 3D scenes. The first part covers the history and new features of the OpenGL ES 3.0 graphics library. Next part briefly describes selected algorithms for the realistic real-time rendering of 3D scenes which can be implemented using the new features of the discussed library. The design of benchmark application is covered next, including the design of online result database containing detailed device specifications. The last part covers implementation on Android and Windows platforms and the testing on mobile devices after publishing the application on Google Play. Finally, the results and possibilites of further development are discussed.
86	On GPU Assisted Polar Decoding : Evaluating the Parallelization of the Successive Cancellation Algorithmusing Graphics Processing Units / Polärkodning med hjälp av GPU:er : En utvärdering av parallelliseringmöjligheterna av SuccessiveCancellation-algoritmen med hjälp av grafikprocessorer Nordqvist, Siri January 2023 (has links) In telecommunication, messages sent through a wireless medium often experience noise interfering with the signal in a way that corrupts the messages. As the demand for high throughput in the mobile network is increasing, algorithms that can detectand correct these corrupted messages quickly and accurately are of interest to the industry. Polar codes have been chosen by the Third Generation Partnership Project as the error correction code for 5G New Radio control channels. This thesis work aimed to investigate whether the polar code Successive Cancellation (SC) could be parallelized and if a graphics processing unit (GPU) can be utilized to optimize the execution time of the algorithm. The polar code Successive Cancellation was enhanced by implementing tree pruning and support for GPUs to leverage their parallelization. The difference in execution time between the concurrent and sequential versions of the SC algorithm with and without tree pruning was evaluated. The tree pruning SC algorithm almost always offered shorter execution times than the SC algorithm that did not employ treepruning. However, the support for GPUs did not reduce the execution time in these tests. Thus, the GPU is not certain to be able to improve this type of enhanced SC algorithm based on these results. / Meddelanden som överförs över ett mobilt nät utsätts ofta för brus som distorterar dem. I takt med att intresset ökat för hög genomströmning i mobilnätet har också intresset för algoritmer som snabbt och tillförlitligt kan upptäcka och korrigera distorderade meddelanden ökat. Polarkoder har valts av "Third Generation Partnership Project" som den klass av felkorrigeringskoder som ska användas för 5G:s radiokontrollkanaler. Detta examensarbete hade som syfte att undersöka om polarkoden "Successive Cancellation" (SC) skulle kunna parallelliseras och om en grafisk bearbetningsenhet (GPU) kan användas för att optimera exekveringstiden för algoritmen. SC utökades med stöd för trädbeskärning och parallellisering med hjälp av GPU:er. Skillnaden i exekveringstid mellan de parallella och sekventiella versionerna av SC-algoritmen med och utan trädbeskärning utvärderades. SC-algoritmen för trädbeskärning erbjöd nästan alltid kortare exekveringstider än SC-algoritmen som inte använde trädbeskärning. Stödet för GPU:er minskade dock inte exekveringstiden. Således kan man med dessa resultat inte med säkerhet säga att GPU-stöd skulle gynna SC-algoritmen. 5G NR New Radio Polar Codes Channel Decoding Successive Cancellation NVIDIA GPU CUDA 5G New Radio (NR) Polärkoder Kanalavkodning Successive Cancellation (SC) NVIDIA Graphics Processing Unit (GPU) Computer Sciences Datavetenskap (datalogi)
87	Implementation and optimization of LDPC decoding algorithms tailored for Nvidia GPUs in 5G / Implementering och optimering av LDPC avkodningsalgoritmer anpassat för Nvidia GPU:er i 5G Salomonsson, Benjamin January 2022 (has links) Low-Density Parity-Check (LDPC) codes are linear error-correcting codes used to establish reliable communication between units on a noisy transmission channel in mobile telecommunications. LDPC algorithms detect and recover altered or corrupted message bits using sparse parity-check matrices in order to decipher messages correctly. LDPC codes have been shown to be fitting coding schemes for the fifth generation (5G) New Radio (NR), according to the third generation partnership project (3GPP). TietoEvry, a consultant in telecom, has discovered that optimizations of LDPC decoding algorithms can be achieved/obtained with the use of a parallel computing platform called Compute Unified Device Architecture (CUDA), developed by NVIDIA. This platform utilizes the capabilities of a graphics processing unit (GPU) rather than a central processing unit (CPU), which in turn provides parallel computing. An optimized version of an LDPC decoding algorithm, the Min-Sum Algorithm (MSA), is implemented in CUDA and in C++ for comparison in terms of CPU execution time, to explore the capabilities that CUDA offers. The testing is done with a set of 12 sparse parity-check matrices and input-channel messages with different sizes. As a result, the CUDA implementation executes approximately 55% faster than a standard, unoptimized C++ implementation. New Radio (NR) Shared Channel Channel Coding Low Density Parity-Check (LDPC) Min-Sum Algorithm (MSA) Graphics Processing Unit (GPU) Computer Sciences Datavetenskap (datalogi)
88	Methods for 3D Structured Light Sensor Calibration and GPU Accelerated Colormap Kurella, Venu January 2018 (has links) In manufacturing, metrological inspection is a time-consuming process. The higher the required precision in inspection, the longer the inspection time. This is due to both slow devices that collect measurement data and slow computational methods that process the data. The goal of this work is to propose methods to speed up some of these processes. Conventional measurement devices like Coordinate Measuring Machines (CMMs) have high precision but low measurement speed while new digitizer technologies have high speed but low precision. Using these devices in synergy gives a significant improvement in the measurement speed without loss of precision. The method of synergistic integration of an advanced digitizer with a CMM is discussed. Computational aspects of the inspection process are addressed next. Once a part is measured, measurement data is compared against its model to check for tolerances. This comparison is a time-consuming process on conventional CPUs. We developed and benchmarked some GPU accelerations. Finally, naive data fitting methods can produce misleading results in cases with non-uniform data. Weighted total least-squares methods can compensate for non-uniformity. We show how they can be accelerated with GPUs, using plane fitting as an example. / Thesis / Doctor of Philosophy (PhD) GPU acceleration 3D Structured Light Sensor CMM 3D sensor calibration GPGPU Coordinate Measuring Machines Accelerated colormap Weighted total least squares CUDA Blue LED sensor Point cloud to CAD Snapshot sensor Angled calibration artefact Synergistic hybrid sensor Graphics Processing Unit
89	Research and Design of Neural Processing Architectures Optimized for Embedded Applications Wu, Binyi 28 May 2024 (has links) Der Einsatz von neuronalen Netzen in Edge-Geräten und deren Einbindung in unser tägliches Leben findet immer mehr Beachtung. Ihre hohen Rechenkosten machen jedoch viele eingebettete Anwendungen zu einer Herausforderung. Das Hauptziel meiner Doktorarbeit ist es, einen Beitrag zur Lösung dieses Dilemmas zu leisten: die Optimierung neuronaler Netze und die Entwicklung entsprechender neuronaler Verarbeitungseinheiten für Endgeräte. Diese Arbeit nahm die algorithmische Forschung als Ausgangspunkt und wandte dann deren Ergebnisse an, um das Architekturdesign von Neural Processing Units (NPUs) zu verbessern. Die Optimierung neuronaler Netzwerkmodelle begann mit der Quantisierung neuronaler Netzwerke mit einfacher Präzision und entwickelte sich zu gemischter Präzision. Die Entwicklung der NPU-Architektur folgte den Erkenntnissen der Algorithmusforschung, um ein Hardware/Software Co-Design zu erreichen. Darüber hinaus wurde ein neuartiger Ansatz zur gemeinsamen Entwicklung von Hardware und Software vorgeschlagen, um das Prototyping und die Leistungsbewertung von NPUs zu beschleunigen. Dieser Ansatz zielt auf die frühe Entwicklungsphase ab. Er hilft Entwicklern, sich auf das Design und die Optimierung von NPUs zu konzentrieren und verkürzt den Entwicklungszyklus erheblich. Im Abschlussprojekt wurde ein auf maschinellem Lernen basierender Ansatz angewendet, um die Rechen- und Speicherressourcen der NPU zu erkunden and optimieren. Die gesamte Arbeit umfasst mehrere verschiedene Bereiche, von der Algorithmusforschung bis zum Hardwaredesign. Sie alle arbeiten jedoch an der Verbesserung der Inferenz-Effizienz neuronaler Netze. Die Optimierung der Algorithmen zielt insbesondere darauf ab, den Speicherbedarf und die Rechenkosten von neuronalen Netzen zu verringern. Das NPU-Design hingegen konzentriert sich auf die Verbesserung der Nutzung von Hardwareressourcen. Der vorgeschlagene Ansatz zur gemeinsamen Entwicklung von Software und Hardware verkürzt den Entwurfszyklus und beschleunigt die Entwurfsiterationen. Die oben dargestellte Reihenfolge entspricht dem Aufbau dieser Dissertation. Jedes Kapitel ist einem Thema gewidmet und umfasst relevante Forschungsarbeiten, Methodik und Versuchsergebnisse.:1 Introduction 2 Convolutional Neural Networks 2.1 Convolutional layer 2.1.1 Padding 2.1.2 Convolution 2.1.3 Batch Normalization 2.1.4 Nonlinearity 2.2 Pooling Layer 2.3 Fully Connected Layer 2.4 Characterization 2.4.1 Composition of Operations and Parameters 2.4.2 Arithmetic Intensity 2.5 Optimization 3 Quantization with Double-Stage Squeeze-and-Threshold 19 3.1 Overview 3.1.1 Binarization 3.1.2 Multi-bit Quantization 3.2 Quantization of Convolutional Neural Networks 3.2.1 Quantization Scheme 3.2.2 Operator fusion of Conv2D 3.3 Activation Quantization with Squeeze-and-Threshold 3.3.1 Double-Stage Squeeze-and-Threshold 3.3.2 Inference Optimization 3.4 Experiment 3.4.1 Ablation Study of Squeeze-and-Threshold 3.4.2 Comparison with State-of-the-art Methods 3.5 Summary 4 Low-Precision Neural Architecture Search 39 4.1 Overview 4.2 Differentiable Architecture Search 4.2.1 Gumbel Softmax 4.2.2 Disadvantage and Solution 4.3 Low-Precision Differentiable Architecture Search 4.3.1 Convolution Sharing 4.3.2 Forward-and-Backward Scaling 4.3.3 Power Estimation 4.3.4 Architecture of Supernet 4.4 Experiment 4.4.1 Effectiveness of solutions to the dominance problem 4.4.2 Softmax and Gumbel Softmax 4.4.3 Optimizer and Inverted Learning Rate Scheduler 4.4.4 NAS Method Evaluation 4.4.5 Searched Model Analysis 4.4.6 NAS Cost Analysis 4.4.7 NAS Training Analysis 4.5 Summary 5 Configurable Sparse Neural Processing Unit 65 5.1 Overview 5.2 NPU Architecture 5.2.1 Buffer 5.2.2 Reshapeable Mixed-Precision MAC Array 5.2.3 Sparsity 5.2.4 Post Process Unit 5.3 Mapping 5.3.1 Mixed-Precision MAC 5.3.2 MAC Array 5.3.3 Support of Other Operation 5.3.4 Configurability 5.4 Experiment 5.4.1 Performance Analysis of Runtime Configuration 5.4.2 Roofline Performance Analysis 5.4.3 Mixed-Precision 5.4.4 Comparison with Cortex-M7 5.5 Summary 6 Agile Development and Rapid Design Space Exploration 91 6.1 Overview 6.1.1 Agile Development 6.1.2 Design Space Exploration 6.2 Agile Development Infrastructure 6.2.1 Chisel Backend 6.2.2 NPU Software Stack 6.3 Modeling and Exploration 6.3.1 Area Modeling 6.3.2 Performance Modeling 6.3.3 Layered Exploration Framework 6.4 Experiment 6.4.1 Efficiency of Agile Development Infrastructure 6.4.2 Effectiveness of Agile Development Infrastructure 6.4.3 Area Modeling 6.4.4 Performance Modeling 6.4.5 Rapid Exploration and Pareto Front 6.5 Summary 7 Summary and Outlook 123 7.1 Summary 7.2 Outlook A Appendix of Double-Stage ST Quantization 127 A.1 Training setting of ResNet-18 in Table 3.3 A.2 Training setting of ReActNet in Table 3.4 A.3 Training setting of ResNet-18 in Table 3.4 A.4 Pseudocode Implementation of Double-Stage ST B Appendix of Low-Precision Neural Architecture Search 131 B.1 Low-Precision NAS on CIFAR-10 B.2 Low-Precision NAS on Tiny-ImageNet B.3 Low-Precision NAS on ImageNet Bibliography 137 / Deploying neural networks on edge devices and bringing them into our daily lives is attracting more and more attention. However, its expensive computational cost makes many embedded applications daunting. The primary objective of my doctoral studies is to make contributions towards resolving this predicament: optimizing neural networks and designing corresponding efficient neural processing units for edge devices. This work took algorithmic research, specifically the optimization of deep neural networks, as a starting point and then applied its findings to steer the architecture design of Neural Processing Units (NPUs). The optimization of neural network models started with single precision neural network quantization and progressed to mixed precision. The NPU architecture development followed the algorithmic research findings to achieve hardware/software co-design. Furthermore, a new approach to hardware and software co-development was introduced, aimed at expediting the prototyping and performance assessment of NPUs. This approach targets early-stage development. It helps developers to focus on the design and optimization of NPUs and significantly shortens the development cycle. In the final project, a machine learning-based approach was applied to explore and optimize the computational and memory resources of the NPU. The entire work covers several different areas, from algorithmic research to hardware design. But they all work on improving the inference efficiency of neural networks. Specifically, algorithm optimization aims to reduce the memory footprint and computational cost of neural networks. The NPU design, on the other hand, focuses on improving the utilization of hardware resources. The proposed software and hardware co-development approach shortens the design cycle and speeds up the design iteration. The order presented above corresponds to the structure of this dissertation. Each chapter corresponds to a topic and covers relevant research, methodology, and experimental results.:1 Introduction 2 Convolutional Neural Networks 2.1 Convolutional layer 2.1.1 Padding 2.1.2 Convolution 2.1.3 Batch Normalization 2.1.4 Nonlinearity 2.2 Pooling Layer 2.3 Fully Connected Layer 2.4 Characterization 2.4.1 Composition of Operations and Parameters 2.4.2 Arithmetic Intensity 2.5 Optimization 3 Quantization with Double-Stage Squeeze-and-Threshold 19 3.1 Overview 3.1.1 Binarization 3.1.2 Multi-bit Quantization 3.2 Quantization of Convolutional Neural Networks 3.2.1 Quantization Scheme 3.2.2 Operator fusion of Conv2D 3.3 Activation Quantization with Squeeze-and-Threshold 3.3.1 Double-Stage Squeeze-and-Threshold 3.3.2 Inference Optimization 3.4 Experiment 3.4.1 Ablation Study of Squeeze-and-Threshold 3.4.2 Comparison with State-of-the-art Methods 3.5 Summary 4 Low-Precision Neural Architecture Search 39 4.1 Overview 4.2 Differentiable Architecture Search 4.2.1 Gumbel Softmax 4.2.2 Disadvantage and Solution 4.3 Low-Precision Differentiable Architecture Search 4.3.1 Convolution Sharing 4.3.2 Forward-and-Backward Scaling 4.3.3 Power Estimation 4.3.4 Architecture of Supernet 4.4 Experiment 4.4.1 Effectiveness of solutions to the dominance problem 4.4.2 Softmax and Gumbel Softmax 4.4.3 Optimizer and Inverted Learning Rate Scheduler 4.4.4 NAS Method Evaluation 4.4.5 Searched Model Analysis 4.4.6 NAS Cost Analysis 4.4.7 NAS Training Analysis 4.5 Summary 5 Configurable Sparse Neural Processing Unit 65 5.1 Overview 5.2 NPU Architecture 5.2.1 Buffer 5.2.2 Reshapeable Mixed-Precision MAC Array 5.2.3 Sparsity 5.2.4 Post Process Unit 5.3 Mapping 5.3.1 Mixed-Precision MAC 5.3.2 MAC Array 5.3.3 Support of Other Operation 5.3.4 Configurability 5.4 Experiment 5.4.1 Performance Analysis of Runtime Configuration 5.4.2 Roofline Performance Analysis 5.4.3 Mixed-Precision 5.4.4 Comparison with Cortex-M7 5.5 Summary 6 Agile Development and Rapid Design Space Exploration 91 6.1 Overview 6.1.1 Agile Development 6.1.2 Design Space Exploration 6.2 Agile Development Infrastructure 6.2.1 Chisel Backend 6.2.2 NPU Software Stack 6.3 Modeling and Exploration 6.3.1 Area Modeling 6.3.2 Performance Modeling 6.3.3 Layered Exploration Framework 6.4 Experiment 6.4.1 Efficiency of Agile Development Infrastructure 6.4.2 Effectiveness of Agile Development Infrastructure 6.4.3 Area Modeling 6.4.4 Performance Modeling 6.4.5 Rapid Exploration and Pareto Front 6.5 Summary 7 Summary and Outlook 123 7.1 Summary 7.2 Outlook A Appendix of Double-Stage ST Quantization 127 A.1 Training setting of ResNet-18 in Table 3.3 A.2 Training setting of ReActNet in Table 3.4 A.3 Training setting of ResNet-18 in Table 3.4 A.4 Pseudocode Implementation of Double-Stage ST B Appendix of Low-Precision Neural Architecture Search 131 B.1 Low-Precision NAS on CIFAR-10 B.2 Low-Precision NAS on Tiny-ImageNet B.3 Low-Precision NAS on ImageNet Bibliography 137 info:eu-repo/classification/ddc/621 ddc:621
90	Modeling and Analysis of Large-Scale On-Chip Interconnects Feng, Zhuo 2009 December 1900 (has links) As IC technologies scale to the nanometer regime, efficient and accurate modeling and analysis of VLSI systems with billions of transistors and interconnects becomes increasingly critical and difficult. VLSI systems impacted by the increasingly high dimensional process-voltage-temperature (PVT) variations demand much more modeling and analysis efforts than ever before, while the analysis of large scale on-chip interconnects that requires solving tens of millions of unknowns imposes great challenges in computer aided design areas. This dissertation presents new methodologies for addressing the above two important challenging issues for large scale on-chip interconnect modeling and analysis: In the past, the standard statistical circuit modeling techniques usually employ principal component analysis (PCA) and its variants to reduce the parameter dimensionality. Although widely adopted, these techniques can be very limited since parameter dimension reduction is achieved by merely considering the statistical distributions of the controlling parameters but neglecting the important correspondence between these parameters and the circuit performances (responses) under modeling. This dissertation presents a variety of performance-oriented parameter dimension reduction methods that can lead to more than one order of magnitude parameter reduction for a variety of VLSI circuit modeling and analysis problems. The sheer size of present day power/ground distribution networks makes their analysis and verification tasks extremely runtime and memory inefficient, and at the same time, limits the extent to which these networks can be optimized. Given today?s commodity graphics processing units (GPUs) that can deliver more than 500 GFlops (Flops: floating point operations per second). computing power and 100GB/s memory bandwidth, which are more than 10X greater than offered by modern day general-purpose quad-core microprocessors, it is very desirable to convert the impressive GPU computing power to usable design automation tools for VLSI verification. In this dissertation, for the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with very promising performance. Our GPU based network analyzer is capable of solving tens of millions of power grid nodes in just a few seconds. Additionally, with the above GPU based simulation framework, more challenging three-dimensional full-chip thermal analysis can be solved in a much more efficient way than ever before.

Search results