• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • 1
  • 1
  • Tagged with
  • 4
  • 4
  • 4
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Método para execução de redes neurais convolucionais em FPGA. / A method for execution of convolutional neural networks in FPGA.

Sousa, Mark Cappello Ferreira de 26 April 2019 (has links)
Redes Neurais Convolucionais têm sido utilizadas com sucesso para reconhecimento de padrões em imagens. Porém, o seu alto custo computacional e a grande quantidade de parâmetros envolvidos dificultam a execução em tempo real deste tipo de rede neural artificial em aplicações embarcadas, onde o poder de processamento e a capacidade de armazenamento de dados são restritos. Este trabalho estudou e desenvolveu um método para execução em tempo real em FPGAs de uma Rede Neural Convolucional treinada, aproveitando o poder de processamento paralelo deste tipo de dispositivo. O foco deste trabalho consistiu na execução das camadas convolucionais, pois estas camadas podem contribuir com até 99% da carga computacional de toda a rede. Nos experimentos, um dispositivo FPGA foi utilizado conjugado com um processador ARM dual-core em um mesmo substrato de silício. Apenas o dispositivo FPGA foi utilizado para executar as camadas convolucionais da Rede Neural Convolucional AlexNet. O método estudado neste trabalho foca na distribuição eficiente dos recursos do FPGA por meio do balanceamento do pipeline formado pelas camadas convolucionais, uso de buffers para redução e reutilização de memória para armazenamento dos dados intermediários (gerados e consumidos pelas camadas convolucionais) e uso de precisão numérica de 8 bits para armazenamento dos kernels e aumento da vazão de leitura dos mesmos. Com o método desenvolvido, foi possível executar todas as cinco camadas convolucionais da AlexNet em 3,9 ms, com a frequência máxima de operação de 76,9 MHz. Também foi possível armazenar todos os parâmetros das camadas convolucionais na memória interna do FPGA, eliminando possíveis gargalos de acesso à memória externa. / Convolutional Neural Networks have been used successfully for pattern recognition in images. However, their high computational cost and the large number of parameters involved make it difficult to perform this type of artificial neural network in real time in embedded applications, where the processing power and the data storage capacity are restricted. This work studied and developed methods for real-time execution in FPGAs of a trained convolutional neural network, taking advantage of the parallel processing power of this type of device. The focus of this work was the execution of convolutional layers, since these layers can contribute up to 99% of the computational load of the entire network. In the experiments, an FPGA device was used in conjunction with a dual-core ARM processor on the same silicon substrate. The FPGA was used to perform convolutional layers of the AlexNet Convolutional Neural Network. The methods studied in this work focus on the efficient distribution of the FPGA resources through the balancing of the pipeline formed by the convolutional layers, the use of buffers for the reduction and reuse of memory for the storage of intermediate data (generated and consumed by the convolutional layers) and 8 bits for storage of the kernels and increase of the flow of reading of them. With the developed methods, it was possible to execute all five AlexNet convolutional layers in 3.9 ms with the maximum operating frequency of 76.9 MHz. It was also possible to store all the parameters of the convolutional layers in the internal memory of the FPGA, eliminating possible external access memory bottlenecks.
2

Representation and Efficient Computation of Sparse Matrix for Neural Networks in Customized Hardware

Yan, Lihao January 2022 (has links)
Deep Neural Networks are widely applied to various kinds of fields nowadays. However, hundreds of thousands of neurons in each layer result in intensive memory storage requirement and a massive number of operations, making it difficult to employ deep neural networks on mobile devices where the hardware resources are limited. One common technique to address the memory limitation is to prune and quantize the neural networks. Besides, due to the frequent usage of Rectified Linear Unit (ReLU) function or network pruning, majority of the data in the weight matrices will be zeros, which will not only take up a large amount of memory space but also cause unnecessary computation operations. In this thesis, a new value-based compression method is put forward to represent sparse matrix more efficiently by eliminating these zero elements, and a customized hardware is implemented to realize the decompression and computation operations. The value-based compression method is aimed to replace the nonzero data in each column of the weight matrix with a reference value (arithmetic mean) and the relative differences between each nonzero element and the reference value. Intuitively, the data stored in each column is likely to contain similar values. Therefore, the differences will have a narrow range, and fewer bits rather than the full form will be sufficient to represent all the differences. In this way, the weight matrix can be further compressed to save memory space. The proposed value-based compression method reduces the memory storage requirement for the fully-connected layers of AlexNet to 37%, 41%, 47% and 68% of the compressed model, e.g., the Compressed Sparse Column (CSC) format, when the data size is set to 8 bits and the sparsity is 20%, 40%, 60% and 80% respectively. In the meanwhile, 41%, 53% and 63% compression rates of the fully-connected layers of the compressed AlexNet model with respect to 8-bit, 16-bit and 32-bit data are achieved when the sparsity is 40%. Similar results are obtained for VGG16 experiment. / Djupa neurala nätverk används i stor utsträckning inom olika fält nuförtiden. Emellertid ställer hundratusentals neuroner per lager krav på intensiv minneslagring och ett stort antal operationer, vilket gör det svårt att använda djupa neurala nätverk på mobila enheter där hårdvaruresurserna är begränsade. En vanlig teknik för att hantera minnesbegränsningen är att beskära och kvantifiera de neurala nätverken. På grund av den frekventa användningen av Rectified Linear Unit (ReLU) -funktionen eller nätverksbeskärning kommer majoriteten av datat i viktmatriserna att vara nollor, vilket inte bara tar upp mycket minnesutrymme utan också orsakar onödiga beräkningsoperationer. I denna avhandling presenteras en ny värdebaserad komprimeringsmetod för att representera den glesa matrisen mer effektivt genom att eliminera dessa nollelement, och en anpassad hårdvara implementeras för att realisera dekompressions- och beräkningsoperationerna. Den värdebaserade komprimeringsmetoden syftar till att ersätta icke-nolldata i varje kolumn i viktmatrisen med ett referensvärde (aritmetiskt medelvärde) och de relativa skillnaderna mellan varje icke-nollelement och referensvärdet. Intuitivt kommer data som lagras i varje kolumn sannolikt att innehålla liknande värden. Därför kommer skillnaderna att ha ett smalt intervall, och färre bitar snarare än den fullständiga formen kommer att räcka för att representera alla skillnader. På så sätt kan viktmatrisen komprimeras ytterligare för att spara minnesutrymme. Den föreslagna värdebaserade komprimeringsmetoden minskar minneslagringskravet för de helt anslutna lagren av AlexNet till 37%, 41%, 47% och 68% av den komprimerade modellen, t.ex. Compressed Sparse Column (CSC) format, när datastorleken är inställd på 8 bitar och sparsiteten är 20%, 40%, 60% respektive 80%. Under tiden uppnås 41%, 53% och 63% komprimeringshastigheter för de helt anslutna lagren i den komprimerade AlexNet-modellen med avseende på 8- bitars, 16-bitars och 32-bitars data när sparsiteten är 40%. Liknande resultat erhålls för VGG16-experiment.
3

An Investigation of Low-Rank Decomposition for Increasing Inference Speed in Deep Neural Networks With Limited Training Data

Wikén, Victor January 2018 (has links)
In this study, to increase inference speed of convolutional neural networks, the optimization technique low-rank tensor decomposition has been implemented and applied to AlexNet which had been trained to classify dog breeds. Due to a small training set, transfer learning was used in order to be able to classify dog breeds. The purpose of the study is to investigate how effective low-rank tensor decomposition is when the training set is limited. The results obtained from this study, compared to a previous study, indicate that there is a strong relationship between the effects of the tensor decomposition and how much available training data exists. A significant speed up can be obtained in the different convolutional layers using tensor decomposition. However, since there is a need to retrain the network after the decomposition and due to the limited dataset there is a slight decrease in accuracy. / För att öka inferenshastigheten hos faltningssnätverk, har i denna studie optimeringstekniken low-rank tensor decomposition implementerats och applicerats på AlexNet, som har tränats för att klassificera hundraser. På grund av en begränsad mängd träningsdata användes transfer learning för uppgiften. Syftet med studien är att undersöka hur effektiv low-rank tensor decomposition är när träningsdatan är begränsad. Jämfört med resultaten från en tidigare studie visar resultaten från denna studie att det finns ett starkt samband mellan effekterna av low-rank tensor decomposition och hur mycket tillgänglig träningsdata som finns. En signifikant hastighetsökning kan uppnås i de olika faltningslagren med hjälp av low-rank tensor decomposition. Eftersom det finns ett behov av att träna om nätverket efter dekompositionen och på grund av den begränsade mängden data så uppnås hastighetsökningen dock på bekostnad av en viss minskning i precisionen för modellen.
4

Implementace neuronové sítě bez operace násobení / Neural Network Implementation without Multiplication

Slouka, Lukáš January 2018 (has links)
The subject of this thesis is neural network acceleration with the goal of reducing the number of floating point multiplications. The theoretical part of the thesis surveys current trends and methods used in the field of neural network acceleration. However, the focus is on the binarization techniques which allow replacing multiplications with logical operators. The theoretical base is put into practice in two ways. First is the GPU implementation of crucial binary operators in the Tensorflow framework with a performance benchmark. Second is an application of these operators in simple image classifier. Results are certainly encouraging. Implemented operators achieve speed-up by a factor of 2.5 when compared to highly optimized cuBLAS operators. The last chapter compares accuracies achieved by binarized models and their full-precision counterparts on various architectures.

Page generated in 0.0245 seconds