• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 50
  • 6
  • 5
  • 2
  • 2
  • 1
  • Tagged with
  • 71
  • 19
  • 18
  • 15
  • 14
  • 14
  • 12
  • 12
  • 12
  • 11
  • 10
  • 10
  • 9
  • 9
  • 8
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling Techniques

Nagarakatte, Santosh G 08 1900 (has links)
Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches.
62

Energy Efficient Techniques For Algorithmic Analog-To-Digital Converters

Hai, Noman January 2011 (has links)
Analog-to-digital converters (ADCs) are key design blocks in state-of-art image, capacitive, and biomedical sensing applications. In these sensing applications, algorithmic ADCs are the preferred choice due to their high resolution and low area advantages. Algorithmic ADCs are based on the same operating principle as that of pipelined ADCs. Unlike pipelined ADCs where the residue is transferred to the next stage, an N-bit algorithmic ADC utilizes the same hardware N-times for each bit of resolution. Due to the cyclic nature of algorithmic ADCs, many of the low power techniques applicable to pipelined ADCs cannot be directly applied to algorithmic ADCs. Consequently, compared to those of pipelined ADCs, the traditional implementations of algorithmic ADCs are power inefficient. This thesis presents two novel energy efficient techniques for algorithmic ADCs. The first technique modifies the capacitors' arrangement of a conventional flip-around configuration and amplifier sharing technique, resulting in a low power and low area design solution. The other technique is based on the unit multiplying-digital-to-analog-converter approach. The proposed approach exploits the power saving advantages of capacitor-shared technique and capacitor-scaled technique. It is shown that, compared to conventional techniques, the proposed techniques reduce the power consumption of algorithmic ADCs by more than 85\%. To verify the effectiveness of such approaches, two prototype chips, a 10-bit 5 MS/s and a 12-bit 10 MS/s ADCs, are implemented in a 130-nm CMOS process. Detailed design considerations are discussed as well as the simulation and measurement results. According to the simulation results, both designs achieve figures-of-merit of approximately 60 fJ/step, making them some of the most power efficient ADCs to date.
63

Návrh převodníku AD s nízkým napájecím napětím v technologii CMOS / Design of AD converter with low supply voltage in CMOS technology

Holas, Jiří January 2016 (has links)
Tato diplomová práce se zabývá návrhem 12 bitového řetězového A/D převodníku. Součástí návrhu bylo vytvořit referenční model převodníku v prostředí Matlab a determinovat faktory, které negativně ovlivňují výsledek konverze. S využitím nabytých poznatků navrhnout řetězový převodník na transistorové úrovni v prostředí Cadence. V teoretické části jsou shrnuty základy A/D převodu a dále jsou představeny nejčastěji používané architektury A/D převodníků. V dalších částech je popsán a diskutován vliv neidealit na vlastnosti řetězových převodníků. Praktická část se již věnuje popisu základních charakteristik řetězových převodníků a dokazuje funkci modelu. Z výsledků modelové struktury byly stanoveny reálné parametry, které byly dále využity v procesu tvorby návrhu v CMOS technologii TSMC 0,18m s nízkým napájecím napětím.
64

A Soft-core processor architecture optimised for radar signal processing applications

Broich, René January 2013 (has links)
Current radar signal processor architectures lack either performance or flexibility in terms of ease of modification and large design time overheads. Combinations of processors and FPGAs are typically hard-wired together into a precisely timed and pipelined solution to achieve a desired level of functionality and performance. Such a fixed processing solution is clearly not feasible for new algorithm evaluation or quick changes during field tests. A more flexible solution based on a high-performance soft-core processing architecture is proposed. To develop such a processing architecture, data and signal-flow characteristics of common radar signal processing algorithms are analysed. Each algorithm is broken down into signal processing and mathematical operations. The computational requirements are then evaluated using an abstract model of computation to determine the relative importance of each mathematical operation. Critical portions of the radar applications are identified for architecture selection and optimisation purposes. Built around these dominant operations, a soft-core architecture model that is better matched to the core computational requirements of a radar signal processor is proposed. The processor model is iteratively refined based on the previous synthesis as well as code profiling results. To automate this iterative process, a software development environment was designed. The software development environment enables rapid architectural design space exploration through the automatic generation of development tools (assembler, linker, code editor, cycle accurate emulator / simulator, programmer, and debugger) as well as platform independent VHDL code from an architecture description file. Together with the board specific HDL-based HAL files, the design files are synthesised using the vendor specific FPGA tools and practically verified on a custom high performance development board. Timing results, functional accuracy, resource usage, profiling and performance data are analysed and fed back into the architecture description file for further refinement. The results from this iterative design process yielded a unique transport-based pipelined architecture. The proposed architecture achieves high data throughput while providing the flexibility that a software-programmable device offers. The end user can thus write custom radar algorithms in software rather than going through a long and complex HDL-based design. The simplicity of this architecture enables high clock frequencies, deterministic response times, and makes it easy to understand. Furthermore, the architecture is scalable in performance and functionality for a variety of different streaming and burst-processing related applications. A comparison to the Texas Instruments C66x DSP core showed a decrease in clock cycles by a factor between 10.8 and 20.9 for the identical radar application on the proposed architecture over a range of typical operating parameters. Even with the limited clock speeds achievable on the FPGA technology, the proposed architecture exceeds the performance of the commercial high-end DSP processor. Further research is required on ASIC, SIMD and multi-core implementations as well as compiler technology for the proposed architecture. A custom ASIC implementation is expected to further improve the processing performance by factors between 10 and 27. / Dissertation (MEng)--University of Pretoria, 2013. / gm2014 / Electrical, Electronic and Computer Engineering / unrestricted
65

Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors / Effektiv LU-faktorisering för Texas Instruments digitala signalprocessorer med Keystone-arkitektur

Netzer, Gilbert January 2015 (has links)
The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. / Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också.
66

Calcul flottant haute performance sur circuits reconfigurables / High-performance floating-point computing on reconfigurable circuits

Pasca, Bogdan Mihai 21 September 2011 (has links)
De plus en plus de constructeurs proposent des accélérateurs de calculs à base de circuits reconfigurables FPGA, cette technologie présentant bien plus de souplesse que le microprocesseur. Valoriser cette flexibilité dans le domaine de l'accélération de calcul flottant en utilisant les langages de description de circuits classiques (VHDL ou Verilog) reste toutefois très difficile, voire impossible parfois. Cette thèse a contribué au développement du logiciel FloPoCo, qui offre aux utilisateurs familiers avec VHDL un cadre C++ de description d'opérateurs arithmétiques génériques adapté au calcul reconfigurable. Ce cadre distingue explicitement la fonctionnalité combinatoire d'un opérateur, et la problématique de son pipeline pour une précision, une fréquence et un FPGA cible donnés. Afin de pouvoir utiliser FloPoCo pour concevoir des opérateurs haute performance en virgule flottante, il a fallu d'abord concevoir des blocs de bases optimisés. Nous avons d'abord développé des additionneurs pipelinés autour des lignes de propagation de retenue rapides, puis, à l'aide de techniques de pavages, nous avons conçu de gros multiplieurs, possiblement tronqués, utilisant des petits multiplieurs. L'évaluation de fonctions élémentaires en flottant implique souvent l'évaluation en virgule fixe d'une fonction. Nous présentons un opérateur générique de FloPoCo qui prend en entrée l'expression de la fonction à évaluer, avec ses précisions d'entrée et de sortie, et construit un évaluateur polynomial optimisé de cette fonction. Ce bloc de base a permis de développer des opérateurs en virgule flottante pour la racine carrée et l'exponentielle qui améliorent considérablement l'état de l'art. Nous avons aussi travaillé sur des techniques de compilation avancée pour adapter l'exécution d'un code C aux pipelines flexibles de nos opérateurs. FloPoCo a pu ainsi être utilisé pour implanter sur FPGA des applications complètes. / Due to their potential performance and unmatched flexibility, FPGA-based accelerators are part of more and more high-performance computing systems. However, exploiting this flexibility for accelerating floating-point computations by manually using classical circuit description languages (VHDL or Verilog) is very difficult, and sometimes impossible. This thesis has contributed to the development of the FloPoCo software, a C++ framework for describing flexible FPGA-specific arithmetic operators. This framework explicitly separates the description of the combinatorial functionality of an arithmetic operator, and its pipelining for a given precision, operating frequency and target FPGA.In order to be able to use FloPoCo for designing high performance floating-point operators, we first had to design the optimized basic blocks. We first developed pipelined addition architectures exploiting the fast-carry lines present in modern FPGAs. Next, we focused on multiplication architectures. Using tiling techniques, we proposed novel architectures for large multipliers, but also truncated multipliers, based on the multipliers found in modern FPGA DSP blocks. We also present a generic FloPoCo operator which inputs the expression of a function, its input and output precisions, and builds an optimized polynomial evaluator for the fixed-point evaluation of this function. Using this building block we have designed floating-point operators for the square-root and exponential functions which significantly outperform existing operators. Finally, we also made use of advanced compilation techniques for adapting the execution of a C program to the flexible pipelines of our operators.
67

Conception en vue de test de convertisseurs de signal analogique-numérique de type pipeline. / Design for test of pipelined analog to digital converters.

Laraba, Asma 20 September 2013 (has links)
La Non-Linéarité-Différentielle (NLD) et la Non-Linéarité-Intégrale (NLI) sont les performances statiques les plus importantes des Convertisseurs Analogique-Numérique (CAN) qui sont mesurées lors d’un test de production. Ces deux performances indiquent la déviation de la fonction de transfert du CAN par rapport au cas idéal. Elles sont obtenues en appliquant une rampe ou une sinusoïde lente au CAN et en calculant le nombre d’occurrences de chacun des codes du CAN.Ceci permet la construction de l’histogramme qui permet l’extraction de la NLD et la NLI. Cette approche requiert lacollection d’une quantité importante de données puisque chacun des codes doit être traversé plusieurs fois afin de moyenner le bruit et la quantité de données nécessaire augmente exponentiellement avec la résolution du CAN sous test. En effet,malgré que les circuits analogiques et mixtes occupent une surface qui n’excède pas généralement 5% de la surface globald’un System-on-Chip (SoC), leur temps de test représente souvent plus que 30% du temps de test global. Pour cette raison, la réduction du temps de test des CANs est un domaine de recherche qui attire de plus en plus d’attention et qui est en train deprendre de l’ampleur. Les CAN de type pipeline offrent un bon compromis entre la vitesse, la résolution et la consommation.Ils sont convenables pour une variété d’applications et sont typiquement utilisés dans les SoCs destinés à des applicationsvidéo. En raison de leur façon particulière du traitement du signal d’entrée, les CAN de type pipeline ont des codes de sortiequi ont la même largeur. Par conséquent, au lieu de considérer tous les codes lors du test, il est possible de se limiter à un sous-ensemble, ce qui permet de réduire considérablement le temps de test. Dans ce travail, une technique pour l’applicationdu test à code réduit pour les CANs de type pipeline est proposée. Elle exploite principalement deux propriétés de ce type deCAN et permet d’obtenir une très bonne estimation des performances statiques. La technique est validée expérimentalementsur un CAN 11-bit, 55nm de STMicroelectronics, obtenant une estimation de la NLD et de la NLI pratiquement identiques àla NLD et la NLI obtenues par la méthode classique d’histogramme, en utilisant la mesure de seulement 6% des codes. / Differential Non Linearity (DNL) and Integral Non Linearity (INL) are the two main static performances ofAnalog to-Digital Converters (ADCs) typically measured during production testing. These two performances reflect thedeviation of the transfer curve of the ADC from its ideal form. In a classic testing scheme, a saturated sine-wave or ramp isapplied to the ADC and the number of occurrences of each code is obtained to construct the histogram from which DNL andINL can be readily calculated. This standard approach requires the collection of a large volume of data because each codeneeds to be traversed many times to average noise. Furthermore, the volume of data increases exponentially with theresolution of the ADC under test. According to recently published data, testing the mixed-signal functions (e.g. dataconverters and phase locked loops) of a System-on-Chip (SoC) contributes to more than 30% of the total test time, althoughmixed-signal circuits occupy a small fraction of the SoC area that typically does not exceed 5%. Thus, reducing test time forADCs is an area of industry focus and innovation. Pipeline ADCs offer a good compromise between speed, resolution, andpower consumption. They are well-suited for a variety of applications and are typically present in SoCs intended for videoapplications. By virtue of their operation, pipeline ADCs have groups of output codes which have the same width. Thus,instead of considering all the codes in the testing procedure, we can consider measuring only one code out of each group,thus reducing significantly the static test time. In this work, a technique for efficiently applying reduced code testing onpipeline ADCs is proposed. It exploits two main properties of the pipeline ADC architecture and allows obtaining an accurateestimation of the static performances. The technique is validated on an experimental 11-bit, 55nm pipeline ADC fromSTMicroelectronics, resulting in estimated DNL and INL that are practically indistinguishable from DNL and INL that areobtained with the standard histogram technique, while measuring only 6% of the codes.
68

Protocol design and performance evaluation for wireless ad hoc networks

Tong, Fei 10 November 2016 (has links)
Benefiting from the constant and significant advancement of wireless communication technologies and networking protocols, Wireless Ad hoc NETwork (WANET) has played a more and more important role in modern communication networks without relying much on existing infrastructures. The past decades have seen numerous applications adopting ad hoc networks for service provisioning. For example, Wireless Sensor Network (WSN) can be widely deployed for environment monitoring and object tracking by utilizing low-cost, low-power and multi-function sensor nodes. To realize such applications, the design and evaluation of communication protocols are of significant importance. Meanwhile, the network performance analysis based on mathematical models is also in great need of development and improvement. This dissertation investigates the above topics from three important and fundamental aspects, including data collection protocol design, protocol modeling and analysis, and physical interference modeling and analysis. The contributions of this dissertation are four-fold. First, this dissertation investigates the synchronization issue in the duty-cycled, pipelined-scheduling data collection of a WSN, based on which a pipelined data collection protocol, called PDC, is proposed. PDC takes into account both the pipelined data collection and the underlying schedule synchronization over duty-cycled radios practically and comprehensively. It integrates all its components in a natural and seamless way to simplify the protocol implementation and to achieve a high energy efficiency and low packet delivery latency. Based on PDC, an Adaptive Data Collection (ADC) protocol is further proposed to achieve dynamic duty-cycling and free addressing, which can improve network heterogeneity, load adaptivity, and energy efficiency. Both PDC and ADC have been implemented in a pioneer open-source operating system for the Internet of Things, and evaluated through a testbed built based on two hardware platforms, as well as through emulations. Second, Linear Sensor Network (LSN) has attracted increasing attention due to the vast requirements on the monitoring and surveillance of a structure or area with a linear topology. Being aware that, for LSN, there is few work on the network modeling and analysis based on a duty-cycled MAC protocol, this dissertation proposes a framework for modeling and analyzing a class of duty-cycled, multi-hop data collection protocols for LSNs. With the model, the dissertation thoroughly investigates the PDC performance in an LSN, considering both saturated and unsaturated scenarios, with and without retransmission. Extensive OPNET simulations have been carried out to validate the accuracy of the model. Third, in the design and modeling of PDC above, the transmission and interference ranges are defined for successful communications between a pair of nodes. It does not consider the cumulative interference from the transmitters which are out of the contention range of a receiver. Since most performance metrics in wireless networks, such as outage probability, link capacity, etc., are nonlinear functions of the distances among communicating, relaying, and interfering nodes, a physical interference model based on distance is definitely needed in quantifying these metrics. Such quantifications eventually involve the Nodal Distance Distribution (NDD) intrinsically depending on network coverage and nodal spatial distribution. By extending a tool in integral geometry and using decomposition and recursion, this dissertation proposes a systematic and algorithmic approach to obtaining the NDD between two nodes which are uniformly distributed at random in an arbitrarily-shaped network. Fourth, with the proposed approach to NDDs, the dissertation presents a physical interference model framework to analyze the cumulative interference and link outage probability for an LSN running the PDC protocol. The framework is further applied to analyze 2D networks, i.e., ad hoc Device-to-Device (D2D) communications underlaying cellular networks, where the cumulative interference and link outage probabilities for both cellular and D2D communications are thoroughly investigated. / Graduate / 0984 / 0544 / tong1987fei@163.com
69

Ring amplification for switched capacitor circuits

Hershberg, Benjamin Poris 19 July 2013 (has links)
A comprehensive and scalable solution for high-performance switched capacitor amplification is presented. Central to this discussion is the concept of ring amplification. A ring amplifier is a small modular amplifier derived from a ring oscillator that naturally embodies all the essential elements of scalability. It can amplify with accurate rail-to-rail output swing, drive large capacitive loads with extreme efficiency using slew-based charging, naturally scale in performance according to process trends, and is simple enough to be quickly constructed from only a handful of inverters, capacitors, and switches. In addition, the gain-enhancement technique of Split-CLS is introduced, and used to extend the efficacy of ring amplifiers in specific and other amplifiers in general. Four different pipelined ADC designs are presented which explore the practical implementation options and design considerations relevant to ring amplification and Split-CLS, and are used to establish ring amplification as a new paradigm for scalable amplification. / Graduation date: 2012 / Access restricted to the OSU Community, at author's request, from July 19, 2012 - July 19, 2013
70

Algorithm And Architecture Design for Real-time Face Recognition

Mahale, Gopinath Vasanth January 2016 (has links) (PDF)
Face recognition is a field of biometrics that deals with identification of subjects based on features present in the images of their faces. The factors that make face recognition popular and favorite as compared to other biometric methods are easier operation and ability to identify subjects without their knowledge. With these features, face recognition has become an integral part of the present day security systems, targeting a smart and secure world. There are various factors that de ne the performance of a face recognition system. The most important among them are recognition accuracy of algorithm used and time taken for recognition. Recognition accuracy of the face recognition algorithm gets affected by changes in pose, facial expression and illumination along with occlusions in the images. There have been a number of algorithms proposed to enable recognition under these ambient changes. However, it has been hard to and a single algorithm that can efficiently recognize faces in all the above mentioned conditions. Moreover, achieving real time performance for most of the complex face recognition algorithms on embedded platforms has been a challenge. Real-time performance is highly preferred in critical applications such as identification of crime suspects in public. As available software solutions for FR have significantly large latency in recognizing individuals, they are not suitable for such critical real-time applications. This thesis focuses on real-time aspect of FR, where acceleration of the algorithms is achieved by means of parallel hardware architectures. The major contributions of this work are as follows. We target to design a face recognition system that can identify at most 30 faces in each frame of video at 15 frames per second, which amounts to 450 recognitions per second. In addition, we target to achieve good recognition accuracy along with scalability in terms of database size and input image resolutions. To design a system with these specifications, as a first step, we explore algorithms in literature and come up with a hybrid face recognition algorithm. This hybrid algorithm shows good recognition accuracy on face images with changes in illumination, pose and expressions, and also with occlusions. In addition the computations in the algorithm are modular in nature which are suitable for real-time realizations through parallel processing. The face recognition system consists of a face detection module to detect faces in the input image, which is followed by a face recognition module to identify the detected faces. There are well established algorithms and architectures for face detection in literature which can perform detection at 15 frames per second on video frames. Detected faces of different sizes need to be scaled to the size specified by the face recognition module. To meet the real-time constraints, we propose a hardware architecture for real-time bi-cubic convolution interpolation with dynamic scaling factors. To recognize the resized faces in real-time, a scalable parallel pipelined architecture is designed for the hybrid algorithm which can perform 450 recognitions per second on a database containing grayscale images of at most 450 classes on Virtex 6 FPGA. To provide flexibility and programmability, we extend this design to REDEFINE, a multi-core massively parallel reconfigurable architecture. In this design, we come up with FR specific programmable cores termed Scalable Unit for Region Evaluation (SURE) capable of performing modular computations in the hybrid face recognition algorithm. We replicate SUREs in each tile of REDEFINE to construct a face recognition module termed REDEFINE for Face Recognition using SURE Homogeneous Cores (REFRESH). There is a need to learn new unseen faces on-line in practical face recognition systems. Considering this, for real-time on-line learning of unseen face images, we design tiny processors termed VOP, Processor for Vector Operations. VOPs function as coprocessors to process elements under each tile of REDEFINE to accelerate micro vector operations appearing in the synaptic weight computations. We also explore deep neural networks which operate similar to the processing in human brain and capable of working on very large face databases. We explore the field of Random matrix theory to come up with a solution for synaptic weight initialization in deep neural networks for better classification . In addition, we perform design space exploration of hardware architecture for deep convolution networks and conclude with directions for future work.

Page generated in 0.0607 seconds