Global ETD Search

1	Implementing CAL Actor Component on Massively Parallel Processor Array Khanfar, Husni January 2010 (has links) No description available. MPPA dataflow CAL Massively Parallel Processor Array static scheduling
2	Pixel-parallel image processing techniques and algorithms Wang, Bin January 2014 (has links) The motivation of the research presented in this thesis is to investigate image processing algorithms utilising various SIMD parallel devices, especially massively parallel Cellular Processor Arrays (CPAs), to accelerate their processing speed. Various SIMD processors with different architectures are reviewed, and their features are analysed. The different types of parallelisms contained in image processing tasks are also analysed, and the methodologies to exploit date-level parallelisms are discussed. The efficiency of the pixel-per-processor architecture used in computer vision scenarios are discussed, as well as its limitations. Aiming to solve the problem that CPA array dimensions are usually smaller than the resolution of the images needed to be processed, a “coarse grain mapping method” is proposed. It provides the CPAs with the ability of processing images with higher resolution than the arrays themselves by allowing CPAs to process multiple pixels per processing element. It is completely software based, easy to implement, and easy to program. To demonstrate the efficiency of pixel-level parallel approach, two image processing algorithms specially designed for pixel-per-processor arrays are proposed: a parallel skeletonization algorithm based on two-layer trigger-wave propagation, and a parallel background detection algorithm. Implementations of the proposed algorithms using different platforms (i.e. CPU, GPU and CPA) are proposed and evaluated. Evaluation results indicate that the proposed algorithms have advantages both in term of processing speed and result quality. This thesis concludes that pixel-per-processor architecture can be used in image processing (or computer vision) algorithms which emphasize analysing pixel-level information, to significantly boost the processing speed of these algorithms. 621.36
3	Design and implementation of massively parallel fine-grained processor arrays Walsh, Declan January 2015 (has links) This thesis investigates the use of massively parallel fine-grained processor arrays to increase computational performance. As processors move towards multi-core processing, more energy-efficient processors can be designed by increasing the number of processor cores on a single chip rather than increasing the clock frequency of a single processor. This can be done by making processor cores less complex, but increasing the number of processor cores on a chip. Using this philosophy, a processor core can be reduced in complexity, area, and speed to form a very small processor which can still perform basic arithmetic operations. Due to the small area occupation this can be multiplied and scaled to form a large scale parallel processor array to offer a significant performance. Following this design methodology, two fine-grained parallel processor arrays are designed which aim to achieve a small area occupation with each individual processor so that a larger array can be implemented over a given area. To demonstrate scalability and performance, SIMD parallel processor array is designed for implementation on an FPGA where each processor can be implemented using four ‘slices’ of a Xilinx FPGA. With such small area utilization, a large fine-grained processor can be implemented on these FPGAs. A 32 × 32 processor array is implemented and fast processing demonstrated using image processing tasks. An event-driven MIMD parallel processor array is also designed which occupies a small amount of area and can be scaled up to form much larger arrays. The event-driven approach allows the processor to enter an idle mode when no events are occurring local to the processor, reducing power consumption. The processor can switch to operational mode when events are detected. The processor core is designed with a multi-bit data path and ALU and contains its own instruction memory making the array a multi-core processor array. With area occupation of primary concern, the processor is relatively simple and connects with its four nearest direct neighbours. A small 8 × 8 prototype chip is implemented in a 65 nm CMOS technology process which can operate at a clock frequency of 80 MHz and offer a peak performance of 5.12 GOPS which can be scaled up to larger arrays. An application of the event-driven processor array is demonstrated using a simulation model of the processor. An event-driven algorithm is demonstrated to perform distributed control of distributed manipulator simulator by separating objects based on their physical properties. 621.3
4	A Scalable Approach to Multi-core Prototyping Newcomb, Jamie David 22 April 2008 (has links) In recent years, multi-core processors and multi-processor networks have grown in popularity as a solution to the limits on increasing clock speed, rising power consumption, and the nanometer manufacturing processes. Multi-core processors and multi-processor networks are seen as the next step in the advancement of computational capabilities by way of concurrent processing. However, parallel software design is difficult due to the immaturity of scalable architectures and software development environments for multi-core hardware. How should processors effectively and quickly pass information, with as little overhead as possible? What kind of communication architecture is best suited for parallelism? How can large-scale architectures be quickly produced, verified and properly utilized by software? Using commercially available FPGA development boards, Xilinx tools and components, this thesis offers a light-weight solution to these questions for effective, low-overhead, low-latency multi-core communication and fast prototyping of multi-processor networks for scalable processor arrays. / Master of Science multi-gigabit Aurora Xilinx Field programmable gate arrays multi-processor array multi-core
5	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2009 (has links) This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable and precise numerical results are obtained as outputs of the algorithms. However the analysis results are not reliable because of the performance analysis tools. Am2000 family Ambric register aDesigner radar antenna arrays beamforming QRD Gentleman-Kung systolic array Givens Rotations MPPA massively parallel processor array fixed point
6	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2008 (has links) <p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal </p><p>processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented </p><p>to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. </p><p>Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable and precise numerical results are obtained as outputs of the algorithms. However the analysis </p><p>results are not reliable because of the performance analysis tools.</p>
7	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2009 (has links) <p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable </p><p>and precise numerical results are obtained as outputs of the algorithms. However the analysis results are not reliable because of the performance analysis tools.</p>
8	Parallelisierung von Algorithmen zur Nutzung auf Architekturen mit Teilwortparallelität / Parallelization of Algorithms for using on Architectures with Subword Parallelism Schaffer, Rainer 12 October 2010 (has links) (PDF) Der technologische Fortschritt gestattet die Implementierung zunehmend komplexerer Prozessorarchitekturen auf einem Schaltkreis. Ein Trend der letzten Jahre ist die Implementierung von mehr und mehr Verarbeitungseinheiten auf einem Chip. Daraus ergeben sich neue Herausforderungen für die Abbildung von Algorithmen auf solche Architekturen, denn alle Verarbeitungseinheiten sollen effizient bei der Ausführung des Algorithmus genutzt werden. Der Schwerpunkt der eingereichten Dissertation ist die Ausnutzung der Parallelität von Rechenfeldern mit Teilwortparallelität. Solche Architekturen erlauben Parallelverarbeitung auf mehreren Ebenen. Daher wurde eine Abbildungsstrategie, mit besonderem Schwerpunkt auf Teilwortparallelität entwickelt. Diese Abbildungsstrategie basiert auf den Methoden des Rechenfeldentwurfs. Rechenfelder sind regelmäßig angeordnete Prozessorelemente, die nur mit ihren Nachbarelementen kommunizieren. Die Datenein- und -ausgabe wird durch die Prozessorelemente am Rand des Rechenfeldes realisiert. Jedes Prozessorelement kann mehrere Funktionseinheiten besitzen, welche die Rechenoperationen des Algorithmus ausführen. Die Teilwortparallelität bezeichnet die Fähigkeit zur Teilung des Datenpfads der Funktionseinheit in mehrere schmale Datenpfade für die parallele Ausführung von Daten mit geringer Wortbreite. Die entwickelte Abbildungsstrategie unterteilt sich in zwei Schritte, die \"Vorverarbeitung\" und die \"Mehrstufige Modifizierte Copartitionierung\" (kurz: MMC). Die \"Vorverarbeitung\" verändert den Algorithmus in einer solchen Art, dass der veränderte Algorithmus schnell und effizient auf die Zielarchitektur abgebildet werden kann. Hierfür wurde ein Optimierungsproblem entwickelt, welches schrittweise die Parameter für die Transformation des Algorithmus bestimmt. Die \"Mehrstufige Modifizierte Copartitionierung\" wird für die schrittweise Anpassung des Algorithmus an die Zielarchitektur eingesetzt. Darüber hinaus ermöglicht die Abbildungsmethode die Ausnutzung der lokalen Register in den Prozessorelementen und die Anpassung des Algorithmus an die Speicherarchitektur, an die das Rechenfeld angebunden ist. Die erste Stufe der MMC dient der Transformation eines Algorithmus mit Einzeldatenoperationen in einen Algorithmus mit teilwortparallelen Operationen. Mit der zweiten Copartitionierungsstufe wird der Algorithmus an die lokalen Register und an das Rechenfeld angepasst. Weitere Copartitionierungsstufen können zur Anpassung des Algorithmus an die Speicherarchitektur verwendet werden. / The technological progress allows the implementation of complex processor architectures on a chip. One trend of the last years is the implemenation of more and more execution units on one chip. That implies new challenges for the mapping of algorithms on such architectures, because the execution units should be used efficiently during the execution of the algorithm. The focus of the submitted dissertation thesis is the utilization of the parallelism of processor arrays with subword parallelism. Such architectures allow parallel executions on different levels. Therefore an algorithm mapping strategy was developed, where the exploitation of the subword parallelism was in the focus. This algorithm mapping strategy is based on the methods of the processor array design. Processor arrays are regular arranged processor elements, which communicate with their neighbors elements only. The data in- and output will be realized by the processor elements on the border of the array. Each processor element can have several functional units, which execute the computational operations. Subword parallelism means the capability for splitting the data path of the functional units in several smaller chunks for the parallel execution of data with lower word width. The developed mapping strategy is subdivided in two steps, the \"Preprocessing\" and the \"Multi-Level Modified Copartitioning\" (kurz: MMC), whereat the MMC means the method of the step simultaneously. The \"Preprocessing\" alter the algorithm in such a kind, that the altered algorithm can be fast and efficient mapped on the target architecture. Therefore an optimization problem was developed, which determines gradual the parameter for the transformation of the algorithm. The \"Multi-Level Modified Copartitioning\" is used for mapping the algorithm gradual on the target architecture. Furthermore the mapping methodology allows the exploitation of the local registers in the processing elements and the adaptation of the algorithm on the memory architecture, where the processing array is connected on. The first level of the MMC is used for the transformation of an algorithm with operation based on single data to an algorithm with subword parallel operations. With the second level, the algorithm will be adapted to the local registers in the processing elements and to the processor array. Further copartition levels can be used for matching the algorithm to the memory architecture. Teilwortparallelität Rechenfeld Abbildung Copartitionierung Speicheroptimierung Ablaufplanung Allokation Mehrstufenparallelität subword parallelism processor array mapping copartitioning memory optimization scheduling allocation multi level parallelism ddc:620 ddc:004 rvk:ST 125
9	Stratégie de fiabilisation au niveau système des architectures MPSoC / Dependable Reconfigurable Processor Array (RPA) Hebert, Nicolas 06 July 2011 (has links) Cette thèse s'inscrit dans un contexte où chaque saut technologique, voit apparaitre des circuits intégrés produits de plus en plus tôt dans la phase de qualification et où la technologie de ces circuits intégrés se rapproche de plus en plus des limitations physiques de la matière. Malgré des contre-mesures technologiques, on se retrouve devant un taux de défaillance grandissant ce qui crée des conditions favorables au retour des techniques de tolérance aux fautes sur les circuits intégrés non critiques.La densité d'intégration atteinte aujourd'hui nous permet de considérer les réseaux reconfigurables de processeur comme des architectures SoC d'avenir. En effet, l'homogénéité de ces architectures laisse entrevoir des reconfigurations possibles de la plateforme qui permettraient d'assurer une qualité de service et donc une fiabilité minimum en présence de défauts. Ainsi, de nouvelles solutions de protection doivent être proposées pour garantir le bon fonctionnement des circuits non plus uniquement au niveau de quelques sous-fonctionnalités critiques mais au niveau architecture système lui-même.En s'appuyant sur ces prérogatives, nous présentons une méthode de protection distribuée et dynamique innovatrice, D-Scale. La méthode consiste à détecter, isoler et recouvrir les systèmes en présence d'erreurs de type « crash ». La détection des erreurs qui ont pour conséquence un « crash » de la plateforme est basée sur un mécanisme de messages de diagnostique échangés entre les unités de traitement. La phase de recouvrement est quant à elle basée sur un mécanisme permettant la reconfiguration de la plateforme de manière autonome. Une implémentation de cette protection matérielle et logicielle est proposée. Le coût de protection est réduit afin d'être intégré dans de futures architectures multiprocesseurs. Finalement, un outil d'évaluation d'impacte des fautes sur la plateforme est aussi étudié afin de valider l'efficacité de la protection. / This thesis is placed in a context where, for each technology node, integrated circuits are design at an earlier stage in the qualification process and where the CMOS technology appears to be closer to the silicon physical limitations. Despite technological countermeasure, we face an increase in the failure rate which creates conditions in favor of the return of fault-tolerant techniques for non-critical integrated circuits.Nowadays, we have reached such an integration density that we can consider the reconfigurable processor array as future SoC architectures. Indeed, these homogenous architectures suggest possible platform reconfigurations that would ensure quality of service and consequently a minimum reliability in presence of defects. Thus, new protection solutions must be proposed to ensure circuit smooth operations not only for sub-critical functionalities but at the system architecture level itself.Based on these prerogatives, we present an innovative dynamical and distributed protection method, named D-Scale. This method consists in detecting, isolating and recovering the systems in the presence of error which lead to a "crash" of the platform. The crash error detection is based on heartbeat specific messages exchanged between PEs. The recovery phase is based on an autonomous mechanism which reconfigures the platform.A hardware/software implementation was proposed and evaluated. The protection cost is reduced in order to be integrated within future multi-processor SoC architectures. Finally, a fault effect analysis tool is studied in order to validate the fault-tolerant method robustness. MP-SoC Système tolerant aux fautes MPSoC Fault tolerant system
10	Parallelisierung von Algorithmen zur Nutzung auf Architekturen mit Teilwortparallelität Schaffer, Rainer 09 March 2010 (has links) Der technologische Fortschritt gestattet die Implementierung zunehmend komplexerer Prozessorarchitekturen auf einem Schaltkreis. Ein Trend der letzten Jahre ist die Implementierung von mehr und mehr Verarbeitungseinheiten auf einem Chip. Daraus ergeben sich neue Herausforderungen für die Abbildung von Algorithmen auf solche Architekturen, denn alle Verarbeitungseinheiten sollen effizient bei der Ausführung des Algorithmus genutzt werden. Der Schwerpunkt der eingereichten Dissertation ist die Ausnutzung der Parallelität von Rechenfeldern mit Teilwortparallelität. Solche Architekturen erlauben Parallelverarbeitung auf mehreren Ebenen. Daher wurde eine Abbildungsstrategie, mit besonderem Schwerpunkt auf Teilwortparallelität entwickelt. Diese Abbildungsstrategie basiert auf den Methoden des Rechenfeldentwurfs. Rechenfelder sind regelmäßig angeordnete Prozessorelemente, die nur mit ihren Nachbarelementen kommunizieren. Die Datenein- und -ausgabe wird durch die Prozessorelemente am Rand des Rechenfeldes realisiert. Jedes Prozessorelement kann mehrere Funktionseinheiten besitzen, welche die Rechenoperationen des Algorithmus ausführen. Die Teilwortparallelität bezeichnet die Fähigkeit zur Teilung des Datenpfads der Funktionseinheit in mehrere schmale Datenpfade für die parallele Ausführung von Daten mit geringer Wortbreite. Die entwickelte Abbildungsstrategie unterteilt sich in zwei Schritte, die \"Vorverarbeitung\" und die \"Mehrstufige Modifizierte Copartitionierung\" (kurz: MMC). Die \"Vorverarbeitung\" verändert den Algorithmus in einer solchen Art, dass der veränderte Algorithmus schnell und effizient auf die Zielarchitektur abgebildet werden kann. Hierfür wurde ein Optimierungsproblem entwickelt, welches schrittweise die Parameter für die Transformation des Algorithmus bestimmt. Die \"Mehrstufige Modifizierte Copartitionierung\" wird für die schrittweise Anpassung des Algorithmus an die Zielarchitektur eingesetzt. Darüber hinaus ermöglicht die Abbildungsmethode die Ausnutzung der lokalen Register in den Prozessorelementen und die Anpassung des Algorithmus an die Speicherarchitektur, an die das Rechenfeld angebunden ist. Die erste Stufe der MMC dient der Transformation eines Algorithmus mit Einzeldatenoperationen in einen Algorithmus mit teilwortparallelen Operationen. Mit der zweiten Copartitionierungsstufe wird der Algorithmus an die lokalen Register und an das Rechenfeld angepasst. Weitere Copartitionierungsstufen können zur Anpassung des Algorithmus an die Speicherarchitektur verwendet werden. / The technological progress allows the implementation of complex processor architectures on a chip. One trend of the last years is the implemenation of more and more execution units on one chip. That implies new challenges for the mapping of algorithms on such architectures, because the execution units should be used efficiently during the execution of the algorithm. The focus of the submitted dissertation thesis is the utilization of the parallelism of processor arrays with subword parallelism. Such architectures allow parallel executions on different levels. Therefore an algorithm mapping strategy was developed, where the exploitation of the subword parallelism was in the focus. This algorithm mapping strategy is based on the methods of the processor array design. Processor arrays are regular arranged processor elements, which communicate with their neighbors elements only. The data in- and output will be realized by the processor elements on the border of the array. Each processor element can have several functional units, which execute the computational operations. Subword parallelism means the capability for splitting the data path of the functional units in several smaller chunks for the parallel execution of data with lower word width. The developed mapping strategy is subdivided in two steps, the \"Preprocessing\" and the \"Multi-Level Modified Copartitioning\" (kurz: MMC), whereat the MMC means the method of the step simultaneously. The \"Preprocessing\" alter the algorithm in such a kind, that the altered algorithm can be fast and efficient mapped on the target architecture. Therefore an optimization problem was developed, which determines gradual the parameter for the transformation of the algorithm. The \"Multi-Level Modified Copartitioning\" is used for mapping the algorithm gradual on the target architecture. Furthermore the mapping methodology allows the exploitation of the local registers in the processing elements and the adaptation of the algorithm on the memory architecture, where the processing array is connected on. The first level of the MMC is used for the transformation of an algorithm with operation based on single data to an algorithm with subword parallel operations. With the second level, the algorithm will be adapted to the local registers in the processing elements and to the processor array. Further copartition levels can be used for matching the algorithm to the memory architecture. info:eu-repo/classification/ddc/620 ddc:620 info:eu-repo/classification/ddc/004 ddc:004

Search results