Global ETD Search

101	Algorithmes de la morphologie mathématique pour les architectures orientées flux Brambor, Jaromír 11 July 2006 (has links) (PDF) Cette thèse est consacrée aux algorithmes de morphologie mathématique qui peuvent considérer les pixels d'une image comme un flux de données. Nous allons démontrer qu'un grand nombre d'algorithmes de morphologie mathématique peuvent être décrits comme un flux de données traversant des unités d'exécution. Nous verrons que cette approche peut aussi fonctionner sur des processeurs génériques possédant un jeu d'instructions multimédia ou sur des cartes graphiques. Pour décrire les algorithmes en flux de données, nous proposons d'utiliser le langage fonctionnel Haskell, ce qui nous permettra de décrire les briques de base de la construction des algorithmes de morphologie mathématique. On applique ces briques dans la description des algorithmes les plus couramment utilisés (dilatation/érosion, opérations géodésiques, fonction distance et nivellements) ce qui facilitera le portage de ces algorithmes sur plusieurs plate-formes. Nous proposons pour la construction des algorithmes morphologiques un mode d'exécution original par macro blocs et nous étudions en profondeur la transposition de cette idée aux architectures SIMD. Nous montrons que l'utilisation des macro blocs est intéressante pour les architectures multimédia et nous montrons également que les algorithmes morphologiques proposés dans cette thèse atteignent de meilleures performances que les implémentations standard. Un nouveau champ s'ouvre ainsi aux algorithmes développés dans les applications de traitement d'images en temps réel. Cette thèse explore également les processeurs graphiques et démontre sur des résultats expérimentaux qu'ils sont, dès à présent, assez performants pour concurrencer les processeurs généraux. [MATH] Mathematics Morphologie mathématique Algorithmes rapides Flux de données macro blocs SIMD Processeurs graphiques Haskell Description formelle Lambda calcul
102	H.264 Baseline Real-time High Definition Encoder on CELL Wei, Zhengzhe January 2010 (has links) <p>In this thesis a H.264 baseline high definition encoder is implemented on CELL processor. The target video sequence is YUV420 1080p at 30 frames per second in our encoder. To meet real-time requirements, a system architecture which reduces DMA requests is designed for large memory accessing. Several key computing kernels: Intra frame encoding, motion estimation searching and entropy coding are designed and ported to CELL processor units. A main challenge is to find a good tradeoff between DMA latency and processing time. The limited 256K bytes on-chip memory of SPE has to be organized efficiently in SIMD way. CAVLC is performed in non-real-time on the PPE.</p><p> </p><p>The experimental results show that our encoder is able to encode I frame in high quality and encode common 1080p video sequences in real-time. With the using of five SPEs and 63KB executable code size, 20.72M cycles are needed to encode one P frame partitions for one SPE. The average PSNR of P frames increases a maximum of 1.52%. In the case of fast speed video sequence, 64x64 search range gets better frame qualities than 16x16 search range and increases only less than two times computing cycles of 16x16. Our results also demonstrate that more potential power of the CELL processor can be utilized in multimedia computing.</p><p> </p><p>The H.264 main profile will be implemented in future phases of this encoder project. Since the platform we use is IBM Full-System Simulator, DMA performance in a real CELL processor is an interesting issue. Real-time entropy coding is another challenge to CELL.</p> Video coding H.264 CELL processor Real-time coding Intra prediction Parallel programming SIMD Computer engineering Datorteknik
103	Parallel video decoding Álvarez Mesa, Mauricio 08 September 2011 (has links) Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmission, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated compression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real-time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications. / El vídeo digital es una tecnología popular utilizada en una gran variedad de aplicaciones. La calidad de vídeo, expresada en la resolución espacial y temporal, ha ido aumentando constantemente en los últimos años. Con el fin de reducir la tasa de bits requerida para su almacenamiento y transmisión, se ha desarrollado una nueva generación de codificadores y decodificadores (códecs) de vídeo. El códec estándar de vídeo más reciente, conocido como H.264/AVC, incluye herramientas sofisticadas de compresión que requieren más recursos de computación que los códecs de vídeo anteriores. El efecto combinado del vídeo de alta calidad y las herramientas de compresión avanzada incluidas en el H.264/AVC han llevado a un aumento significativo de los requerimientos computacionales de la decodificación de vídeo. El objetivo principal de esta tesis es proporcionar el rendimiento necesario para la decodificación en tiempo real de vídeo de alta calidad. Nuestra solución ha sido la explotación simultánea de múltiples niveles de paralelismo. Por un lado, se realizaron modificaciones en el decodificador de vídeo con el fin de extraer múltiples niveles de paralelismo. Y, por otro lado, se modificaron las arquitecturas de propósito general para mejorar la explotación del tipo paralelismo que está presente en las aplicaciones de vídeo. Primero hicimos un análisis de la escalabilidad de dos extensiones de Instrucción Simple con Múltiples Datos (SIMD por sus siglas en inglés): una de una dimensión (1D) y otra matricial de dos dimensiones (2D). Se demostró que al escalar la extensión 2D se obtiene un mayor rendimiento con una menor complejidad que al escalar la extensión 1D. Luego se realizó una caracterización de la decodificación de H.264/AVC en aplicaciones de alta definición (HD) donde se identificaron los núcleos principales. Debido a la falta de un punto de referencia (benchmark) adecuado para la decodificación de vídeo HD, desarrollamos uno propio, llamado HD-VideoBench el cual incluye aplicaciones completas de codificación y decodificación de vídeo junto con una serie de secuencias de vídeo en HD. Después optimizamos los núcleos más importantes del decodificador H.264/AVC usando instrucciones SIMD. Sin embargo, los resultados no alcanzaron el máximo rendimiento posible debido al efecto negativo de la desalineación de los datos en memoria. Como solución, evaluamos el hardware y el software necesarios para realizar accesos no alineados. Este soporte produjo mejoras significativas de rendimiento en la aplicación. Aparte se realizó una investigación sobre cómo extraer paralelismo de nivel de tarea. Se encontró que ninguno de los mecanismos existentes podía escalar para sistemas masivamente paralelos. Como alternativa, desarrollamos un nuevo algoritmo que fue capaz de encontrar miles de tareas independientes al explotar paralelismo de nivel de macrobloque. Luego implementamos una versión paralela del decodificador de H.264 en una máquina de memoria compartida distribuida (DSM por sus siglas en inglés). Sin embargo esta implementación no alcanzó el máximo rendimiento posible debido al impacto negativo de las operaciones de sincronización y al efecto del núcleo de decodificación de entropía. Con el fin de eliminar estos cuellos de botella se evaluó la paralelización al nivel de imagen de la fase de decodificación de entropía combinada con la paralelización al nivel de macrobloque de los demás núcleos. La sobrecarga de las operaciones de sincronización se eliminó casi por completo mediante el uso de operaciones aceleradas por hardware. Con todas las mejoras presentadas se permitió la decodificación, en tiempo real, de vídeo de alta definición y alta tasa de imágenes por segundo. Como resultado global se creó una solución escalable capaz de usar el número creciente procesadores en las arquitecturas multinúcleo. H.264 High definition Video decoding Parallel SIMD MPEG-2 Vector processors Heterogeneous computing Multicore Computer architecture Parallel programming 004
104	Harmony: an execution model for heterogeneous systems Diamos, Gregory Frederick 10 November 2011 (has links) The emergence of heterogeneous and many-core architectures presents a unique opportunity to deliver order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. However, their ubiquitous adoption has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development inherent to heterogeneous architectures. This dissertation introduces Harmony, an execution model for heterogeneous systems that draws heavily from concepts and optimizations used in processor micro-architecture to provide: (1) semantics for simplifying heterogeneity management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. This work focuses on simplifying development and ensuring binary portability and scalability across system configurations and sizes. Heterogeneous Many-core Compiler Runtime GPU Processor SIMD Scheduling Execution model Modeling Computing model Computer architecture Algorithms Heterogeneous computing
105	Compilation optimisante pour processeurs extensibles Floc'h, Antoine 08 June 2012 (has links) (PDF) Les processeurs à jeu d'instructions spécifiques (ASIP) constituent un compromis entre les performances d'un circuit matériel dédié et la flexibilité d'un processeur programmable. Ces processeurs spécialisés peuvent être composés d'un processeur généraliste dont le jeu d'instructions est étendu par des instructions spécifiques à une ou plusieurs applications et qui sont exécutées sur une extension matérielle. On parle alors de processeurs extensibles. Si le coût de conception et de vérification de telles architectures est considérablement réduit en comparaison à une conception complète, la complexité est en partie reportée sur l'étape de compilation. En effet, le jeu d'instructions d'un processeur extensible est à la fois une entrée et une sortie du processus de compilation. Cette thèse propose plusieurs contributions pour guider le processus de conception de telles architectures à travers des techniques d'optimisations adaptées aux processeurs extensibles. La première de ces contributions consiste à sélectionner et à ordonnancer les instructions spécialisées VLIW en résolvant un unique problème d'optimisation de programmation par contraintes (CP). D'autre part, nous proposons une technique originale qui traite de l'interaction entre l'optimisation de code et l'extension de jeu d'instructions. Le principe est de transformer automatiquement le code original des nids de boucles d'un programme (à l'aide du modèle polyédrique) afin de sélectionner des instructions spécialisées vectorisables et dont les données temporaires, produites lors d'une itération de boucle, sont mémorisées sur l'extension matérielle du processeur. Compilation ASIP SIMD Programmation par contraintes Modèle polyédrique IDM
106	Optimized SIMD scheduling and architecture implementation for ultra-low energy bioimaging processor / Βελτιστοποιημένος χρονοπρογραμματισμός εντολών για παράλληλη επεξεργασία (SIMD) και υλοποίηση αρχιτεκτονικής για επεξεργαστή χαμηλής κατανάλωσης για αλγόριθμους βιοαπεικόνισης Ψύχου, Γεωργία 03 August 2010 (has links) On-line poultry monitoring can significantly improve living conditions of hens in industrial farms. A very low-cost low-energy solution needs to be provided though. ASIPs can be an ideal solution when they cover many submarkets and low-energy concepts are used for their realization. Aiming to high energy-efficiency, this work implements data parallelization, using a recently introduced software-controled SIMD realization in an innovative way. A manual mapping and scheduling effort of the most crucial part of the application leads to a highly optimized result, in terms of cycles, area and energy. This manual scheduling implementation must also be supported by a commercial compiler tool so that the design-time is minimized. Moreover, energy-efficient mapping must be explored for the remaining parts of the code. In that case, because the frequency of occurence of a part of the code is very low, more attention should be given to minimizing the area overhead. Increasing the energy efficiency of the data-path in such ways can be very important, since data- path can be dominant in the total energy-pie, once the instruction/data memory overhead is minimized by other complementary approaches. / Η αυτόματη μέθοδος παρακολούθησης ζωντανών οργανισμών μπορεί να βελτιώσει σημαντικά τις συνθήκες διαβίωσης των ζώων στις βιομηχανικές φάρμες. Για να είναι οικονομικά εφικτή όμως μια τέτοια λύση πρέπει να είναι μια λύση χαμηλής ενέργειας. Τα ASIPs μπορούν να είναι μια ιδανική λύση όταν τεχνικές χαμηλής κατανάλωσης ενέργειας εφαρμόζονται σε αυτά, καθώς λόγω της ευελιξίας τους μπορούν να καλύπτουν πολλούς τομείς της συγκεκριμένης αγοράς. Στοχεύοντας σε υψηλή εξοικονόμηση ενέργειας, η παρούσα δουλειά υλοποιεί παραλληλισμό δεδομένων, χρησιμοποιώντας μια προσφάτως προταθείσα πραγματοποίηση Single Instruction Multiple Data (SIMD) εντολών, που υλοποιούνται μέσω software με ένα καινοτόμο τρόπο. Μια χειρωνακτική προσπάθεια αντιστοίχισης σε υλικό του πιο κρίσιμου κομματιού της εφαρμογής και χρονοπρογραμματισμού των εντολών του οδηγεί σε ένα πολύ βελτιστοποιημένο αποτέλεσμα αναφορικά με τους κύκλους εκτέλεσης, την καταλαμβανόμενη επιφάνεια και την απαιτούμενη ενέργεια. Η χειρωνακτική υλοποίηση χρονοπρογραμματισμού των εντολών πρέπει να μπορεί να επιτευχθεί από ένα εμπορικό εργαλείο μετάφρασης (compiler tool) ώστε στο μέλλον ο χρόνος σχεδιασμού να ελαχιστοποιηθεί. Επιπλέον, πρέπει να διερευνηθεί μια αποδοτική ως προς το θέμα της ενέργειας προσπάθεια απεικόνισης σε υλικό για τα υπόλοιπα τμήματα της εφαρμογής πέραν του πιο κρίσιμου. Σε αυτή την περίπτωση, επειδή η συχνότητα εμφάνισης αυτών των τμημάτων του κώδικα είναι πολύ μικρή, έμφαση δίνεται στην ελαχιστοποίηση της επιφάνειας του υλικού. Η βελτίωση της κατανάλωσης ενέργειας του data-path με τέτοιους τρόπους είναι πολύ σημαντική, αφού το data-path είναι κυρίαρχο στην κατανομή της ενέργειας, όταν η επιβάρυνση της μνήμης δεδομένων και εντολών ελαχιστοποιείται από συμπληρωματικές μεθόδους, όπως συμβαίνει στο προτεινόμενο ASIP. SIMD Scheduling Ultra-low energy Architecture implementation 629.895 43 Χρονοπρογραμματισμός
107	Optimized SIMD architecture exploration and implementation for ultra-low energy processors / Εξερεύνηση και υλοποίηση βελτιστοποιημένης SIMD αρχιτεκτονικής για επεξεργαστές πολύ χαμηλής κατανάλωσης Δακουρού, Στεφανία 19 July 2012 (has links) On-line monitoring is an important challenge in future biotechnology applications, for instance in the domain of precision livestock farming where a strong need is present for low-cost intelligent sensors to monitor animal welfare. On-line poultry monitoring can significantly improve living conditions of hens in industrial farms. A very low-cost low-energy solution needs to be provided though due to the stringent battery limitations. Domain-specific ASIPs can be an ideal solution when they cover enough submarkets to increase the production volume (reducing the price) and ultra-low energy concepts are used for their realization. This work is a part of a larger project and aiming to high energy-efficiency. The current study implements data parallelization, using a recently introduced software-controlled SIMD realization in an innovative way. The approaches that have been employed for the determination of the final instruction set of the architecture that has been created for the mapping of the critical Gauss loop of the detection application, are thoroughly explored. The re-design of the data-parallel data path, also referred to as Soft-SIMD architecture, has been necessary in order to achieve instruction encoding optimization. Furthermore, we have explored the capabilities that a commercial compiler retargetable Tool, like Target, can offer for our target design and we have suggested some potential modifications that would help the tool to become more efficient and useful for a designer’s needs in such architecture. Thereby, this study also demonstrates the promising results obtained by experimenting with detours around the current Target tool design limitations. Finding the right balance between efficiency and flexibility requires the ability to quickly evaluate alternative architectures through simulations and testing techniques. The methods developed for exactly this purpose, with the help of Target’s IP Designer retargetable tool-suite, are discussed in detail. By exploiting the profiling information produced by the ISS, and by reading the assembly code produced by the C compiler, it is possible to identify the instructions in the critical loop, and optimize them by using a number of techniques discussed. The main purpose of this optimization is to reduce the cycle count of the application, in order to reduce the overall power consumption. VHDL files of the optimized and un-optimized processor are automatically generated using the HDL generation tool. However, examining a bio-imaging application, instantiated from the ULP-ASIP architectural template [FEENECS book], many other issues are present too. In particular, the way that these kinds of implementations have to be tested should be taken into consideration. Preferably, the testability has not only to be sufficient and efficient but also reusable, in the sense that test patterns should be able to be generated not only for a specific application or for a group of applications but for the entire architectural template. Therefore, this study also illustrates a Systematic Test Vector generation process for the ULP-ASIP template. Our goal is to make generalized principles, because such principles are reusable and can be applied to any instances, such as our present processor for the Gauss Filter. Finally, this study is completed by presenting some realistic power numbers based on layout back-annotation, which concern the data path components of the processor. Based on all the advanced optimizations and broad search space explorations that are presented in this thesis, a heavily optimized ASIP architecture has been fully implemented which results in a low-cost ultra low-energy consumption while still meeting all the performance requirements. / Η αυτόματη μέθοδος παρακολούθησης ζωντανών οργανισμών, όπως έχει ερευνηθεί και δημοσιευθεί από το Τμήμα Biosystems (BIOSYST) του K.U. Leuven [1], συνίσταται από μια εϕαρμογή με «υπολογιστική όραση», η οποία, βασιζόμενη στις αποκρίσεις τους, κατηγοριοποιεί τη συμπεριϕορά τους. Η βιοτεχνολογική αυτή εϕαρμογή αναπτύσσει ένα πλήρως αυτοματοποιημένο σύστημα «υπολογιστικής όρασης» σε μεμονωμένες και υπό περιορισμό όρνι- θες.Η εϕαρμογή χωρίζεται σε δύο αλγόριθμους, εκ των οποίων ο πρώτος ανιχνεύει το αντι- κείμενο παρακολούθησης (detection algorithm) και ο δεύτερος το εντοπίζει (tracking algorithm). Η παρούσα μελέτη αποτελεί κομμάτι ενός μεγαλυτέρου project και συνέχεια της προηγούμενης δουλείας που αναπτύχθηκε στον τομέα αυτό.Ο σκοπός αυτής της μελέτης είναι η εξερεύνηση της αρχιτεκτονικής που έχει δημιουργηθεί για την αντιστοίχιση του κρίσιμου βρόχου Gauss του αλγόριθμου ανίχνευσης προκειμένου να καθοριστεί το τελικό σύνολο εντολών του ULP-ASIP SIMD επεξεργαστή. Οι τεχνικές και οι προσεγγίσεις που χρησιμοποιούνται για την υποστήριξη της διαδικασίας βελτιστοποίησης της κωδικοποίησης του συνόλου εντολών παρουσιάζονται εκτεταμένα στο κεϕάλαιο 2. Επιπλέον, κατά τη διάρκεια της εξερεύνησης της αρχιτεκτονικής, το σύνολο εντολών που ορίστηκε και οι τεχνικές αντιστοίχισης επανεξετάζονται, προκειμένου να μειωθεί το συνολικό κόστος εκτέλεσης. Η εύρεση της σωστής ισορροπίας μεταξύ της αποτελεσματικότητας και της ευελιξίας απαιτεί την ικανότητα να αξιολογούνται γρήγορα εναλλακτικές αρχιτεκτονικές μέσω εξομοιώσεων και τεχνικών δοκιμών. Το Κεϕάλαιο 3 επεξηγεί τις μεθόδους που αναπτύχθηκαν ακριβώς για το σκοπό αυτό, με τη βοήθεια του περιβάλλοντος σχεδίασης IP των TARGET Compiler Τεχνολογιών η οποία προσϕέρει ένα πλήρες reTARGETable εργαλείο. Ωστόσο, μια πιο συστηματική διαδικασία παραγωγής διανυσμάτων δοκιμής για ολόκληρη την πλατϕόρμα ULP-ASIP κατέληξε να είναι ένα πολύ σημαντικό πλεονέκτημα για την επικύρωση της λειτουργίας του επεξεργαστή ULP-ASIP. Ως εκ τούτου, μια τέτοια μέθοδος, αναλύεται και παρουσιάζεται εκτεταμένα στο κεϕάλαιο 4. Τέλος, το Κεϕάλαιο 5 παρουσιάζει την εκτίμηση της ενέργειας του data path του επεξεργαστή. Με βάση όλες τις προηγμένες βελτιστοποιήσεις και τις ευρείες εξερευνήσεις του χώρου αναζήτησης που παρουσιάζονται στα προηγούμενα κεϕάλαια, μια ισχυρά βελτιστοποιημένη συνθέσιμη αρχιτεκτονική ASIP υλοποιείται πλήρως η οποία οδηγεί σε μια χαμηλού κόστους, πολύ χαμηλής κατανάλωσης ενέργειας πλατϕόρμα, καλύπτοντας συγχρόνως όλες τις απαιτήσεις επιδόσεων. SIMD Ultra-low energy processors ASIP Instruction set processors 621.395
108	H.264 Baseline Real-time High Definition Encoder on CELL Wei, Zhengzhe January 2010 (has links) In this thesis a H.264 baseline high definition encoder is implemented on CELL processor. The target video sequence is YUV420 1080p at 30 frames per second in our encoder. To meet real-time requirements, a system architecture which reduces DMA requests is designed for large memory accessing. Several key computing kernels: Intra frame encoding, motion estimation searching and entropy coding are designed and ported to CELL processor units. A main challenge is to find a good tradeoff between DMA latency and processing time. The limited 256K bytes on-chip memory of SPE has to be organized efficiently in SIMD way. CAVLC is performed in non-real-time on the PPE. The experimental results show that our encoder is able to encode I frame in high quality and encode common 1080p video sequences in real-time. With the using of five SPEs and 63KB executable code size, 20.72M cycles are needed to encode one P frame partitions for one SPE. The average PSNR of P frames increases a maximum of 1.52%. In the case of fast speed video sequence, 64x64 search range gets better frame qualities than 16x16 search range and increases only less than two times computing cycles of 16x16. Our results also demonstrate that more potential power of the CELL processor can be utilized in multimedia computing. The H.264 main profile will be implemented in future phases of this encoder project. Since the platform we use is IBM Full-System Simulator, DMA performance in a real CELL processor is an interesting issue. Real-time entropy coding is another challenge to CELL. Video coding H.264 CELL processor Real-time coding Intra prediction Parallel programming SIMD Computer Engineering Datorteknik
109	Idiom-driven innermost loop vectorization in the presence of cross-iteration data dependencies in the HotSpot C2 compiler / Idiomdriven vektorisering av inre loopar med databeroenden i HotSpots C2 kompilator Sjöblom, William January 2020 (has links) This thesis presents a technique for automatic vectorization of innermost single statement loops with a cross-iteration data dependence by analyzing data-flow to recognize frequently recurring program idioms. Recognition is carried out by matching the circular SSA data-flow found around the loop body’s φ-function against several primitive patterns, forming a tree representation of the relevant data-flow that is then pruned down to a single parameterized node, providing a high-level specification of the data-flow idiom at hand used to guide algorithmic replacement applied to the intermediate representation. The versatility of the technique is shown by presenting an implementation supporting vectorization of both a limited class of linear recurrences as well as prefix sums, where the latter shows how the technique generalizes to intermediate representations with memory state in SSA-form. Finally, a thorough performance evaluation is presented, showing the effectiveness of the vectorization technique. compiler vectorization SIMD Java HotSpot code optimization reductions prefix sums parallel programming data-level parallelism Computer Sciences Datavetenskap (datalogi)
110	Optimalizace rozpoznávání řeči pro mobilní zařízení / Optimization of Voice Recognition for Mobile Devices Tomec, Martin January 2010 (has links) This work deals with optimization of keyword spotting algorithms on processor architecture ARM Cortex-A8. At first it describes this architecture and especially the NEON unit for vector computing. In addition it briefly describes keyword spotting algorithms and also there is proposed optimization of these algorithms for described architecture. Main part of this work is implementation of these optimizations and analysis of their impact on performance.

Search results