Global ETD Search

111	Evaluation of Machine Learning Primitives on a Digital Signal Processor Engström, Vilhelm January 2020 (has links) Modern handheld devices rely on specialized hardware for evaluating machine learning algorithms. This thesis investigates the feasibility of using the digital signal processor, a part of the modem of the device, as an alternative to this specialized hardware. Memory management techniques and implementations for evaluating the machine learning primitives convolutional, max-pooling and fully connected layers are proposed. The implementations are evaluated based on to what degree they utilize available hardware units. New instructions for packing data and facilitating instruction pipelining are suggested and evaluated. The results show that convolutional and fully connected layers are well-suited to the processor used. The aptness of the convolutional layer is subject to the kernel being applied with a stride of 1 as larger strides cause the hardware usage to plummet. Max-pooling layers, while not ill-suited, are the most limited in terms of hardware usage. The proposed instructions are shown to have positive effects on the throughput of the implementations. digital signal processor DSP SIMD data parallelism machine learning deep learning convolutional neural network Computer Engineering Datorteknik
112	Optimierung der Energie-Effizienz für Algorithmen der Linearen Algebra durch SIMD-Programmierung und AVX-Vektorisierung Jakobs, Thomas 10 January 2022 (has links) Neben einer kurzen Ausführungszeit rückt bei der Optimierung von Anwendungen und Algorithmen ein geringer Energieverbrauch der genutzten Rechenressourcen in den Fokus der aktuellen Forschung. Eine hohe Energie-Effizienz von Programmen wird dabei erreicht, indem der Energieverbrauch von Programmen und Technologien reduziert wird, ohne dafür die Ausführungszeit übermäßig zu erhöhen. Im parallelen wissenschaftlichen Rechnen ist der Bedarf an energie-effizienten Programmausführungen vor allem für Algorithmen der linearen Algebra gegeben, die als Unterfunktionen in einer Vielzahl von Anwendungen eingesetzt werden. Die Vektorisierung von Programmen durch die Prozessor- und Instruktionssatzerweiterung AVX zeigt Potenzial zur energie-effizienten Ausführung von Algorithmen der linearen Algebra, wobei die erzielte Energie-Effizienz von der Umsetzung der Implementierung abhängt. Für die gezeigten Untersuchungen werden drei repräsentativ ausgewählte Algorithmen der linearen Algebra für die Ausführung auf AVX-Vektoreinheiten genutzt. Bei der AVX-Vektorisierung der Algorithmen werden verschiedene Programmvarianten erstellt, mit denen Ausführungszeit und Energieverbrauch bei der Ausführung ermittelt werden. Die Programmvarianten unterscheiden sich dabei unter anderem in der Anwendung von Programmtransformationen, wie Loop Tiling oder einer veränderten Speicherzugriffsstruktur. Zusätzlich wird gezeigt, wie die Umsetzung verschiedener Programmieransätze, wie Autovektorisierung oder unterschiedlicher Instruktionssätze, sowie Implementierungsvarianten durch die Auswahl der verwendeten Instruktionen, die Ausführungszeit und den Energieverbrauch der Programmausführung beeinflussen. Die so erstellten Programmvarianten werden auf modernen Prozessoren verschiedener Architekturfamilien mit unterschiedlichen Ausführungsparametern, wie der eingestellten Prozessorfrequenz, ausgeführt. Die Untersuchungen zeigen, dass sich Ausführungszeit und Energieverbrauch von Programmen durch die Vektorisierung reduzieren lassen. Die Auswahl der Programmtransformationen, des Programmieransatzes und der Ausführungsparameter für die energie-effiziente Ausführung von vektorisierten Programmen kann dabei anwendungsspezifisch aufgrund der Eigenschaften des ausgewählten Algorithmus getroffen werden. info:eu-repo/classification/ddc/004.3 ddc:004.3
113	Using Associative Processing to Simplify Current Air Traffic Control Mohammed Amin, Rasti Jameel January 2015 (has links) No description available. Computer Science Aerospace Materials Aerospace Engineering air traffic control ATC air traffic management air traffic collision detection SIMD associative processing
114	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 08 May 2014 (has links) (PDF) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. Data Mining Assoziationsanalyse Mehrprozessorsysteme Paralleles Data Mining SIMD Apriori Eclat FP-growth Data mining Association rule mining Multiprocessor Systems Parallel mining SIMD Compression Apriori Eclat FP-growth ddc:004 rvk:ST 530 Datenverarbeitung Informatik Computerprogrammierung Programme Daten Spezielle Computerverfahren Data Mining Algorithmen Multithreading SIMD Datenkompression
115	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 30 May 2013 (has links) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. info:eu-repo/classification/ddc/004 ddc:004
116	Runtime specialization for heterogeneous CPU-GPU platforms Farooqui, Naila 27 May 2016 (has links) Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices. Dynamic instrumentation Dynamic compilation GPU computing Heterogeneous computing Profile-guided optimizations Program analysis Workload characterization Compiler Runtime Multicore CUDA OpenCL SIMD
117	Conception d'architectures intégrées de traitement d'image de bas niveau Boubekeur, Ahmed 04 March 1992 (has links) (PDF) . WSI reconfiguration rendement tolérance aux défauts réseau d'interconnexion 2D array SIMD traitement d'image synthèse de haut niveau
118	Les optimisations d'algorithmes de traitement de signal sur les architectures modernes parallèles et embarquées Perez-Seva, Jean-Paul 24 August 2009 (has links) (PDF) Cette thèse s'intéresse aux méthodologies d'optimisation d'algorithmes de traitement de signal sur les architectures parallèles de processeurs embarqués. L'état de l'art des différentes architectures destinées au milieu embarqué permet de mettre en évidence les différents outils d'optimisation mis à disposition par les concepteurs de processeurs. L'accent est particulièrement mis sur les solutions bénéfiques aux calculs flottants intensifs, tout en notifiant les points communs et les divergences entre les différents processeurs. Le choix de l'algorithme de transformée de Fourier, comme algorithme représentatif des applications de traitement de signal, permet de détailler étape par étape les différents choix d'optimisation dans le cas d'une implémentation sur un PowerPC 970FX. Nous montrons comment à partir d'un algorithme radix-2, il est possible de réduire au plus prés du minimum la complexité de calcul grâce à l'usage de l'instruction de multiplication addition fusionnée. Nous proposons enfin une méthodologie de programmation multi-architectures utilisant le retour d'expérience précédent afin d'optimiser l'ordonnancement des instructions constituant l'algorithme. Transformée de Fourier Rapide Processeurs Embarqués Multiplication addition fusionnée Instructions SIMD Programmation haute performance GFLOPS Génération de code Multi architecture
119	Méthode de conception rapide d'architecture massivement parallèle sur puce : de la modélisation à l'expérimentation sur FPGA Baklouti, Mouna 18 December 2010 (has links) (PDF) Les travaux présentés dans cette thèse s'inscrivent dans le cadre des recherches menés sur la concep- tion et implémentation des systèmes sur puce à hautes performances afin d'accélérer et faciliter la conception ainsi que la mise en œuvre des applications de traitement systématique à parallélisme de données massif. Nous définissons dans ce travail un système SIMD massivement parallèle sur puce nommé mppSoC : massively paral- lel processing System on Chip. Ce système est générique et paramétrique pour s'adapter à l'application. Nous proposons une démarche de conception rapide et modulaire pour mppSoC. Cette conception se base sur un assemblage de composants ou IPs. À cette fin, une bibliothèque mppSoCLib est mise en place. Le concepteur pourra directement choisir les composants nécessaires et définir les paramètres du système afin de construire une configuration SIMD répondant à ses besoins. Une chaîne de génération automatisée a été développée. Cette chaîne permet la génération automatique du code VHDL d'une configuration mppSoC modélisée à haut niveau d'abstraction (UML). Le code VHDL produit est directement simulable et synthétisable sur FPGA. Cette chaîne autorise la définition à un haut niveau d'abstraction d'une configuration adéquate à une application donnée. À partir de la simulation du code généré automatiquement, nous pouvons modifier la configuration dans une démarche d'exploration pour le moment semi-automatique. Nous validons mppSoC dans un contexte applicatif réel de traitement vidéo à base de FPGA. Dans ce même contexte, une comparaison entre mppSoC et d'autres systèmes montre les performances suffisantes et l'efficacité de mppSoC. [INFO] Computer Science Assemblage d'IPs FPGA IDM ingénierie dirigée par les modèles mppSoC parallélisme SIMD systèmes- sur-puce traitement intensif de données
120	A model of dynamic compilation for heterogeneous compute platforms Kerr, Andrew 10 December 2012 (has links) Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality. Dynamic compilation GPU computing Cuda Opencl SIMD Vector Multicore Parallel computing Parallel computers Parallel programs (Computer programs) Heterogeneous computing High performance computing

Search results