• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 13
  • 10
  • 8
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 146
  • 64
  • 36
  • 32
  • 24
  • 23
  • 19
  • 19
  • 18
  • 16
  • 15
  • 14
  • 14
  • 13
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
111

Evaluation of Machine Learning Primitives on a Digital Signal Processor

Engström, Vilhelm January 2020 (has links)
Modern handheld devices rely on specialized hardware for evaluating machine learning algorithms. This thesis investigates the feasibility of using the digital signal processor, a part of the modem of the device, as an alternative to this specialized hardware. Memory management techniques and implementations for evaluating the machine learning primitives convolutional, max-pooling and fully connected layers are proposed. The implementations are evaluated based on to what degree they utilize available hardware units. New instructions for packing data and facilitating instruction pipelining are suggested and evaluated. The results show that convolutional and fully connected layers are well-suited to the processor used. The aptness of the convolutional layer is subject to the kernel being applied with a stride of 1 as larger strides cause the hardware usage to plummet. Max-pooling layers, while not ill-suited, are the most limited in terms of hardware usage. The proposed instructions are shown to have positive effects on the throughput of the implementations.
112

Optimierung der Energie-Effizienz für Algorithmen der Linearen Algebra durch SIMD-Programmierung und AVX-Vektorisierung

Jakobs, Thomas 10 January 2022 (has links)
Neben einer kurzen Ausführungszeit rückt bei der Optimierung von Anwendungen und Algorithmen ein geringer Energieverbrauch der genutzten Rechenressourcen in den Fokus der aktuellen Forschung. Eine hohe Energie-Effizienz von Programmen wird dabei erreicht, indem der Energieverbrauch von Programmen und Technologien reduziert wird, ohne dafür die Ausführungszeit übermäßig zu erhöhen. Im parallelen wissenschaftlichen Rechnen ist der Bedarf an energie-effizienten Programmausführungen vor allem für Algorithmen der linearen Algebra gegeben, die als Unterfunktionen in einer Vielzahl von Anwendungen eingesetzt werden. Die Vektorisierung von Programmen durch die Prozessor- und Instruktionssatzerweiterung AVX zeigt Potenzial zur energie-effizienten Ausführung von Algorithmen der linearen Algebra, wobei die erzielte Energie-Effizienz von der Umsetzung der Implementierung abhängt. Für die gezeigten Untersuchungen werden drei repräsentativ ausgewählte Algorithmen der linearen Algebra für die Ausführung auf AVX-Vektoreinheiten genutzt. Bei der AVX-Vektorisierung der Algorithmen werden verschiedene Programmvarianten erstellt, mit denen Ausführungszeit und Energieverbrauch bei der Ausführung ermittelt werden. Die Programmvarianten unterscheiden sich dabei unter anderem in der Anwendung von Programmtransformationen, wie Loop Tiling oder einer veränderten Speicherzugriffsstruktur. Zusätzlich wird gezeigt, wie die Umsetzung verschiedener Programmieransätze, wie Autovektorisierung oder unterschiedlicher Instruktionssätze, sowie Implementierungsvarianten durch die Auswahl der verwendeten Instruktionen, die Ausführungszeit und den Energieverbrauch der Programmausführung beeinflussen. Die so erstellten Programmvarianten werden auf modernen Prozessoren verschiedener Architekturfamilien mit unterschiedlichen Ausführungsparametern, wie der eingestellten Prozessorfrequenz, ausgeführt. Die Untersuchungen zeigen, dass sich Ausführungszeit und Energieverbrauch von Programmen durch die Vektorisierung reduzieren lassen. Die Auswahl der Programmtransformationen, des Programmieransatzes und der Ausführungsparameter für die energie-effiziente Ausführung von vektorisierten Programmen kann dabei anwendungsspezifisch aufgrund der Eigenschaften des ausgewählten Algorithmus getroffen werden.
113

Using Associative Processing to Simplify Current Air Traffic Control

Mohammed Amin, Rasti Jameel January 2015 (has links)
No description available.
114

Frequent itemset mining on multiprocessor systems

Schlegel, Benjamin 08 May 2014 (has links) (PDF)
Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets.
115

Frequent itemset mining on multiprocessor systems

Schlegel, Benjamin 30 May 2013 (has links)
Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets.
116

Runtime specialization for heterogeneous CPU-GPU platforms

Farooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
117

Conception d'architectures intégrées de traitement d'image de bas niveau

Boubekeur, Ahmed 04 March 1992 (has links) (PDF)
.
118

Les optimisations d'algorithmes de traitement de signal sur les architectures modernes parallèles et embarquées

Perez-Seva, Jean-Paul 24 August 2009 (has links) (PDF)
Cette thèse s'intéresse aux méthodologies d'optimisation d'algorithmes de traitement de signal sur les architectures parallèles de processeurs embarqués. L'état de l'art des différentes architectures destinées au milieu embarqué permet de mettre en évidence les différents outils d'optimisation mis à disposition par les concepteurs de processeurs. L'accent est particulièrement mis sur les solutions bénéfiques aux calculs flottants intensifs, tout en notifiant les points communs et les divergences entre les différents processeurs. Le choix de l'algorithme de transformée de Fourier, comme algorithme représentatif des applications de traitement de signal, permet de détailler étape par étape les différents choix d'optimisation dans le cas d'une implémentation sur un PowerPC 970FX. Nous montrons comment à partir d'un algorithme radix-2, il est possible de réduire au plus prés du minimum la complexité de calcul grâce à l'usage de l'instruction de multiplication addition fusionnée. Nous proposons enfin une méthodologie de programmation multi-architectures utilisant le retour d'expérience précédent afin d'optimiser l'ordonnancement des instructions constituant l'algorithme.
119

Méthode de conception rapide d'architecture massivement parallèle sur puce : de la modélisation à l'expérimentation sur FPGA

Baklouti, Mouna 18 December 2010 (has links) (PDF)
Les travaux présentés dans cette thèse s'inscrivent dans le cadre des recherches menés sur la concep- tion et implémentation des systèmes sur puce à hautes performances afin d'accélérer et faciliter la conception ainsi que la mise en œuvre des applications de traitement systématique à parallélisme de données massif. Nous définissons dans ce travail un système SIMD massivement parallèle sur puce nommé mppSoC : massively paral- lel processing System on Chip. Ce système est générique et paramétrique pour s'adapter à l'application. Nous proposons une démarche de conception rapide et modulaire pour mppSoC. Cette conception se base sur un assemblage de composants ou IPs. À cette fin, une bibliothèque mppSoCLib est mise en place. Le concepteur pourra directement choisir les composants nécessaires et définir les paramètres du système afin de construire une configuration SIMD répondant à ses besoins. Une chaîne de génération automatisée a été développée. Cette chaîne permet la génération automatique du code VHDL d'une configuration mppSoC modélisée à haut niveau d'abstraction (UML). Le code VHDL produit est directement simulable et synthétisable sur FPGA. Cette chaîne autorise la définition à un haut niveau d'abstraction d'une configuration adéquate à une application donnée. À partir de la simulation du code généré automatiquement, nous pouvons modifier la configuration dans une démarche d'exploration pour le moment semi-automatique. Nous validons mppSoC dans un contexte applicatif réel de traitement vidéo à base de FPGA. Dans ce même contexte, une comparaison entre mppSoC et d'autres systèmes montre les performances suffisantes et l'efficacité de mppSoC.
120

A model of dynamic compilation for heterogeneous compute platforms

Kerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality.

Page generated in 0.0466 seconds