Global ETD Search

21	Stream Computing on FPGAs Plavec, Franjo 01 September 2010 (has links) Field Programmable Gate Arrays (FPGAs) are programmable logic devices used for the implementation of a wide range of digital systems. In recent years, there has been an increasing interest in design methodologies that allow high-level design descriptions to be automatically implemented in FPGAs. This thesis describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook, and is based on the existing Brook and GPU Brook languages, which target streaming multiprocessors and graphics processing units (GPUs), respectively. A streaming language is suitable for targeting FPGAs because it allows system designers to express applications in a way that exposes parallelism, which can then be exploited through parallel hardware implementation. FPGA Brook supports replication, which allows the system designer to trade-off area for performance, by specifying the parts of an application that should be implemented as multiple hardware units operating in parallel, to achieve desired application throughput. Hardware units are interconnected through FIFO buffers, which effectively utilize the small memory modules available in FPGAs. The FPGA Brook design flow uses a source-to-source compiler, and combines it with a commercial behavioural synthesis tool to generate hardware. The source-to-source compiler was developed as a part of this thesis and includes novel algorithms for implementation of complex reductions in FPGAs. The design flow is fully automated and presents a user-interface similar to traditional software compilers. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results show that applications implemented using our flow achieve much higher throughput than the Nios II soft processor implemented in the same FPGA device. Comparison to the commercial C2H compiler from Altera shows that while simple applications can be effectively implemented using the C2H compiler, complex applications achieve significantly better throughput when implemented by our system. Performance of many applications implemented using our design flow would scale further if a larger FPGA device were used. The thesis demonstrates that using an automated design flow to implement streaming applications in FPGAs is a promising methodology. Field Programmable Gate Array FPGA Streaming Behavioral synthesis Behavioural synthesis Data parallelism Task parallelism Replication Throughput Scalability C2H High-level synthesis Reduction Parallel reduction FIFO Strength reduction 0544 0984
22	Modèles de programmation des applications de traitement du signal et de l'image sur cluster parallèle et hétérogène / Programming models for signal and image processing on parallel and heterogeneous architectures Mansouri, Farouk 14 October 2015 (has links) Depuis une dizaine d'année, l'évolution des machines de calcul tend vers des architectures parallèles et hétérogènes. Composées de plusieurs nœuds connectés via un réseau incluant chacun des unités de traitement hétérogènes, ces grilles offrent de grandes performances. Pour programmer ces architectures, l'utilisateur doit s'appuyer sur des modèles de programmation comme MPI, OpenMP, CUDA. Toutefois, il est toujours difficile d'obtenir à la fois une bonne productivité du programmeur, qui passe par une abstraction des spécificités de l'architecture et performances. Dans cette thèse, nous proposons d'exploiter l'idée qu'un modèle de programmation spécifique à un domaine applicatif particulier permet de concilier ces deux objectifs antagonistes. En effet, en caractérisant une famille d'applications, il est possible d'identifier des abstractions de haut niveau permettant de les modéliser. Nous proposons deux modèles spécifiques au traitement du signal et de l'image sur cluster hétérogène. Le premier modèle est statique. Nous lui apportons une fonctionnalité de migration de tâches. Le second est dynamique, basé sur le support exécutif StarPU. Les deux modèles offrent d'une part un haut niveau d'abstraction en modélisant les applications de traitement du signal et de l'image sous forme de graphe de flot de données et d'autre part, ils permettent d'exploiter efficacement les différents niveaux de parallélisme tâche, données, graphe. Ces deux modèles sont validés par plusieurs implémentations et comparaisons incluant deux applications de traitement de l'image du monde réel sur cluster CPU-GPU. / Since a decade, computing systems evolved to parallel and heterogeneous architectures. Composed of several nodes connected via a network and including heterogeneous processing units, clusters achieve high performances. To program these architectures, the user must rely on programming models such as MPI, OpenMP or CUDA. However, it is still difficult to conciliate productivity provided by abstracting the architectural specificities, and performances. In this thesis, we exploit the idea that a programming model specific to a particular domain of application can achieve these antagonist goals. In fact, by characterizing a family of application, it is possible to identify high level abstractions to efficiently model them. We propose two models specific to the implementation of signal and image processing applications on heterogeneous clusters. The first model is static. We enrich it with a task migration feature. The second model is dynamic, based on the StarPU runtime. Both models offer firstly a high level of abstraction by modeling image and signal applications as a data flow graph and secondly they efficiently exploit task, data and graph parallelisms. We validate these models with different implementations and comparisons including two real-world applications of images processing on a CPU-GPU cluster. Modèle de programmation Modèle de calcul DFG Traitement de signal et d'images Parallel programming models Parallel and heterogeneous cluster DFG model of computation Task parallelism data parallelism Digital signal processing 620
23	Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning Weijian Zheng (14226626) 08 December 2022 (has links) <p>High-performance computing (HPC) and machine learning (ML) have been widely adopted by both academia and industries to address enormous data problems at extreme scales. While research has reported on the interactions of HPC and ML, achieving high performance and scalability for parallel and distributed ML algorithms is still a challenging task. This dissertation first summarizes the major challenges for applying HPC to ML applications: 1) poor performance and scalability, 2) loss of the convergence rate, 3) lower quality of the trained model, and 4) a lack of performance optimization techniques designed for specific applications. Researchers can address the four challenges in new ML applications. This dissertation shows how to solve them for two specific applications: 1) a clustering algorithm and 2) graph optimization algorithms that use reinforcement learning (RL).</p> <p>As to the clustering algorithm, we first propose an algorithm called the simulated-annealing clustering algorithm. By combining a blocked data layout and asynchronous local optimization within each thread, the simulated-annealing enhanced clustering algorithm has a convergence rate that is comparable to the K-means algorithm but with much higher performance. Experiments with synthetic and real-world datasets show that the simulated-annealing enhanced clustering algorithm is significantly faster than the MPI K-means library using up to 1024 cores. However, the optimization costs (Sum of Square Error (SSE)) of the simulated-annealing enhanced clustering algorithm became higher than the original costs. To tackle this problem, we devise a new algorithm called the full-step feel-the-way clustering algorithm. In the full-step feel-the-way algorithm, there are L local steps within each block of data points. We use the first local step’s results to compute accurate global optimization costs. Our results show that the full-step algorithm can significantly reduce the global number of iterations needed to converge while obtaining low SSE costs. However, the time spent on the local steps is greater than the benefits of the saved iterations. To improve this performance, we next optimize the local step time by incorporating a sampling-based method called reassignment-history-aware sampling. Extensive experiments with various synthetic and real world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that our parallel algorithms can outperform the fastest open-source MPI K-means implementation by up to 110% on 4,096 CPU cores with comparable SSE costs.</p> <p>Our evaluations of the sampling-based feel-the-way algorithm establish the effectiveness of the local optimization strategy, the blocked data layout, and the sampling methods for addressing the challenges of applying HPC to ML applications. To explore more parallel strategies and optimization techniques, we focus on a more complex application: graph optimization problems using reinforcement learning (RL). RL has proved successful for automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised RL’s ability to solve large scale graph optimization problems due to the lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop OpenGraphGym-MG, a high performance distributed-GPU RL framework for solving graph optimization problems. OpenGraphGym-MG focuses on a class of computationally demanding RL problems in which both the RL environment and the policy model are highly computation intensive. In this work, we distribute large-scale graphs across distributed GPUs and use spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of spatial and data parallelism and highlight their differences. To support graph neural network (GNN) layers that take data samples partitioned across distributed GPUs as input, we design new parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design new parallel graph environments to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance OpenGraphGym-MG training and inference algorithms in parallel.</p> <p>To summarize, after proposing the major challenges for applying HPC to ML applications, this thesis explores several parallel strategies and performance optimization techniques using two ML applications. Specifically, we propose a local optimization strategy, a blocked data layout, and sampling methods for accelerating the clustering algorithm, and we create a spatial parallelism strategy, a parallel graph environment, agent, and policy model, and an optimized replay buffer, and multi-node selection strategy for solving large optimization problems over graphs. Our evaluations prove the effectiveness of these strategies and demonstrate that our accelerations can significantly outperform the state-of-the-art ML libraries and frameworks without loss of quality in trained models.</p> Graph, social and multimedia data Distributed systems and algorithms High performance computing Reinforcement learning High Performance Computing (HPC) Clustering Algorithm Reinforcement Learning combinatorial optimization problems graph problems Travelling salesperson problem Minimum Vertex Cover Problem Distributed processing of data NP-Hard optimization problems model parallelism data parallelism

Page generated in 0.0583 seconds