• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 90
  • 29
  • 11
  • 11
  • 8
  • 5
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 183
  • 75
  • 51
  • 40
  • 29
  • 28
  • 24
  • 23
  • 22
  • 21
  • 19
  • 19
  • 18
  • 18
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

A framework for efficient execution on GPU and CPU+GPU systems / Framework pour une exécution efficace sur systèmes GPU et CPU+GPU

Dollinger, Jean-François 01 July 2015 (has links)
Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt. / Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.
42

A Multi-core Testbed on Desktop Computer for Research on Power/Thermal Aware Resource Management

Dierivot, Ashley 06 June 2014 (has links)
Our goal is to develop a flexible, customizable, and practical multi-core testbed based on an Intel desktop computer that can be utilized to assist the theoretical research on power/thermal aware resource management in design of computer systems. By integrating different modules, i.e. thread mapping/scheduling, processor/core frequency and voltage variation, temperature/power measurement, and run-time performance collection, into a systematic and unified framework, our testbed can bridge the gap between the theoretical study and practical implementation. The effectiveness for our system was validated using appropriately selected benchmarks. The importance of this research is that it complements the current theoretical research by validating the theoretical results in practical scenarios, which are closer to that in the real world. In addition, by studying the discrepancies of results of theoretical study and their applications in real world, the research also aids in identifying new research problems and directions.
43

General Resource Management for Computationally Demanding Scientific Software

Xinchen Guo (13965024) 17 October 2022 (has links)
<p>Many scientific problems contain nonlinear systems of equations that require multiple iterations to reach converged results. Such software pattern follows the bulk synchronous parallel model. In that sense, an iteration is a superstep, which includes computation of local data, global communication to update data for the next iteration, and synchronization between iterations. In modern HPC environments, MPI is used to distribute data and OpenMP is used to accelerate computation of each data. More MPI processes increase the cost of communication and synchronization whereas more OpenMP threads increase the overhead of multithreading. A proper combination of MPI and OpenMP is critical to accelerate each superstep. Proper orchestration of MPI processes and OpenMP threads is also needed to efficiently use the underlying hardware resources.</p> <p>  </p> <p>Purdue’s multi-purpose nanodevice simulation tool NEMO5 distributes the computation of independent spectral points by MPI. The computation of each spectral point is accelerated with OpenMP threads. A few examples of resource utilization optimizations are presented. One type of simulation applies the non-equilibrium Green’s function method to accurately predict drug molecules. Our profiling results suggest the optimum combination has more MPI processes and fewer OpenMP threads. However, NEMO5's memory usage has large spikes for each spectral point. Such behavior limits the concurrency of spectral point calculation due to the lack of swap space on HPC nodes to prevent out-of-memory. </p> <p><br></p> <p>A distributed resource management framework is proposed and developed to automatically and dynamically manage memory and CPU usage. The concurrent calculation of spectral points is pipelined to avoid simultaneous peak memory usage. This allows more MPI processes and fewer OpenMP threads for higher parallel efficiency. Automatic CPU usage adjustment also reduces the time cost to fill and drain the calculation pipeline. The resource management framework requires minimum code intrusion and successfully speeds up the calculation. It can also be generalized for other simulation software.</p>
44

HEURISTICS AND EXPERIMENTAL DESIGN FOR FPGA ROUTING ALGORITHMS

GAO, LI 03 December 2001 (has links)
No description available.
45

Implementation and Evaluation of Proportional Share Scheduler on Linux Kernel 2.6

Srinivasan, Pradeep Kumar 25 April 2008 (has links)
No description available.
46

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Daga, Mayank 06 June 2011 (has links)
The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for today's heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation. / Master of Science
47

Resource limiting and accounting facility for FreeBSD / Resource limiting and accounting facility for FreeBSD

Tomori, Rudolf January 2013 (has links)
This thesis analyses the implementation of the Linux cgroups subsystems responsible for limiting CPU time and disk I/O throughput. Apart from the Linux cgroups approach, an overview and short analysis of other possible approaches to the problem of limiting CPU time and disk I/O throughput is presented. Based on the analysis, the thesis proposes an extension to the resource limit- ing and accounting framework racct/rctl in the FreeBSD kernel. Our prototype implementation of this extension provides features that enable the administrators and privileged users to define disk I/O throughput limits and relative CPU time limits for a particular process, user or FreeBSD jail.
48

RTOS med 1.5K RAM?

Chahine, Sandy, Chowdhury, Selma January 2018 (has links)
Internet of Things (IoT) blir allt vanligare i dagens samhälle. Allt fler vardagsenheter blir uppkopplade mot det trådlösa nätet. För det krävs kostnadseffektiv datorkraft vilket medför att det kan vara gynnsamt att undersöka mikrokontroller och hur de skulle klara av detta arbete. Dessa kan ses som mindre kompakta datorer vilka trots sin storlek erbjuder en hel del prestanda. Denna studie avser att underrätta om något befintligt operativsystem kan fungera ihop med mikrokontrollern PIC18F452 samt hur många processer som kan köras parallellt givet MCU:ns begränsade minne. Olika metodval undersöktes och diskuterades för att avgöra vilken metod som skulle generera bäst resultat. En undersökning och flera experiment genomfördes för att kunna besvara dessa frågor. Experimenten krävde att en speciell utvecklingsmiljö installerades och att den generiska FreeRTOS distributionen porterades till både rätt processor och experimentkort. Porteringen lyckades och experimenten visade att frågeställningen kunde besvaras med ett ja - det går att köra ett realtidsoperativsystem på en MCU med enbart 1,5 kB RAM-minne. Under arbetets gång konstaterade också projektet att Amazon byggt sin IoTsatsning på FreeRTOS. De hade dock satsat på en mer kraftfull MCU. Satsningen ville därmed framhålla det som en mer framtidssäker inriktning. / Internet of Things (IoT) is becoming more common in today's society. More and more everyday devices are connected to the wireless network. This requires costeffective computing power, which means that it can be beneficial to investigate the microcontroller and how they would cope with this task. These can be seen as smaller compact computers which despite their size offer a lot of performance. This study aims to inform if any existing operating system can work together with the microcontroller PIC18F452 and how many processes that can run in parallel given the MCU's limited memory. A survey and an experiment were conducted to answer these questions. Different choice of methods was investigated and discussed to determine which method would generate the best results. A survey and an experiment were conducted to answer these questions. The experiments required a special development environment to be installed and the generic FreeRTOS distribution was ported to both the correct processor and the experimental card. The porting succeeded and experiments showed that the research question could be answered with a yes. You can run a real-time operating system on an MCU with only 1,5 kB RAM memory. During the work, the project also found that Amazon built its IoT on FreeRTOS. However, they had invested in a more powerful MCU. The effort would thus emphasize it as a more future-proof approach.
49

PROGRAMAÇÃO PARALELA HÍBRIDA PARA CPU E GPU: UMA AVALIAÇÃO DO OPENACC FRENTE A OPENMP E CUDA / HYBRID PARALLEL PROGRAMMING FOR CPU AND GPU: AN EVALUATION OF OPENACC AS RELATED TO OPENMP AND CUDA

Sulzbach, Maurício 22 August 2014 (has links)
As a consequence of the CPU and GPU's architectures advance, in the last years there was a raise of the number of parallel programming APIs for both devices. While OpenMP is used to make parallel programs for the CPU, CUDA and OpenACC are employed in the parallel processing in the GPU. In the programming for the GPU, CUDA presents a model based on functions that make the source code extensive and prone to errors, in addition to leading to low development productivity. OpenACC emerged aiming to solve these problems and to be an alternative to the utilization of CUDA. Similar to OpenMP, this API has policies that ease the development of parallel applications that run on the GPU only. To further increase performance and take advantage of the parallel aspects of both CPU and GPU, it is possible to develop hybrid algorithms that split the processing on the two devices. In that sense, the main objective of this work is to verify if the advantages that OpenACC introduces are also positively reflected on the hybrid programming using OpenMP, if compared to the OpenMP + CUDA model. A second objective of this work is to identify aspects of the two programming models that could limit the performance or on the applications' development. As a way to accomplish these goals, this work presents the development of three hybrid parallel algorithms that are based on the Rodinia's benchmark algorithms, namely, RNG, Hotspot and SRAD, using the hybrid models OpenMP + CUDA and OpenMP + OpenACC. In these algorithms, the CPU part of the code is programmed using OpenMP, while it's assigned for the CUDA and OpenACC the parallel processing on the GPU. After the execution of the hybrid algorithms, the performance, efficiency and the processing's splitting in each one of the devices were analyzed. It was verified, through the hybrid algorithms' runs, that, in the two proposed programming models it was possible to outperform the performance of a parallel application that runs on a single API and in only one of the devices. In addition to that, in the hybrid algorithms RNG and Hotspot, CUDA's performance was superior to that of OpenACC, while in the SRAD algorithm OpenACC was faster than CUDA. / Como consequência do avanço das arquiteturas de CPU e GPU, nos últimos anos houve um aumento no número de APIs de programação paralela para os dois dispositivos. Enquanto que OpenMP é utilizada no processamento paralelo em CPU, CUDA e OpenACC são empregadas no processamento paralelo em GPU. Na programação para GPU, CUDA apresenta um modelo baseado em funções que deixam o código fonte extenso e propenso a erros, além de acarretar uma baixa produtividade no desenvolvimento. Objetivando solucionar esses problemas e sendo uma alternativa à utilização de CUDA surgiu o OpenACC. Semelhante ao OpenMP, essa API disponibiliza diretivas que facilitam o desenvolvimento de aplicações paralelas, porém para execução em GPU. Para aumentar ainda mais o desempenho e tirar proveito da capacidade de paralelismo de CPU e GPU, é possível desenvolver algoritmos híbridos que dividam o processamento nos dois dispositivos. Nesse sentido, este trabalho objetiva verificar se as facilidades que o OpenACC introduz também refletem positivamente na programação híbrida com OpenMP, se comparado ao modelo OpenMP + CUDA. Além disso, o trabalho visa relatar as limitações nos dois modelos de programação híbrida que possam influenciar no desempenho ou no desenvolvimento de aplicações. Como forma de cumprir essas metas, este trabalho apresenta o desenvolvimento de três algoritmos paralelos híbridos baseados nos algoritmos do benchmark Rodinia, a saber, RNG, Hotspot e SRAD, utilizando os modelos híbridos OpenMP + CUDA e OpenMP + OpenACC. Nesses algoritmos é atribuída ao OpenMP a execução paralela em CPU, enquanto que CUDA e OpenACC são responsáveis pelo processamento paralelo em GPU. Após as execuções dos algoritmos híbridos foram analisados o desempenho, a eficiência e a divisão da execução em cada um dos dispositivos. Verificou-se através das execuções dos algoritmos híbridos que nos dois modelos de programação propostos foi possível superar o desempenho de uma aplicação paralela em uma única API, com execução em apenas um dos dispositivos. Além disso, nos algoritmos híbridos RNG e Hotspot o desempenho de CUDA foi superior ao desempenho de OpenACC, enquanto que no algoritmo SRAD a API OpenACC apresentou uma execução mais rápida, se comparada à API CUDA.
50

Jádra schématu lifting pro vlnkovou transformaci / Lifting Scheme Cores for Wavelet Transform

Bařina, David Unknown Date (has links)
Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace. Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu výpočtu. Prezentovaný přístup je nakonec začleněn do kompresního řetězce formátu JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.

Page generated in 0.0139 seconds