• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 138
  • 41
  • 23
  • 16
  • 15
  • 9
  • 8
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 303
  • 107
  • 104
  • 104
  • 60
  • 52
  • 50
  • 47
  • 46
  • 39
  • 31
  • 30
  • 30
  • 29
  • 29
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
211

Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPU

Mamani, Alexander Victor Ocsa 22 September 2011 (has links)
A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions
212

Implementações paralelas para os problemas do fecho transitivo e caminho mínimo APSP na GPU / Parallel implementations for transitive closure and minimum path APSP problems in GPU

Gaioso, Roussian Di Ramos Alves 08 August 2014 (has links)
Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2014-10-30T14:24:27Z No. of bitstreams: 2 Dissertação - Roussian Di Ramos Alves Gaioso - 2014.pdf: 6127790 bytes, checksum: 9990f791c0f9abaee7e3e03e4cdc8ee4 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2014-10-30T14:29:29Z (GMT) No. of bitstreams: 2 Dissertação - Roussian Di Ramos Alves Gaioso - 2014.pdf: 6127790 bytes, checksum: 9990f791c0f9abaee7e3e03e4cdc8ee4 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2014-10-30T14:29:29Z (GMT). No. of bitstreams: 2 Dissertação - Roussian Di Ramos Alves Gaioso - 2014.pdf: 6127790 bytes, checksum: 9990f791c0f9abaee7e3e03e4cdc8ee4 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2014-08-08 / Conselho Nacional de Pesquisa e Desenvolvimento Científico e Tecnológico - CNPq / This paper presents a Graphics Processing Unit (GPU) based parallels implementations for the All Pairs Shortest Paths and Transitive Closure problems in graph. The implementations are based on the main sequential algorithms and takes full advantage of the highly multithreaded architecture of current manycore GPUs. Our solutions reduces the communication between CPU and GPU, improves the Streaming Multiprocessors (SMs) utilization, and makes intensive use of coalesced memory access to optimize graph data access. The advantages of the proposed implementations are demonstrated for several graphs randomly generated using the widely known graph library GTgraph. Graphs containing thousands of vertices and different edges densities, varying from sparse to complete graphs, were generated and used in the experiments. Our results confirm that GPU implementations can be competitive even for graph algorithms whose memory accesses and work distribution are both irregular and data-dependent. Keywords / Este trabalho apresenta implementações paralelas baseadas em Graphics Processing Unit (GPU) para os problemas da identificação dos caminhos mínimos entre todos os pares de vértices e do fecho transitivo em um grafo. As implementações são baseadas nos principais algoritmos sequenciais e tiram o máximo proveito da arquitetura multithreaded das GPUs atuais. Nossa solução reduz a comunicação entre a Central Processing Unit (CPU) e a GPU, melhora a utilização dos Streaming Multiprocessors (SMs) e faz um uso intensivo de acesso aglutinado em memória para otimizar o acesso de dados do grafo. As vantagens dessas implementações propostas são demonstradas por vários grafos gerados aleatoriamente utilizando a ferramenta GTgraph. Grafos contendo milhares de vértices foram gerados e utilizados nos experimentos. Nossos resultados confirmam que implementações baseadas em GPU podem ser viáveis mesmo para algoritmos de grafos cujo acessos à memória e distribuição de trabalho são irregulares e causam dependência de dados.
213

A framework for efficient execution on GPU and CPU+GPU systems / Framework pour une exécution efficace sur systèmes GPU et CPU+GPU

Dollinger, Jean-François 01 July 2015 (has links)
Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt. / Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.
214

Valorisation d’options américaines et Value At Risk de portefeuille sur cluster de GPUs/CPUs hétérogène / American option pricing and computation of the portfolio Value at risk on heterogeneous GPU-CPU cluster

Benguigui, Michaël 27 August 2015 (has links)
Le travail de recherche décrit dans cette thèse a pour objectif d'accélérer le temps de calcul pour valoriser des instruments financiers complexes, tels des options américaines sur panier de taille réaliste (par exemple de 40 sousjacents), en tirant partie de la puissance de calcul parallèle qu'offrent les accélérateurs graphiques (Graphics Processing Units). Dans ce but, nous partons d'un travail précédent, qui avait distribué l'algorithme de valorisation de J.Picazo, basé sur des simulations de Monte Carlo et l'apprentissage automatique. Nous en proposons une adaptation pour GPU, nous permettant de diviser par 2 le temps de calcul de cette précédente version distribuée sur un cluster de 64 cœurs CPU, expérimentée pour valoriser une option américaine sur 40 actifs. Cependant, le pricing de cette option de taille réaliste nécessite quelques heures de calcul. Nous étendons donc ce premier résultat dans le but de cibler un cluster de calculateurs, hétérogènes, mixant GPUs et CPUs, via OpenCL. Ainsi, nous accélérons fortement le temps de valorisation, même si les entrainements des différentes méthodes de classification expérimentées (AdaBoost, SVM) sont centralisés et constituent donc un point de blocage. Pour y remédier, nous évaluons alors l'utilisation d'une méthode de classification distribuée, basée sur l'utilisation de forêts aléatoires, rendant ainsi notre approche extensible. La dernière partie réutilise ces deux contributions dans le cas de calcul de la Value at Risk d’un portefeuille d'options, sur cluster hybride hétérogène. / The research work described in this thesis aims at speeding up the pricing of complex financial instruments, like an American option on a realistic size basket of assets (e.g. 40) by leveraging the parallel processing power of Graphics Processing Units. To this aim, we start from a previous research work that distributed the pricing algorithm based on Monte Carlo simulation and machine learning proposed by J. Picazo. We propose an adaptation of this distributed algorithm to take advantage of a single GPU. This allows us to get performances using one single GPU comparable to those measured using a 64 cores cluster for pricing a 40-assets basket American option. Still, on this realistic-size option, the pricing requires a handful of hours. Then we extend this first contribution in order to tackle a cluster of heterogeneous devices, both GPUs and CPUs programmed in OpenCL, at once. Doing this, we are able to drastically accelerate the option pricing time, even if the various classification methods we experiment with (AdaBoost, SVM) constitute a performance bottleneck. So, we consider instead an alternate, distributable approach, based upon Random Forests which allow our approach to become more scalable. The last part reuses these two contributions to tackle the Value at Risk evaluation of a complete portfolio of financial instruments, on a heterogeneous cluster of GPUs and CPUs.
215

Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPU

Alexander Victor Ocsa Mamani 22 September 2011 (has links)
A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions
216

Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications

Schmitt, Felix, Dietrich, Robert, Juckeland, Guido 29 October 2019 (has links)
The use of accelerators in heterogeneous systems is an established approach in designing petascale applications. Today, Compute Unified Device Architecture (CUDA) offers a rich programming interface for GPU accelerators but requires developers to incorporate several layers of parallelism on both the CPU and the GPU. From this increasing program complexity emerges the need for sophisticated performance tools. This work contributes by analyzing hybrid MPICUDA programs for properties based on wait states, such as the critical path, a metric proven to identify application bottlenecks effectively. We developed a tool to construct a dependency graph based on an execution trace and the inherent dependencies of the programming models CUDA and Message Passing Interface (MPI). Thereafter, it detects wait states and attributes blame to responsible activities. Together with the property of being on the critical path, we can identify activities that are most viable for optimization. To evaluate the global impact of optimizations to critical activities, we predict the program execution using a graph-based performance projection. The developed approach has been demonstrated with suitable examples to be both scalable and correct. Furthermore, we establish a new categorization of CUDA inefficiency patterns ensuing from the dependencies between CUDA activities.
217

Architectures massivement parallèles et vision artificielle bas-niveau

Plyer, Aurélien 20 February 2013 (has links) (PDF)
Ce travail de thèse étudie l'apport à la vision bas-niveau des architectures de calcul massivement parallèles. Nous reprenons l'évolution récente de l'architecture des ordinateurs, en mettant en avant les solutions massivement parallèles qui se sont imposées récemment, les GPU. L'exploitation des potentialités de ces architectures impose une modification des méthodes de programmation. Nous montrons qu'il est possible d'utiliser un nombre restreint de schémas ("patterns") de calcul pour résoudre un grand nombre de problématiques de vision bas niveau. Nous présentons ensuite un nouveau modèle pour estimer la complexité de ces solutions. La suite du travail consiste à appliquer ces modèles de programmation à des problématiques de vision bas-niveau. Nous abordons d'abord le calcul du flot optique, qui est le champ de déplacement d'une image à une autre, et dont l'estimation est une brique de base de très nombreuses applications en traitement vidéo. Nous présentons un code sur GPU, nommé FOLKI qui permet d'atteindre une très bonne qualité de résultats sur séquences réelles pour un temps de calcul bien plus faible que les solutions concurrentes actuelles. Une application importante de ces travaux concerne la vélocimétrie par imagerie de particules dans le domaine de la mécanique des fluides expérimentale. La seconde problématique abordée est la super-résolution (SR). Nous proposons d'abord un algorithme très rapide de SR utilisant le flot optique FOLKI pour recaler les images. Ensuite différentes solutions à coût de calcul croissant sont développées, qui permettent une amélioration de précision et de robustesse. Nous présentons des résultats très originaux de SR sur des séquences affectées de mouvement complexes, comme des séquences de piétons ou des séquences aériennes de véhicules en mouvement. Enfin le dernier chapitre aborde rapidement des extensions en cours de nos travaux à des contextes de mesure 3D, dans des domaines comme la physique expérimentale ou la robotique.
218

Throughput-oriented analytical models for performance estimation on programmable hardware accelerators

Lai, Junjie 15 February 2013 (has links) (PDF)
In this thesis work, we have mainly worked on two topics of GPU performance analysis. First, we have developed an analytical method and a timing estimation tool (TEG) to predict CUDA application's performance for GT200 generation GPUs. TEG can predict GPU applications' performance in cycle-approximate level. Second, we have developed an approach to estimate GPU applications' performance upper bound based on application analysis and assembly code level benchmarking. With the performance upper bound of an application, we know how much optimization space is left and can decide the optimization effort. Also with the analysis we can understand which parameters are critical to the performance.
219

Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets

Lundell, Fredrik January 2011 (has links)
A modality device can today capture high resolution volumetric data sets and as the data resolutions increase so does the challenges of processing volumetric data through a visualization pipeline. Standard volume rendering pipelines often use a graphic processing unit (GPU) to accelerate rendering performance by taking beneficial use of the parallel architecture on such devices. Unfortunately, graphics cards have limited amounts of video memory (VRAM), causing a bottleneck in a standard pipeline. Multi-resolution techniques can be used to efficiently modify the rendering pipeline, allowing a sub-domain within the volume to be represented at different resolutions. The active resolution distribution is temporarily stored on the VRAM for rendering and the inactive parts are stored on secondary memory layers such as the system RAM or on disk. The active resolution set can be optimized to produce high quality renders while minimizing the amount of storage required. This is done by using a dynamic compression scheme which optimize the visual quality by evaluating user-input data. The optimized resolution of each sub-domain is then, on demand, streamed to the VRAM from secondary memory layers. Rendering a multi-resolution data set requires some extra care between boundaries of sub-domains. To avoid artifacts, an intrablock interpolation (II) sampling scheme capable of creating smooth transitions between sub-domains at arbitrary resolutions can be used. The result is a highly optimized rendering pipeline complemented with a preprocessing pipeline together capable of rendering large volumetric data sets in real-time.
220

GPU-accelerated Model Checking of Periodic Self-Suspending Real-Time Tasks

Liberg, Tim, Måhl, Per-Erik January 2012 (has links)
Efficient model checking is important in order to make this type of software verification useful for systems that are complex in their structure. If a system is too large or complex then model checking does not simply scale, i.e., it could take too much time to verify the system. This is one strong argument for focusing on making model checking faster. Another interesting aim is to make model checking so fast that it can be used for predicting scheduling decisions for real-time schedulers at runtime. This of course requires the model checking to complete within a order of milliseconds or even microseconds. The aim is set very high but the results of this thesis will at least give a hint on whether this seems possible or not. The magic card for (maybe) making this possible is called Graphics Processing Unit (GPU). This thesis will investigate if and how a model checking algorithm can be ported and executed on a GPU. Modern GPU architectures offers a high degree of processing power since they are equipped with up to 1000 (NVIDIA GTX 590) or 3000 (NVIDIA Tesla K10) processor cores. The drawback is that they offer poor thread-communication possibilities and memory caches compared to CPU. This makes it very difficult to port CPU programs to GPUs.The example model (system) used in this thesis represents a real-time task scheduler that can schedule up to three periodic self-suspending tasks. The aim is to verify, i.e., find a feasible schedule for these tasks, and do it as fast as possible with the help of the GPU.

Page generated in 0.0254 seconds