301 |
Application of GPU Computing to Some Urban Traffic ProblemsJradi, Walid Abdala Rfaei 30 November 2016 (has links)
Submitted by Erika Demachki (erikademachki@gmail.com) on 2017-01-06T16:59:11Z
No. of bitstreams: 2
Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-01-09T09:29:25Z (GMT) No. of bitstreams: 2
Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-01-09T09:29:25Z (GMT). No. of bitstreams: 2
Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2016-11-30 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The present work studies and proposes GPU-based parallel algorithms and implementations
for the problem of macroscopic assignment of urban traffic on large-scale networks,
promoting an in-depth investigation on each sub-problem that must be efficiently solved
during the traffic assignment process. Among the main contributions of this work, there are:
1) the first GPU-based algorithm for the enumeration of chordless cycles; 2) a new parallel
GPU-based shortest path algorithm that takes advantage of some common properties of urban
traffic networks; a refinement in the parallel reduction implementation proposed by one of the
leaders in the GPU market, which resulted in a 2.8x speedup relative to its original version;
and finally, 3) a parallel algorithm for the macroscopic traffic assignment problem, 39x faster
than the equivalent sequential approach when applied to large scale networks.
The main goal of this thesis is to contribute to the extension of the PET-Gyn software,
proposing efficient GPU data structures and parallel algorithms for a faster resolution of two
well known problems in the literature: The Traffic Assignment Problem (TAP) and the
Enumeration of Chordless Cycles. When applied to difficult input sets, the performed
experiments showed a clear advantage of the parallel algorithms over their sequential
versions. / O presente trabalho estuda e propõe algoritmos e implementações paralelas baseadas em
GPU para o problema de alocação macroscópica de tráfego urbano em redes de grande porte,
promovendo uma investigação aprofundada de cada sub-problema que deve ser resolvido de
forma eficiente durante o processo de atribuição de tráfego. Entre as principais contribuições
deste trabalho, estão: 1) o primeiro algoritmo baseado em GPU para a enumeração de ciclos
sem corda; 2) um novo algoritmo de caminho mínimo paralelo que tira vantagem de algumas
propriedades comuns das redes de tráfego urbano; Um refinamento na implementação de
redução paralela proposta por um dos líderes no mercado de GPU, o que resultou em uma
aceleração de 2,8x em relação à sua versão original; 3) e, finalmente, um algoritmo paralelo
para o problema de alocação macroscópica de tráfego, 39x mais rápido do que a abordagem
equivalente sequencial quando aplicado a redes de larga escala.
O objetivo principal desta tese é de contribuir para a expansão do software PET-Gyn,
propondo estruturas de dados de GPU eficientes e algoritmos paralelos para uma resolução
mais rápida de dois problemas bem conhecidos na literatura: O Problema de Alocação de
Tráfego e a Enumeração de Ciclos sem Corda. Quando aplicados a conjuntos de entrada
difíceis, os experimentos realizados mostraram uma clara vantagem dos algoritmos paralelos
sobre suas versões sequenciais.
|
302 |
Développement et implémentation parallèle de méthodes d'interaction de configurations sélectionnées / Development and parallel implementation of selected configuration interaction methodsGarniron, Yann 03 December 2018 (has links)
Cette thèse, ayant pour thème les algorithmes de la chimie quantique, s'inscrit dans le cade du changement de paradigme observé depuis une douzaines d'années, dans lequel les méthodes de calcul séquentielles se doivent d'être progressivement remplacées par des méthodes parallèles. En effet, l'augmentation de la fréquences des processeurs se heurtant à des barrières physiques difficilement franchissables, l'augmentation de la puissance de calcul se fait par l'augmentation du nombre d'unités de calcul. Toutefois, là où une augmentation de la fréquence conduisait mécaniquement à une exécution plus rapide d'un code, l'augmentation du nombre de cœurs peut se heurter à des barrières algorithmiques, qui peuvent nécessiter une adaptation ou un changement d'algorithme. Parmi les méthodes développées afin de contourner ce problème, on trouve en particulier celles de type Monte-Carlo (stochastiques), qui sont intrinsèquement "embarrassingly parallel", c'est à dire qu'elles sont par construction constituées d'une multitudes de tâches indépendantes, et de ce fait particulièrement adaptées aux architectures massivement parallèles. Elles ont également l'avantage, dans de nombreux cas, d'être capables de produire un résultat approché pour une fraction du coût calculatoire de l'équivalent déterministe exacte. Lors de cette thèse, des implémentations massivement parallèles de certains algorithmes déterministes de chimie quantique ont été réalisées. Il s'agit des algorithmes suivants : CIPSI, diagonalisation de Davidson, calcul de la perturbation au second ordre, shifted-Bk, et Coupled Cluster Multi Références. Pour certains, une composante stochastique a été introduite en vue d'améliorer leur efficacité. Toutes ces méthodes ont été implémentées sur un modèle de tâches distribuées en TCP, où un processus central distribue des tâches par le réseau et collecte les résultats. En d'autres termes, des nœuds esclaves peuvent être ajoutés au cours du calcul depuis n'importe quelle machine accessible depuis internet. L'efficacité parallèle des algorithmes implémentés dans cette thèse a été étudiée, et le programme a pu donner lieu à de nombreuses applications, notamment pour permettre d'obtenir des énergies de références pour des systèmes moléculaires difficiles. / This thesis, whose topic is quantum chemistry algorithms, is made in the context of the change in paradigm that has been going on for the last decade, in which the usual sequential algorithms are progressively replaced by parallel equivalents. Indeed, the increase in processors' frequency is challenged by physical barriers, so increase in computational power is achieved through increasing the number of cores. However, where an increase of frequency mechanically leads to a faster execution of a code, an increase in number of cores may be challenged by algorithmic barriers, which may require adapting of even changing the algorithm. Among methods developed to circumvent this issue, we find in particular Monte-Carlo methods (stochastic methods), which are intrinsically "embarrassingly parallel", meaning they are by design composed of a large number of independent tasks, and thus, particularly well-adapted to massively parallel architectures. In addition, they often are able to yield an approximate result for just a fraction of the cost of the equivalent deterministic, exact computation. During this thesis, massively parallel implementations of some deterministic quantum chemistry algorithms were realized. Those methods are: CIPSI, Davidson diagonalization, computation of second-order perturbation, shifted-Bk, Multi-Reference Coupled-Cluster. For some of these, a stochastic aspect was introduced in order to improve their efficiency. All of them were implemented on a distributed task model, with a central process distributing tasks and collecting results. In other words, slave nodes can be added during the computation from any location reachable through Internet. The efficiency for the implemented algorithms has been studied, and the code could give way to numerous applications, in particular to obtain reference energies for difficult molecular systems.
|
303 |
Parallelizing Tabu Search Based Optimization Algorithm on GPUsMalleypally, Vinaya 14 March 2018 (has links)
There are many combinatorial optimization problems such as traveling salesman problem, quadratic-assignment problem, flow shop scheduling, that are computationally intractable. Tabu search based simulated annealing is a stochastic search algorithm that is widely used to solve combinatorial optimization problems. Due to excessive run time, there is a strong demand for a parallel version that can be applied to any problem with minimal modifications. Existing advanced and/or parallel versions of tabu search algorithms are specific to the problem at hand. This leads to a drawback of optimization only for that particular problem. In this work, we propose a parallel version of tabu search based SA on the Graphics Processing Unit (GPU) platform. We propose two variants of the algorithm based on where the tabu list is stored (global vs. local). In the first version, the list is stored in the global shared memory such that all threads can access this list. Multiple random walks in solution space are carried out. Each walk avoids the moves made in rest of the walks due to their access to global tabu list at the expense of more time. In the second version, the list is stored at the block level and is shared by only the block threads. Groups of random walks are performed in parallel and a walk in a group avoids the moves made by the rest of the walks within that group due to their access to shared local tabu list. This version is better than the first version in terms of execution time. On the other hand, the first version finds the global optima more often. We present experimental results for six difficult optimization functions with known global optima. Compared to the CPU implementation with similar workload, the proposed GPU versions are faster by approximately three orders of magnitude and often find the global optima.
|
304 |
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word ArchitecturesSalinger, Alejandro January 2013 (has links)
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming.
|
305 |
Pricing a Multi-Asset American Option in a Parallel Environment by a Finite Element Method ApproachKaya, Deniz January 2011 (has links)
There is the need for applying numerical methods to problems that cannot be solved analytically and as the spatial dimension of the problem is increased the need for computational recourses increase exponentially, a phenomenon known as the “curse of dimensionality”. In the Black-Scholes-Merton framework the American option pricing problem has no closed form solution and a numerical procedure has to be employed for solving a PDE. The multi-asset American option introduces challenging computational problems, since for every added asset the dimension of the PDE is increased by one. One way to deal with the curse of dimensionality is threw parallelism. Here the finite element method-of-lines is used for pricing a multi-asset American option dependent on up to four assets in a parallel environment. The problem is also solved with the PSOR method giving a accurate benchmark used for comparison. In finance the put option is one of the most fundamental derivatives since it is basically asset-value insurance and a lot of research is done in the field of quantitative finance on accurate and fast pricing techniques for the multi-dimensional case. “What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.” Norbert Wiener “As soon as an Analytical Engine exists, it will necessarily guide the future course of the science. Whenever any result is sought by its aid, the question will then arise – by what course of calculation can these results be arrived at by the machine in the shortest time?” Charles Babbage
|
306 |
Gas-kinetic Methods For 3-d Inviscid And Viscous Flow Solutions On Unstructured/hybrid GridsIlgaz, Murat 01 February 2007 (has links) (PDF)
In this thesis, gas-kinetic methods for inviscid and viscous flow simulations are developed. Initially, the finite volume gas-kinetic methods are investigated for 1-D flows as a preliminary study and are discussed in detail from theoretical and numerical points of view. The preliminary results show that the gas-kinetic methods do not produce any unphysical flow phenomena. Especially the Gas-Kinetic BGK method, which takes into account the particle collisions, predicts compressible flows accurately. The Gas-Kinetic BGK method is then extended for the solution of 2-D and 3-D inviscid and viscous flows on unstructured/hybrid grids. The computations are performed in parallel. Various inviscid and viscous test cases are considered and it is shown that the Gas-Kinetic BGK method predicts both inviscid and viscous flow fields accurately.
The implementation of hybrid grids for viscous flows reduces the overall number
of grid cells while enabling the resolution of boundary layers. The parallel
computations significantly improve the computation time of the Gas-Kinetic
BGK method which, in turn, enable the method for the computation of practical
aerodynamic flow problems.
|
307 |
Accuracy And Efficiency Improvements In Finite Difference Sensitivity CalculationsOzhamam, Murat 01 December 2007 (has links) (PDF)
Accuracy of the finite difference sensitivity calculations are improved by
calculating the optimum finite difference interval sizes. In an aerodynamic inverse
design algorithm, a compressor cascade geometry is perturbed by shape functions
and finite differences sensitivity derivatives of the flow variables are calculated with
respect to the base geometry flow variables. Sensitivity derivatives are used in an
optimization code and a new airfoil is designed verifying given design
characteristics. Accurate sensitivities are needed for optimization process. In order to
find the optimum finite difference interval size, a method is investigated.
Convergence error estimation techniques in iterative solutions and second derivative
estimations are investigated to facilitate this method. For validation of the method,
analytical sensitivity calculations of Euler equations are used and several
applications are performed.
Efficiency of the finite difference sensitivity calculations is improved by
parallel computing. Finite difference sensitivity calculations are independent tasks in
an inverse aerodynamic design algorithm and can be computed separately.
Sensitivity calculations are performed on parallel processors and computing time is
decreased.
|
308 |
Vessel Segmentation Using Shallow Water EquationsNar, Fatih 01 May 2011 (has links) (PDF)
This thesis investigates the feasibility of using fluid flow as a deformable model for
segmenting vessels in 2D and 3D medical images. Exploiting fluid flow in vessel
segmentation is biologically plausible since vessels naturally provide the medium for
blood transportation. Fluid flow can be used as a basis for powerful vessel
segmentation because streaming fluid regions can merge and split providing
topological adaptivity. In addition, the fluid can also flow through small gaps formed
by imaging artifacts building connections between disconnected areas. In our study,
due to their simplicity, parallelism, and low computational cost compared to other
fluid simulation methods, linearized shallow water equations (LSWE) are used. The
method developed herein is validated using synthetic data sets, two clinical datasets,
and publicly available simulated datasets which contain Magnetic Resonance
Angiography (MRA) images, Magnetic Resonance Venography (MRV) images and
retinal angiography images. Depending on image size, one to two order of magnitude
speed ups are obtained with developed parallel implementation using Nvidia
Compute Unified Device Architecture (CUDA) compared to single-core and multicore
CPU implementation.
|
309 |
Proceedings of the 4th Many-core Applications Research Community (MARC) SymposiumJanuary 2012 (has links)
In continuation of a successful series of events, the 4th Many-core Applications Research Community (MARC) symposium took place at the HPI in Potsdam on December 8th and 9th 2011. Over 60 researchers from different fields presented their work on many-core hardware architectures, their programming models, and the resulting research questions for the upcoming generation of heterogeneous parallel systems.
|
310 |
Optimistic semantic synchronizationSreeram, Jaswanth 06 October 2011 (has links)
Within the last decade multi-core processors have become increasingly commonplace with the
power and performance demands of modern real-world programs acting to accelerate this trend. The
rapid advancements in designing and adoption of such architectures mean that there is a serious need
for programming models that allow the development of correct parallel programs that execute
efficiently on these processors. A principle problem in this regard is that of efficiently synchronizing
concurrent accesses to shared memory. Traditional solutions to this problem are either inefficient but
provide programmability (coarse-grained locks) or are efficient but are not composable and very hard
to program and verify (fine-grained locks). Optimistic Transactional Memory systems provide many of
the composability and programmabillity advantages of coarse-grained locks and good theoretical
scaling but several studies have found that their performance in practice for many programs remains
quite poor primarily because of the high overheads of providing safe optimism. Moreover current
transactional memory models remain rigid - they are not suited for expressing some of the complex
thread interactions that are prevalent in modern parallel programs. Moreover, the synchronization
achieved by these transactional memory systems is at the physical or memory level.
This thesis advocates a position that memory synchronization problem for threads should be
modeled and solved in terms of synchronization of underlying program values which have semantics
associated with them. It presents optimistic synchronization techniques that address the semantic
synchronization requirements of a parallel program instead.
These techniques include methods to 1) enable optimistic transactions to recover from
expensive sharing conflicts without discarding all the work made possible by the optimism 2) enable a
hybrid pessimistic-optimistic form of concurrency control that lowers overheads 3) make
synchronization value-aware and semantics-aware 4) enable finer grained consistency rules (than
allowed by traditional optimistic TM models) therefore avoiding conflicts that do not enforce any
semantic property required by the program. In addition to improving the expressibility of specific
synchronization idioms all these techniques are also effective in improving parallel performance. This
thesis formulates these techniques in terms of their purpose, the extensions to the language, the
compiler as well as to the concurrency control runtime necessary to implement them. It also briefly
presents an experimental evaluation of each of them on a variety of modern parallel workloads. These
experiments show that these techniques significantly improve parallel performance and scalability over
programs using state-of-the-art optimistic synchronization methods.
|
Page generated in 0.1221 seconds