Global ETD Search

301	Application of GPU Computing to Some Urban Traffic Problems Jradi, Walid Abdala Rfaei 30 November 2016 (has links) Submitted by Erika Demachki (erikademachki@gmail.com) on 2017-01-06T16:59:11Z No. of bitstreams: 2 Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-01-09T09:29:25Z (GMT) No. of bitstreams: 2 Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-01-09T09:29:25Z (GMT). No. of bitstreams: 2 Tese - Walid Abdala Rfaei Jradi - 2016.pdf: 5339936 bytes, checksum: 0a0a6bdc4791ee31c229b5175ae3af03 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2016-11-30 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The present work studies and proposes GPU-based parallel algorithms and implementations for the problem of macroscopic assignment of urban traffic on large-scale networks, promoting an in-depth investigation on each sub-problem that must be efficiently solved during the traffic assignment process. Among the main contributions of this work, there are: 1) the first GPU-based algorithm for the enumeration of chordless cycles; 2) a new parallel GPU-based shortest path algorithm that takes advantage of some common properties of urban traffic networks; a refinement in the parallel reduction implementation proposed by one of the leaders in the GPU market, which resulted in a 2.8x speedup relative to its original version; and finally, 3) a parallel algorithm for the macroscopic traffic assignment problem, 39x faster than the equivalent sequential approach when applied to large scale networks. The main goal of this thesis is to contribute to the extension of the PET-Gyn software, proposing efficient GPU data structures and parallel algorithms for a faster resolution of two well known problems in the literature: The Traffic Assignment Problem (TAP) and the Enumeration of Chordless Cycles. When applied to difficult input sets, the performed experiments showed a clear advantage of the parallel algorithms over their sequential versions. / O presente trabalho estuda e propõe algoritmos e implementações paralelas baseadas em GPU para o problema de alocação macroscópica de tráfego urbano em redes de grande porte, promovendo uma investigação aprofundada de cada sub-problema que deve ser resolvido de forma eficiente durante o processo de atribuição de tráfego. Entre as principais contribuições deste trabalho, estão: 1) o primeiro algoritmo baseado em GPU para a enumeração de ciclos sem corda; 2) um novo algoritmo de caminho mínimo paralelo que tira vantagem de algumas propriedades comuns das redes de tráfego urbano; Um refinamento na implementação de redução paralela proposta por um dos líderes no mercado de GPU, o que resultou em uma aceleração de 2,8x em relação à sua versão original; 3) e, finalmente, um algoritmo paralelo para o problema de alocação macroscópica de tráfego, 39x mais rápido do que a abordagem equivalente sequencial quando aplicado a redes de larga escala. O objetivo principal desta tese é de contribuir para a expansão do software PET-Gyn, propondo estruturas de dados de GPU eficientes e algoritmos paralelos para uma resolução mais rápida de dois problemas bem conhecidos na literatura: O Problema de Alocação de Tráfego e a Enumeração de Ciclos sem Corda. Quando aplicados a conjuntos de entrada difíceis, os experimentos realizados mostraram uma clara vantagem dos algoritmos paralelos sobre suas versões sequenciais. Tráfego urbano Alocação macroscópica de tráfego Computação paralela GPU Urban traffic Macroscopic traffic assignment Parallel computing
302	Développement et implémentation parallèle de méthodes d'interaction de configurations sélectionnées / Development and parallel implementation of selected configuration interaction methods Garniron, Yann 03 December 2018 (has links) Cette thèse, ayant pour thème les algorithmes de la chimie quantique, s'inscrit dans le cade du changement de paradigme observé depuis une douzaines d'années, dans lequel les méthodes de calcul séquentielles se doivent d'être progressivement remplacées par des méthodes parallèles. En effet, l'augmentation de la fréquences des processeurs se heurtant à des barrières physiques difficilement franchissables, l'augmentation de la puissance de calcul se fait par l'augmentation du nombre d'unités de calcul. Toutefois, là où une augmentation de la fréquence conduisait mécaniquement à une exécution plus rapide d'un code, l'augmentation du nombre de cœurs peut se heurter à des barrières algorithmiques, qui peuvent nécessiter une adaptation ou un changement d'algorithme. Parmi les méthodes développées afin de contourner ce problème, on trouve en particulier celles de type Monte-Carlo (stochastiques), qui sont intrinsèquement "embarrassingly parallel", c'est à dire qu'elles sont par construction constituées d'une multitudes de tâches indépendantes, et de ce fait particulièrement adaptées aux architectures massivement parallèles. Elles ont également l'avantage, dans de nombreux cas, d'être capables de produire un résultat approché pour une fraction du coût calculatoire de l'équivalent déterministe exacte. Lors de cette thèse, des implémentations massivement parallèles de certains algorithmes déterministes de chimie quantique ont été réalisées. Il s'agit des algorithmes suivants : CIPSI, diagonalisation de Davidson, calcul de la perturbation au second ordre, shifted-Bk, et Coupled Cluster Multi Références. Pour certains, une composante stochastique a été introduite en vue d'améliorer leur efficacité. Toutes ces méthodes ont été implémentées sur un modèle de tâches distribuées en TCP, où un processus central distribue des tâches par le réseau et collecte les résultats. En d'autres termes, des nœuds esclaves peuvent être ajoutés au cours du calcul depuis n'importe quelle machine accessible depuis internet. L'efficacité parallèle des algorithmes implémentés dans cette thèse a été étudiée, et le programme a pu donner lieu à de nombreuses applications, notamment pour permettre d'obtenir des énergies de références pour des systèmes moléculaires difficiles. / This thesis, whose topic is quantum chemistry algorithms, is made in the context of the change in paradigm that has been going on for the last decade, in which the usual sequential algorithms are progressively replaced by parallel equivalents. Indeed, the increase in processors' frequency is challenged by physical barriers, so increase in computational power is achieved through increasing the number of cores. However, where an increase of frequency mechanically leads to a faster execution of a code, an increase in number of cores may be challenged by algorithmic barriers, which may require adapting of even changing the algorithm. Among methods developed to circumvent this issue, we find in particular Monte-Carlo methods (stochastic methods), which are intrinsically "embarrassingly parallel", meaning they are by design composed of a large number of independent tasks, and thus, particularly well-adapted to massively parallel architectures. In addition, they often are able to yield an approximate result for just a fraction of the cost of the equivalent deterministic, exact computation. During this thesis, massively parallel implementations of some deterministic quantum chemistry algorithms were realized. Those methods are: CIPSI, Davidson diagonalization, computation of second-order perturbation, shifted-Bk, Multi-Reference Coupled-Cluster. For some of these, a stochastic aspect was introduced in order to improve their efficiency. All of them were implemented on a distributed task model, with a central process distributing tasks and collecting results. In other words, slave nodes can be added during the computation from any location reachable through Internet. The efficiency for the implemented algorithms has been studied, and the code could give way to numerous applications, in particular to obtain reference energies for difficult molecular systems. Calcul parallèle Chimie quantique Théorie des perturbations Parallel computing Quantum chemistry Selected configuration interaction Perturbation theory
303	Parallelizing Tabu Search Based Optimization Algorithm on GPUs Malleypally, Vinaya 14 March 2018 (has links) There are many combinatorial optimization problems such as traveling salesman problem, quadratic-assignment problem, flow shop scheduling, that are computationally intractable. Tabu search based simulated annealing is a stochastic search algorithm that is widely used to solve combinatorial optimization problems. Due to excessive run time, there is a strong demand for a parallel version that can be applied to any problem with minimal modifications. Existing advanced and/or parallel versions of tabu search algorithms are specific to the problem at hand. This leads to a drawback of optimization only for that particular problem. In this work, we propose a parallel version of tabu search based SA on the Graphics Processing Unit (GPU) platform. We propose two variants of the algorithm based on where the tabu list is stored (global vs. local). In the first version, the list is stored in the global shared memory such that all threads can access this list. Multiple random walks in solution space are carried out. Each walk avoids the moves made in rest of the walks due to their access to global tabu list at the expense of more time. In the second version, the list is stored at the block level and is shared by only the block threads. Groups of random walks are performed in parallel and a walk in a group avoids the moves made by the rest of the walks within that group due to their access to shared local tabu list. This version is better than the first version in terms of execution time. On the other hand, the first version finds the global optima more often. We present experimental results for six difficult optimization functions with known global optima. Compared to the CPU implementation with similar workload, the proposed GPU versions are faster by approximately three orders of magnitude and often find the global optima. Hill Climbing Simulated Annealing Parallel Computing CUDA Tabu List Aspiration Criteria Neighborhood Search Global Optimum Local Minima Computer Sciences
304	Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures Salinger, Alejandro January 2013 (has links) Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming. computer science parallel computing algorithms multi-core models of computation low degree parallelism paging cache GPU LoPRAM online algorithms Computer Science
305	Pricing a Multi-Asset American Option in a Parallel Environment by a Finite Element Method Approach Kaya, Deniz January 2011 (has links) There is the need for applying numerical methods to problems that cannot be solved analytically and as the spatial dimension of the problem is increased the need for computational recourses increase exponentially, a phenomenon known as the “curse of dimensionality”. In the Black-Scholes-Merton framework the American option pricing problem has no closed form solution and a numerical procedure has to be employed for solving a PDE. The multi-asset American option introduces challenging computational problems, since for every added asset the dimension of the PDE is increased by one. One way to deal with the curse of dimensionality is threw parallelism. Here the finite element method-of-lines is used for pricing a multi-asset American option dependent on up to four assets in a parallel environment. The problem is also solved with the PSOR method giving a accurate benchmark used for comparison. In finance the put option is one of the most fundamental derivatives since it is basically asset-value insurance and a lot of research is done in the field of quantitative finance on accurate and fast pricing techniques for the multi-dimensional case. “What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.” Norbert Wiener “As soon as an Analytical Engine exists, it will necessarily guide the future course of the science. Whenever any result is sought by its aid, the question will then arise – by what course of calculation can these results be arrived at by the machine in the shortest time?” Charles Babbage multi-asset American options Parallel Computing Finite Element Method-of-lines
306	Gas-kinetic Methods For 3-d Inviscid And Viscous Flow Solutions On Unstructured/hybrid Grids Ilgaz, Murat 01 February 2007 (has links) (PDF) In this thesis, gas-kinetic methods for inviscid and viscous flow simulations are developed. Initially, the finite volume gas-kinetic methods are investigated for 1-D flows as a preliminary study and are discussed in detail from theoretical and numerical points of view. The preliminary results show that the gas-kinetic methods do not produce any unphysical flow phenomena. Especially the Gas-Kinetic BGK method, which takes into account the particle collisions, predicts compressible flows accurately. The Gas-Kinetic BGK method is then extended for the solution of 2-D and 3-D inviscid and viscous flows on unstructured/hybrid grids. The computations are performed in parallel. Various inviscid and viscous test cases are considered and it is shown that the Gas-Kinetic BGK method predicts both inviscid and viscous flow fields accurately. The implementation of hybrid grids for viscous flows reduces the overall number of grid cells while enabling the resolution of boundary layers. The parallel computations significantly improve the computation time of the Gas-Kinetic BGK method which, in turn, enable the method for the computation of practical aerodynamic flow problems.
307	Accuracy And Efficiency Improvements In Finite Difference Sensitivity Calculations Ozhamam, Murat 01 December 2007 (has links) (PDF) Accuracy of the finite difference sensitivity calculations are improved by calculating the optimum finite difference interval sizes. In an aerodynamic inverse design algorithm, a compressor cascade geometry is perturbed by shape functions and finite differences sensitivity derivatives of the flow variables are calculated with respect to the base geometry flow variables. Sensitivity derivatives are used in an optimization code and a new airfoil is designed verifying given design characteristics. Accurate sensitivities are needed for optimization process. In order to find the optimum finite difference interval size, a method is investigated. Convergence error estimation techniques in iterative solutions and second derivative estimations are investigated to facilitate this method. For validation of the method, analytical sensitivity calculations of Euler equations are used and several applications are performed. Efficiency of the finite difference sensitivity calculations is improved by parallel computing. Finite difference sensitivity calculations are independent tasks in an inverse aerodynamic design algorithm and can be computed separately. Sensitivity calculations are performed on parallel processors and computing time is decreased.
308	Vessel Segmentation Using Shallow Water Equations Nar, Fatih 01 May 2011 (has links) (PDF) This thesis investigates the feasibility of using fluid flow as a deformable model for segmenting vessels in 2D and 3D medical images. Exploiting fluid flow in vessel segmentation is biologically plausible since vessels naturally provide the medium for blood transportation. Fluid flow can be used as a basis for powerful vessel segmentation because streaming fluid regions can merge and split providing topological adaptivity. In addition, the fluid can also flow through small gaps formed by imaging artifacts building connections between disconnected areas. In our study, due to their simplicity, parallelism, and low computational cost compared to other fluid simulation methods, linearized shallow water equations (LSWE) are used. The method developed herein is validated using synthetic data sets, two clinical datasets, and publicly available simulated datasets which contain Magnetic Resonance Angiography (MRA) images, Magnetic Resonance Venography (MRV) images and retinal angiography images. Depending on image size, one to two order of magnitude speed ups are obtained with developed parallel implementation using Nvidia Compute Unified Device Architecture (CUDA) compared to single-core and multicore CPU implementation. QA Mathematical Analysis 276-280
309	Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium January 2012 (has links) In continuation of a successful series of events, the 4th Many-core Applications Research Community (MARC) symposium took place at the HPI in Potsdam on December 8th and 9th 2011. Over 60 researchers from different fields presented their work on many-core hardware architectures, their programming models, and the resulting research questions for the upcoming generation of heterogeneous parallel systems. Mehrkernsysteme Verbindungsnetzwerke Prozessoren paralleles Rechnen Virtualisierung many-core multi-core interconnect processor hardware parallel computing virtualization Data processing Computer science
310	Optimistic semantic synchronization Sreeram, Jaswanth 06 October 2011 (has links) Within the last decade multi-core processors have become increasingly commonplace with the power and performance demands of modern real-world programs acting to accelerate this trend. The rapid advancements in designing and adoption of such architectures mean that there is a serious need for programming models that allow the development of correct parallel programs that execute efficiently on these processors. A principle problem in this regard is that of efficiently synchronizing concurrent accesses to shared memory. Traditional solutions to this problem are either inefficient but provide programmability (coarse-grained locks) or are efficient but are not composable and very hard to program and verify (fine-grained locks). Optimistic Transactional Memory systems provide many of the composability and programmabillity advantages of coarse-grained locks and good theoretical scaling but several studies have found that their performance in practice for many programs remains quite poor primarily because of the high overheads of providing safe optimism. Moreover current transactional memory models remain rigid - they are not suited for expressing some of the complex thread interactions that are prevalent in modern parallel programs. Moreover, the synchronization achieved by these transactional memory systems is at the physical or memory level. This thesis advocates a position that memory synchronization problem for threads should be modeled and solved in terms of synchronization of underlying program values which have semantics associated with them. It presents optimistic synchronization techniques that address the semantic synchronization requirements of a parallel program instead. These techniques include methods to 1) enable optimistic transactions to recover from expensive sharing conflicts without discarding all the work made possible by the optimism 2) enable a hybrid pessimistic-optimistic form of concurrency control that lowers overheads 3) make synchronization value-aware and semantics-aware 4) enable finer grained consistency rules (than allowed by traditional optimistic TM models) therefore avoiding conflicts that do not enforce any semantic property required by the program. In addition to improving the expressibility of specific synchronization idioms all these techniques are also effective in improving parallel performance. This thesis formulates these techniques in terms of their purpose, the extensions to the language, the compiler as well as to the concurrency control runtime necessary to implement them. It also briefly presents an experimental evaluation of each of them on a variety of modern parallel workloads. These experiments show that these techniques significantly improve parallel performance and scalability over programs using state-of-the-art optimistic synchronization methods. Synchronization Transactional memory Parallel computing Computer architecture Multiprocessors Parallel programming (Computer science) Parallel programs (Computer programs)

Search results