Global ETD Search

541	Parallel Evaluation of Numerical Models for Algorithmic Trading / Parallel Evaluation of Numerical Models for Algorithmic Trading Ligr, David January 2016 (has links) This thesis will address the problem of the parallel evaluation of algorithmic trading models based on multiple kernel support vector regression. Various approaches to parallelization of the evaluation of these models will be proposed and their suitability for highly parallel architectures, namely the Intel Xeon Phi coprocessor, will be analysed considering specifics of this coprocessor and also specifics of its programming. Based on this analysis a prototype will be implemented, and its performance will be compared to a serial and multi-core baseline pursuant to executed experiments. Powered by TCPDF (www.tcpdf.org)
542	Collective behaviour of model microswimmers Putz, Victor B. January 2010 (has links) At small length scales, low velocities, and high viscosity, the effects of inertia on motion through fluid become insignificant and viscous forces dominate. Microswimmer propulsion, of necessity, is achieved through different means than that achieved by macroscopic organisms. We describe in detail the hydrodynamics of microswimmers consisting of colloidal particles and their interactions. In particular we focus on two-bead swimmers and the effects of asymmetry on collective motion, calculating analytical formulae for time-averaged pair interactions and verifying them with microscopic time-resolved numerical simulation, finding good agreement. We then examine the long-term effects of a swimmer's passing on a passive tracer particle, finding that the force-free nature of these microswimmers leads to loop-shaped tracer trajectories. Even in the presence of Brownian motion, the loop-shaped structures of these trajectories can be recovered by averaging over a large enough sample size. Finally, we explore the phenomenon of synchronisation between microswimmers through hydrodynamic interactions, using the method of constraint forces on a force-based swimmer. We find that the hydrodynamic interactions between swimmers can alter the relative phase between them such that phase-locking can occur over the long term, altering their collective motion. 530.4
543	Performance Analysis of kNN Query Processing on large datasets using CUDA & Pthreads : comparing between CPU & GPU Kalakuntla, Preetham January 2017 (has links) Telecom companies do a lot of analytics to provide consumers a better service and to stay in competition. These companies accumulate special big data that has potential to provide inputs for business. Query processing is one of the major tool to fire analytics at their data. Traditional query processing techniques which follow in-memory algorithm cannot cope up with the large amount of data of telecom operators. The k nearest neighbour technique(kNN) is best suitable method for classification and regression of large datasets. Our research is focussed on implementation of kNN as query processing algorithm and evaluate the performance of it on large datasets using single core, multi-core and on GPU. This thesis shows an experimental implementation of kNN query processing on single core CPU, Multicore CPU and GPU using Python, P- threads and CUDA respectively. We considered different levels of sizes, dimensions and k as inputs to evaluate the performance. The experiment shows that GPU performs better than CPU single core on the order of 1.4 to 3 times and CPU multi-core on the order of 5.8 to 16 times for different levels of inputs. GPU Multicore CPU Parallel computing Performance Single core CPU kNN Query Processing Telecommunications Telekommunikation
544	Pencil beam dose calculation for proton therapy on graphics processing units da Silva, Joakim January 2016 (has links) Radiotherapy delivered using scanned beams of protons enables greater conformity between the dose distribution and the tumour than conventional radiotherapy using X rays. However, the dose distributions are more sensitive to changes in patient anatomy, and tend to deteriorate in the presence of motion. Online dose calculation during treatment delivery offers a way of monitoring the delivered dose in real time, and could be used as a basis for mitigating the effects of motion. The aim of this work has therefore been to investigate how the computational power offered by graphics processing units can be harnessed to enable fast analytical dose calculation for online monitoring in proton therapy. The first part of the work consisted of a systematic investigation of various approaches to implementing the most computationally expensive step of the pencil beam algorithm to run on graphics processing units. As a result, it was demonstrated how the kernel superposition operation, or convolution with a spatially varying kernel, can be efficiently implemented using a novel scatter-based approach. For the intended application, this outperformed the conventional gather-based approach suggested in the literature, permitting faster pencil beam dose calculation and potential speedups of related algorithms in other fields. In the second part, a parallelised proton therapy dose calculation engine employing the scatter-based kernel superposition implementation was developed. Such a dose calculation engine, running all of the principal steps of the pencil beam algorithm on a graphics processing unit, had not previously been presented in the literature. The accuracy of the calculation in the high- and medium-dose regions matched that of a clinical treatment planning system whilst the calculation was an order of magnitude faster than previously reported. Importantly, the calculation times were short, both compared to the dead time available during treatment delivery and to the typical motion period, making the implementation suitable for online calculation. In the final part, the beam model of the dose calculation engine was extended to account for the low-dose halo caused by particles travelling at large angles with the beam, making the algorithm comparable to those in current clinical use. By reusing the workflow of the initial calculation but employing a lower resolution for the halo calculation, it was demonstrated how the improved beam model could be included without prohibitively prolonging the calculation time. Since the implementation was based on a widely used algorithm, it was further predicted that by careful tuning, the dose calculation engine would be able to reproduce the dose from a general beamline with sufficient accuracy. Based on the presented results, it was concluded that, by using a single graphics processing unit, dose calculation using the pencil beam algorithm could be made sufficiently fast for online dose monitoring, whilst maintaining the accuracy of current clinical systems. 616.99
545	Efficient Dynamic Automatic Memory Management And Concurrent Kernel Execution For General-Purpose Programs On Graphics Processing Units Pai, Sreepathi 11 1900 (has links) (PDF) Modern supercomputers now use accelerators to achieve their performance with the most widely used accelerator being the Graphics Processing Unit (GPU). However, achieving the performance potential of systems that combine a GPU and CPU is an arduous task which could be made easier with the assistance of the compiler or runtime. In particular, exploiting two features of GPU architectures -- distributed memory and concurrent kernel execution -- is critical to achieve good performance, but in current GPU programming systems, programmers must exploit them manually. This can lead to poor performance. In this thesis, we propose automatic techniques that: i) perform data transfers between the CPU and GPU, ii) allocate resources for concurrent kernels, and iii) schedule concurrent kernels efficiently without programmer intervention. <p>Most GPU programs access data in GPU memory for performance. Manually inserting data transfers that move data to and from this GPU memory is an error-prone and tedious task. In this work, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale data accesses and uses a runtime to initiate transfers as necessary. This avoids redundant transfers that are exhibited by all other existing automatic memory management proposals for general purpose programs. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only (ADSM) and compiler-only (OpenMPC) proposals, it also transfers 2.2x to 13.3x less data on average. <p>Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite. Current GPUs therefore allow concurrent execution of kernels to improve utilization. We study concurrent execution of GPU kernels using multiprogrammed workloads on current NVIDIA Fermi GPUs. On two-program workloads from Parboil2 we find concurrent execution is often no better than serialized execution. We identify lack of control over resource allocation to kernels as a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware runtime concurrency policies that offer significantly better performance and concurrency than the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads over the current CUDA concurrency implementation. <p>Recent NVIDIA GPUs use a FIFO policy in their thread block scheduler (TBS) to schedule thread blocks of concurrent kernels. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the Shortest Remaining Time First (SRTF) policy instead. Since SRTF requires an estimate of runtime (i.e. execution time), we introduce Structural Runtime Prediction that uses the grid structure of GPU programs for predicting runtimes. Using a novel Staircase model of GPU kernel execution, we show that kernel runtime can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench and find that predictions made after the execution of single thread block are between 0.48x to 1.08x of actual runtime. %Next, we design a thread block scheduler that is both concurrent kernel-aware and incorporates this predictor. We implement the SRTF policy for concurrent kernels that uses this predictor and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. Compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves STP to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF. GPGPU Automatic Memory Management Concurrent Kernel Graphics Processing Unit (GPU) Elastic Kernels GPGPU Computer and Information Science
546	Finite Element Computations on Multicore and Graphics Processors Ljungkvist, Karl January 2017 (has links) In this thesis, techniques for efficient utilization of modern computer hardwarefor numerical simulation are considered. In particular, we study techniques for improving the performance of computations using the finite element method. One of the main difficulties in finite-element computations is how to perform the assembly of the system matrix efficiently in parallel, due to its complicated memory access pattern. The challenge lies in the fact that many entries of the matrix are being updated concurrently by several parallel threads. We consider transactional memory, an exotic hardware feature for concurrent update of shared variables, and conduct benchmarks on a prototype multicore processor supporting it. Our experiments show that transactions can both simplify programming and provide good performance for concurrent updates of floating point data. Secondly, we study a matrix-free approach to finite-element computation which avoids the matrix assembly. In addition to removing the need to store the system matrix, matrix-free methods are attractive due to their low memory footprint and therefore better match the architecture of modern processors where memory bandwidth is scarce and compute power is abundant. Motivated by this, we consider matrix-free implementations of high-order finite-element methods for execution on graphics processors, which have seen a revolutionary increase in usage for numerical computations during recent years due to their more efficient architecture. In the implementation, we exploit sum-factorization techniques for efficient evaluation of matrix-vector products, mesh coloring and atomic updates for concurrent updates, and a geometric multigrid algorithm for efficient preconditioning of iterative solvers. Our performance studies show that on the GPU, a matrix-free approach is the method of choice for elements of order two and higher, yielding both a significantly faster execution, and allowing for solution of considerably larger problems. Compared to corresponding CPU implementations executed on comparable multicore processors, the GPU implementation is about twice as fast, suggesting that graphics processors are about twice as power efficient as multicores for computations of this kind. Finite Element Methods GPU Matrix-Free Multigrid Transactional Memory Computer Science Datavetenskap (datalogi) Computational Mathematics Beräkningsmatematik
547	Méthode de décomposition de domaines pour l’équation de Schrödinger / Domain decomposition method for Schrödinger equation Xing, Feng 28 November 2014 (has links) Ce travail de thèse porte sur le développement et la mise en oeuvre des méthodes de décomposition de domaines (DD) pour les équations de Schrödinger linéaires ou non-linéaires en une ou deux dimensions d'espace. Dans la première partie, nous considérons la méthode de relaxation d'ondes de Schwarz (SWR) pour l'équation de Schrödinger en une dimension. Dans le cas où le potentiel est linéaire et indépendant du temps, nous proposons un nouvel algorithme qui est scalable et permet une forte réduction du temps de calcul comparativement à l'algorithme classique. Pour un potentiel général, nous utilisons un opérateur linéaire préalablement défini comme un préconditionneur. Cela permet d'assurer une forte scalabilité. Nous généralisons également les travaux de Halpern et Szeftel sur la condition de transmission en utilisant des conditions absorbantes construites récemment par Antoine, Besse et Klein. Par ailleurs, nous portons les codes développés sur Cpu sur des accélérateurs Gpu. La deuxième partie concerne les méthodes DD pour l'équation de Schrödinger en deux dimensions. Nous généralisons le nouvel algorithme et l'algorithme avec préconditionneur proposés au cas de la dimension deux. Dans le chapitre 6, nous généralisons les travaux de Loisel sur la méthode de Schwarz optimisée avec points de croisement pour l'équation de Laplace, qui conduit à la méthode SWR avec points de croisement. Dans la dernière partie, nous appliquons les méthodes DD que nous avons étudiées à la simulation de condensat de Bose-Einstein qui permettent de diminuer le temps de calcul, mais aussi de réaliser des simulations plus grosses. / This thesis focuses on the development and implementation of domain decomposition methods (DD) for the linear or non-linear Schrödinger equations in one or two dimensions. In the first part, we focus on the Schwarz waveform relaxation method (SWR) for the one dimensional Schrödinger equation. In the case the potential is linear and time-independent, we propose a new algorithm that is scalable and allows a significant reduction of computation time compared with the classical algorithm. For a general potential, we use a linear operator previously defined as a preconditioner. This ensures high scalability. We also generalize the work of Halpern and Szeftel on transmission condition. We use the absorbing boundary conditions recently constructed by Antoine, Besse and Klein as the transmission condition. We also adapt the codes developed originally on Cpus to the Gpu. The second part concerns with the methods DD for the Schrödinger equation in two dimensions. We generalize the new algorithm and the preconditioned algorithm proposed in the first part to the case of two dimensions. Furthermore, in Chapter 6, we generalize the work of Loisel on the optimized Schwarz method with cross points for the Laplace equation, which leads to the SWR method with cross points. In the last part, we apply the domain decomposition methods we studied to the simulation of Bose-Einstein condensate that could not only reduce the total computation time, but also realise the larger simulations. Accélération GPU Méthode de décomposition de domaines Méthode de décomposition en espace Méthode de relaxation d'onde de Schwarz 515.353
548	Flow control using optical sensors / Contrôle d'écoulement par capteurs optiques Gautier, Nicolas 08 October 2014 (has links) Le contrôle d'écoulement en utilisant des capteurs optiques est étudié dans un contexte expérimental. Le calcul de champs de vitesses en temps réel en utilisant une caméra pour l'acquisition et une carte graphique pour le calcul est détaillé. La validité de l'approche en terme de rapidité et de précision est étudiée. Un guide complet pour l'optimisation logicielle et matérielle est donné. Nous démontrons que le calcul dynamique de champs de vitesse est non seulement possible mais plus facile à gérer que l'utilisation d'un appareillage (PIV) classique. Un canal hydrodynamique est utilisé pour toutes les expériences. Celui-ci comporte une marche descendante pour le contrôle d' écoulements décollés. Les actionneurs sont des jets. Dans le cas de la marche descendante une étude paramétrique approfondie est faite pour qualifier les effets d'une injection en amont des jets, celle-ci étant traditionnellement effectuée à l'arrête de la marche.Plusieurs méthodes de contrôle sont étudiées. Un algorithme de contrôle basique de type PID est mis en place pour démontrer la viabilité du contrôle d'écoulement en boucle fermée par capteurs optiques. La zone de recirculation située derrière la marche est calculée en temps réel dans un plan vertical et horizontal. La taille de cette région est manipulée avec succès. Une approche basée sur des observations de la dynamique de l'écoulement est présentée.Des résultats précédents dans la littérature montrent que la recirculation peut être réduite avec succès en agissant sur l'écoulement à la fréquence naturelle de lâchés tourbillonnaires liés à l'instabilité de Kelvin-Helmholtz de la couche cisaillée crée par la marche. Une éthode basée de détection de vortex est introduite pour calculer cette fréquence, qui est ensuite utilisée dans une boucle de contrôle qui assure que l'écoulement est toujours pulsé à la bonne fréquence. Ainsi en utilisant des capteurs optiques la recirculation est réduite de façon simple.Ensuite nous implémentons un contrôle de type feed-forward dont l'efficacité a préalablement été démontrée en simulation. Cette approche vise à prévenir l'amplification de perturbations amont par la couche cisaillée. Nous montrons comment une telle méthode peut être implémentée avec succès dans un contexte expérimental. Enfin, nous implémentons également une approche radicalement différente basée sur un algorithme génétique. Des lois de contrôle aléatoires sont testées et évaluées. Les meilleurs sont répliquées, mutées et croisées. Ce processus se poursuit itérativement jusqu'à ce que le coût soit minimisé. Bien que lente à converger cette approche donne des résultats encourageants à travers une loi de commande originale. / Flow control using optical sensors is experimentally investigated. Real-time computation of flow velocity fields is implemented. This novel approach featuring a camera for acquisition and a graphic processor unit (GPU) for processing is presented and detailed. Its validity with regards to speed and precision is investigated. A comprehensive guide to software and hardware optimization is given. We demonstrate that online computation of velocity fields is not only achievable but offers advantages over traditional particle image velocimetry (PIV) setups. It shows great promise not only for flow control but for parametric studies and prototyping also.A hydrodynamic channel is used in all experiments, featuring a backward facing step for separated flow control. Jets are used to provide actuation. A comprehensive parametric study is effected to determine the effects of upstream jet injection. It is shown upstream injection can be very effective at reducing recirculation, corroborating results from the literature.Both open and closed loop control methods are investigated using this setup. Basic control is introduced to ascertain the effectiveness of this optical setup. The recirculation region created in the backward-facing step flow is computed in the vertical symmetry plane and the horizontal plane. We show that the size of this region can be successfully manipulated through set-point adaptive control and gradient based methods.A physically driven control approach is introduced. Previous works have shown successful reduction recirculation reduction can be achieved by periodic actuation at the natural Kelvin-Helmholtz frequency of the shear layer.A method based on vortex detection is introduced to determine this frequency, which is used in a closed loop to ensure the flow is always adequately actuated. Thus showing how recirculation reduction can be achieved through simple and elegant means using optical sensors. Next a feed-forward approach based on ARMAX models is implemented. It was successfully used in simulations to prevent amplification of upstream disturbances by the backward-facing step shear layer. We show how such an approach can be successful in an experimental setting.Higher Reynolds number flows exhibit non-linear behavior which can be difficult to model in a satisfactory manner thus a new approach was attempted dubbed machine learning control and based on genetic programming. A number of random control laws are implemented and rated according to a given cost function. The laws that perform best are bred, mutated or copied to yield a second generation. The process carries on iteratively until cost is minimized. This approach can give surprising insights into effective control laws. Contrôle d'écoulement Gpu Boucle fermée Marche descendante Expérimental Flow control Graphic processor unit 532
549	Multi-phase modelling of violent hydrodynamics using Smoothed Particle Hydrodynamics (SPH) on Graphics Processing Units (GPUs) Mokos, Athanasios Dorotheos January 2014 (has links) This thesis investigates violent air-water flows in two and three dimensions using a smoothed particle hydrodynamics (SPH) model accelerated using the parallel architecture of graphics processing units (GPUs). SPH is a meshless Lagrangian technique for CFD simulations, whose major advantage for multi-phase flows is that the highly nonlinear behaviour of the motion of the interface can be implicitly captured with a sharp interface. However, prior to this thesis performing multi-phase simulations of large scale air-water flows has been prohibitive due to the inherent high computational cost. The open source code DualSPHysics, a hybrid central processing unit (CPU) and GPU code, is heavily modified in order to be able to handle flows with multiple fluids by implementing a weakly compressible multi-phase model that is simple to implement on GPUs. The computational runtime shows a clear improvement over a conventional serial code for both two- and three dimensional cases enabling simulations with millions of particles. An investigation into different GPU algorithms focuses on optimising the multi-phase SPH implementation for the first time, leading to speedups of up to two orders of magnitude compared to a CPU-only simulation. Detailed comparison of different GPU algorithms reveals a further 12% improvement on the computational runtime. Enabling the modelling of cases with millions of fluid particles demonstrates some previously unreported problems regarding the simulation of the air phase. A new particle shifting algorithm has been proposed for multi-phase flows enabling the air, initially simulated as a highly compressible liquid, to expand rapidly as a gas and prevent the formation of unphysical voids. The new shifting algorithm is validated using dam break flows over a dry bed where good agreement is obtained with experimental data and reference solutions published in the literature. An improvement over a corresponding single-phase SPH simulation is also shown. Results for dam break flows over a wet bed are shown for different resolutions performing simulations that were unfeasible prior to the GPU multi-phase SPH code. Good agreement with the experimental results and a clear improvement over the single-phase model are obtained with the higher resolution showing closer agreement with the experimental results. Sloshing inside a rolling tank was also examined and was found to be heavily dependent on the viscosity model and the speed of sound of the phases. A sensitivity analysis was performed for a range of different values comparing the results to experimental data with the emphasis on the pressure impact on the wall. Finally, a 3-D gravity-driven flow where water is impacting an obstacle was studied comparing results with published experimental data. The height of the water at different points in the domain and the pressure on the side of the obstacle are compared to a state-of-the-art single-phase GPU SPH simulation. The results obtained were generally in good agreement with the experiment with closer results obtained for higher resolutions and showing an improvement on the single-phase model. 532
550	Design and Implementation of Efficient Algorithms for Wireless MIMO Communication Systems Roger Varea, Sandra 16 July 2012 (has links) En la última década, uno de los avances tecnológicos más importantes que han hecho culminar la nueva generación de banda ancha inalámbrica es la comunicación mediante sistemas de múltiples entradas y múltiples salidas (MIMO). Las tecnologías MIMO han sido adoptadas por muchos estándares inalámbricos tales como LTE, WiMAS y WLAN. Esto se debe principalmente a su capacidad de aumentar la máxima velocidad de transmisión , junto con la fiabilidad alcanzada y la cobertura de las comunicaciones inalámbricas actuales sin la necesidad de ancho de banda extra ni de potencia de transmisión adicional. Sin embargo, las ventajas proporcionadas por los sistemas MIMO se producen a expensas de un aumento sustancial del coste de implementación de múltiples antenas y de la complejidad del receptor, la cual tiene un gran impacto sobre el consumo de energía. Por esta razón, el diseño de receptores de baja complejidad es un tema importante que se abordará a lo largo de esta tesis. En primer lugar, se investiga el uso de técnicas de preprocesado de la matriz de canal MIMO bien para disminuir el coste computacional de decodificadores óptimos o bien para mejorar las prestaciones de detectores subóptimos lineales, SIC o de búsqueda en árbol. Se presenta una descripción detallada de dos técnicas de preprocesado ampliamente utilizadas: el método de Lenstra, Lenstra, Lovasz (LLL) para lattice reduction (LR) y el algorimo VBLAST ZF-DFE. Tanto la complejidad como las prestaciones de ambos métodos se han evaluado y comparado entre sí. Además, se propone una implementación de bajo coste del algoritmo VBLAST ZF-DFE, la cual se incluye en la evaluación. En segundo lugar, se ha desarrollado un detector MIMO basado en búsqueda en árbol de baja complejidad, denominado detector K-Best de amplitud variable (VB K-Best). La idea principal de este método es aprovechar el impacto del número de condición de la matriz de canal sobre la detección de datos con el fin de disminuir la complejidad de los sistemas / Roger Varea, S. (2012). Design and Implementation of Efficient Algorithms for Wireless MIMO Communication Systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16562 / Palancia Mimo Sphere decoding Detección por búsqueda en árbol Lattice reduction Cuantificación de llr Gpu TEORIA DE LA SEÑAL Y COMUNICACIONES

Search results