Global ETD Search

161	An Analysis of Real-Time Ray Tracing Techniques Using the Vulkan® Explicit API Souza, Elleis C 01 June 2021 (has links) (PDF) In computer graphics applications, the choice and implementation of a rendering technique is crucial when targeting real-time performance. Traditionally, rasterization-based approaches have dominated the real-time sector. Other algorithms were simply too slow to compete on consumer graphics hardware. With the addition of hardware support for ray-intersection calculations on modern GPUs, hybrid ray tracing/rasterization and purely ray tracing approaches have become possible in real-time as well. Industry real-time graphics applications, namely games, have been exploring these different rendering techniques with great levels of success. The addition of ray tracing into the graphics developer’s toolkit has without a doubt increased what level of graphical fidelity is achievable in real-time. In this thesis, three rendering techniques are implemented in a custom rendering engine built on the Vulkan® Explicit API. Each technique represents a different family of modern real-time rendering algorithms. A largely rasterization-based method, a hybrid ray tracing/rasterization method, and a method solely using ray tracing. Both the hybrid and ray tracing exclusive approach rely on the ReSTIR algorithm for lighting calculations. Analysis of the performance and render quality of these approaches reveals the trade-offs incurred by each approach, alongside the performance viability of each in a real-time setting. Computer Graphics Ray Tracing Real-Time Ray Tracing Physically-Based Rendering Lighting Parallelization Graphics and Human Computer Interfaces
162	Automatic Parallelization of Simulation Code from Equation Based Simulation Languages Aronsson, Peter January 2002 (has links) Modern state-of-the-art equation based object oriented modeling languages such as Modelica have enabled easy modeling of large and complex physical systems. When such complex models are to be simulated, simulation tools typically perform a number of optimizations on the underlying set of equations in the modeled system, with the goal of gaining better simulation performance by decreasing the equation system size and complexity. The tools then typically generate efficient code to obtain fast execution of the simulations. However, with increasing complexity of modeled systems the number of equations and variables are increasing. Therefore, to be able to simulate these large complex systems in an efficient way parallel computing can be exploited. This thesis presents the work of building an automatic parallelization tool that produces an efficient parallel version of the simulation code by building a data dependency graph (task graph) from the simulation code and applying efficient scheduling and clustering algorithms on the task graph. Various scheduling and clustering algorithms, adapted for the requirements from this type of simulation code, have been implemented and evaluated. The scheduling and clustering algorithms presented and evaluated can also be used for functional dataflow languages in general, since the algorithms work on a task graph with dataflow edges between nodes. Results are given in form of speedup measurements and task graph statistics produced by the tool. The conclusion drawn is that some of the algorithms investigated and adapted in this work give reasonable measured speedup results for some specific Modelica models, e.g. a model of a thermofluid pipe gave a speedup of about 2.5 on 8 processors in a PC-cluster. However, future work lies in finding a good algorithm that works well in general. / <p>Report code: LiU-Tek-Lic-2002:06.</p> state-of-the-art equation object oriented modeling automatic parallelization tool data dependency graph clustering algorithms Computer Sciences Datavetenskap (datalogi)
163	Developing models and algorithms to design a robust inland waterway transportation network under uncertainty Nur, Farjana 07 August 2020 (has links) This dissertation develops mathematical models to efficiently manage the inland waterway port operations while minimizing the overall supply chain cost. In the first part, a capacitated, multi-commodity, multi-period mixed-integer linear programming model is proposed capturing diversified inland waterway transportation network related properties. We developed an accelerated Benders decomposition algorithm to solve this challenging NP-hard problem. The next study develops a two-stage stochastic mixed-integer nonlinear programming model to manage congestion in an inland waterway transportation network under stochastic commodity supply and water-level fluctuation scenarios. The model also jointly optimizes trip-wise towboat and barge assignment decisions and different supply chain decisions (e.g., inventory management, transportation decisions) in such a way that the overall system cost can be minimized. We develop a parallelized hybrid decomposition algorithm, combining Constraint Generation algorithm, Sample Average Approximation (SAA), and an enhanced variant of the L-shaped algorithm, to effectively solve our proposed optimization model in a timely fashion. While the first two parts develop models from the supply chain network design viewpoint, the next two parts propose mathematical models to emphasize the port and waterway transportation related operations. Two two-stage, stochastic, mixed-integer linear programming (MILP) models are proposed under stochastic commodity supply and water level fluctuations scenarios. The last one puts the specific focus in modeling perishable inventories. To solve the third model we propose to develop a highly customized parallelized hybrid decomposition algorithm that combines SAA with an enhanced Progressive Hedging and Nested Decomposition algorithm. Similarly, to solve the last mathematical model we propose a hybrid decomposition algorithm combining the enhanced Benders decomposition algorithm and SAA to solve the large size of test instances of this complex, NP-hard problem. Both proposed approaches are highly efficient in solving the real-life test instances of the model to desired quality within a reasonable time frame. All the four developed models are validated a real-life case study focusing on the inland waterway transportation network along the Mississippi River. A number of managerial insights are drawn for different key input parameters that impact port operations. These insights will essentially help decisions makers to effectively and efficiently manage an inland waterway-based transportation network. Inland waterway port Congestion Benders decomposition algorithm Hybrid nested decomposition algorithm Parallelization Waterlevel fluctuation Progressive Hedging algorithm
164	Design and Analysis of Modular Architectures for an RNS to Mixed Radix Conversion Multi-processor Shivashankar, Nithin 27 October 2014 (has links) No description available. Computer Engineering Residue Number System Mixed Radix RNS to Mixed Radix Conversion Multi-processor FPGA parallelization pipelining Modular Inverse
165	Symbolic Generation of Parallel Solvers for Unconstrained Optimization Pavlin, Jessica L. 10 1900 (has links) <p>In this thesis we consider the need to generate efficient solvers for inverse imaging problems in a way that supports both quality and performance in software, as well as flexibility in the underlying mathematical models. Many problem domains involve large data sizes and rates, and changes in mathematical modelling are limited only by researcher ingenuity and driven by the value of the application. We use a problem in Magnetic Resonance Imaging to illustrate this situation, motivate the need for better software tools and test the tools we develop. The problem is the determination of velocity profiles, think blood-flow patterns, using Phase Contrast Angiography. Despite the name, this method is completely noninvasive, not requiring the injection of contrast agents, but it is too time-consuming with present imaging and computing technology.</p> <p>Our approach is to separate the specification, the mathematical model, from the implementation details required for performance, using a custom language. The Domain Specific Language (DSL) provided to scientists allows for a complete abstraction from the highly optimized generated code. The mathematical DSL is converted to an internal representation we refer to as the Coconut Expression Library. Our expression library uses the directed acyclic graphs as an underlying data structure, which lends itself nicely to our automatic simplifications, differentiation and subexpression elimination. We show how parallelization and other optimizations are encoded as rules which are applied automatically rather than schemes that need to be implemented by the programmer in the low-level implementation. Finally, we present results, both in terms of numerical results and computational performance.</p> / Master of Science (MSc) symbolic code generation parallelization MRI optimization Bioimaging and biomedical optics Computer and Systems Architecture Other Computer Engineering Bioimaging and biomedical optics
166	On the Parallelization of a Search for Counterexamples to a Conjecture of Erd\H{o}s Shen, ShengWei 10 1900 (has links) <p>Denote by $k_t(G)$ the number of cliques of order $t$ in a graph $G$ having $n$ vertices. Let $k_t(n) = \min\{k_t(G)+k_t(\overline{G}) \}$ where $\overline{G}$ denotes the complement of $G$. Let $c_t(n) = {k_t(n)}/{\tbinom{n}{t}}$ and $c_t$ be the limit of $c_t(n)$ for $n$ going to infinity. A 1962 conjecture of Erd\H{o}s stating that $c_t = 2^{1-\tbinom{t}{2}}$ was disproved by Thomason in 1989 for all $t\geq 4$. Tighter counterexamples have been constructed by Jagger, {\v S}{\v t}ov{\' \i}{\v c}ek and Thomason in 1996, by Thomason for $t\leq 6$ in 1997, and by Franek for $t=6$ in 2002. Further tightenings $t=6,7$ and $8$ was recently obtained by Deza, Franek, and Liu.</p> <p>We investigate the computational framework used by Deza, Franek, and Liu. In particular, we present the benefits and limitations of different parallel computer memory architectures and parallel programming models. We propose a functional decomposition approach which is implemented in C++ with POSIX thread (Pthread) libraries for multi-threading. Computational benchmarking on the parallelized framework and a performance analysis including a comparison with the original computational framework are presented.</p> / Master of Science (MSc) clique coclique Cayley graph complete graph subgraph parallelization Computer Sciences Discrete Mathematics and Combinatorics Theory and Algorithms Computer Sciences
167	GPU Parallelization of Astronomical Image Subtraction / GPU-parallelisering av astronomisk bildsubtraction Arneving, Gustav, Wilhelmsson, Hugo January 2024 (has links) Astronomical image subtraction is a method for generating a difference image from two images, which covers the same area but taken at different times, in order to see changes over time. Due to the images being taken at different times, one of the images has to be convolved, to match the atmospheric conditions ofthe other image. HOTPANTS is an open source software used for astronomical image subtraction. The problem is that HOTPANTS is written in serial C and therefore does not scale well with growing image sizes. There have been previous efforts to parallelize HOTPANTS, which include P-HOTPANTS and GBAISP. However, these projects are outdated or unavailable, respectively. The latest effort, BACH, is a reimplementation of HOTPANTS in C++, where the convolution and subtraction parts have been parallelized on a GPU using OpenCL. This thesis project is a continuation of BACH, called X-BACH, which aims to parallelize the remaining parts of the HOTPANTS algorithm using OpenCL. The results show that some parts of the HOTPANTS algorithm, excluding convolution and subtraction, are highly suitable for the GPU while other parts arenot suitable for the GPU. It is believed that some parts which are not suitable forthe GPU are highly suitable for CPU parallelization. Overall, running on an external GPU, X-BACH achieves a relative speed of 1 to 2 compared to BACH, and a relative of 0.8 to 2.5 compared to HOTPANTS. When running on an integrated GPU, X-BACH achieves a relative speed of 0.5 to 1.2 compared to BACH, and a relative speed of 0.3 to 2 compared to HOTPANTS. Some parts of the algorithm achieves a speedup of up to 10 times when parallelized on a GPU. In terms of accuracy, X-BACH generally obtains a maximum relative error in order of magnitude ranging from 10−7 to 10−1. However, on certain test images, the algorithm has been observed to be unstable. GPU Graphics Processing Unit Graphics Card Parallelization Image Processing GPU Grafikprocessor Grafikkort Parallelisering Bildbehandling Computer and Information Sciences Data- och informationsvetenskap
168	Lattice QCD Optimization and Polytopic Representations of Distributed Memory / Optimisation de LatticeQCD et représentations polytopiques de la mémoire distribuée Kruse, Michael 26 September 2014 (has links) La physique actuelle cherche, à côté des expériences, à vérifier et déduire les lois de la nature en simulant les modèles physiques sur d'énormes ordinateurs. Cette thèse explore comment accélérer ces simulations en améliorant les programmes qui les font tourner. L'application de référence est la chromodynamique quantique sur réseaux (LQCD pour "Lattice Quantum Chromodynamics"), une branche de la théorie quantique des champs, tournant sur le plus récent des supercalculateurs d'IBM, le Blue Gene/Q.Dans un premier temps, on améliore le code source de tmLQCD, un programme de LQCD, dont l'opération clef pour la performance est un stencil à 8 points en dimension 4. On étudie deux stratégies d'optimisation différentes: la première se donne comme priorité d'améliorer la localité spatiale et temporelle; la seconde utilise le préchargement matériel de flux de données. Sur le Blue Gene/Q, la première stratégie permet d'atteindre 20% de la performance crête théorique. La seconde, avec jusqu'à 54% de la performance crête est bien meilleure mais utilise 4 fois plus de mémoire car elle stocke les résultats dans l'ordre où les utilise le stencil suivant, ce qui requiert de dupliquer des données. Les autres techniques exploitées sont la programmation directe du système de communication (appelé MUSPI chez IBM), un mécanisme allégé de gestion des threads, le préchargement explicite de certaines données (à l'aide de l'instruction dcbt) et la vectorisation manuelle (en utilisant les instructions SIMD de largeur 4; appelé QPX par IBM). Le préchargement de liste et la mémoire transactionnelle - deux nouveaux mécanismes du Blue Gene/Q - n'améliorent pas les performances.Dans un second temps, on présente la réalisation d'une extension appelé Molly au compilateur LLVM, pour optimiser automatiquement le programme, et plus précisément la distribution des données et des calculs entre les nœuds d'un cluster tel que le Blue Gene/Q. Molly représente les tableaux par des polyèdres entiers et utilise l'extension existante Polly qui représente les boucles et les instructions par des polyèdres. Partant de la spécification de la distribution des données et de l'emplacement des calculs, Molly ajoute le code qui gère les flots de données entre les nœuds de calcul. Molly peut aussi permuter l'ordre des données en mémoire. La tâche principale de Molly est d'agréger les données dans des ensembles qui sont envoyés dans le même tampon au même destinataire, pour éviter l'overhead des transferts trop petits. Nous présentons un algorithme qui minimise le nombre de transferts pour des boucles non-paramétrées, basé sur les antichaînes du flot des données. De plus, nous implémentons une heuristique qui tient compte de la manière dont le programmeur a écrit son code. Les primitives de communication asynchrone sont insérées juste après que les données soient disponibles - respectivement juste avant qu'elles soient utilisées. Une bibliothèque runtime implémente ces primitives en utilisant MPI. Molly gère la distribution pour tout code représentable dans le modèle polyédrique, mais fonctionne mieux pour du code à stencil tel LQCD. Compilé avec Molly, le code LQCD atteint 2,5% de la performance crête. L'écart de performance est surtout dû au fait que les autres optimisations ne sont pas faites, par exemple la vectorisation. Les versions futures de Molly pourraient aussi gérer efficacement les codes non à stencil et exploiter les autres optimisations qui ont rendu le code LQCD optimisé à la main si rapide. / Motivated by modern day physics which in addition to experiments also tries to verify and deduce laws of nature by simulating the state-of-the-art physical models using oversized computers, this thesis explores means of accelerating such simulations by improving the simulation programs they run. The primary focus is Lattice Quantum Chromodynamics (QCD), a branch of quantum field theory, running on IBM newest supercomputer, the Blue Gene/Q.In a first approach, the source code of tmLQCD, a Lattice QCD program, is improved to run faster on the Blue Gene machine. Its most performance-relevant operation is a 8-point stencil in 4 dimensional space. Two different optimization strategies are perused: One with the priority of improving spatial and temporal locality, and a second making use of the hardware's data stream prefetcher. On Blue Gene/Q the first strategy reaches up to 20% of the peak theoretical floating point operation performance of that machine. The second strategy with up to 54% of peak is much faster at the cost of using 4 times more memory by storing the data in the order they will be used in the next stencil operation, duplicating data where necessary.Other techniques exploited are direct programming of the messaging hardware (called MUSPI by IBM), a low-overhead work distribution mechanism for threads, explicit data prefetching of data (using dcbt instruction) and manual vectorization (using QPX; width-4 SIMD instructions). Hardware-based list prefetching and transactional memory - both distinct and novel features of the Blue Gene/Q system -- did not improve the program's performance.The second approach is the newly-written LLVM compiler extension called Molly which optimizes the program itself, specifically the distribution of data and work between the nodes of a cluster machine such as Blue Gene/Q. Molly represents arrays using integer polyhedra and uses another already existing compiler extension Polly which represents statements and loops using polyhedra. When Molly knows how data is distributed among the nodes and where statements are executed, it adds code that manages the data flow between the nodes. Molly can also permute the order of data in memory. Molly's main task is to cluster data into sets that are sent to the same target into the same buffer because single transfers involve a massive overhead. We present an algorithm that minimizes the number of transfers for unparametrized loops using anti-chains of data flows. In addition, we implement a heuristic that takes into account how the programmer wrote the code. Asynchronous communication primitives are inserted right after the data is available respectively just before it is used. A runtime library implements these primitives using MPI.Molly manages to distribute any code that is representable by the polyhedral model, but does so best for stencils codes such as Lattice QCD. Compiled using Molly, the Lattice QCD stencil reaches 2.5% of the theoretical peak performance. The performance gap is mostly because all the other optimizations are missing, such as vectorization. Future versions of Molly may also effectively handle non-stencil codes and use make use of all the optimizations that make the manually optimized Lattice QCD stencil so fast. Blue Gene Lattice QCD Mémoire distribuée LLVM Clang Modèle polyhédrique Parallélisation Blue Gene Lattice QCD Distributed Memory LLVM Clang Polyhedral Model Parallelization
169	Décomposition des jeux dans le domaine du General Game Playing. / Game Decomposition for General Game Playing Aline Hufschmitt Hufschmitt, Aline 04 October 2018 (has links) Dans cette thèse nous présentons une approche générale et robuste pour la décomposition des jeux décrits en Game Description Language (GDL). Dans le domaine du General Game Playing (GGP), les joueurs peuvent significativement diminuer le coût de l'exploration d'un jeu s'ils disposent d'une version décomposée de celui-ci. Les travaux existants sur la décomposition des jeux s'appuient sur la structure syntaxique des règles, sur des habitudes d'écriture du GDL ou sur le coûteux calcul de la forme normale disjonctive des règles. Nous proposons une méthode plus générale pour décomposer les jeux solitaires ou multijoueurs. Dans un premier temps nous envisageons une approche fondée sur l'analyse logique des règles. Celle-ci nécessite l'utilisation d'heuristiques, qui en limitent la robustesse, et le coûteux calcul de la forme normale disjonctive des règles. Une seconde approche plus efficace est fondée sur la collecte d'informations durant des simulations (playouts). Cette dernière permet la détection des liens de causalité entre les actions et les fluents d'un jeu. Elle est capable de traiter les différents types de jeux composés et de prendre en charge certains cas difficiles comme les jeux à actions composées et les jeux en série. Nous avons testé notre approche sur un panel de 597 jeux GGP. Pour 70% des jeux, la décomposition nécessite moins d'une minute en faisant 5k playouts. Nous montrons de 87% d'entre eux peuvent être correctement décomposés après seulement 1k playouts. Nous ébauchons également une approche originale pour jouer avec ces jeux décomposés. Les tests préliminaires sur quelques jeux solitaires sont prometteurs. / This PhD thesis presents a robust and general approach for the decomposition of games described in Game Description Language (GDL). In the General Game Playing framework (GGP), players can drastically decrease game search cost if they hold a decomposed version of the game. Previous works on decomposition rely on syntactical structures and writing habits of the GDL, or on the disjunctive normal form of the rules, which is very costly to compute. We offer an approach to decompose single or multi-player games. First, we consider an approach based on logical rule analysis. This requires the use of heuristics, which limit its robustness, and the costly calculation of the disjunctive normal form of the rules. A second more efficient approach is based on information gathering during simulations of the game (playouts). The latter allows the detection of causal links between actions. It can handle the different classes of compound games and can process some difficult cases like synchrounous parallel games with compound moves and serial games. We tested our program on 597 games. Given 5k playouts, 70% of the games are decomposed in less than one minute. We demonstrate that for 87% of the games, 1k playouts are sufficient to obtain a correct decomposition. We also sketch an original approach to play with these decomposed games. Preliminary tests on some one-player games are promising. Another contribution of this thesis is the evaluation of the MPPA architecture for the parallelization of a GGP player (LeJoueur of Jean-Noël Vittaut). General Game Playing Game Description Langage Decomposition Causalité Actions composées Jeux en série Parallélisation MPPA General Game Playing Game Description Langage Decomposition Causality Compound moves Serial games Parallelization MPPA
170	Representação Nó-profundidade em FPGA para algoritmos evolutivos aplicados ao projeto de redes de larga-escala / Node-depth representation in FPGA for evolutionary algorithms applied to network design problems of large-scale Gois, Marcilyanne Moreira 26 October 2011 (has links) Diversos problemas do mundo real estão relacionados ao projeto de redes, tais como projeto de circuitos de energia elétrica, roteamento de veículos, planejamento de redes de telecomunicações e reconstrução filogenética. Em geral, esses problemas podem ser modelados por meio de grafos, que manipulam milhares ou milhões de nós (correspondendo às variáveis de entrada), dificultando a obtenção de soluções em tempo real. O Projeto de uma Rede é um problema combinatório, em que se busca encontrar a rede mais adequada segundo um critério como, por exemplo, menor custo, menor caminho e tempo de percurso. A solução desses problemas é, em geral, computacionalmente complexa. Nesse sentido, metaheurísticas como Algoritmos Evolutivos têm sido amplamente investigadas. Diversas pesquisas mostram que o desempenho de Algoritmos Evolutivos para Problemas de Projetos de Redes pode ser aumentado significativamente por meio de representações mais apropriadas. Este trabalho investiga a paralelização da Representação Nó-Profundidade (RNP) em hardware, com o objetivo de encontrar melhores soluções para Problemas de Projetos de Redes. Para implementar a arquitetura de hardware, denominada de HP-RNP (Hardware Parallelized RNP), foi utilizada a tecnologia de FPGA para explorar o alto grau de paralelismo que essa plataforma pode proporcionar. Os resultados experimentais mostraram que o HP-RNP é capaz de gerar e avaliar novas redes em tempo médio limitado por uma constante (O(1)) / Many problems related to network design can be found in real world applications, such as design of electric circuits, vehicle routing, telecommunication network planning and phylogeny reconstruction. In general, these problems can be modelled using graphs that handle thousands or millions of nodes (input variables), making it hard to obtain solutions in real-time. The Network Design is the combinatorial problem of finding the most suitable network subject to a evaluation criterion as, for example, lower cost, minimal path and time to traverse the network. The solution of those problems is in general computationally complex. Metaheuristics as Evolutionary Algorithms have been widely investigated for such problems. Several researches have shown that the performance of Evolutionary Algorithms for the Network Design Problems can be significantly increased through more appropriated dynamic data structures (encodings). This work investigates the parallelization of Node-Depth Encoding (NDE) in hardware in order to find better solutions for Network Design Problems. To implement the proposed hardware architecture, called HP-NDE (Hardware Parallellized NDE), the FPGA technology was used to explore the high degree of parallelism that such platform can provide. The experimental results have shown that the HP-NDE can generate and evaluate new networks in average time constrained by a constant (O(1)) Árvores geradoras Florestas geradoras FPGA FPGA Network design problem Node-depth representation Paralelização Parallelization Problemas de projetos de redes Representação nó-profundidade Spanning forests Spanning trees

Search results