• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 239
  • 81
  • 31
  • 30
  • 17
  • 7
  • 6
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 543
  • 543
  • 111
  • 70
  • 66
  • 62
  • 61
  • 59
  • 58
  • 57
  • 57
  • 56
  • 54
  • 50
  • 48
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
181

Evaluation of the Configurable Architecture REPLICA with Emulated Shared Memory / Utvärdering av den konfigurerbara arkitekturen REPLICA med emulerat delat minne

Alnervik, Erik January 2014 (has links)
REPLICA is a family of novel scalable chip multiprocessors with configurable emulated shared memory architecture, whose computation model is based on the PRAM (Parallel Random Access Machine) model. The purpose of this thesis is to, by benchmarking different types of computation problems on REPLICA, similar parallel architectures (SB-PRAM and XMT) and more diverse ones (Xeon X5660 and Tesla M2050), evaluate how REPLICA is positioned among other existing architectures, both in performance and programming effort. But it should also examine if REPLICA is more suited for any special kinds of computational problems. By using some of the well known Berkeley dwarfs, and input from unbiased sources, such as The University of Florida Sparse Matrix Collection and Rodinia benchmark suite, we have made sure that the benchmarks measure relevant computation problems. We show that today’s parallel architectures have some performance issues for applications with irregular memory access patterns, which the REPLICA architecture can solve. For example, REPLICA only need to be clocked with a few MHz to match both Xeon X5660 and Tesla M2050 for the irregular memory access benchmark breadth first search. By comparing the efficiency of REPLICA to a CPU (Xeon X5660), we show that it is easier to program REPLICA efficiently than today’s multiprocessors. / REPLICA är en grupp av konfigurerbara multiprocessorer som med hjälp utav ett emulerat delat minne realiserar PRAM modellen. Syftet med denna avhandling är att genom benchmarking av olika beräkningsproblem på REPLICA, liknande (SB-PRAM och XMT) och mindre lika (Xeon X5660 och Tesla M2050) parallella arkitekturer, utvärdera hur REPLICA står sig mot andra befintliga arkitekturer. Både prestandamässigt och hur enkel arkitekturen är att programmera effektiv, men även försöka ta reda på om REPLICA är speciellt lämpad för några särskilda typer av beräkningsproblem. Genom att använda välkända Berkeley dwarfs applikationer och opartisk indata från bland annat The University of Florida Sparse Matrix Collection och Rodinia benchmark suite, säkerställer vi att det är relevanta beräkningsproblem som utförs och mäts. Vi visar att dagens parallella arkitekturer har problem med prestandan för applikationer med oregelbundna minnesaccessmönster, vilken REPLICA arkitekturen kan vara en lösning på. Till exempel, så behöver REPLICA endast vara klockad med några få MHz för att matcha både Xeon X5660 och Tesla M2050 för algoritmen breadth first search, vilken lider av just oregelbunden minnesåtkomst. Genom att jämföra effektiviteten för REPLICA gentemot en CPU (Xeon X5660), visar vi att det är lättare att programmera REPLICA effektivt än dagens multiprocessorer.
182

Development of a Parallel Computational Framework to Solve Flow and Transport in Integrated Surface-Subsurface Hydrologic Systems

Hwang, Hyoun-Tae January 2012 (has links)
HydroGeoSphere (HGS) is a 3D control-volume finite element hydrologic model describing fully-integrated surface-subsurface water flow and solute and thermal energy transport. Because the model solves tightly-coupled highly-nonlinear partial differential equations, often applied at regional and continental scales (for example, to analyze the impact of climate change on water resources), high performance computing (HPC) is essential. The target parallelization includes the composition of the Jacobian matrix for the iterative linearization method and the sparse-matrix solver, preconditioned BiCGSTAB. The Jacobian matrix assembly is parallelized by using a static scheduling scheme with taking account into data racing conditions, which may occur during the matrix construction. The parallelization of the solver is achieved by partitioning the domain into equal-size sub-domains, with an efficient reordering scheme. The computational flow of the BiCGSTAB solver is also modified to reduce the parallelization overhead and to be suitable for parallel architectures. The parallelized model is tested on several benchmark cases that include linear and nonlinear problems involving various domain sizes and degrees of hydrologic complexity. The performance is evaluated in terms of computational robustness and efficiency, using standard scaling performance measures. Simulation profiling results indicate that the efficiency becomes higher for three situations: 1) with an increasing number of nodes/elements in the mesh because the work load per CPU decreases with increasing the number of nodes, which reduces the relative portion of parallel overhead in total computing time., 2) for increasingly nonlinear transient simulations because this makes the coefficient matrix diagonal dominance, and 3) with domains of irregular geometry that increases condition number. These characteristics are promising for the large-scale analysis of water resource problems that involve integrated surface-subsurface flow regimes. Large-scale real-world simulations illustrate the importance of node reordering, which is associated with the process of the domain partitioning. With node reordering, super-scalarable parallel speedup was obtained when compared to a serial simulation performed with natural node ordering. The results indicate that the number of iterations increases as the number of threads increases due to the increased number of elements in the off-diagonal blocks in the coefficient matrix. In terms of the privatization scheme, the parallel efficiency with privatization was higher than that with the shared scheme for most of simulations performed.
183

Numerical Simulation Of Laminar Reacting Flows

Tarhan, Tanil 01 September 2004 (has links) (PDF)
Novel sequential and parallel computational fluid dynamic (CFD) codes based on method of lines (MOL) approach were developed for the numerical simulation of multi-component reacting flows using detailed transport and thermodynamic models. Both codes were applied to the prediction of a confined axisymmetric laminar co-flowing methane-air diffusion flame for which experimental data were available in the literature. Flame-sheet model for infinite-rate chemistry and one-, two-, and five- and ten-step reduced finite-rate reaction mechanisms were employed for methane-air combustion sub-model. A second-order high-resolution total variation diminishing (TVD) scheme based on Lagrange interpolation polynomial was proposed in order to alleviate spurious oscillations encountered in time evolution of flame propagation. Steady-state velocity, temperature and species profiles obtained by using infinite- and finite-rate chemistry models were validated against experimental data and other numerical solutions. They were found to be in reasonably good agreement with measurements and numerical results. The proposed difference scheme produced accurate results without spurious oscillations and numerical diffusion encountered in the classical schemes and hence was found to be a successful scheme applicable to strongly convective flow problems with non-uniform grid resolution. The code was also found to be an efficient tool for the prediction and understanding of transient combustion systems. This study constitutes the initial steps in the development of an efficient numerical scheme for direct numerical simulation (DNS) of unsteady, turbulent, multi-dimensional combustion with complex chemistry.
184

Numerical Simulation Of Radiating Flows

Karaismail, Ertan 01 August 2005 (has links) (PDF)
Predictive accuracy of the previously developed coupled code for the solution of the time-dependent Navier-Stokes equations in conjunction with the radiative transfer equation was first assessed by applying it to the prediction of thermally radiating, hydrodynamically developed laminar pipe flow for which the numerical solution had been reported in the literature. The effect of radiation on flow and temperature fields was demonstrated for different values of conduction to radiation ratio. It was found that the steady-state temperature predictions of the code agree well with the benchmark solution. In an attempt to test the predictive accuracy of the coupled code for turbulent radiating flows, it was applied to fully developed turbulent flow of a hot gas through a relatively cold pipe and the results were compared with the numerical solution available in the literature. The code was found to mimic the reported steady-state temperature profiles well. Having validated the predictive accuracy of the coupled code for steady, laminar/turbulent, radiating pipe flows, the performance of the code for transient radiating flows was tested by applying it to a test problem involving laminar/turbulent flow of carbon dioxide through a circular pipe for the simulation of simultaneous hydrodynamic and thermal development. The transient solutions for temperature, velocity and radiative energy source term fields were found to demonstrate the physically expected trends. In order to improve the performance of the code, a parallel algorithm of the code was developed and tested against sequential code for speed up and efficiency. It was found that the same results are obtained with a reasonably high speed-up and efficiency.
185

Capsules: expressing composable computations in a parallel programming model

Mandviwala, Hasnain A. 01 October 2008 (has links)
A well-known problem in designing high-level parallel programming models and languages is the "granularity problem", where the execution of parallel tasks that are too fine grain incur large overheads in the parallel runtime and adversely affect the speed-up that can be achieved by parallel execution. On the other hand, tasks that are too coarse-grain create load imbalance and do not adequately utilize the parallel machine. In this work we attempt to address the issue of granularity with a concept of expressing "composable computations" within a parallel programming model called "Capsules". In Capsules, we provide a unifying framework that allows composition and adjustment of granularity for both data and computation over iteration space and computation space. The Capsules model not only allows the user to express the decision on granularity of execution, but also the decision on the granularity of garbage collection (and therefore, the aggressiveness of the GC optimization), and other features that may be supported by the programming model. We argue that this adaptability of execution granularity leads to efficient parallel execution by matching the available application concurrency to the available hardware concurrency, thereby reducing parallelization overhead. By matching, we refer to creating coarsegrain Computation Capsules that encompass multiple instances of fine-grain computation instances. In effect, creating coarse-grain computations reduces overhead by simply reducing the number of parallel computations. Reducing parallel computation instances in turn leads to: (1) Reduced synchronization cost such as that required to access and search in shared data-structures; (2) Reduced distribution and scheduling cost for parallel computation instances; and (3) Reduced book-keeping costs consisting of maintain data-structures such as blocked lists for unfulfilled data requests. Capsules builds on our prior work, TStreams, a data-flow oriented parallel programming framework. Our results on an CMP/SMP machine using real vision applications such as the Cascade Face Detector, and the Stereo Vision Depth applications, and other synthetic applications show benefits in application performance. We use profiling to help determine optimal coarse-grain serial execution granularity, and provide empirical proof that adjusting execution granularity reduces parallelization overhead to yield maximum application performance.
186

Multiprocessor scheduling in the presence of link contention delays

Macey, Benjamin January 2004 (has links)
[Truncated abstract] Parallel computing is recognised today as an important tool in the solution of a wide variety of computationally intensive problems, problems which were previously considered intractable. While it offers the promise of vastly increased performance, parallel computing introduces additional complexities which are not encountered with sequential processing. One of these is the scheduling problem, in which the individual tasks comprising a parallel program are scheduled onto the processors comprising the parallel architecture. The objective is to minimise execution time while still preserving the precedence relations between the tasks. Scheduling is of vital importance since a poor task schedule can undo any potential gains from the parallelism present in the application. Inappropriate scheduling can result in the hardware being used inefficiently, or worse, the program could run slower in parallel than on a single processor. The scheduling problem is one of the more difficult problems facing the parallel programmer. In fact, it is NP-complete in the general case. As a result, a large number of heuristic methods with sub-optimal performance but polynomial, rather than exponential, time complexity have been proposed. In order to simplify their algorithms, researchers have restricted the problem: by making assumptions concerning the parallel architecture or imposing limitations on the task graph representing the parallel program. The evolution of the task scheduling problem has involved the gradual relaxation of these restrictions. A major change occurred when the assumption of zero inter-processor communication costs was removed. This was driven by the increasing popularity of distributed-memory message-passing multiprocessors.
187

Παραλληλοποίηση αλγορίθμου Aho-Corasick με τεχνολογία CUDA

Δημόπουλος, Παναγιώτης 24 October 2012 (has links)
Στην παρούσα διπλωματική εκπονείται μία μελέτη για την απόδοση των αλγορίθμων αναζήτησης μοτίβων όταν αυτοί τροποποιηθούν κατάλληλα ώστε να εκμεταλλεύονται την αρχιτεκτονική του Υλικού των καρτών γραφικών. Για τον σκοπό αυτό στην παρούσα διπλωματική παρουσιάζεται στην αρχή το πρόβλημα της αναζήτησης ώστε να γίνει κατανοητό γιατί είναι επιτακτική η ανάγκη βελτιστοποίησης της απόδοσης των υπαρχόντων αλγορίθμων. Επίσης παρουσιάζονται οι κυριότεροι αλγόριθμοι αναζήτησης μοτίβων που χρησιμοποιούνται σήμερα και εξηγείται γιατί επιλέγεται ένας από αυτούς τους αλγόριθμους που στην συνέχεια θα τροποποιηθεί ώστε να εκμεταλλεύεται την ιδιαίτερη αρχιτεκτονική μιας κάρτας γραφικών. Έπειτα εξάγονται συμπεράσματα για την απόδοση που μας προσφέρει αυτή η νέα υλοποίηση του αλγορίθμου σε λογισμικό σε σχέση με την απλή υλοποίηση του αλγορίθμου και για διαφορετικά μεγέθη εισόδων / Conversion of Aho-Corasick algorithm in order to execute in an Nvidia graphic card using CUDA technology. Comparison of speed between the parallel and the classic version of the algorithm.
188

Analyzing the memory behavior of parallel scientific applications / Analyse du comportement mémoire d'applications parallèles de calcul scientifique

Beniamine, David 05 December 2016 (has links)
Depuis plusieurs décennies, afin de réduire la consommation énergétique des processeurs, les constructeurs fabriquent des ordinateurs de plus en plus parallèles.Dans le même temps, l'écart de fréquence entre les processeurs et la mémoire a significativement augmenté.Pour compenser cet écart, les processeurs modernes embarquent une hiérarchie de caches complexe.Développer un programme efficace sur de telles machines est une tâche complexe.Par conséquent, l'analyse de performance est devenue une étape majeure lors du développement d'applications requérant des performances.La plupart des outils d'analyse de performances se concentrent sur le point de vue du processeur.Ces outils voient la mémoire comme une entité monolithique et sont donc incapable de comprendre comment elle est accédée.Cependant, la mémoire est une ressource critique et les schémas d'accès à cette dernière peuvent impacter les performances de manière significative.Quelques outils permettant l'analyse de performances mémoire existent, cependant ils sont basé sur un échantillon age à large grain.Par conséquent, ces outils se concentrent sur une petite partie de l’Exécution et manquent le comportement global de l'application.De plus, l'échantillonnage à large granularité ne permet pas de collecter des schémas d'accès.Dans cette thèse, nous proposons deux outils différences pour analyser le comportement mémoire d'une application.Le premier outil est conçu spécifiquement pour pour les machines NUMA (Not Uniform Memory Accesses) et fournit plusieurs visualisations du schéma global de partage de chaque structure de données entre les flux d’ExécutionLe deuxième outil collecte des traces mémoires a grain fin avec information temporelles.Nous proposons de visualiser ces traces soit à l'aide d'un outil générique de gestion de traces soit en utilisant une approche programmatique basé sur R.De plus nous évaluons ces deux outils en les comparant a des outils existant de trace mémoire en terme de performances, précision et de complétude. / Since a few decades, to reduce energy consumption, processor vendors builds more and more parallel computers.At the same time, the gap between processors and memory frequency increased significantly.To mitigate this gap, processors embed a complex hierarchical caches architectureWriting efficient code for such computers is a complex task.Therefore, performance analysis has became an important step of the development of applications seeking for performances.Most existing performance analysis tools focuses on the point of view of the processor.Theses tools see the main memory as a monolithic entity and thus are not able to understand how it is accessed.However, memory is a common bottleneck in High Performances Computing, and the pattern of memory accesses can impact significantly the performances.There are a few tools to analyze memory performances, however theses tools are based on a coarse grain sampling.Consequently, they focus on a small part of the execution missing the global memory behavior.Furthermore, these coarse grain sampling are not able to collect memory accesses patterns.In this thesis we propose two different tools to analyze the memory behavior of an application.The first tool is designed specifically for Not Uniform Memory Accesses machines and provides some visualizations of the global sharing pattern inside each data structure between the threads.The second one collects fine grain memory traces with temporal information.We can visualize theses traces either with a generic trace management framework or with a programmatic exploration using R.Furthermore we evaluate both of these tools, comparing them with state of the art memory analysis tools in terms of performances, precision and completeness.
189

Uma plataforma para manuseio, análise e avaliação de dados de observação e controle de simulações

Sidou, Pedro Niederhagebock January 2017 (has links)
Atualmente, com a melhoria das técnicas de paralelização, cada vez mais vem sendo empregado o uso de clusteres de computadores para programas exigentes de alta performance. Porém, para um programa rodar adequadamente em paralelo, o mesmo deve ser escrito de maneira paralelizável. A proposta deste trabalho é o desenvolvimento de uma plataforma computacional que facilite a escrita de programas orientados à análise de dados climáticos e simulações de fenômenos atmosféricos escritos em paralelo. A plataforma foi pensada de maneira a ser o mais exível possível, permitindo que novas funcionalidades sejam adicionadas através de plugins. Em um futuro próximo, pretende-se disponibilizá-la para uso público com licença GNU General Public License (GNU GPL) e código aberto. Para demonstrar seu potencial de uso, foi realizado um estudo exploratório com dados provenientes do projeto de reanálise Twentieth Century Reanalysis version 2 (20CRv2). Tal estudo visa obter informações sobre que parâmetros e escalas de tempo são importantes para a descrição do fenômeno da zona de convergência do Atlântico sul (ZCAS). / Since parallel computing technologies are improving very fast, computer clusteres are getting more and more employed for highly demanding computational tasks. Nevertheless, for a software to run in a cluster, it needs to be written in a parallel way, which is not necessarily a simple task. Therefore, the aim of this work is to develop a computer platform capable of manipulating, analysing and evaluating climate observational data and controlling simulation in a parallel manner. Features can be added to the platform through plugins, making it very exible and extensible for a very large range of tasks. A simple application is purposed to demonstrate the platform's potential uses. The application consists of a exploratory study of data from the 20th century reanalysis project concerning the South American Convergence Zone (SACZ), a very important phenomena in the south hemisphere.
190

Trajectory Sensitivity Based Power System Dynamic Security Assessment

January 2012 (has links)
abstract: Contemporary methods for dynamic security assessment (DSA) mainly re-ly on time domain simulations to explore the influence of large disturbances in a power system. These methods are computationally intensive especially when the system operating point changes continually. The trajectory sensitivity method, when implemented and utilized as a complement to the existing DSA time domain simulation routine, can provide valuable insights into the system variation in re-sponse to system parameter changes. The implementation of the trajectory sensitivity analysis is based on an open source power system analysis toolbox called PSAT. Eight categories of sen-sitivity elements have been implemented and tested. The accuracy assessment of the implementation demonstrates the validity of both the theory and the imple-mentation. The computational burden introduced by the additional sensitivity equa-tions is relieved by two innovative methods: one is by employing a cluster to per-form the sensitivity calculations in parallel; the other one is by developing a mod-ified very dishonest Newton method in conjunction with the latest sparse matrix processing technology. The relation between the linear approximation accuracy and the perturba-tion size is also studied numerically. It is found that there is a fixed connection between the linear approximation accuracy and the perturbation size. Therefore this finding can serve as a general application guide to evaluate the accuracy of the linear approximation. The applicability of the trajectory sensitivity approach to a large realistic network has been demonstrated in detail. This research work applies the trajectory sensitivity analysis method to the Western Electricity Coordinating Council (WECC) system. Several typical power system dynamic security problems, in-cluding the transient angle stability problem, the voltage stability problem consid-ering load modeling uncertainty and the transient stability constrained interface real power flow limit calculation, have been addressed. Besides, a method based on the trajectory sensitivity approach and the model predictive control has been developed for determination of under frequency load shedding strategy for real time stability assessment. These applications have shown the great efficacy and accuracy of the trajectory sensitivity method in handling these traditional power system stability problems. / Dissertation/Thesis / Ph.D. Electrical Engineering 2012

Page generated in 0.1281 seconds