• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 226
  • 81
  • 30
  • 24
  • 14
  • 7
  • 6
  • 3
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 501
  • 501
  • 103
  • 70
  • 61
  • 58
  • 58
  • 57
  • 57
  • 56
  • 54
  • 54
  • 52
  • 50
  • 47
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
151

Lygiagrečiųjų simbolinių skaičiavimų programinė įranga / Software for parallel symbolic computing

Užpalis, Evaldas 15 July 2009 (has links)
Egzistuoja du matematinių problemų sprendimo būdai: skaitmeninis ir simbolinis. Simbolinis sprendimo būdas manipuliuoja simboliniais objektais, tokiais kaip loginės ar algebrinės formulės, taisyklės ar programos. Priešingai nei skaitmeninis būdas, pagrindinis simbolinių skaičiavimų tikslas yra matematinės išraiškos supaprastinimas. Dažniausiai galutinis atsakymas būna racionalusis skaičius arba formulė, todėl simboliniai skaičiavimai gali būti naudojami: • surasti tikslų matematinės problemos sprendimą, • supaprastinti matematinį modelį. Nedidelės apimties matematinėms išraiškoms supaprastinti užtenka ir vieno kompiuterio, tačiau yra tokių išraiškų, kurioms supaprastinti nebeužtenka vieno kompiuterio atminties ar procesoriaus, todėl geriausias sprendimas šioje situacijoje yra lygiagretieji skaičiavimai kompiuterių klasteryje. Pagrindinė problema lygiagrečiuose skaičiavimuose yra duomenų paskirstymo algoritmo efektyvumas. Šiame darbe yra pateikti vieno iš paskirstymo algoritmų ir kelių jo modifikacijų eksperimentiniai tyrimai. / There are two methods of mathematical problems solving: the digital, and symbolic. Symbolic solutions manipulate symbolic objects, such as logical or algebraic formulas, rules or programs. In contrast to the digital solution, the main purpose of the symbolic calculations is the symbolic simplification of mathematical expressions. In most cases, the final answer is rational number, or formula, and therefore symbolic calculations can be used: (1) • to identify the precise solution of the mathematical problem, • to simplify the mathematical model. For calculation of small mathematical expression it is enough one computer. But there are expressions which need more then one computer memory capacity or processing power. In these cases best solution is parallel calculations in computer cluster. The main problem of parallel calculations is the efficiency of distribution algorithm. This work presents experimental studies of one distribution algorithm and of several it‘s modifications.
152

Scalable data-management systems for Big Data

Tran, Viet-Trung 21 January 2013 (has links) (PDF)
Big Data can be characterized by 3 V's. * Big Volume refers to the unprecedented growth in the amount of data. * Big Velocity refers to the growth in the speed of moving data in and out management systems. * Big Variety refers to the growth in the number of different data formats. Managing Big Data requires fundamental changes in the architecture of data management systems. Data storage should continue being innovated in order to adapt to the growth of data. They need to be scalable while maintaining high performance regarding data accesses. This thesis focuses on building scalable data management systems for Big Data. Our first and second contributions address the challenge of providing efficient support for Big Volume of data in data-intensive high performance computing (HPC) environments. Particularly, we address the shortcoming of existing approaches to handle atomic, non-contiguous I/O operations in a scalable fashion. We propose and implement a versioning-based mechanism that can be leveraged to offer isolation for non-contiguous I/O without the need to perform expensive synchronizations. In the context of parallel array processing in HPC, we introduce Pyramid, a large-scale, array-oriented storage system. It revisits the physical organization of data in distributed storage systems for scalable performance. Pyramid favors multidimensional-aware data chunking, that closely matches the access patterns generated by applications. Pyramid also favors a distributed metadata management and a versioning concurrency control to eliminate synchronizations in concurrency. Our third contribution addresses Big Volume at the scale of the geographically distributed environments. We consider BlobSeer, a distributed versioning-oriented data management service, and we propose BlobSeer-WAN, an extension of BlobSeer optimized for such geographically distributed environments. BlobSeer-WAN takes into account the latency hierarchy by favoring locally metadata accesses. BlobSeer-WAN features asynchronous metadata replication and a vector-clock implementation for collision resolution. To cope with the Big Velocity characteristic of Big Data, our last contribution feautures DStore, an in-memory document-oriented store that scale vertically by leveraging large memory capability in multicore machines. DStore demonstrates fast and atomic complex transaction processing in data writing, while maintaining high throughput read access. DStore follows a single-threaded execution model to execute update transactions sequentially, while relying on a versioning concurrency control to enable a large number of simultaneous readers.
153

Design, development and implementation of a parallel algorithm for computed tomography using algebraic reconstruction technique

Melvin, Cameron 05 October 2007 (has links)
This project implements a parallel algorithm for Computed Tomography based on the Algebraic Reconstruction Technique (ART) algorithm. This technique for reconstructing pictures from projections is useful for applications such as Computed Tomography (CT or CAT). The algorithm requires fewer views, and hence less radiation, to produce an image of comparable or better quality. However, the approach is not widely used because of its computationally intensive nature in comparison with rival technologies. A faster ART algorithm could reduce the amount of radiation needed for CT imaging by producing a better image with fewer projections. A reconstruction from projections version of the ART algorithm for two dimensions was implemented in parallel using the Message Passing Interface (MPI) and OpenMP extensions for C. The message passing implementation did not result in faster reconstructions due to prohibitively long and variant communication latency. The shared memory implementation produced positive results, showing a clear computational advantage for multiple processors and measured efficiency ranging from 60-95%. Consistent with the literature, image quality proved to be significantly better compared to the industry standard Filtered Backprojection algorithm especially when reconstructing from fewer projection angles.
154

Design, development and implementation of a parallel algorithm for computed tomography using algebraic reconstruction technique

Melvin, Cameron 05 October 2007 (has links)
This project implements a parallel algorithm for Computed Tomography based on the Algebraic Reconstruction Technique (ART) algorithm. This technique for reconstructing pictures from projections is useful for applications such as Computed Tomography (CT or CAT). The algorithm requires fewer views, and hence less radiation, to produce an image of comparable or better quality. However, the approach is not widely used because of its computationally intensive nature in comparison with rival technologies. A faster ART algorithm could reduce the amount of radiation needed for CT imaging by producing a better image with fewer projections. A reconstruction from projections version of the ART algorithm for two dimensions was implemented in parallel using the Message Passing Interface (MPI) and OpenMP extensions for C. The message passing implementation did not result in faster reconstructions due to prohibitively long and variant communication latency. The shared memory implementation produced positive results, showing a clear computational advantage for multiple processors and measured efficiency ranging from 60-95%. Consistent with the literature, image quality proved to be significantly better compared to the industry standard Filtered Backprojection algorithm especially when reconstructing from fewer projection angles.
155

Evaluation of the Configurable Architecture REPLICA with Emulated Shared Memory / Utvärdering av den konfigurerbara arkitekturen REPLICA med emulerat delat minne

Alnervik, Erik January 2014 (has links)
REPLICA is a family of novel scalable chip multiprocessors with configurable emulated shared memory architecture, whose computation model is based on the PRAM (Parallel Random Access Machine) model. The purpose of this thesis is to, by benchmarking different types of computation problems on REPLICA, similar parallel architectures (SB-PRAM and XMT) and more diverse ones (Xeon X5660 and Tesla M2050), evaluate how REPLICA is positioned among other existing architectures, both in performance and programming effort. But it should also examine if REPLICA is more suited for any special kinds of computational problems. By using some of the well known Berkeley dwarfs, and input from unbiased sources, such as The University of Florida Sparse Matrix Collection and Rodinia benchmark suite, we have made sure that the benchmarks measure relevant computation problems. We show that today’s parallel architectures have some performance issues for applications with irregular memory access patterns, which the REPLICA architecture can solve. For example, REPLICA only need to be clocked with a few MHz to match both Xeon X5660 and Tesla M2050 for the irregular memory access benchmark breadth first search. By comparing the efficiency of REPLICA to a CPU (Xeon X5660), we show that it is easier to program REPLICA efficiently than today’s multiprocessors. / REPLICA är en grupp av konfigurerbara multiprocessorer som med hjälp utav ett emulerat delat minne realiserar PRAM modellen. Syftet med denna avhandling är att genom benchmarking av olika beräkningsproblem på REPLICA, liknande (SB-PRAM och XMT) och mindre lika (Xeon X5660 och Tesla M2050) parallella arkitekturer, utvärdera hur REPLICA står sig mot andra befintliga arkitekturer. Både prestandamässigt och hur enkel arkitekturen är att programmera effektiv, men även försöka ta reda på om REPLICA är speciellt lämpad för några särskilda typer av beräkningsproblem. Genom att använda välkända Berkeley dwarfs applikationer och opartisk indata från bland annat The University of Florida Sparse Matrix Collection och Rodinia benchmark suite, säkerställer vi att det är relevanta beräkningsproblem som utförs och mäts. Vi visar att dagens parallella arkitekturer har problem med prestandan för applikationer med oregelbundna minnesaccessmönster, vilken REPLICA arkitekturen kan vara en lösning på. Till exempel, så behöver REPLICA endast vara klockad med några få MHz för att matcha både Xeon X5660 och Tesla M2050 för algoritmen breadth first search, vilken lider av just oregelbunden minnesåtkomst. Genom att jämföra effektiviteten för REPLICA gentemot en CPU (Xeon X5660), visar vi att det är lättare att programmera REPLICA effektivt än dagens multiprocessorer.
156

Numerical Simulation Of Laminar Reacting Flows

Tarhan, Tanil 01 September 2004 (has links) (PDF)
Novel sequential and parallel computational fluid dynamic (CFD) codes based on method of lines (MOL) approach were developed for the numerical simulation of multi-component reacting flows using detailed transport and thermodynamic models. Both codes were applied to the prediction of a confined axisymmetric laminar co-flowing methane-air diffusion flame for which experimental data were available in the literature. Flame-sheet model for infinite-rate chemistry and one-, two-, and five- and ten-step reduced finite-rate reaction mechanisms were employed for methane-air combustion sub-model. A second-order high-resolution total variation diminishing (TVD) scheme based on Lagrange interpolation polynomial was proposed in order to alleviate spurious oscillations encountered in time evolution of flame propagation. Steady-state velocity, temperature and species profiles obtained by using infinite- and finite-rate chemistry models were validated against experimental data and other numerical solutions. They were found to be in reasonably good agreement with measurements and numerical results. The proposed difference scheme produced accurate results without spurious oscillations and numerical diffusion encountered in the classical schemes and hence was found to be a successful scheme applicable to strongly convective flow problems with non-uniform grid resolution. The code was also found to be an efficient tool for the prediction and understanding of transient combustion systems. This study constitutes the initial steps in the development of an efficient numerical scheme for direct numerical simulation (DNS) of unsteady, turbulent, multi-dimensional combustion with complex chemistry.
157

Numerical Simulation Of Radiating Flows

Karaismail, Ertan 01 August 2005 (has links) (PDF)
Predictive accuracy of the previously developed coupled code for the solution of the time-dependent Navier-Stokes equations in conjunction with the radiative transfer equation was first assessed by applying it to the prediction of thermally radiating, hydrodynamically developed laminar pipe flow for which the numerical solution had been reported in the literature. The effect of radiation on flow and temperature fields was demonstrated for different values of conduction to radiation ratio. It was found that the steady-state temperature predictions of the code agree well with the benchmark solution. In an attempt to test the predictive accuracy of the coupled code for turbulent radiating flows, it was applied to fully developed turbulent flow of a hot gas through a relatively cold pipe and the results were compared with the numerical solution available in the literature. The code was found to mimic the reported steady-state temperature profiles well. Having validated the predictive accuracy of the coupled code for steady, laminar/turbulent, radiating pipe flows, the performance of the code for transient radiating flows was tested by applying it to a test problem involving laminar/turbulent flow of carbon dioxide through a circular pipe for the simulation of simultaneous hydrodynamic and thermal development. The transient solutions for temperature, velocity and radiative energy source term fields were found to demonstrate the physically expected trends. In order to improve the performance of the code, a parallel algorithm of the code was developed and tested against sequential code for speed up and efficiency. It was found that the same results are obtained with a reasonably high speed-up and efficiency.
158

Multiprocessor scheduling in the presence of link contention delays

Macey, Benjamin January 2004 (has links)
[Truncated abstract] Parallel computing is recognised today as an important tool in the solution of a wide variety of computationally intensive problems, problems which were previously considered intractable. While it offers the promise of vastly increased performance, parallel computing introduces additional complexities which are not encountered with sequential processing. One of these is the scheduling problem, in which the individual tasks comprising a parallel program are scheduled onto the processors comprising the parallel architecture. The objective is to minimise execution time while still preserving the precedence relations between the tasks. Scheduling is of vital importance since a poor task schedule can undo any potential gains from the parallelism present in the application. Inappropriate scheduling can result in the hardware being used inefficiently, or worse, the program could run slower in parallel than on a single processor. The scheduling problem is one of the more difficult problems facing the parallel programmer. In fact, it is NP-complete in the general case. As a result, a large number of heuristic methods with sub-optimal performance but polynomial, rather than exponential, time complexity have been proposed. In order to simplify their algorithms, researchers have restricted the problem: by making assumptions concerning the parallel architecture or imposing limitations on the task graph representing the parallel program. The evolution of the task scheduling problem has involved the gradual relaxation of these restrictions. A major change occurred when the assumption of zero inter-processor communication costs was removed. This was driven by the increasing popularity of distributed-memory message-passing multiprocessors.
159

Παραλληλοποίηση αλγορίθμου Aho-Corasick με τεχνολογία CUDA

Δημόπουλος, Παναγιώτης 24 October 2012 (has links)
Στην παρούσα διπλωματική εκπονείται μία μελέτη για την απόδοση των αλγορίθμων αναζήτησης μοτίβων όταν αυτοί τροποποιηθούν κατάλληλα ώστε να εκμεταλλεύονται την αρχιτεκτονική του Υλικού των καρτών γραφικών. Για τον σκοπό αυτό στην παρούσα διπλωματική παρουσιάζεται στην αρχή το πρόβλημα της αναζήτησης ώστε να γίνει κατανοητό γιατί είναι επιτακτική η ανάγκη βελτιστοποίησης της απόδοσης των υπαρχόντων αλγορίθμων. Επίσης παρουσιάζονται οι κυριότεροι αλγόριθμοι αναζήτησης μοτίβων που χρησιμοποιούνται σήμερα και εξηγείται γιατί επιλέγεται ένας από αυτούς τους αλγόριθμους που στην συνέχεια θα τροποποιηθεί ώστε να εκμεταλλεύεται την ιδιαίτερη αρχιτεκτονική μιας κάρτας γραφικών. Έπειτα εξάγονται συμπεράσματα για την απόδοση που μας προσφέρει αυτή η νέα υλοποίηση του αλγορίθμου σε λογισμικό σε σχέση με την απλή υλοποίηση του αλγορίθμου και για διαφορετικά μεγέθη εισόδων / Conversion of Aho-Corasick algorithm in order to execute in an Nvidia graphic card using CUDA technology. Comparison of speed between the parallel and the classic version of the algorithm.
160

Analyzing the memory behavior of parallel scientific applications / Analyse du comportement mémoire d'applications parallèles de calcul scientifique

Beniamine, David 05 December 2016 (has links)
Depuis plusieurs décennies, afin de réduire la consommation énergétique des processeurs, les constructeurs fabriquent des ordinateurs de plus en plus parallèles.Dans le même temps, l'écart de fréquence entre les processeurs et la mémoire a significativement augmenté.Pour compenser cet écart, les processeurs modernes embarquent une hiérarchie de caches complexe.Développer un programme efficace sur de telles machines est une tâche complexe.Par conséquent, l'analyse de performance est devenue une étape majeure lors du développement d'applications requérant des performances.La plupart des outils d'analyse de performances se concentrent sur le point de vue du processeur.Ces outils voient la mémoire comme une entité monolithique et sont donc incapable de comprendre comment elle est accédée.Cependant, la mémoire est une ressource critique et les schémas d'accès à cette dernière peuvent impacter les performances de manière significative.Quelques outils permettant l'analyse de performances mémoire existent, cependant ils sont basé sur un échantillon age à large grain.Par conséquent, ces outils se concentrent sur une petite partie de l’Exécution et manquent le comportement global de l'application.De plus, l'échantillonnage à large granularité ne permet pas de collecter des schémas d'accès.Dans cette thèse, nous proposons deux outils différences pour analyser le comportement mémoire d'une application.Le premier outil est conçu spécifiquement pour pour les machines NUMA (Not Uniform Memory Accesses) et fournit plusieurs visualisations du schéma global de partage de chaque structure de données entre les flux d’ExécutionLe deuxième outil collecte des traces mémoires a grain fin avec information temporelles.Nous proposons de visualiser ces traces soit à l'aide d'un outil générique de gestion de traces soit en utilisant une approche programmatique basé sur R.De plus nous évaluons ces deux outils en les comparant a des outils existant de trace mémoire en terme de performances, précision et de complétude. / Since a few decades, to reduce energy consumption, processor vendors builds more and more parallel computers.At the same time, the gap between processors and memory frequency increased significantly.To mitigate this gap, processors embed a complex hierarchical caches architectureWriting efficient code for such computers is a complex task.Therefore, performance analysis has became an important step of the development of applications seeking for performances.Most existing performance analysis tools focuses on the point of view of the processor.Theses tools see the main memory as a monolithic entity and thus are not able to understand how it is accessed.However, memory is a common bottleneck in High Performances Computing, and the pattern of memory accesses can impact significantly the performances.There are a few tools to analyze memory performances, however theses tools are based on a coarse grain sampling.Consequently, they focus on a small part of the execution missing the global memory behavior.Furthermore, these coarse grain sampling are not able to collect memory accesses patterns.In this thesis we propose two different tools to analyze the memory behavior of an application.The first tool is designed specifically for Not Uniform Memory Accesses machines and provides some visualizations of the global sharing pattern inside each data structure between the threads.The second one collects fine grain memory traces with temporal information.We can visualize theses traces either with a generic trace management framework or with a programmatic exploration using R.Furthermore we evaluate both of these tools, comparing them with state of the art memory analysis tools in terms of performances, precision and completeness.

Page generated in 0.0388 seconds