Global ETD Search

431	Two Optimization Problems in Genetics : Multi-dimensional QTL Analysis and Haplotype Inference Nettelblad, Carl January 2012 (has links) The existence of new technologies, implemented in efficient platforms and workflows has made massive genotyping available to all fields of biology and medicine. Genetic analyses are no longer dominated by experimental work in laboratories, but rather the interpretation of the resulting data. When billions of data points representing thousands of individuals are available, efficient computational tools are required. The focus of this thesis is on developing models, methods and implementations for such tools. The first theme of the thesis is multi-dimensional scans for quantitative trait loci (QTL) in experimental crosses. By mating individuals from different lines, it is possible to gather data that can be used to pinpoint the genetic variation that influences specific traits to specific genome loci. However, it is natural to expect multiple genes influencing a single trait to interact. The thesis discusses model structure and model selection, giving new insight regarding under what conditions orthogonal models can be devised. The thesis also presents a new optimization method for efficiently and accurately locating QTL, and performing the permuted data searches needed for significance testing. This method has been implemented in a software package that can seamlessly perform the searches on grid computing infrastructures. The other theme in the thesis is the development of adapted optimization schemes for using hidden Markov models in tracing allele inheritance pathways, and specifically inferring haplotypes. The advances presented form the basis for more accurate and non-biased line origin probabilities in experimental crosses, especially multi-generational ones. We show that the new tools are able to reconstruct haplotypes and even genotypes in founder individuals and offspring alike, based on only unordered offspring genotypes. The tools can also handle larger populations than competing methods, resolving inheritance pathways and phase in much larger and more complex populations. Finally, the methods presented are also applicable to datasets where individual relationships are not known, which is frequently the case in human genetics studies. One immediate application for this would be improved accuracy for imputation of SNP markers within genome-wide association studies (GWAS). / eSSENCE quantitative trait loci genome-wide association studies hidden Markov models numerical optimization linkage analysis haplotype inference genotype imputation high performance computing
432	Desenvolvimento de um simulador para espectrometria por fluorescência de raios X usando computação distribuída / Development of a X-ray fluorescence spectrometry simulator using distributed computing Marcio Henrique dos Santos 30 March 2012 (has links) Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro / A Física das Radiações é um ramo da Física que está presente em diversas áreas de estudo e se relaciona ao conceito de espectrometria. Dentre as inúmeras técnicas espectrométricas existentes, destaca-se a espectrometria por fluorescência de raios X. Esta também possui uma gama de variações da qual pode-se dar ênfase a um determinado subconjunto de técnicas. A produção de fluorescência de raios X permite (em certos casos) a análise das propriedades físico-químicas de uma amostra específica, possibilitando a determinação de sua constituiçõa química e abrindo um leque de aplicações. Porém, o estudo experimental pode exigir uma grande carga de trabalho, tanto em termos do aparato físico quanto em relação conhecimento técnico. Assim, a técnica de simulação entra em cena como um caminho viável, entre a teoria e a experimentação. Através do método de Monte Carlo, que se utiliza da manipulação de números aleatórios, a simulação se mostra como uma espécie de alternativa ao trabalho experimental.Ela desenvolve este papel por meio de um processo de modelagem, dentro de um ambiente seguro e livre de riscos. E ainda pode contar com a computação de alto desempenho, de forma a otimizar todo o trabalho por meio da arquitetura distribuída. O objetivo central deste trabalho é a elaboração de um simulador computacional para análise e estudo de sistemas de fluorescência de raios X desenvolvido numa plataforma de computação distribuída de forma nativa com o intuito de gerar dados otimizados. Como resultados deste trabalho, mostra-se a viabilidade da construção do simulador através da linguagem CHARM++, uma linguagem baseada em C++ que incorpora rotinas para processamento distribuído, o valor da metodologia para a modelagem de sistemas e a aplicação desta na construção de um simulador para espectrometria por fluorescência de raios X. O simulador foi construído com a capacidade de reproduzir uma fonte de radiação eletromagnética, amostras complexas e um conjunto de detectores. A modelagem dos detectores incorpora a capacidade de geração de imagens baseadas nas contagens registradas. Para validação do simulador, comparou-se os resultados espectrométricos com os resultados gerados por outro simulador já validado: o MCNP. / Radiation Physics is a branch of Physics that is present in various studying areas and relates to the concept of spectrometry. Among the numerous existing spectrometry techniques, there is the X-ray fluorescence spectrometry. It also has a range of variations which can emphasize a particular subset of techniques. The production of X-ray fluorescence enables (in some cases) the analysis of physical and chemical properties of a given sample, allowing the determination of its chemical constitution and also a range of applications. However, the experimental analysis may require a large workload, both in terms of physical apparatus and in relation to technical knowledge. Thus, the simulation comes into play as a viable path between theory and experiment. Through the Monte Carlo method, which uses the manipulation of random numbers, the simulation is a kind of alternative to the experimental analysis. It develops this role by a modeling process, within a secure environment and risk free. And it can count on high performance computing in order to optimize all the work through the distributed architecture. The aim of this paper is the development of a computational simulator for analysis and studying of X-ray fluorescence systems developed on a communication platform distributed natively, in order to generate optimal data. As results, has been proved the viability of the simulator implementation through the CHARM++ language, a language based on C++ which incorporate procedures to distributed processing, the value of the methodology to system modelling e its application to build a simulator for X-ray fluorescence spectrometry. The simulator was built with the ability to reproduce a eletromagnetic radiation source, complex samples and a set of detectors. The modelling of the detectors embody the ability to yield images based on recorded counts. To validate the simulator, the results were compared with the results provided by other known simulator: MCNP. Simulação de Monte Carlo Computação de alto desempenho X-ray fluorescence spectrometry FISICA DA MATERIA CONDENSADA
433	Ordonnancement de E/S transversal : des applications à des dispositifs / Transversal I/O Scheduling : from Applications to Devices / Escalonamento de E/S Transversal para Sistemas de Arquivos Paralelos : das Aplicações aos Dispositivos Zanon Boito, Francieli 30 March 2015 (has links) Ordonnancement d’E/S Transversal pour les Systèmes de Fichiers Parallèles : desApplications aux DispositifsCette thèse porte sur l’utilisation de l’ordonnancement d’Entrées/Sorties (E/S) pour atténuer leseffets d’interférence et améliorer la performance d’E/S des systèmes de fichiers parallèles. Ilest commun pour les plates-formes de calcul haute performance (HPC) de fournir une infrastructurede stockage partagée pour les applications qui y sont hébergées. Dans cette situation,où plusieurs applications accèdent simultanément au système de fichiers parallèle partagé, leursaccès vont souffrir de l’interférence, ce qui compromet l’efficacité des stratégies d’optimisationd’E/S.Nous avons évalué la performance de cinq algorithmes d’ordonnancement dans les serveurs dedonnées d’un système de fichiers parallèle. Ces tests ont été exécutés sur différentes platesformeset sous différents modèles d’accès. Les résultats indiquent que la performance des ordonnanceursest affectée par les modèles d’accès des applications, car il est important pouraméliorer la performance obtenue grâce à un algorithme d’ordonnancement de surpasser sessurcoûts. En même temps, les résultats des ordonnanceurs sont affectés par les caractéristiquesdu système d’E/S sous-jacent - en particulier par des dispositifs de stockage. Différents dispositifsprésentent des niveaux de sensibilité à la séquentialité et la taille des accès distincts, ce quipeut influencer sur le niveau d’amélioration de obtenue grâce à l’ordonnancement d’E/S.Pour ces raisons, l’objectif principal de cette thèse est de proposer un modèle d’ordonnancementd’E/S avec une double adaptabilité : aux applications et aux dispositifs. Nous avons extraitdes informations sur les modèles d’accès des applications en utilisant des fichiers de trace,obtenus à partir de leurs exécutions précédentes. Ensuite, nous avons utilisé de l’apprentissageautomatique pour construire un classificateur capable d’identifier la spatialité et la taille desaccès à partir du flux de demandes antérieures. En outre, nous avons proposé une approche pourobtenir efficacement le ratio de débit séquentiel et aléatoire pour les dispositifs de stockage enexécutant des benchmarks pour un sous-ensemble des paramètres et en estimant les restantsavec des régressions linéaires.Nous avons utilisé les informations sur les caractéristiques des applications et des dispositifsde stockage pour décider automatiquement l’algorithme d’ordonnancement le plus appropriéen utilisant des arbres de décision. Notre approche améliore les performances jusqu’à 75% parrapport à une approche qui utilise le même algorithme d’ordonnancement dans toutes les situations,sans capacité d’adaptation. De plus, notre approche améliore la performance dans 64%de scénarios en plus, et diminue les performances dans 89% moins de situations. Nos résultatsmontrent que les deux aspects - des applications et des dispositifs - sont essentiels pour faire desbons choix d’ordonnancement. En outre, malgré le fait qu’il n’y a pas d’algorithme d’ordonnancementqui fournit des gains de performance pour toutes les situations, nous montrons queavec la double adaptabilité il est possible d’appliquer des techniques d’ordonnancement d’E/Spour améliorer la performance, tout en évitant les situations où cela conduirait à une diminutionde performance. / This thesis focuses on I/O scheduling as a tool to improve I/O performance on parallel file systemsby alleviating interference effects. It is usual for High Performance Computing (HPC)systems to provide a shared storage infrastructure for applications. In this situation, when multipleapplications are concurrently accessing the shared parallel file system, their accesses willaffect each other, compromising I/O optimization techniques’ efficacy.We have conducted an extensive performance evaluation of five scheduling algorithms at aparallel file system’s data servers. Experiments were executed on different platforms and underdifferent access patterns. Results indicate that schedulers’ results are affected by applications’access patterns, since it is important for the performance improvement obtained througha scheduling algorithm to surpass its overhead. At the same time, schedulers’ results are affectedby the underlying I/O system characteristics - especially by storage devices. Differentdevices present different levels of sensitivity to accesses’ sequentiality and size, impacting onhow much performance is improved through I/O scheduling.For these reasons, this thesis main objective is to provide I/O scheduling with double adaptivity:to applications and devices. We obtain information about applications’ access patternsthrough trace files, obtained from previous executions. We have applied machine learning tobuild a classifier capable of identifying access patterns’ spatiality and requests size aspects fromstreams of previous requests. Furthermore, we proposed an approach to efficiently obtain thesequential to random throughput ratio metric for storage devices by running benchmarks for asubset of the parameters and estimating the remaining through linear regressions.We use this information on applications’ and storage devices’ characteristics to decide the bestfit in scheduling algorithm though a decision tree. Our approach improves performance byup to 75% over an approach that uses the same scheduling algorithm to all situations, withoutadaptability. Moreover, our approach improves performance for up to 64% more situations, anddecreases performance for up to 89% less situations. Our results evidence that both aspects- applications and storage devices - are essential for making good scheduling choices. Moreover,despite the fact that there is no scheduling algorithm able to provide performance gainsfor all situations, we show that through double adaptivity it is possible to apply I/O schedulingtechniques to improve performance, avoiding situations where it would lead to performanceimpairment. / Esta tese se concentra no escalonamento de operações de entrada e saída (E/S) como uma soluçãopara melhorar o desempenho de sistemas de arquivos paralelos, aleviando os efeitos dainterferência. É usual que sistemas de computação de alto desempenho (HPC) ofereçam umainfraestrutura compartilhada de armazenamento para as aplicações. Nessa situação, em quemúltiplas aplicações acessam o sistema de arquivos compartilhado de forma concorrente, osacessos das aplicações causarão interferência uns nos outros, comprometendo a eficácia de técnicaspara otimização de E/S.Uma avaliação extensiva de desempenho foi conduzida, abordando cinco algoritmos de escalonamentotrabalhando nos servidores de dados de um sistema de arquivos paralelo. Foramexecutados experimentos em diferentes plataformas e sob diferentes padrões de acesso. Osresultados indicam que os resultados obtidos pelos escalonadores são afetados pelo padrão deacesso das aplicações, já que é importante que o ganho de desempenho provido por um algoritmode escalonamento ultrapasse o seu sobrecusto. Ao mesmo tempo, os resultados doescalonamento são afetados pelas características do subsistema local de E/S - especialmentepelos dispositivos de armazenamento. Dispositivos diferentes apresentam variados níveis desensibilidade à sequencialidade dos acessos e ao seu tamanho, afetando o quanto técnicas deescalonamento de E/S são capazes de aumentar o desempenho.Por esses motivos, o principal objetivo desta tese é prover escalonamento de E/S com duplaadaptabilidade: às aplicações e aos dispositivos. Informações sobre o padrão de acesso dasaplicações são obtidas através de arquivos de rastro, vindos de execuções anteriores. Aprendizadode máquina foi aplicado para construir um classificador capaz de identificar os aspectosespacialidade e tamanho de requisição dos padrões de acesso através de fluxos de requisiçõesanteriores. Além disso, foi proposta uma técnica para obter eficientemente a razão entre acessossequenciais e aleatórios para dispositivos de armazenamento, executando testes para apenas umsubconjunto dos parâmetros e estimando os demais através de regressões lineares.Essas informações sobre características de aplicações e dispositivos de armazenamento são usadaspara decidir a melhor escolha em algoritmo de escalonamento através de uma árvore dedecisão. A abordagem proposta aumenta o desempenho em até 75% sobre uma abordagem queusa o mesmo algoritmo para todas as situações, sem adaptabilidade. Além disso, essa técnicamelhora o desempenho para até 64% mais situações, e causa perdas de desempenho em até 89%menos situações. Os resultados obtidos evidenciam que ambos aspectos - aplicações e dispositivosde armazenamento - são essenciais para boas decisões de escalonamento. Adicionalmente,apesar do fato de não haver algoritmo de escalonamento capaz de prover ganhos de desempenhopara todas as situações, esse trabalho mostra que através da dupla adaptabilidade é possívelaplicar técnicas de escalonamento de E/S para melhorar o desempenho, evitando situações emque essas técnicas prejudicariam o desempenho. Ordonnancement d’E/S Systèmes de Fichiers Parallèles Calcul Haute Performance I/O Scheduling Parallel File Systems High Performance Computing Escalonamento de E/S Sistemas de Arquivos Paralelos Computação de Alto Desempenho. 004
434	Programmation des architectures hétérogènes à l'aide de tâches divisibles ou modulables / Programmation of heterogeneous architectures using moldable tasks Cojean, Terry 26 March 2018 (has links) Les ordinateurs équipés d'accélérateurs sont omniprésents parmi les machines de calcul haute performance. Cette évolution a entraîné des efforts de recherche pour concevoir des outils permettant de programmer facilement des applications capables d'utiliser toutes les unités de calcul de ces machines. Le support d'exécution StarPU développé dans l'équipe STORM de INRIA Bordeaux, a été conçu pour servir de cible à des compilateurs de langages parallèles et des bibliothèques spécialisées (algèbre linéaire, développements de Fourier, etc.). Pour proposer la portabilité des codes et des performances aux applications, StarPU ordonnance des graphes dynamiques de tâches de manière efficace sur l’ensemble des ressources hétérogènes de la machine. L’un des aspects les plus difficiles, lors du découpage d’une application en graphe de tâches, est de choisir la granularité de ce découpage, qui va typiquement de pair avec la taille des blocs utilisés pour partitionner les données du problème. Les granularités trop petites ne permettent pas d’exploiter efficacement les accélérateurs de type GPU, qui ont besoin de peu de tâches possédant un parallélisme interne de données massif pour « tourner à plein régime ». À l’inverse, les processeurs traditionnels exhibent souvent des performances optimales à des granularités beaucoup plus fines. Le choix du grain d’un tâche dépend non seulement du type de l'unité de calcul sur lequel elle s’exécutera, mais il a en outre une influence sur la quantité de parallélisme disponible dans le système : trop de petites tâches risque d’inonder le système en introduisant un surcoût inutile, alors que peu de grosses tâches risque d’aboutir à un déficit de parallélisme. Actuellement, la plupart des approches pour solutionner ce problème dépendent de l'utilisation d'une granularité des tâches intermédiaire qui ne permet pas un usage optimal des ressources aussi bien du processeur que des accélérateurs. L'objectif de cette thèse est d'appréhender ce problème de granularité en agrégeant des ressources afin de ne plus considérer de nombreuses ressources séparées mais quelques grosses ressources collaborant à l'exécution de la même tâche. Un modèle théorique existe depuis plusieurs dizaines d'années pour représenter ce procédé : les tâches parallèles. Le travail de cette thèse consiste alors en l'utilisation pratique de ce modèle via l'implantation de mécanismes de gestion de tâches parallèles dans StarPU et l'implantation ainsi que l'évaluation d'ordonnanceurs de tâches parallèles de la littérature. La validation du modèle se fait dans le cadre de l'amélioration de la programmation et de l'optimisation de l'exécution d'applications numériques au dessus de machines de calcul modernes. / Hybrid computing platforms equipped with accelerators are now commonplace in high performance computing platforms. Due to this evolution, researchers concentrated their efforts on conceiving tools aiming to ease the programmation of applications able to use all computing units of such machines. The StarPU runtime system developed in the STORM team at INRIA Bordeaux was conceived to be a target for parallel language compilers and specialized libraries (linear algebra, Fourier transforms,...). To provide the portability of codes and performances to applications, StarPU schedules dynamic task graphs efficiently on all heterogeneous computing units of the machine. One of the most difficult aspects when expressing an application into a graph of task is to choose the granularity of the tasks, which typically goes hand in hand with the size of blocs used to partition the problem's data. Small granularity do not allow to efficiently use accelerators such as GPUs which require a small amount of task with massive inner data-parallelism in order to obtain peak performance. Inversely, processors typically exhibit optimal performances with a big amount of tasks possessing smaller granularities. The choice of the task granularity not only depends on the type of computing units on which it will be executed, but in addition it will influence the quantity of parallelism available in the system: too many small tasks may flood the runtime system by introducing overhead, whereas too many small tasks may create a parallelism deficiency. Currently, most approaches rely on finding a compromise granularity of tasks which does not make optimal use of both CPU and accelerator resources. The objective of this thesis is to solve this granularity problem by aggregating resources in order to view them not as many small resources but fewer larger ones collaborating to the execution of the same task. One theoretical machine and scheduling model allowing to represent this process exists since several decades: the parallel tasks. The main contributions of this thesis are to make practical use of this model by implementing a parallel task mechanism inside StarPU and to implement and study parallel task schedulers of the literature. The validation of the model is made by improving the programmation and optimizing the execution of numerical applications on top of modern computing machines. Calcul Haute Performance Supports d'exécution Algèbre linéaire appliquée High Performance Computing Runtime systems Parallel tasks programming Applied linear algebra
435	Simulation 3D d'une décharge couronne pointe-plan, dans l'air : calcul haute performance, algorithmes de résolution de l'équation de Poisson et analyses physiques / 3D simulation of a pine to plane corona discharge in dry air : High performance computing, Poisson equation solvers and Physics Plewa, Joseph-Marie 13 October 2017 (has links) Cette thèse porte sur la simulation tridimensionnelle (3D) des décharges couronnes à l'aide du calcul haute performance. Lorsqu'on applique une impulsion de haute tension entre une pointe et un plan, les lignes de champ électrique fortement resserrées autour de la pointe induisent la propagation simultanée de plusieurs streamers et la formation d'une décharge couronne de structure arborescente. Dans ces conditions, seule une simulation électro-hydrodynamique 3D est apte à reproduire cette structure et fournir les ordres de grandeur de l'énergie déposée et de la concentration des espèces créées durant la phase de décharge. Cependant, cette simulation 3D est très consommatrice en temps et mémoire de calcul et n'est désormais accessible que grâce à l'accroissement permanent de la puissance des ordinateurs dédié au calcul haute performance. Dans le cadre d'une simulation électro-hydrodynamique 3D, une attention particulière doit être prise concernant l'efficacité des solveurs à résoudre les équations elliptiques 3D car leur contribution en termes de temps de calcul peut dépasser 80% du temps global de la simulation. Ainsi, une partie de manuscrit est consacrée aux tests de performances de méthodes de résolution d'équations elliptiques directes ou itératives telle que SOR R&B, BiCGSTAB et MUMPS, en utilisant le calcul massivement parallèle et les librairies MPI. Les calculs sont réalisés sur le supercalculateur EOS du réseau CALMIP, avec un nombre de cœurs de calcul allant jusqu'à 1800, et un nombre de mailles atteignant 8003 (soit plus 1/2 Milliard de mailles). Les tests de performances sont réalisés en statique sur le calcul du potentiel géométrique et en dynamique en propageant une densité de charge d'espace analytique caractéristique des streamers. Pour réaliser une simulation complète 3D de la décharge il faut également intégrer au programme un algorithme capable de résoudre les équations de transport de particule chargée à fort gradients de densité caractéristiques aux streamers. Dans ce manuscrit, l'algorithme MUSCL est testé dans différentes conditions de propagation d'un cube de densité (à vitesse homogène ou non homogène spatialement) afin d'optimiser le transport des densités d'espèces chargées impliquées. Le code 3D, conçu pour résoudre le modèle électro- hydrodynamique complet de la décharge (couplant les équations de transport, de Poisson et de cinétique réactionnelle) est ensuite validé par la confrontation des résultats 3D et 2D dans une condition de simulation présentant une symétrie de révolution autour de l'axe de propagation d'un streamer. Enfin, les premiers résultats des simulations 3D de la phase décharge avec la propagation d'un ou plusieurs streamers asymétriques sont présentés et analysés. Ces simulations permettent de suivre la structure arborescente de la décharge lorsqu'on applique une tension pulsée entre une pointe et un plan. L'initiation de la structure arborescente est étudiée en fonction de la position de spots plasmas et de leur influence sur l'amorçage des streamers. / This work is devoted to the three dimensional (3D) simulation of streamer corona discharges in air at atmospheric pressure using high-performance parallel computing. When a pulsed high-voltage is applied between a tip and a plane in air, the strong electric field lines constricted around the tip induce the simultaneous propagation of several streamers leading to a corona discharge with a tree structure. Only a true 3D electro-hydrodynamics simulation is able to reproduce this branching and to provide the orders of magnitude of the local deposited energy and the concentration of the species created during the discharge phase. However, such a 3D simulation which requires large computational memory and huge time calculation is nowadays accessible only when performed with massively parallel computation. In the field of 3D electro-hydrodynamics simulations, a special attention must be paid to the efficiency of solvers in solving 3D elliptic equations because their contribution can exceed 80% of the global computation time. Therefore, a specific chapter is devoted to test the performance of iterative and direct methods (such as SOR R&B, BiCGSTAB and MUMPS) in solving elliptic equations, using the massively parallel computation and the MPI library. The calculations are performed on the supercomputer EOS of the CALMIP network, with a number of computing cores and meshes increasing up to respectively 1800 and 8003 (i.e. more than 1/2 Billion meshes). The performances are compared for the calculation of the geometric potential and in a dynamic simulation conditions consisting in the propagation of an analytical space charge density characteristic of the streamers. To perform a complete 3D simulation of the streamer discharge, must also involve a robust algorithm able to solve the coupled conservation equations of the charged particle density with very sharp gradients characteristic of the streamers. In this manuscript, the MUSCL algorithm is tested under different propagation conditions of a cubic density (with uniform or non-uniform velocity field). The 3D code, designed to solve the complete electro-hydrodynamics model of the discharge (coupling the conservation equations, the Poisson equation and the chemical kinetics) is validated by comparing the 3D and 2D results in a simulation conditions presenting a rotational symmetry around the propagation axis of a mono-filamentary streamer. Finally, the first results of the 3D simulations of the discharge phase with the propagation of one or several asymmetric streamers are presented and analyzed. These simulations allow to follow the tree structure of a corona discharge when a pulsed voltage is applied between a tip and a plane. The ignition of the tree structure is studied as a function of the initial position of the plasma spots. Simulation 3D Décharges couronnes Calcul haute performance MPI Streamer Equation de Poisson 3D simulation Corona discharges High performance computing MPI Streamer Poisson equation
436	Contributions to parallel stochastic simulation : application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte Carlo simulations / Contributions à la simulation stochastique parallèle : architectures logicielles pour la distribution de flux pseudo-aléatoires dans les simulations Monte Carlo sur CPU/GPU Passerat-Palmbach, Jonathan 11 October 2013 (has links) Résumé non disponible / The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet.In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks.Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer. Pseudorandom Number Generation (PRNG) High Performance Computing (HPC) Software Engineering Stochastic Simulation Graphics Processing Units (GPUs) GPU Programming Automatic Parallelization
437	PaVo un tri parallèle adaptatif / PaVo. An Adaptative Parallel Sorting Algorithm. Durand, Marie 25 October 2013 (has links) Les joueurs exigeants acquièrent dès que possible une carte graphique capable de satisfaire leur soif d'immersion dans des jeux dont la précision, le réalisme et l'interactivité redoublent d'intensité au fil du temps. Depuis l'avènement des cartes graphiques dédiées au calcul généraliste, ils n'en sont plus les seuls clients. Dans un premier temps, nous analysons l'apport de ces architectures parallèles spécifiques pour des simulations physiques à grande échelle. Cette étude nous permet de mettre en avant un goulot d'étranglement en particulier limitant la performance des simulations. Partons d'un cas typique : les fissures d'une structure complexe de type barrage en béton armé peuvent être modélisées par un ensemble de particules. La cohésion de la matière ainsi simulée est assurée par les interactions entre elles. Chaque particule est représentée en mémoire par un ensemble de paramètres physiques à consulter systématiquement pour tout calcul de forces entre deux particules. Ainsi, pour que les calculs soient rapides, les données de particules proches dans l'espace doivent être proches en mémoire. Dans le cas contraire, le nombre de défauts de cache augmente et la limite de bande passante de la mémoire peut être atteinte, particulièrement en parallèle, bornant les performances. L'enjeu est de maintenir l'organisation des données en mémoire tout au long de la simulation malgré les mouvements des particules. Les algorithmes de tri standard ne sont pas adaptés car ils trient systématiquement tous les éléments. De plus, ils travaillent sur des structures denses ce qui implique de nombreux déplacements de données en mémoire. Nous proposons PaVo, un algorithme de tri dit adaptatif, c'est-à-dire qu'il sait tirer parti de l'ordre pré-existant dans une séquence. De plus, PaVo maintient des trous dans la structure, répartis de manière à réduire le nombre de déplacements mémoires nécessaires. Nous présentons une généreuse étude expérimentale et comparons les résultats obtenus à plusieurs tris renommés. La diminution des accès à la mémoire a encore plus d'importance pour des simulations à grande échelles sur des architectures parallèles. Nous détaillons une version parallèle de PaVo et évaluons son intérêt. Pour tenir compte de l'irrégularité des applications, la charge de travail est équilibrée dynamiquement par vol de travail. Nous proposons de distribuer automatiquement les données en mémoire de manière à profiter des architectures hiérarchiques. Les tâches sont pré-assignées aux cœurs pour utiliser cette distribution et nous adaptons le moteur de vol pour favoriser des vols de tâches concernant des données proches en mémoire. / Gamers are used to throw onto the latest graphics cards to play immersive games which precision, realism and interactivity keep increasing over time. With general-propose processing on graphics processing units, scientists now participate in graphics card use too. First, we examine these architectures interest for large-scale physics simulations. Drawing on this experience, we highlight in particular a bottleneck in simulations performance. Let us consider a typical situation: cracks in complex reinforced concrete structures such as dams are modelised by many particles. Interactions between particles simulate the matter cohesion. In computer memory, each particle is represented by a set of physical parameters used for every force calculations between two particles. Then, to speed up computations, data from particles close in space should be close in memory. Otherwise, the number of cache misses raises up and memory bandwidth may be reached, specially in parallel environments, limiting global performance. The challenge is to maintain data organization during the simulations despite particle movements. Classical sorting algorithms do not suit such situations because they consistently sort all the elements. Besides, they work upon dense structures leading to a lot of memory transfers. We propose PaVo, an adaptive sort which means it benefits from sequence presortedness. Moreover, to reduce the number of necessary memory transfers, PaVo spreads some gaps inside the data structure. We present a large experimental study and confront results to reputed sort algorithms. Reducing memory requests is again more important for large scale simulations with parallel architectures. We detail a parallel version of PaVo and evaluate its interest. To deal with application irregularities, we do load balancing with work-stealing. We take advantage of hierarchical architectures by automatically distributing data in memory. Thus, tasks are pre-assigned to cores with respect to this organization and we adapt the scheduler to favor steals of tasks working on data close in memory. Simulation physique Calcul parallèle Architectures NUMA Algorithme de tri adaptatif Structure de données à trous Physics Simulation High Performance Computing NUMA Architectures Adaptive Sorting Algorithms Data Structure with Gaps 510
438	Protection obligatoire répartie : usage pour le calcul intensif et les postes de travail / Distributed mandatory protection Gros, Damien 30 June 2014 (has links) La thèse porte sur deux enjeux importants de sécurité. Le premier concerne l’amélioration de la sécurité des systèmes Linux présents dans le calcul intensif et le second la protection des postes de travail Windows. Elle propose une méthode commune pour l’observation des appels système et la répartition d’observateurs afin de renforcer la sécurité et mesurer les performances obtenues. Elle vise des observateurs du type moniteur de référence afin de garantir de la confidentialité et de l’intégrité. Une solution utilisant une méthode de calcul intensif est mise en oeuvre pour réduire les surcoûts de communication entre les deux moniteurs de référence SELinux et PIGA. L’évaluation des performances montre les surcoûts engendrés par les moniteurs répartis et analyse la faisabilité pour les différents noeuds d’environnements de calcul intensif. Concernant la sécurité des postes de travail, un moniteur de référence est proposé pour Windows. Il repose sur les meilleures protections obligatoires issues des systèmes Linux et simplifie l’administration. Nous présentons une utilisation de ce nouveau moniteur pour analyser le fonctionnement de logiciels malveillants. L’analyse permet une protection avancée qui contrôle l’ensemble du scénario d’attaque de façon optimiste. Ainsi, la sécurité est renforcée sans nuire aux activités légitimes. / This thesis deals with two major issues in the computer security field. The first is enhancing the security of Linux systems for scientific computation, the second is the protection of Windows workstations. In order to strengthen the security and measure the performances, we offer a common method for the distributed observation of system calls. It relies on reference monitors to ensure confidentiality and integrity. Our solution uses specific high performance computing technologies to lower the communication latencies between the SELinux and PIGA monitors. Benchmarks study the integration of these distributed monitors in the scientific computation. Regarding workstation security, we propose a new reference monitor implementing state of the art protection models from Linux and simplifying administration. We present how to use our monitor to analyze the behavior of malware. This analysis enables an advanced protection to prevent attack scenarii in an optimistic manner. Thus, security is enforced while allowing legitimate activities. Sécurité Contrôle d’accès obligatoire Systèmes d’exploitation Logiciels malveillants Poste de travail Calcul intensif Security Mandatory access control Operating systems Malware Workstation High performance computing laborat
439	MPI sobre MOM para suportar log de mensagens pessimista remoto / MPI over MOM to support remote pessimistic message logging Machado, Caciano dos Santos January 2010 (has links) O aumento crescente no número de processadores das arquiteturas paralelas que estão no topo dos rankings de desempenho, apesar de permitir uma maior capacidade de processamento, também traz consigo um aumento na taxa de falhas diretamente proporcional ao número de processadores. Atualmente, as técnicas de tolerância a falhas com recuperação retroativa são as mais empregadas em aplicações MPI, principalmente a técnica de checkpoint coordenado. No entanto, previsões afirmam que essa última técnica será inadequada para as arquiteturas emergentes. Em contrapartida, as técnicas de log de mensagens possuem características que as tornam mais apropriadas no novo cenário que se estabelece. O presente trabalho consiste em uma proposta de log de mensagens pessimista remoto com checkpoint não-coordenado e a avaliação de desempenho da comunicação MPI sobre Publish/Subscriber no qual se baseia o log de mensagens. O trabalho compreende: um estudo das técnicas de tolerância a falhas mais empregadas em ambientes de alto desempenho e a motivação para a escolha dessa variante de log de mensagens; a proposta de log de mensagens; uma implementação de comunicação Open MPI sobre OpenAMQ e sua respectiva avaliação de desempenho com comunicação tradicional TCP/IP e com o log de mensagens pessimista local da distribuição do Open MPI. Os benchmarks utilizados foram o NetPIPE, o NAS Parallel Benchmarks e a aplicação Virginia Hydrodynamics (VH-1). / The growing number of processors in parallel architectures at the top of performance rankings allows a higher processing capacity. However, it also brings an increase in the fault rate which is directly proportional to the number of processors. Nowadays, coordinated checkpoint is the most widely used rollback technique for system recovery in the occurrence of faults in MPI applications. Nevertheless, projections point that this technique will be inappropriate for the emerging architectures. On the other hand, message logging seems to be more appropriate to this new scenario. This work consists in a proposal of pessimistic message logging (remote based) with non-coordinated checkpoint and the performance evaluation of an MPI communication mechanism that works over Publish/Subscriber channels in which the proposed message logging is based. The work is organized as following: an study of fault tolerant techniques used in HPC and the motivation for choosing this variant of message logging; a message logging proposal; an implementation of Open MPI communication over OpenAMQ; performance evaluation and comparision with the tradicional TCP/IP communication and a pessimistic message logging (sender based) from Open MPI distribution. The benchmark set is composed of NetPIPE, NAS Parallel Benchmarks and Virginia Hydrodynamics (VH-1). Processamento paralelo Mpi Programação paralela Processamento : Alto desempenho High performance computing Cluster based computing Fault tolerance Pessimistic message logging Message-oriented middleware
440	A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms / Um sistema de escalonamento dinâmico e tuning em tempo de execução para plataformas desktop heterogêneas de múltiplos núcleos Binotto, Alécio Pedro Delazari January 2011 (has links) Atualmente, o computador pessoal (PC) moderno poder ser considerado como um cluster heterogênedo de um nodo, o qual processa simultâneamente inúmeras tarefas provenientes das aplicações. O PC pode ser composto por Unidades de Processamento (PUs) assimétricas, como a Unidade Central de Processamento (CPU), composta de múltiplos núcleos, a Unidade de Processamento Gráfico (GPU), composta por inúmeros núcleos e que tem sido um dos principais co-processadores que contribuiram para a computação de alto desempenho em PCs, entre outras. Neste sentido, uma plataforma de execução heterogênea é formada em um PC para efetuar cálculos intensivos em um grande número de dados. Na perspectiva desta tese, a distribuição da carga de trabalho de uma aplicação nas PUs é um fator importante para melhorar o desempenho das aplicações e explorar tal heterogeneidade. Esta questão apresenta desafios uma vez que o custo de execução de uma tarefa de alto nível em uma PU é não-determinístico e pode ser afetado por uma série de parâmetros não conhecidos a priori, como o tamanho do domínio do problema e a precisão da solução, entre outros. Nesse escopo, esta pesquisa de doutorado apresenta um sistema sensível ao contexto e de adaptação em tempo de execução com base em um compromisso entre a redução do tempo de execução das aplicações - devido a um escalonamento dinâmico adequado de tarefas de alto nível - e o custo de computação do próprio escalonamento aplicados em uma plataforma composta de CPU e GPU. Esta abordagem combina um modelo para um primeiro escalonamento baseado em perfis de desempenho adquiridos em préprocessamento com um modelo online, o qual mantém o controle do tempo de execução real de novas tarefas e escalona dinâmicamente e de modo eficaz novas instâncias das tarefas de alto nível em uma plataforma de execução composta de CPU e de GPU. Para isso, é proposto um conjunto de heurísticas para escalonar tarefas em uma CPU e uma GPU e uma estratégia genérica e eficiente de escalonamento que considera várias unidades de processamento. A abordagem proposta é aplicada em um estudo de caso utilizando uma plataforma de execução composta por CPU e GPU para computação de métodos iterativos focados na solução de Sistemas de Equações Lineares que se utilizam de um cálculo de stencil especialmente concebido para explorar as características das GPUs modernas. A solução utiliza o número de incógnitas como o principal parâmetro para a decisão de escalonamento. Ao escalonar tarefas para a CPU e para a GPU, um ganho de 21,77% em desempenho é obtido em comparação com o escalonamento estático de todas as tarefas para a GPU (o qual é utilizado por modelos de programação atuais, como OpenCL e CUDA para Nvidia) com um erro de escalonamento de apenas 0,25% em relação à combinação exaustiva. / A modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search. Processamento paralelo Microeletrônica Processamento : Imagem Processamento : Alto desempenho High-performance computing Scheduling Dynamic load-balancing Heterogenous systems Graphics processors Solvers for systems of linear equations

Search results