Global ETD Search

211	A Systematic Approach for Obtaining Performance on Matrix-Like Operations Veras, Richard Michael 01 August 2017 (has links) Scientific Computation provides a critical role in the scientific process because it allows us ask complex queries and test predictions that would otherwise be unfeasible to perform experimentally. Because of its power, Scientific Computing has helped drive advances in many fields ranging from Engineering and Physics to Biology and Sociology to Economics and Drug Development and even to Machine Learning and Artificial Intelligence. Common among these domains is the desire for timely computational results, thus a considerable amount of human expert effort is spent towards obtaining performance for these scientific codes. However, this is no easy task because each of these domains present their own unique set of challenges to software developers, such as domain specific operations, structurally complex data and ever-growing datasets. Compounding these problems are the myriads of constantly changing, complex and unique hardware platforms that an expert must target. Unfortunately, an expert is typically forced to reproduce their effort across multiple problem domains and hardware platforms. In this thesis, we demonstrate the automatic generation of expert level high-performance scientific codes for Dense Linear Algebra (DLA), Structured Mesh (Stencil), Sparse Linear Algebra and Graph Analytic. In particular, this thesis seeks to address the issue of obtaining performance on many complex platforms for a certain class of matrix-like operations that span across many scientific, engineering and social fields. We do this by automating a method used for obtaining high performance in DLA and extending it to structured, sparse and scale-free domains. We argue that it is through the use of the underlying structure found in the data from these domains that enables this process. Thus, obtaining performance for most operations does not occur in isolation of the data being operated on, but instead depends significantly on the structure of the data. Code Generation Computational Science Dense Linear Algebra Graph Analytics High Performance Computing Sparse Linear Algebra
212	KernTune: self-tuning Linux kernel performance using support vector machines Yi, Long January 2006 (has links) Magister Scientiae - MSc / Self-tuning has been an elusive goal for operating systems and is becoming a pressing issue for modern operating systems. Well-trained system administrators are able to tune an operating system to achieve better system performance for a specific system class. Unfortunately, the system class can change when the running applications change. The model for self-tuning operating system is based on a monitor-classify-adjust loop. The idea of this loop is to continuously monitor certain performance metrics, and whenever these change, the system determines the new system class and dynamically adjusts tuning parameters for this new class. This thesis described KernTune, a prototype tool that identifies the system class and improves system performance automatically. A key aspect of KernTune is the notion of Artificial Intelligence oriented performance tuning. Its uses a support vector machine to identify the system class, and tunes the operating system for that specific system class. This thesis presented design and implementation details for KernTune. It showed how KernTune identifies a system class and tunes the operating system for improved performance. / South Africa Linux Operating systems (Computers) High performance computing System analysis Data processing
213	Virtualisation en contexte HPC / Virtualisation in HPC context Capra, Antoine 17 December 2015 (has links) Afin de répondre aux besoins croissants de la simulation numérique et de rester à la pointe de la technologie, les supercalculateurs doivent d’être constamment améliorés. Ces améliorations peuvent être d’ordre matériel ou logiciel. Cela force les applications à s’adapter à un nouvel environnement de programmation au fil de son développement. Il devient alors nécessaire de se poser la question de la pérennité des applications et de leur portabilité d’une machine à une autre. L’utilisation de machines virtuelles peut être une première réponse à ce besoin de pérennisation en stabilisant les environnements de programmation. Grâce à la virtualisation, une application peut être développée au sein d’un environnement figé, sans être directement impactée par l’environnement présent sur une machine physique. Pour autant, l’abstraction supplémentaire induite par les machines virtuelles entraine en pratique une perte de performance. Nous proposons dans cette thèse un ensemble d’outils et de techniques afin de permettre l’utilisation de machines virtuelles en contexte HPC. Tout d’abord nous montrons qu’il est possible d’optimiser le fonctionnement d’un hyperviseur afin de répondre le plus fidèlement aux contraintes du HPC que sont : le placement des fils d’exécution et la localité mémoire des données. Puis en s’appuyant sur ce résultat, nous avons proposé un service de partitionnement des ressources d’un noeud de calcul par le biais des machines virtuelles. Enfin, pour étendre nos travaux à une utilisation pour des applications MPI, nous avons étudié les solutions et performances réseau d’une machine virtuelle. / To meet the growing needs of the digital simulation and remain at the forefront of technology, supercomputers must be constantly improved. These improvements can be hardware or software order. This forces the application to adapt to a new programming environment throughout its development. It then becomes necessary to raise the question of the sustainability of applications and portability from one machine to another. The use of virtual machines may be a first response to this need for sustaining stabilizing programming environments. With virtualization, applications can be developed in a fixed environment, without being directly impacted by the current environment on a physical machine. However, the additional abstraction induced by virtual machines in practice leads to a loss of performance. We propose in this thesis a set of tools and techniques to enable the use of virtual machines in HPC context. First we show that it is possible to optimize the operation of a hypervisor to respond accurately to the constraints of HPC that are : the placement of implementing son and memory data locality. Then, based on this, we have proposed a resource partitioning service from a compute node through virtual machines. Finally, to expand our work to use for MPI applications, we studied the network solutions and performance of a virtual machine. Calcul haute performance OpenMP MPI Virtualisation High Performance Computing OpenMP MPI Virtualisation
214	Dynamic Load Balancing Schemes for Large-scale HLA-based Simulations De Grande, Robson E. January 2012 (has links) Dynamic balancing of computation and communication load is vital for the execution stability and performance of distributed, parallel simulations deployed on shared, unreliable resources of large-scale environments. High Level Architecture (HLA) based simulations can experience a decrease in performance due to imbalances that are produced initially and/or during run-time. These imbalances are generated by the dynamic load changes of distributed simulations or by unknown, non-managed background processes resulting from the non-dedication of shared resources. Due to the dynamic execution characteristics of elements that compose distributed simulation applications, the computational load and interaction dependencies of each simulation entity change during run-time. These dynamic changes lead to an irregular load and communication distribution, which increases overhead of resources and execution delays. A static partitioning of load is limited to deterministic applications and is incapable of predicting the dynamic changes caused by distributed applications or by external background processes. Due to the relevance in dynamically balancing load for distributed simulations, many balancing approaches have been proposed in order to offer a sub-optimal balancing solution, but they are limited to certain simulation aspects, specific to determined applications, or unaware of HLA-based simulation characteristics. Therefore, schemes for balancing the communication and computational load during the execution of distributed simulations are devised, adopting a hierarchical architecture. First, in order to enable the development of such balancing schemes, a migration technique is also employed to perform reliable and low-latency simulation load transfers. Then, a centralized balancing scheme is designed; this scheme employs local and cluster monitoring mechanisms in order to observe the distributed load changes and identify imbalances, and it uses load reallocation policies to determine a distribution of load and minimize imbalances. As a measure to overcome the drawbacks of this scheme, such as bottlenecks, overheads, global synchronization, and single point of failure, a distributed redistribution algorithm is designed. Extensions of the distributed balancing scheme are also developed to improve the detection of and the reaction to load imbalances. These extensions introduce communication delay detection, migration latency awareness, self-adaptation, and load oscillation prediction in the load redistribution algorithm. Such developed balancing systems successfully improved the use of shared resources and increased distributed simulations' performance. Load Balancing High Performance Computing High Level Architecture Distributed Systems Parallel and Distributed Simulations Discrete Event Simulations
215	Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacional Cordeiro, Silvio Ricardo January 2014 (has links) Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs. Processamento paralelo Processamento : Alto desempenho Transactional memory Profiling Scheduling Shared memory Parallel programming High-performance computing
216	Avaliação do impacto da comunicação intra e entre-nós em nuvens computacionais para aplicações de alto desempenho / Evaluation of impact from inter and intra-node communication in cloud computing for HPC applications Thiago Kenji Okada 07 November 2016 (has links) Com o advento da computação em nuvem, não é mais necessário ao usuário investir grandes quantidades de recursos financeiros em equipamentos computacionais. Ao invés disto, é possível adquirir recursos de processamento, armazenamento ou mesmo sistemas completos por demanda, usando um dos diversos serviços disponibilizados por provedores de nuvem como a Amazon, o Google, a Microsoft, e a própria USP. Isso permite um controle maior dos gastos operacionais, reduzindo custos em diversos casos. Por exemplo, usuários de computação de alto desempenho podem se beneficiar desse modelo usando um grande número de recursos durante curtos períodos de tempo, ao invés de adquirir um aglomerado computacional de alto custo inicial. Nosso trabalho analisa a viabilidade de execução de aplicações de alto desempenho, comparando o desempenho de aplicações de alto desempenho em infraestruturas com comportamento conhecido com a nuvem pública oferecida pelo Google. Em especial, focamos em diferentes configurações de paralelismo com comunicação interna entre processos no mesmo nó, chamado de intra-nós, e comunicação externa entre processos em diferentes nós, chamado de entre-nós. Nosso caso de estudo para esse trabalho foi o NAS Parallel Benchmarks, um benchmark bastante popular para a análise de desempenho de sistemas paralelos e de alto desempenho. Utilizamos aplicações com implementações puramente MPI (para as comunicações intra e entre-nós) e implementações mistas onde as comunicações internas foram feitas utilizando OpenMP (comunicação intra-nós) e as comunicações externas foram feitas usando o MPI (comunicação entre-nós). / With the advent of cloud computing, it is no longer necessary to invest large amounts of money on computing resources. Instead, it is possible to obtain processing or storage resources, and even complete systems, on demand, using one of the several available services from cloud providers like Amazon, Google, Microsoft, and USP. Cloud computing allows greater control of operating expenses, reducing costs in many cases. For example, high-performance computing users can benefit from this model using a large number of resources for short periods of time, instead of acquiring a computer cluster with high initial cost. Our study examines the feasibility of running high-performance applications, comparing the performance of high-performance applications in a known infrastructure compared to the public cloud offering from Google. In particular, we focus on various parallel configurations with internal communication between processes on the same node, called intra-node, and external communication between processes on different nodes, called inter-nodes. Our case study for this work was the NAS Parallel Benchmarks, a popular benchmark for performance analysis of parallel systems and high performance computing. We tested applications with MPI-only implementations (for intra and inter-node communications) and mixed implementations where internal communications were made using OpenMP (intra-node communications) and external communications were made using the MPI (inter-node communications). Benchmark Computação de alto desempenho Nuvens computacionais Benchmark Cloud computing High performance computing
217	Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous Systems Ernstsson, August January 2020 (has links) Today's society is increasingly software-driven and dependent on powerful computer technology. Therefore it is important that advancements in the low-level processor hardware are made available for exploitation by a growing number of programmers of differing skill level. However, as we are approaching the end of Moore's law, hardware designers are finding new and increasingly complex ways to increase the accessible processor performance. It is getting more and more difficult to effectively target these processing resources without expert knowledge in parallelization, heterogeneous computation, communication, synchronization, and so on. To ensure that the software side can keep up, advanced programming environments and frameworks are needed to bridge the widening gap between hardware and software. One such example is the pattern-centric skeleton programming model and in particular the SkePU project. The work presented in this thesis first redesigns the SkePU framework based on modern C++ variadic template metaprogramming and state-of-the-art compiler technology. It then explores new ways to improve performance: by providing new patterns, improving the data access locality of existing ones, and using both static and dynamic knowledge about program flow. The work combines novel ideas with practical evaluation of the approach on several applications. The advancements also include the first skeleton API that allows variadic skeletons, new data containers, and finally an approach to make skeleton programming more customizable without compromising universal portability. / <p>Ytterligare forskningsfinansiärer: EU H2020 project EXA2PRO (801015); SeRC.</p> High‐level parallel programming Algorithmic skeletons Heterogeneous systems High‐performance computing Computer Sciences Datavetenskap (datalogi)
218	Design of an Optimized Supervisor Module for Tomographic Adaptive Optics Systems of Extremely Large Telescopes Doucet, Nicolas 08 January 2020 (has links) The recent advent of next generation ground-based telescopes, code-named Extremely Large Telescopes (ELT), highlights the beginning of a forced march toward an era of deploying instruments capable of exploiting starlight captured by mirrors at an unprecedented scale. This confronts the astronomy community with both a daunting challenge and a unique opportunity. The challenge arises from the mismatch between the complexity of current instruments and their expected scaling with the square of the future telescope diameters, on which astronomy applications have relied to produce better science. To deliver on the promise of tomorrow’s ELT, astronomers must design new technologies that can effectively enhance the performance of the instrument at scale, while compensating for the atmospheric turbulence in real-time. This is an unsolved problem. This problem presents an opportunity because the astronomy community is now compelled to rethink essential components of the optical systems and their traditional hardware/software ecosystems in order to achieve high optical performance with a near real-time computational response. In order to realize the full potential of such instruments, we investigate a technique supporting Adaptive Optics (AO), i.e., a dedicated concept relying on turbulence tomography. In particular, a critical part of AO systems is the supervisor module, which is responsible for providing the system with a Tomographic Reconstructor (ToR) at a regular pace, as the atmospheric turbulence evolves over an observation window. In this thesis, we implement an optimized supervisor module and assess it under real configurations of the future European ELT (E-ELT) with a 39m diameter, the largest and most complex optical telescope ever conceived. This necessitates manipulating large matrix sizes (i.e., up to 100k × 100k) that contain measurements captured by multiple wavefront sensors. To address the complexity bottleneck, we employ high performance computing software solutions based on cutting-edge numerical algorithms using asynchronous, fine-grained computations as well as approximations techniques that leverage the resulting matrix data structure. Furthermore, GPU-based hardware accelerators are used in conjunction with the software solutions to ensure reasonable time-to-solution to cope with rapidly evolving atmospheric turbulence. The proposed software/hardware solution permits to reconstruct an image with high accuracy. We demonstrate the validity of the AO systems with a third-party testbed simulating at the E-ELT scale, which is intended to pave the way for a first prototype installed on-site Adaptive Optics High Performance Computing Real Time Controller Extremely Large Telescope
219	Instalace a konfigurace Octave výpočetního clusteru / Installation and configuration of Octave computation cluster Mikulka, Zdeněk January 2014 (has links) This diploma thesis contains detailed design of high-performance cluster, primarely focused for parallel computing in Octave application. Each of component of this cluster is described along with instructions for installation and configuration. Cluster is based on GNU/Linux operating system and Message Parsing Interface. Design alllows implementation of this cluster in computers of schoolroom with active lessons.
220	Low-rank Approximations in Quantum Transport Simulations Daniel A. Lemus (5929940) 07 May 2020 (has links) Quantum-mechanical effects play a major role in the performance of modern electronic devices. In order to predict the behavior of novel devices, quantum effects are often included using Non-Equilibrium Green's Function (NEGF) methods in atomistic device representations. These quantum effects may include realistic inelastic scattering caused by device impurities and phonons. With the inclusion of realistic physical phenomena, the computational load of predictive simulations increases greatly, and a manageable basis through low-rank approximations is desired.<br><br>In this work, low-rank approximations are used to reduce the computational load of atomistic simulations. The benefits of basis reductions on simulation time and peak memory are assessed.<br>The low-rank approximation method is then extended to include more realistic physical effects than those modeled today, including exact calculations of scattering phenomena. The inclusion of these exact calculations are then contrasted to current methods and approximations. Nanoelectronics NEGF low-rank approximation High Performance Computing (HPC) quantum transport calculations

Search results