Global ETD Search

11	Transforming TLP into DLP with the dynamic inter-thread vectorization architecture / Transformer le TLP en DLP avec l'architecture de vectorisation dynamique inter-thread Kalathingal, Sajith 13 December 2016 (has links) De nombreux microprocesseurs modernes mettent en œuvre le multi-threading simultané (SMT) pour améliorer l'efficacité globale des processeurs superscalaires. SMT masque les opérations à longue latence en exécutant les instructions de plusieurs threads simultanément. Lorsque les threads exécutent le même programme (cas des applications SPMD), les mêmes instructions sont souvent exécutées avec des entrées différentes. Les architectures SMT traditionnelles exploitent le parallélisme entre threads, ainsi que du parallélisme de données explicite au travers d'unités d'exécution SIMD. L'exécution SIMD est efficace en énergie car le nombre total d'instructions nécessaire pour exécuter un programme est significativement réduit. Cette réduction du nombre d'instructions est fonction de la largeur des unités SIMD et de l'efficacité de la vectorisation. L'efficacité de la vectorisation est cependant souvent limitée en pratique. Dans cette thèse, nous proposons l'architecture de vectorisation dynamique inter-thread (DITVA) pour tirer parti du parallélisme de données implicite des applications SPMD en assemblant dynamiquement des instructions vectorielles à l'exécution. DITVA augmente un processeur à exécution dans l'ordre doté d'unités SIMD en lui ajoutant un mode d'exécution vectorisant entre threads. Lorsque les threads exécutent les mêmes instructions simultanément, DITVA vectorise dynamiquement ces instructions pour assembler des instructions SIMD entre threads. Les threads synchronisés sur le même chemin d'exécution partagent le même flot d'instructions. Pour conserver du parallélisme de threads, DITVA groupe de manière statique les threads en warps ordonnancés indépendamment. DITVA tire parti des unités SIMD existantes et maintient la compatibilité binaire avec les architectures CPU existantes. / Many modern microprocessors implement Simultaneous Multi-Threading (SMT) to improve the overall efficiency of superscalar CPU. SMT hides long latency operations by executing instructions from multiple threads simultaneously. SMT may execute threads of different processes, threads of the same processes or any combination of them. When the threads are from the same process, they often execute the same instructions with different data most of the time, especially in the case of Single-Program Multiple Data (SPMD) applications.Traditional SMT architecture exploit thread-level parallelism and with the use of SIMD execution units, they also support explicit data-level parallelism. SIMD execution is power efficient as the total number of instructions required to execute a complete program is significantly reduced. This instruction reduction is a factor of the width of SIMD execution units and the vectorization efficiency. Static vectorization efficiency depends on the programmer skill and the compiler. Often, the programs are not optimized for vectorization and hence it results in inefficient static vectorization by the compiler.In this thesis, we propose the Dynamic Inter-Thread vectorization Architecture (DITVA) to leverage the implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA optimizes an SIMD-enabled in-order SMT processor with inter-thread vectorization execution mode. When the threads are running in lockstep, similar instructions across threads are dynamically vectorized to form a SIMD instruction. The threads in the convergent paths share an instruction stream. When all the threads are in the convergent path, there is only a single stream of instructions. To optimize the performance in such cases, DITVA statically groups threads into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures. Multi-Threading simultané SPMD SIMD Vectorisation dynamique Architectures des microprocesseurs Simultaneous Multi-Threading SPMD SIMD Dynamic inter-Thread vectorization Microprocessor architectures
12	Improving memory consumption and performance scalability of HPC applications with multi-threaded network communications / Amélioration de la consommation mémoire et de l'extensibilité des performances des applications HPC par le multi-threading des communications réseaux Didelot, Sylvain 12 June 2014 (has links) La tendance en HPC est à l'accroissement du nombre de coeurs par noeud de calcul pour une quantité totale de mémoire par noeud constante. A large échelle, l'un des principaux défis pour les applications parallèles est de garder une faible consommation mémoire. Cette thèse présente une couche de communication multi-threadée sur Infiniband, laquelle fournie de bonnes performances et une faible consommation mémoire. Nous ciblons les applications scientifiques parallélisées grâce à la bibliothèque MPI ou bien combinées avec un modèle de programmation en mémoire partagée. En partant du constat que le nombre de connexions réseau et de buffers de communication est critique pour la mise à l'échelle des bibliothèques MPI, la première contribution propose trois approches afin de contrôler leur utilisation. Nous présentons une topologie virtuelle extensible et entièrement connectée pour réseaux rapides orientés connexion. Dans un contexte agrégeant plusieurs cartes permettant d'ajuster dynamiquement la configuration des buffers réseau utilisant la technologie RDMA. La seconde contribution propose une optimisation qui renforce le potentiel d'asynchronisme des applications MPI, laquelle montre une accélération de deux des communications. La troisième contribution évalue les performances de plusieurs bibliothèques MPI exécutant une application de modélisation sismique en contexte hybride. Les expériences sur des noeuds de calcul jusqu'à 128 coeurs montrent une économie de 17 % sur la mémoire. De plus, notre couche de communication multi-threadée réduit le temps d'exécution dans le cas où plusieurs threads OpenMP participent simultanément aux communications MPI. / A recent trend in high performance computing shows a rising number of cores per compute node, while the total amount of memory per compute node remains constant. To scale parallel applications on such large machines, one of the major challenges is to keep a low memory consumption. This thesis develops a multi-threaded communication layer over Infiniband which provides both good performance of communications and a low memory consumption. We target scientific applications parallelized using the MPI standard in pure mode or combined with a shared memory programming model. Starting with the observation that network endpoints and communication buffers are critical for the scalability of MPI runtimes, the first contribution proposes three approaches to control their usage. We introduce a scalable and fully-connected virtual topology for connection-oriented high-speed networks. In the context of multirail configurations, we then detail a runtime technique which reduces the number of network connections. We finally present a protocol for dynamically resizing network buffers over the RDMA technology. The second contribution proposes a runtime optimization to enforce the overlap potential of MPI communications, showing a 2x improvement factor on communications. The third contribution evaluates the performance of several MPI runtimes running a seismic modeling application in a hybrid context. On large compute nodes up to 128 cores, the introduction of OpenMP in the MPI application saves up to 17 % of memory. Moreover, we show a performance improvement with our multi-threaded communication layer where the OpenMP threads concurrently participate to the MPI communications Calcul haute performance Multi-threading Réseaux haut débit MPI NUMA High performance computing Multi-threading High-speed networks MPI NUMA
13	Difúzní rozptyl rentgenového záření na GaN epitaxních vrstvách / Diffuse x-ray scattering from GaN epitaxial layers Barchuk, Mykhailo January 2012 (has links) Real structure of heteroepitaxial GaN and AlGaN layers is studied by diffuse x-ray scattering. A new developed method based on Monte Carlo simulation enabling to determine densities of threading dislocations in c-plane GaN and stacking faults in a-plane GaN is presented. The results of Monte Carlo simulations are compared with ones obtained by use of other conventional techniques. The advantages and limitations of the new method are discussed in detail. The methods accuracy is estimated as about 15%. We have shown that our method is a reliable tool for threading dislocations and stacking faults densities determination.
14	On the automated compilation of UML notation to a VLIW chip multiprocessor Stevens, David January 2013 (has links) With the availability of more and more cores within architectures the process of extracting implicit and explicit parallelism in applications to fully utilise these cores is becoming complex. Implicit parallelism extraction is performed through the inclusion of intelligent software and hardware sections of tool chains although these reach their theoretical limit rather quickly. Due to this the concept of a method of allowing explicit parallelism to be performed as fast a possible has been investigated. This method enables application developers to perform creation and synchronisation of parallel sections of an application at a finer-grained level than previously possible, resulting in smaller sections of code being executed in parallel while still reducing overall execution time. Alongside explicit parallelism, a concept of high level design of applications destined for multicore systems was also investigated. As systems are getting larger it is becoming more difficult to design and track the full life-cycle of development. One method used to ease this process is to use a graphical design process to visualise the high level designs of such systems. One drawback in graphical design is the explicit nature in which systems are required to be generated, this was investigated, and using concepts already in use in text based programming languages, the generation of platform-independent models which are able to be specialised to multiple hardware architectures was developed. The explicit parallelism was performed using hardware elements to perform thread management, this resulted in speed ups of over 13 times when compared to threading libraries executed in software on commercially available processors. This allowed applications with large data dependent sections to be parallelised in small sections within the code resulting in a decrease of overall execution time. The modelling concepts resulted in the saving of between 40-50% of the time and effort required to generate platform-specific models while only incurring an overhead of up to 15% the execution cycles of these models designed for specific architectures. 621.3
15	High Performance Portability with RAJA and Agency Obermiller, Dan 01 January 2017 (has links) High performance and scientific computing take advantage of high-end and high-spec computer architectures. As these architectures evolve, and new architectures are created, applications may be able to run at greater and greater speeds. These changes persent challenges to implementors who wish to take advantage of the newest features and machines. Portability layers such as RAJA and Agency seek to abstract away machine-specific details and allow scientists to take advantage of new features as they become available. We enhance RAJA with a lower-level framework, Agency, to determine if these layered abstractions provide performance or maintainability benefits. high-performance portability c++ gpu threading Programming Languages and Compilers
16	CAFM Studies of Epitaxial Lateral Overgrowth GaN Films Kasliwal, Vishal P. 01 January 2007 (has links) This thesis uses the techniques of atomic force microscopy (AFM) and conductiveAFM (CAFM) to study defect sites on GaN films. In particular, these defect sites demonstrate current leakage under reverse-bias conditions that are detrimental to device fabrication. Two growth techniques that were used to improve this leakage behavior for samples in this study included: epitaxial lateral overgrowth (ELO) and nano-ELO using a Si3N4 film. Both techniques decrease defects such as threading dislocations by controlling the nucleation and growth behavior of the GaN films. The EL0 technique uses a patterned dielectric film to laterally grow micron-wide regions (referred to as 'wings') that minimize dislocation defects. Our CAFM studies indicate that ELO films have no detectable leakage sites in these wing regions; however, between these regions the films have typical leakage site densities seen for standard films on the order of 107cm-3. The nano-ELO technique utilizes a porous Si3N4 film to reduce defects over the entire film, and CAFM data indicate nearly a factor of ten reduction in leakage site densities. The nano-ELO technique is therefore optimal for an overall improvement in film quality, whereas the ELO technique is suitable for device fabrication in patterned regions with optimized film quality. threading dislocation semiconductor growth behavior leakage defect site gallium nitride Physical Sciences and Mathematics Physics
17	Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer Graphics Yeh, Chia-Yu 06 September 2011 (has links) Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU. multi-threading graphics processing unit (GPU) vertex shader SIMD matrix-vector multiplication OpenGL ES 2.0
18	A Fold Recognition Approach to Modeling of Structurally Variable Regions Levefelt, Christer January 2004 (has links) <p>A novel approach is proposed for modeling of structurally variable regions in proteins. In this approach, a prerequisite sequence-structure alignment is examined for regions where the target sequence is not covered by the structural template. These regions, extended with a number of residues from adjacent stem regions, are submitted to fold recognition. The alignments produced by fold recognition are integrated into the initial alignment to create a multiple alignment where gaps in the main structural template are covered by local structural templates. This multiple alignment is used to create a protein model by existing protein modeling techniques.</p><p>Several alternative parameters are evaluated using a set of ten proteins. One set of parameters is selected and evaluated using another set of 31 proteins. The most promising result is for loop regions not located at the C- or N-terminal of a protein, where the method produces an average RMSD 12% lower than the loop modeling provided with the program MODELLER. This improvement is shown to be statistically significant.</p> Computer science Datavetenskap
19	Improving Performance Of Network Intrusion Detection Systems Through Concurrent Mechanisms Atakan, Mustafa 01 January 2004 (has links) (PDF) As the bandwidth of present networks gets larger than the past, the demand of Network Intrusion Detection Systems (NIDS) that function in real time becomes the major requirement for high-speed networks. If these systems are not fast enough to process all network traffic passing, some malicious security violations may take role using this drawback. In order to make that kind of applications schedulable, some concurrency mechanism is introduced to the general flowchart of their algorithm. The principal aim is to fully utilize each resource of the platform and overlap the independent parts of the applications. In the sense of this context, a generic multi-threaded infrastructure is designed and proposed. The concurrency metrics of the new system is analyzed and compared with the original ones. QA Computer Software 76.75-76.765
20	Dynamic Analysis of Multithreaded Embedded Software to Expose Atomicity Violations January 2016 (has links) abstract: Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence of bugs remains same, approach changes a bit for embedded systems. The main focus of this research is to develop a systemic methodology to address the issue from embedded systems perspective. A framework is developed which predicts the access interleaving patterns that may violate atomicity using memory references of shared variables and provides support to force and analyze these schedules for any output change, system fault or change in execution path. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer engineering atomicity violation Concurrency bugs dynamic analysis embedded software execution replay multi-threading

Search results