Global ETD Search

1	Hardware and Software Optimizations for Deep Learning Workloads on Graphics Processing Units Garrigus, Justin Matthew 12 1900 (has links) This thesis investigates different hardware and software optimization techniques for accelerating deep-learning workloads using graphics processing units (GPUs). Deep learning is widely used across various engineering and scientific domains such as computer vision and natural-language processing. Such applications are used in environments spanning from resource-constrained embedded devices to exascale supercomputers. GPUs are a major hardware platform for accelerating the training and inference of deep-learning models. With the fast growth of deep-learning models, multi-GPU systems containing up to 1,000 nodes have been used for training models with billions of parameters. We characterize three different problems and techniques for improving the usage of GPUs in deep learning: deep-learning compilers, which iteratively converge to an optimized software representation of the model; a lowered L2 cache, which is aware of the duplication in image-recognition matrix-multiplication kernels; and a characterization of multi-GPU training environments across different large language models. Future research can take advantage of our results to enhance the GPU hardware and software designs and thus improve the efficiency in training and deploying deep learning models. Computer Science Hardware Optimizations Software Optimizations
2	Study on electricity characteristics of electro-magnetic vibration-induced micro-generators Chen, Ssu-ting 28 August 2007 (has links) With the flourishing development of MEMS, it is possible to combine micro-sensors with micro-actuator and apply to the organ transplant in medical fields or as an embedded sensor on buildings or bridges. Generally batteries is are used as the kinetic energy source, but it involves the issue of recycling. Therefore, development of a self-generator utilizing vibrational source from environment is another better choice. This study succeeds in building up the transform mode of electricity in an electro-magnetic vibration-induced micro-generator. The electricity characteristics of micro-generator are obtained by Mathematical software analysis. MEMs technology can be used to fabricate and assemble the microstructure , planar coils and magnetic films. The analytic results of maximum power and minimum volume by using a mathematics model are achieved. The validity of this model is verified by comparing the theoretical and experiment data from the literature. MEMS micro-generator optimizations
3	Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis Muddukrishna, Ananya January 2016 (has links) The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity. Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling. Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs. All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis. / <p>QC 20151221</p> OpenMP Performance Analysis Scheduling Locality Optimizations
4	Enabling Efficient Storage of Git Repositories in PAClab Brunner, Rebecca 10 August 2020 (has links) No description available. Computer Science PAClab Git ZFS Graph Optimizations
5	An Optimized R5RS Macro Expander Reque, Sean P. 05 February 2013 (has links) (PDF) Macro systems allow programmers abstractions over the syntax of a programming language. This gives the programmer some of the same power posessed by a programming language designer, namely, the ability to extend the programming language to meet the needs of the programmer. The value of such systems has been demonstrated by their continued adoption in more languages and platforms. However, several barriers to widespread adoption of macro systems still exist. The language Racket defines a small core of primitive language constructs, including a powerful macro system, upon which all other features are built. Because of this design, many features of other programming languages can be implemented through libraries, keeping the core language simple without sacrificing power or flexibility. However, slow macro expansion remains a lingering problem in the language's primary implementation, and in fact macro expansion currently dominates compile times for Racket modules and programs. Besides the typical problems associated with slow compile times, such as slower testing feedback, increased mental disruption during the programming process, and unscalable build times for large projects, slow macro expansion carries its own unique problems, such as poorer performance for IDEs and other software analysis tools. In order to improve macro expansion times for Racket, we implement an existing expansion algorithm for R5RS Scheme macros, which comprise a subset of Racket's macro system, and use that implementation to explore optimization opportunities. Our resulting expander appears to be the fastest implementation of a R5RS macro expander in a high-level language and performs several times faster than the existing C-based Racket implementation. Macros R5RS Optimizations Expander Computer Sciences
6	Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors Chen, Linchuan 08 October 2015 (has links) No description available. Computer Science accelerators pattern-specific optimizations
7	Tools for Performance Optimizations and Tuning of Affine Loop Nests Hartono, Albert January 2009 (has links) No description available. Computer Science compilers loop optimizations parametric tiling loop parallelization wavefront parallelism empirical tuning annotation-based optimizations
8	Efficient Execution Paradigms for Parallel Heterogeneous Architectures Koukos, Konstantinos January 2016 (has links) This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound. DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes. Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation. Decoupled Execution Performance Energy DVFS Compiler Optimizations Heterogeneous Coherence
9	Método de otimização topológica aplicado a projeto de moldes utilizados em processos de sinterização por plasma. / Topology optimization method applied to the dies design used in the spark plasma sintering. Vasconcelos, Flávio Marinho 07 February 2013 (has links) A técnica de sinterização por plasma, também conhecida como processo SPS (Spark Plasma Sintering), é um processo para consolidação e sinterização de pós, em que corrente elétrica alternada pulsada e pressão de compactação são aplicadas simultaneamente aos componentes ferramentais (molde, punções, etc.). O molde utilizado neste processo tradicionalmente é cilíndrico, composto por grafite e permite a fabricação de amostras com geometria circular. Esse processo também possibilita a sinterização de um grande número de materiais, em especial, Materiais com Gradação Funcional (MGF). Tendo em vista os aspectos de geometria e composição da amostra, um projeto de otimização de moldes pode ser desenvolvido visando a fabricação de amostras com geometrias e gradação complexas. Com isso, é possível adequar a geometria do molde ao formato e composição da amostra que se deseja sinterizar, visando uma sinterização uniforme. Portanto, o objetivo deste trabalho é o desenvolvimento de uma metodologia para projetos de moldes utilizados na sinterização por plasma. Esta metodologia consiste na implementação de um algoritmo de otimização baseado no Método de Otimização Topológica (MOT), considerando três tipos de abordagem: a primeira abordagem, a qual visa a geometria da amostra, busca obter um molde prismático considerando amostras com geometria arbitrária, como por exemplo quadrada, triangular ou em cruz, com o objetivo de uniformizar o campo de temperaturas na amostra: na segunda abordagem, que considera moldes para a fabricação de amostras (MGF, os moldes podem ser projetados de modo a produzirem um gradiente de temperatura, na direção axial, através da variação da espessura da parede do molde; a terceira abordagem considera um molde constituído por material compósito. Nesta última abordagem é proposto um novo conceito de molde, onde se busca trabalhar não apenas com a geometria, como também com a microestrutura do molde dada por um material anisotrópico. Para a implementação do algoritmo de otimização, um modelo computacional baseado no Método dos Elementos Finitos (MEF), é desenvolvido considerando o processo SPS como um problema de acoplamento eletrotérmico. Na implementação do MOT utiliza-se um modelo de material baseado no SIMP (Solid Isotropic Material with Penalization) e Programação Linear Sequencial (PLS) para resolver o problema de otimização do molde. Todo algoritmo de otimização é implementado na linguagem própria do ambiente Matlab® e o pós-processamento, para verificação e validação dos resultados, é executado no software comercial Comsol®. / The Spark Plasma Sintering (SPS) technique is a powder consolidating and sintering process, in which pulsed DC electric current and pressure loads are applied simultaneously in the tool system components (graphite die, punchers, etc.) in order to perform the sintering process. Generally, a cylindrical graphite die is used for circular samples manufacturing and through this process the sinterization of a large number of materials, including Functionally Graded Materials (FGM), is possible. Considering the geometry and sample material aspects, an optimization die design technique can be developed based on the manufacturing of samples with complex geometry and gradation. Thus, it is possible to adjust the die geometry to the sample geometry or gradation in order to achieve a uniform sinterization. Therefore, the aim of this work is the development of a methodology to be applied in the design of dies used in SPS sintering process. This methodology consists of implementing an optimization algorithm based on the Topology Optimization Method (TOM), considering three approaches: in the first one a prismatic die is designed to process a sample with arbitrary geometry, for example square, triangular and cross sample; in the second approach the change of the die wall thickness is considered to achieve a predefined temperature gradient in the gradation direction of MGF samples and the third approach the same previous objective is considered, however the focus is the optimization of thermal conductive fibers. In the latest approach, a new die concept is proposed, where the objective is to optimize not only the die geometry but he microstructure considering a die composed by an anisotropic material. T implement the optimization algorithm a computational model based on the Finite Element Method (FEM) is developed considering the SPS process as an electrothermal coupled problem. In the TOM implementation a material model based on SIMP (Solid Isotropic Material with Penalization) is adopted and the Sequential Linear Programming is used to solve the optimization problem. The optimization algorithm is implemented using the Matlab® environment and the pos-processing, for verification and validation of the obtained results is carried out by using Comsol®. Gabaritos e moldes Jigs and dies Sintering Sinterização Topologia (Otimização) Topology (Optimizations)
10	Predictor Virtualization: Teaching Old Caches New Tricks Burcea, Ioana Monica 20 August 2012 (has links) To improve application performance, current processors rely on prediction-based hardware optimizations, such as data prefetching and branch prediction. These hardware optimizations store application metadata in on-chip predictor tables and use the metadata to anticipate and optimize for future application behavior. As application footprints grow, the predictor tables need to scale for predictors to remain effective. One important challenge in processor design is to decide which hardware optimizations to implement and how much resources to dedicate to a specific optimization. Traditionally, processor architects employ a one-size-fits-all approach when designing predictor-based hardware optimizations: for each optimization, a fixed portion of the on-chip resources is allocated to the predictor storage. This approach often leads to sub-optimal designs where: 1) resources are wasted for applications that do not benefit from a particular predictor or require only small predictor tables, or 2) predictors under-perform for applications that need larger predictor tables that can not be built due to area-latency-power constraints. This thesis introduces Predictor Virtualization (PV), a framework that uses the traditional processor memory hierarchy to store application metadata used in speculative hardware optimizations. This allows to emulate large, more accurate predictor tables, which, in return, leads to higher application performance. PV exploits the current trend of unprecedentedly large on- chip secondary caches and allocates on demand a small portion of the cache capacity to store application metadata used in hardware optimizations, adjusting to the application’s need for predictor resources. As a consequence, PV is a pay-as-you-go technique that emulates large predictor tables without increasing the dedicated storage overhead. To demonstrate the benefits of virtualizing hardware predictors, we present virtualized designs for three different hardware optimizations: a state-of-the-art data prefetcher, conventional branch target buffers and an object-pointer prefetcher. While each of these hardware predictors exhibit different characteristics that lead to different virtualized designs, virtualization improves the cost-performance trade-off for all these optimizations. PV increases the utility of traditional processor caches: in addition to being accelerators for slow off-chip memories, on-chip caches are leveraged for increasing the effectiveness of predictor-based hardware optimizations. predictor virtualization hardware optimizations processor caches 0984 0544

Search results