Spelling suggestions: "subject:"bperformance optimization"" "subject:"deperformance optimization""
11 |
Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms : An Overview of Performance Limitations in Big Data Systems and Proposed OptimizationsKalavri, Vasiliki January 2014 (has links)
Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome. One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential. The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets. The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions. Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms; PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends. HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation. An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times. Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution. / <p>QC 20140605</p>
|
12 |
ADAPT : architectural and design exploration for application specific instruction-set processor technologiesShee, Seng Lin, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
This thesis presents design automation methodologies for extensible processor platforms in application specific domains. The work presents first a single processor approach for customization; a methodology that can rapidly create different processor configurations by the removal of unused instructions sets from the architecture. A profile directed approach is used to identify frequently used instructions and to eliminate unused opcodes from the available instruction pool. A coprocessor approach is next explored to create an SoC (System-on-Chip) to speedup the application while reducing energy consumption. Loops in applications are identified and accelerated by tightly coupling a coprocessor to an ASIP (Application Specific Instruction-set Processor). Latency hiding is used to exploit the parallelism provided by this architecture. A case study has been performed on a JPEG encoding algorithm; comparing two different coprocessor approaches: a high-level synthesis approach and our custom coprocessor approach. The thesis concludes by introducing a heterogenous multi-processor system using ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration is formulated. We proposed an estimation technique to calculate runtimes of the configured multiprocessor system without running cycle-accurate simulations, which could take a significant amount of time. We present two heuristics to efficiently search the design space of a pipeline-based multi ASIP system and compare the results against an exhaustive approach. In our first approach, we show that, on average, processor size can be reduced by 30%, energy consumption by 24%, while performance is improved by 24%. In the coprocessor approach, compared with the use of a main processor alone, a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, as against 1.58x for the high level synthesis method, and 1.33x for the customized instruction approach. Energy savings are 57%, 28% and 19%, respectively. Our multiprocessor design provides a performance improvement of at least 4.03x for JPEG and 3.31x for MP3, for a single processor design system. The minimum cost obtained using our heuristic was within 0.43% and 0.29% of the optimum values for the JPEG and MP3 benchmarks respectively.
|
13 |
Reducing DRAM Row Activations with Eager WritebackJeon, Myeongjae 06 September 2012 (has links)
This thesis describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, dirty cache lines that have not been recently accessed are eagerly written to DRAM when the corresponding row has been activated by an ordinary access, such as a read. This approach enables clustering of reads and writes that target the same row, resulting in a significant reduction in row activations. Specifically, for 29 applications, it reduces the number of DRAM row activations by an average of 38% and a maximum of 81%. The results from a full system simulator show that for the 29 applications, 11 have performance improvements between 10% and 20%, and 9 have improvements in excess of 20%. Furthermore, 10 consume between 10% and 20% less DRAM energy, and 10 have energy consumption reductions in excess of 20%.
|
14 |
Maestro: Achieving scalability and coordination in centralizaed network control planeJanuary 2012 (has links)
Modem network control plane that supports versatile communication services (e.g. performance differentiation, access control, virtualization, etc.) is highly complex. Different control components such as routing protocols, security policy enforcers, resource allocation planners, quality of service modules, and more, are interacting with each other in the control plane to realize complicated control objectives. These different control components need to coordinate their actions, and sometimes they could even have conflicting goals which require careful handling. Furthermore, a lot of these existing components are distributed protocols running on large number of network devices. Because protocol state is distributed in the network, it is very difficult to tightly coordinate the actions of these distributed control components, thus inconsistent control actions could create serious problems in the network. As a result, such complexity makes it really difficult to ensure the optimality and consistency among all different components. Trying to address the complexity problem in the network control plane, researchers have proposed different approaches, and among these the centralized control plane architecture has become widely accepted as a key to solve the problem. By centralizing the control functionality into a single management station, we can minimize the state distributed in the network, thus have better control over the consistency of such state. However, the centralized architecture has fundamental limitations. First, the centralized architecture is more difficult to scale up to large network size or high requests rate. In addition, it is equally important to fairly service requests and maintain low request-handling latency, while at the same time having highly scalable throughput. Second, the centralized routing control is neither as responsive nor as robust to failures as distributed routing protocols. In order to enhance the responsiveness and robustness, one approach is to achieve the coordination between the centralized control plane and distributed routing protocols. In this thesis, we develop a centralized network control system, called Maestro, to solve the fundamental limitations of centralized network control plane. First we use Maestro as the central controller for a flow-based routing network, in which large number of requests are being sent to the controller at very high rate for processing. Such a network requires the central controller to be extremely scalable. Using Maestro, we systematically explore and study multiple design choices to optimally utilize modern multi-core processors, to fairly distribute computation resource, and to efficiently amortize unavoidable overhead. We show a Maestro design based on the abstraction that each individual thread services switches in a round-robin manner, can achieve excellent throughput scalability while maintaining far superior and near optimal max-min fairness. At the same time, low latency even at high throughput is achieved by Maestro's workload-adaptive request batching. Second, we use Maestro to achieve the coordination between centralized controls and distributed routing protocols in a network, to realize a hybrid control plane framework which is more responsive and robust than a pure centralized control plane, and more globally optimized and consistent than a pure distributed control plane. Effectively we get the advantages of both the centralized and the distributed solutions. Through experimental evaluations, we show that such coordination between the centralized controls and distributed routing protocols can improve the SLA compliance of the entire network.
|
15 |
Computational design, fabrication, and characterization of microarchitectured solid oxide fuel cells with improved energy efficiencyYoon, Chan 07 July 2010 (has links)
Electrodes in a solid oxide fuel cell (SOFC) must possess both adequate porosity and electronic conductivity to perform their functions in the cell. They must be porous to permit rapid mass transport of reactant and product gases and sufficiently conductive to permit efficient electron transfer. However, it is nearly impossible to simultaneously control porosity and conductivity using conventional design and fabrication techniques. In this dissertation, computational design and performance optimization of microarchitectured SOFCs is first investigated in order to achieve higher power density and thus higher efficiency than currently attainable in state-of-the-art SOFCs. This involves a coupled multiphysics simulation of mass transport, electrochemical charge transfer reaction, and current balance as a function of SOFC microarchitecture. Next, the fabrication of microarchitectured SOFCs consistent with the computational designs is addressed based on anode-supported SOFC button cells using the laser ablation technique. Finally, the performance of a fabricated SOFC unit cell is characterized and compared against the performance predicted by the computational model. The results show that the performance of microarchitectured SOFCs was improved against the baseline structure and measured experimental data were well matched to simulation results.
|
16 |
Control performance assessment of run-to-run control system used in high-mix semiconductor manufacturingJiang, Xiaojing 04 October 2012 (has links)
Control performance assessment (CPA) is an important tool to realize high performance control systems in manufacturing plants. CPA of both continuous and batch processes have attracted much attention from researchers, but only a few results about semiconductor processes have been proposed previously. This work provides methods for performance assessment and diagnosis of the run-to-run control system used in high-mix semiconductor manufacturing processes.
First, the output error source of the processes with a run-to-run EWMA controller is analyzed and a CPA method (namely CPA I) is proposed based on closed-loop parameter estimation. In CPA I, ARMAX regression is directly applied to the process output error, and the performance index is defined based on the variance of the regression results. The influence of plant model mismatch in the process gain and disturbance model parameter to the control performance in the cases with or without set point change is studied. CPA I method is applied to diagnose the plant model mismatch in the case with set point change.
Second, an advanced CPA method (namely CPA II) is developed to assess the control performance degradation in the case without set point change. An estimated disturbance is generated by a filter, and ARMAX regression method is applied to the estimated disturbance to assess the control performance. The influence of plant model mismatch, improper controller tuning, metrology delay, and high-mix process parameters is studied and the results showed that CPA II method can quickly identify, diagnose and correct the control performance degradation.
The CPA II method is applied to industrial data from a high-mix photolithography process in Texas Instruments and the influence of metrology delay and plant model mismatch is discussed. A control performance optimization (CPO) method based on analysis of estimated disturbance is proposed, and optimal EWMA controller tuning factor is suggested.
Finally, the CPA II method is applied to non-threaded run-to-run controller which is developed based on state estimation and Kalman filter. Overall process control performance and state estimation behavior are assessed. The influence of plant model mismatch and improper selection of different controller variables is studied. / text
|
17 |
The Case For Hardware Overprovisioned SupercomputersPatki, Tapasya January 2015 (has links)
Power management is one of the most critical challenges on the path to exascale supercomputing. High Performance Computing (HPC) centers today are designed to be worst-case power provisioned, leading to two main problems: limited application performance and under-utilization of procured power. In this dissertation we introduce hardware overprovisioning: a novel, flexible design methodology for future HPC systems that addresses the aforementioned problems and leads to significant improvements in application and system performance under a power constraint. We first establish that choosing the right configuration based on application characteristics when using hardware overprovisioning can improve application performance under a power constraint by up to 62%. We conduct a detailed analysis of the infrastructure costs associated with hardware overprovisioning and show that it is an economically viable supercomputing design approach. We then develop RMAP (Resource MAnager for Power), a power-aware, low-overhead, scalable resource manager for future hardware overprovisioned HPC systems. RMAP addresses the issue of under-utilized power by using power-aware backfilling and improves job turnaround times by up to 31%. This dissertation opens up several new avenues for research in power-constrained supercomputing as we venture toward exascale, and we conclude by enumerating these.
|
18 |
Optimizing System Performance and Dependability Using Compiler TechniquesRajagopalan, Mohan January 2006 (has links)
As systems become more complex, there are increasing demands for improvement with respect to attributes such as performance, dependability, and security. Optimization is defined as theprocess of making the most effective use of a set of resources with respect to some attribute. Existing optimization techniques, however, have two fundamental limitations. They target individual parts of a system without considering the potentially significant global picture, and they are designed to improve a single attribute at a time. These limitations impose significant restrictions on the kinds of optimization possible, the effectiveness of the techniques, and the ability to improvethe optimization process itself.This dissertation presents holistic system optimization, a new approach to optimization based on taking a broad view of a system. Unlike current approaches, holistic optimizations consider different kinds of interactions at multiple levels in a system, and target a variety of metrics uniformly. A key component of this research has been the use of proven compiler techniques to ensure transparency, automation, and correctness. These techniques have been implemented in Cassyopia, a software prototype of a framework for performing holistic optimization.The core of this work is three new holistic optimizations, which are also presented. The first describes profile-directed static optimizations designed to improve the performance of eventbased programs by spanning boundaries that separate code that raises events from handlers that field them. The second, system call clustering, improves the system call behavior of an entire program by grouping together calls that can be executed in a single boundary crossing. In thiscase, the optimization spans kernel and user address spaces. Finally, authenticated system calls optimize system security through a novel implementation of an efficient system call monitor. This example demonstrates how the new approach can be used to create new optimizations that not only span address space boundaries but also target attributes such as dependability. All of these optimizations involve the application of standard compiler techniques in non-traditional contexts and demonstrate how systems can be improved beyond what is possible using existing techniques.
|
19 |
Aurora : seamless optimization of openMP applications / Aurora: Otimização Transparente de Aplicações OpenMPLorenzon, Arthur Francisco January 2018 (has links)
A exploração eficiente do paralelismo no nível de threads tem sido um desafio para os desenvolvedores de softwares. Como muitas aplicações não escalam com o número de núcleos, aumentar cegamente o número de threads pode não produzir os melhores resultados em desempenho ou energia. No entanto, a tarefa de escolher corretamente o número ideal de threads não é simples: muitas variáveis estão envolvidas (por exemplo, saturação do barramento off-chip e sobrecarga de sincronização de dados), que mudam de acordo com diferentes aspectos do sistema (por exemplo, conjunto de entrada, micro-arquitetura) e mesmo durante a execução da aplicação. Para abordar esse complexo cenário, esta tese apresenta Aurora. Ela é capaz de encontrar automaticamente, em tempo de execução e com o mínimo de sobrecarga, o número ideal de threads para cada região paralela da aplicação e se readaptar nos casos em que o comportamento de uma região muda durante a execução. Aurora trabalha com o OpenMP e é completamente transparente tanto para o programador quanto para o usuário final: dado um binário de uma aplicação OpenMP, Aurora o otimiza sem nenhuma transformação ou recompilação de código. Através da execução de quinze benchmarks conhecidos em quatro processadores multi-core, mostramos que Aurora melhora o trade-off entre desempenho e energia em até: 98% sobre a execução padrão do OpenMP; 86% sobre o recurso interno do OpenMP que ajusta dinamicamente o número de threads; e 91% quando comparado a uma emulação do feedback-driven threading. / Efficiently exploiting thread-level parallelism has been challenging for software developers. As many parallel applications do not scale with the number of cores, blindly increasing the number of threads may not produce the best results in performance or energy. However, the task of rightly choosing the ideal amount of threads is not straightforward: many variables are involved (e.g. off-chip bus saturation and overhead of datasynchronization), which will change according to different aspects of the system at hand (e.g., input set, micro-architecture) and even during execution. To address this complex scenario, this thesis presents Aurora. It is capable of automatically finding, at run-time and with minimum overhead, the optimal number of threads for each parallel region of the application and re-adapt in cases the behavior of a region changes during execution. Aurora works with OpenMP and is completely transparent to both designer and end-user: given an OpenMP application binary, Aurora optimizes it without any code transformation or recompilation. By executing fifteen well-known benchmarks on four multi-core processors, Aurora improves the trade-off between performance and energy by up to: 98% over the standard OpenMP execution; 86% over the built-in feature of OpenMP that dynamically adjusts the number of threads; and 91% over a feedback-driven threading emulation.
|
20 |
Aurora : seamless optimization of openMP applications / Aurora: Otimização Transparente de Aplicações OpenMPLorenzon, Arthur Francisco January 2018 (has links)
A exploração eficiente do paralelismo no nível de threads tem sido um desafio para os desenvolvedores de softwares. Como muitas aplicações não escalam com o número de núcleos, aumentar cegamente o número de threads pode não produzir os melhores resultados em desempenho ou energia. No entanto, a tarefa de escolher corretamente o número ideal de threads não é simples: muitas variáveis estão envolvidas (por exemplo, saturação do barramento off-chip e sobrecarga de sincronização de dados), que mudam de acordo com diferentes aspectos do sistema (por exemplo, conjunto de entrada, micro-arquitetura) e mesmo durante a execução da aplicação. Para abordar esse complexo cenário, esta tese apresenta Aurora. Ela é capaz de encontrar automaticamente, em tempo de execução e com o mínimo de sobrecarga, o número ideal de threads para cada região paralela da aplicação e se readaptar nos casos em que o comportamento de uma região muda durante a execução. Aurora trabalha com o OpenMP e é completamente transparente tanto para o programador quanto para o usuário final: dado um binário de uma aplicação OpenMP, Aurora o otimiza sem nenhuma transformação ou recompilação de código. Através da execução de quinze benchmarks conhecidos em quatro processadores multi-core, mostramos que Aurora melhora o trade-off entre desempenho e energia em até: 98% sobre a execução padrão do OpenMP; 86% sobre o recurso interno do OpenMP que ajusta dinamicamente o número de threads; e 91% quando comparado a uma emulação do feedback-driven threading. / Efficiently exploiting thread-level parallelism has been challenging for software developers. As many parallel applications do not scale with the number of cores, blindly increasing the number of threads may not produce the best results in performance or energy. However, the task of rightly choosing the ideal amount of threads is not straightforward: many variables are involved (e.g. off-chip bus saturation and overhead of datasynchronization), which will change according to different aspects of the system at hand (e.g., input set, micro-architecture) and even during execution. To address this complex scenario, this thesis presents Aurora. It is capable of automatically finding, at run-time and with minimum overhead, the optimal number of threads for each parallel region of the application and re-adapt in cases the behavior of a region changes during execution. Aurora works with OpenMP and is completely transparent to both designer and end-user: given an OpenMP application binary, Aurora optimizes it without any code transformation or recompilation. By executing fifteen well-known benchmarks on four multi-core processors, Aurora improves the trade-off between performance and energy by up to: 98% over the standard OpenMP execution; 86% over the built-in feature of OpenMP that dynamically adjusts the number of threads; and 91% over a feedback-driven threading emulation.
|
Page generated in 0.1404 seconds