Spelling suggestions: "subject:"aprocessing unit"" "subject:"eprocessing unit""
81 |
A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU SystemsRengasamy, Vasudevan January 2014 (has links) (PDF)
The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications.
Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems.
In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation.
In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications.
|
82 |
應用機器學習預測利差交易的收益 / Application of machine learning to predicting the returns of carry trade吳佳真 Unknown Date (has links)
本研究提出了一個類神經網路機制,可以及時有效的預測利差交易(carry trade)的收益。為了實現及時性,我們將通過Tensorflow和圖形處理單元(GPU)來實作這個機制。此外,類神經網路機制需要處理具有概念飄移和異常值的時間序列數據。而我們將透過設計的實驗來驗證這個機制的及時性與有效性。
在實驗過程中,我們發現在演算法設置不同的參數將影響類神經網路的性能。本研究將討論不同參數下所產生的不同結果。實驗結果表明,我們所提出的類神經網路機制可以預測出利差交易的收益的動向。希望這個研究將對機器學習和金融領域皆有所貢獻。 / This research derives an artificial neural networks (ANN) mechanism for timely and effectively predicting the return of carry trade. To achieve the timeliness, the ANN mechanism is implemented via the infrastructure of TensorFlow and graphic processing unit (GPU). Furthermore, the ANN mechanism needs to cope with the time series data that may have concept-drifting phenomenon and outliers. An experiment is also designed to verify the timeliness and effectiveness of the proposed mechanism.
During the experiment, we find that different parameters we set in the algorithm will affect the performance of the neural network. And this research will discuss the different results in different parameters. Our experiment result represents that the proposed ANN mechanism can predict movement of the returns of carry trade well. Hope this research would contribute for both machine learning and finance field.
|
83 |
Performance Analysis of Non Local Means Algorithm using Hardware AcceleratorsAntony, Daniel Sanju January 2016 (has links) (PDF)
Image De-noising forms an integral part of image processing. It is used as a standalone algorithm for improving the quality of the image obtained through camera as well as a starting stage for image processing applications like face recognition, super resolution etc. Non Local Means (NL-Means) and Bilateral Filter are two computationally complex de-noising algorithms which could provide good de-noising results. Due to its computational complexity, the real time applications associated with these letters are limited.
In this thesis, we propose the use of hardware accelerators such as GPU (Graphics Processing Units) and FPGA (Field Programmable Gate Arrays) to speed up the filter execution and efficiently implement using them. GPU based implementation of these letters is carried out using Open Computing Language (Open CL). The basic objective of this research is to perform high speed de-noising without compromising on the quality. Here we implement a basic NL-Means filter, a Fast NL-Means filter, and Bilateral filter using Gauss Polynomial decomposition on GPU. We also propose a modification to the existing NL-Means algorithm and Gauss Polynomial Bilateral filter. Instead of Gaussian Spatial Kernel used in standard algorithm, Box Spatial kernel is introduced to improve the speed of execution of the algorithm. This research work is a step forward towards making the real time implementation of these algorithms possible. It has been found from results that the NL-Means implementation on GPU using Open CL is about 25x faster than regular CPU based implementation for larger images (1024x1024). For Fast NL-Means, GPU based implementation is about 90x faster than CPU implementation. Even with the improved execution time, the embedded system application of the NL-Means is limited due to the power and thermal restrictions of the GPU device. In order to create a low power and faster implementation, we have implemented the algorithm on FPGA. FPGAs are reconfigurable devices and enable us to create a custom architecture for the parallel execution of the algorithm. It was found that the execution time for smaller images (256x256) is about 200x faster than CPU implementation and about 25x faster than GPU execution. Moreover the power requirements of the FPGA design of the algorithm (0.53W) is much less compared to CPU(30W) and GPU(200W).
|
84 |
Benchmark pro zařízení s podporou OpenGL ES 3.0 / Benchmark for OpenGL ES 3.0 DevicesKimer, Tomáš January 2014 (has links)
This thesis deals with the development of benchmark application for the OpenGL ES 3.0 devices using the realistic real-time rendering of 3D scenes. The first part covers the history and new features of the OpenGL ES 3.0 graphics library. Next part briefly describes selected algorithms for the realistic real-time rendering of 3D scenes which can be implemented using the new features of the discussed library. The design of benchmark application is covered next, including the design of online result database containing detailed device specifications. The last part covers implementation on Android and Windows platforms and the testing on mobile devices after publishing the application on Google Play. Finally, the results and possibilites of further development are discussed.
|
85 |
On GPU Assisted Polar Decoding : Evaluating the Parallelization of the Successive Cancellation Algorithmusing Graphics Processing Units / Polärkodning med hjälp av GPU:er : En utvärdering av parallelliseringmöjligheterna av SuccessiveCancellation-algoritmen med hjälp av grafikprocessorerNordqvist, Siri January 2023 (has links)
In telecommunication, messages sent through a wireless medium often experience noise interfering with the signal in a way that corrupts the messages. As the demand for high throughput in the mobile network is increasing, algorithms that can detectand correct these corrupted messages quickly and accurately are of interest to the industry. Polar codes have been chosen by the Third Generation Partnership Project as the error correction code for 5G New Radio control channels. This thesis work aimed to investigate whether the polar code Successive Cancellation (SC) could be parallelized and if a graphics processing unit (GPU) can be utilized to optimize the execution time of the algorithm. The polar code Successive Cancellation was enhanced by implementing tree pruning and support for GPUs to leverage their parallelization. The difference in execution time between the concurrent and sequential versions of the SC algorithm with and without tree pruning was evaluated. The tree pruning SC algorithm almost always offered shorter execution times than the SC algorithm that did not employ treepruning. However, the support for GPUs did not reduce the execution time in these tests. Thus, the GPU is not certain to be able to improve this type of enhanced SC algorithm based on these results. / Meddelanden som överförs över ett mobilt nät utsätts ofta för brus som distorterar dem. I takt med att intresset ökat för hög genomströmning i mobilnätet har också intresset för algoritmer som snabbt och tillförlitligt kan upptäcka och korrigera distorderade meddelanden ökat. Polarkoder har valts av "Third Generation Partnership Project" som den klass av felkorrigeringskoder som ska användas för 5G:s radiokontrollkanaler. Detta examensarbete hade som syfte att undersöka om polarkoden "Successive Cancellation" (SC) skulle kunna parallelliseras och om en grafisk bearbetningsenhet (GPU) kan användas för att optimera exekveringstiden för algoritmen. SC utökades med stöd för trädbeskärning och parallellisering med hjälp av GPU:er. Skillnaden i exekveringstid mellan de parallella och sekventiella versionerna av SC-algoritmen med och utan trädbeskärning utvärderades. SC-algoritmen för trädbeskärning erbjöd nästan alltid kortare exekveringstider än SC-algoritmen som inte använde trädbeskärning. Stödet för GPU:er minskade dock inte exekveringstiden. Således kan man med dessa resultat inte med säkerhet säga att GPU-stöd skulle gynna SC-algoritmen.
|
86 |
Implementation and optimization of LDPC decoding algorithms tailored for Nvidia GPUs in 5G / Implementering och optimering av LDPC avkodningsalgoritmer anpassat för Nvidia GPU:er i 5GSalomonsson, Benjamin January 2022 (has links)
Low-Density Parity-Check (LDPC) codes are linear error-correcting codes used to establish reliable communication between units on a noisy transmission channel in mobile telecommunications. LDPC algorithms detect and recover altered or corrupted message bits using sparse parity-check matrices in order to decipher messages correctly. LDPC codes have been shown to be fitting coding schemes for the fifth generation (5G) New Radio (NR), according to the third generation partnership project (3GPP). TietoEvry, a consultant in telecom, has discovered that optimizations of LDPC decoding algorithms can be achieved/obtained with the use of a parallel computing platform called Compute Unified Device Architecture (CUDA), developed by NVIDIA. This platform utilizes the capabilities of a graphics processing unit (GPU) rather than a central processing unit (CPU), which in turn provides parallel computing. An optimized version of an LDPC decoding algorithm, the Min-Sum Algorithm (MSA), is implemented in CUDA and in C++ for comparison in terms of CPU execution time, to explore the capabilities that CUDA offers. The testing is done with a set of 12 sparse parity-check matrices and input-channel messages with different sizes. As a result, the CUDA implementation executes approximately 55% faster than a standard, unoptimized C++ implementation.
|
87 |
Methods for 3D Structured Light Sensor Calibration and GPU Accelerated ColormapKurella, Venu January 2018 (has links)
In manufacturing, metrological inspection is a time-consuming process.
The higher the required precision in inspection, the longer the
inspection time. This is due to both slow devices that collect
measurement data and slow computational methods that process the data.
The goal of this work is to propose methods to speed up some of these
processes. Conventional measurement devices like Coordinate Measuring
Machines (CMMs) have high precision but low measurement speed while
new digitizer technologies have high speed but low precision. Using
these devices in synergy gives a significant improvement in the
measurement speed without loss of precision. The method of synergistic
integration of an advanced digitizer with a CMM is discussed.
Computational aspects of the inspection process are addressed next. Once
a part is measured, measurement data is compared against its
model to check for tolerances. This comparison is a time-consuming
process on conventional CPUs. We developed and benchmarked some GPU accelerations. Finally, naive data fitting methods can produce misleading results in cases with non-uniform data. Weighted total least-squares methods can compensate for non-uniformity. We show how they can be accelerated with GPUs, using plane fitting as an example. / Thesis / Doctor of Philosophy (PhD)
|
88 |
Research and Design of Neural Processing Architectures Optimized for Embedded ApplicationsWu, Binyi 28 May 2024 (has links)
Der Einsatz von neuronalen Netzen in Edge-Geräten und deren Einbindung in unser tägliches Leben findet immer mehr Beachtung. Ihre hohen Rechenkosten machen jedoch viele eingebettete Anwendungen zu einer Herausforderung. Das Hauptziel meiner Doktorarbeit ist es, einen Beitrag zur Lösung dieses Dilemmas zu leisten: die Optimierung neuronaler Netze und die Entwicklung entsprechender neuronaler Verarbeitungseinheiten für Endgeräte. Diese Arbeit nahm die algorithmische Forschung als Ausgangspunkt und wandte dann deren Ergebnisse an, um das Architekturdesign von Neural Processing Units (NPUs) zu verbessern. Die Optimierung neuronaler Netzwerkmodelle begann mit der Quantisierung neuronaler Netzwerke mit einfacher Präzision und entwickelte sich zu gemischter Präzision. Die Entwicklung der NPU-Architektur folgte den Erkenntnissen der Algorithmusforschung, um ein Hardware/Software Co-Design zu erreichen. Darüber hinaus wurde ein neuartiger Ansatz zur gemeinsamen Entwicklung von Hardware und Software vorgeschlagen, um das Prototyping und die Leistungsbewertung von NPUs zu beschleunigen. Dieser Ansatz zielt auf die frühe Entwicklungsphase ab. Er hilft Entwicklern, sich auf das Design und die Optimierung von NPUs zu konzentrieren und verkürzt den Entwicklungszyklus erheblich. Im Abschlussprojekt wurde ein auf maschinellem Lernen basierender Ansatz angewendet, um die Rechen- und Speicherressourcen der NPU zu erkunden and optimieren. Die gesamte Arbeit umfasst mehrere verschiedene Bereiche, von der Algorithmusforschung bis zum Hardwaredesign. Sie alle arbeiten jedoch an der Verbesserung der Inferenz-Effizienz neuronaler Netze. Die Optimierung der Algorithmen zielt insbesondere darauf ab, den Speicherbedarf und die Rechenkosten von neuronalen Netzen zu verringern. Das NPU-Design hingegen konzentriert sich auf die Verbesserung der Nutzung von Hardwareressourcen. Der vorgeschlagene Ansatz zur gemeinsamen Entwicklung von Software und Hardware verkürzt den Entwurfszyklus und beschleunigt die Entwurfsiterationen. Die oben dargestellte Reihenfolge entspricht dem Aufbau dieser Dissertation. Jedes Kapitel ist einem Thema gewidmet und umfasst relevante Forschungsarbeiten, Methodik und Versuchsergebnisse.:1 Introduction
2 Convolutional Neural Networks
2.1 Convolutional layer
2.1.1 Padding
2.1.2 Convolution
2.1.3 Batch Normalization
2.1.4 Nonlinearity
2.2 Pooling Layer
2.3 Fully Connected Layer
2.4 Characterization
2.4.1 Composition of Operations and Parameters
2.4.2 Arithmetic Intensity
2.5 Optimization
3 Quantization with Double-Stage Squeeze-and-Threshold 19
3.1 Overview
3.1.1 Binarization
3.1.2 Multi-bit Quantization
3.2 Quantization of Convolutional Neural Networks
3.2.1 Quantization Scheme
3.2.2 Operator fusion of Conv2D
3.3 Activation Quantization with Squeeze-and-Threshold
3.3.1 Double-Stage Squeeze-and-Threshold
3.3.2 Inference Optimization
3.4 Experiment
3.4.1 Ablation Study of Squeeze-and-Threshold
3.4.2 Comparison with State-of-the-art Methods
3.5 Summary
4 Low-Precision Neural Architecture Search 39
4.1 Overview
4.2 Differentiable Architecture Search
4.2.1 Gumbel Softmax
4.2.2 Disadvantage and Solution
4.3 Low-Precision Differentiable Architecture Search
4.3.1 Convolution Sharing
4.3.2 Forward-and-Backward Scaling
4.3.3 Power Estimation
4.3.4 Architecture of Supernet
4.4 Experiment
4.4.1 Effectiveness of solutions to the dominance problem
4.4.2 Softmax and Gumbel Softmax
4.4.3 Optimizer and Inverted Learning Rate Scheduler
4.4.4 NAS Method Evaluation
4.4.5 Searched Model Analysis
4.4.6 NAS Cost Analysis
4.4.7 NAS Training Analysis
4.5 Summary
5 Configurable Sparse Neural Processing Unit 65
5.1 Overview
5.2 NPU Architecture
5.2.1 Buffer
5.2.2 Reshapeable Mixed-Precision MAC Array
5.2.3 Sparsity
5.2.4 Post Process Unit
5.3 Mapping
5.3.1 Mixed-Precision MAC
5.3.2 MAC Array
5.3.3 Support of Other Operation
5.3.4 Configurability
5.4 Experiment
5.4.1 Performance Analysis of Runtime Configuration
5.4.2 Roofline Performance Analysis
5.4.3 Mixed-Precision
5.4.4 Comparison with Cortex-M7
5.5 Summary
6 Agile Development and Rapid Design Space Exploration 91
6.1 Overview
6.1.1 Agile Development
6.1.2 Design Space Exploration
6.2 Agile Development Infrastructure
6.2.1 Chisel Backend
6.2.2 NPU Software Stack
6.3 Modeling and Exploration
6.3.1 Area Modeling
6.3.2 Performance Modeling
6.3.3 Layered Exploration Framework
6.4 Experiment
6.4.1 Efficiency of Agile Development Infrastructure
6.4.2 Effectiveness of Agile Development Infrastructure
6.4.3 Area Modeling
6.4.4 Performance Modeling
6.4.5 Rapid Exploration and Pareto Front
6.5 Summary
7 Summary and Outlook 123
7.1 Summary
7.2 Outlook
A Appendix of Double-Stage ST Quantization 127
A.1 Training setting of ResNet-18 in Table 3.3
A.2 Training setting of ReActNet in Table 3.4
A.3 Training setting of ResNet-18 in Table 3.4
A.4 Pseudocode Implementation of Double-Stage ST
B Appendix of Low-Precision Neural Architecture Search 131
B.1 Low-Precision NAS on CIFAR-10
B.2 Low-Precision NAS on Tiny-ImageNet
B.3 Low-Precision NAS on ImageNet
Bibliography 137 / Deploying neural networks on edge devices and bringing them into our daily lives is attracting more and more attention. However, its expensive computational cost makes many embedded applications daunting. The primary objective of my doctoral studies is to make contributions towards resolving this predicament: optimizing neural networks and designing corresponding efficient neural processing units for edge devices. This work took algorithmic research, specifically the optimization of deep neural networks, as a starting point and then applied its findings to steer the architecture design of Neural Processing Units (NPUs). The optimization of neural network models started with single precision neural network quantization and progressed to mixed precision. The NPU architecture development followed the algorithmic research findings to achieve hardware/software co-design. Furthermore, a new approach to hardware and software co-development was introduced, aimed at expediting the prototyping and performance assessment of NPUs. This approach targets early-stage development. It helps developers to focus on the design and optimization of NPUs and significantly shortens the development cycle. In the final project, a machine learning-based approach was applied to explore and optimize the computational and memory resources of the NPU. The entire work covers several different areas, from algorithmic research to hardware design. But they all work on improving the inference efficiency of neural networks. Specifically, algorithm optimization aims to reduce the memory footprint and computational cost of neural networks. The NPU design, on the other hand, focuses on improving the utilization of hardware resources. The proposed software and hardware co-development approach shortens the design cycle and speeds up the design iteration. The order presented above corresponds to the structure of this dissertation. Each chapter corresponds to a topic and covers relevant research, methodology, and experimental results.:1 Introduction
2 Convolutional Neural Networks
2.1 Convolutional layer
2.1.1 Padding
2.1.2 Convolution
2.1.3 Batch Normalization
2.1.4 Nonlinearity
2.2 Pooling Layer
2.3 Fully Connected Layer
2.4 Characterization
2.4.1 Composition of Operations and Parameters
2.4.2 Arithmetic Intensity
2.5 Optimization
3 Quantization with Double-Stage Squeeze-and-Threshold 19
3.1 Overview
3.1.1 Binarization
3.1.2 Multi-bit Quantization
3.2 Quantization of Convolutional Neural Networks
3.2.1 Quantization Scheme
3.2.2 Operator fusion of Conv2D
3.3 Activation Quantization with Squeeze-and-Threshold
3.3.1 Double-Stage Squeeze-and-Threshold
3.3.2 Inference Optimization
3.4 Experiment
3.4.1 Ablation Study of Squeeze-and-Threshold
3.4.2 Comparison with State-of-the-art Methods
3.5 Summary
4 Low-Precision Neural Architecture Search 39
4.1 Overview
4.2 Differentiable Architecture Search
4.2.1 Gumbel Softmax
4.2.2 Disadvantage and Solution
4.3 Low-Precision Differentiable Architecture Search
4.3.1 Convolution Sharing
4.3.2 Forward-and-Backward Scaling
4.3.3 Power Estimation
4.3.4 Architecture of Supernet
4.4 Experiment
4.4.1 Effectiveness of solutions to the dominance problem
4.4.2 Softmax and Gumbel Softmax
4.4.3 Optimizer and Inverted Learning Rate Scheduler
4.4.4 NAS Method Evaluation
4.4.5 Searched Model Analysis
4.4.6 NAS Cost Analysis
4.4.7 NAS Training Analysis
4.5 Summary
5 Configurable Sparse Neural Processing Unit 65
5.1 Overview
5.2 NPU Architecture
5.2.1 Buffer
5.2.2 Reshapeable Mixed-Precision MAC Array
5.2.3 Sparsity
5.2.4 Post Process Unit
5.3 Mapping
5.3.1 Mixed-Precision MAC
5.3.2 MAC Array
5.3.3 Support of Other Operation
5.3.4 Configurability
5.4 Experiment
5.4.1 Performance Analysis of Runtime Configuration
5.4.2 Roofline Performance Analysis
5.4.3 Mixed-Precision
5.4.4 Comparison with Cortex-M7
5.5 Summary
6 Agile Development and Rapid Design Space Exploration 91
6.1 Overview
6.1.1 Agile Development
6.1.2 Design Space Exploration
6.2 Agile Development Infrastructure
6.2.1 Chisel Backend
6.2.2 NPU Software Stack
6.3 Modeling and Exploration
6.3.1 Area Modeling
6.3.2 Performance Modeling
6.3.3 Layered Exploration Framework
6.4 Experiment
6.4.1 Efficiency of Agile Development Infrastructure
6.4.2 Effectiveness of Agile Development Infrastructure
6.4.3 Area Modeling
6.4.4 Performance Modeling
6.4.5 Rapid Exploration and Pareto Front
6.5 Summary
7 Summary and Outlook 123
7.1 Summary
7.2 Outlook
A Appendix of Double-Stage ST Quantization 127
A.1 Training setting of ResNet-18 in Table 3.3
A.2 Training setting of ReActNet in Table 3.4
A.3 Training setting of ResNet-18 in Table 3.4
A.4 Pseudocode Implementation of Double-Stage ST
B Appendix of Low-Precision Neural Architecture Search 131
B.1 Low-Precision NAS on CIFAR-10
B.2 Low-Precision NAS on Tiny-ImageNet
B.3 Low-Precision NAS on ImageNet
Bibliography 137
|
89 |
Modeling and Analysis of Large-Scale On-Chip InterconnectsFeng, Zhuo 2009 December 1900 (has links)
As IC technologies scale to the nanometer regime, efficient and accurate modeling
and analysis of VLSI systems with billions of transistors and interconnects becomes
increasingly critical and difficult. VLSI systems impacted by the increasingly high
dimensional process-voltage-temperature (PVT) variations demand much more modeling
and analysis efforts than ever before, while the analysis of large scale on-chip
interconnects that requires solving tens of millions of unknowns imposes great challenges
in computer aided design areas. This dissertation presents new methodologies
for addressing the above two important challenging issues for large scale on-chip interconnect
modeling and analysis:
In the past, the standard statistical circuit modeling techniques usually employ
principal component analysis (PCA) and its variants to reduce the parameter
dimensionality. Although widely adopted, these techniques can be very
limited since parameter dimension reduction is achieved by merely considering
the statistical distributions of the controlling parameters but neglecting
the important correspondence between these parameters and the circuit performances
(responses) under modeling. This dissertation presents a variety of
performance-oriented parameter dimension reduction methods that can lead to
more than one order of magnitude parameter reduction for a variety of VLSI
circuit modeling and analysis problems.
The sheer size of present day power/ground distribution networks makes their
analysis and verification tasks extremely runtime and memory inefficient, and
at the same time, limits the extent to which these networks can be optimized.
Given today?s commodity graphics processing units (GPUs) that can deliver
more than 500 GFlops (Flops: floating point operations per second). computing
power and 100GB/s memory bandwidth, which are more than 10X greater
than offered by modern day general-purpose quad-core microprocessors, it is
very desirable to convert the impressive GPU computing power to usable design
automation tools for VLSI verification. In this dissertation, for the first time, we
show how to exploit recent massively parallel single-instruction multiple-thread
(SIMT) based graphics processing unit (GPU) platforms to tackle power grid
analysis with very promising performance. Our GPU based network analyzer
is capable of solving tens of millions of power grid nodes in just a few seconds.
Additionally, with the above GPU based simulation framework, more challenging
three-dimensional full-chip thermal analysis can be solved in a much more
efficient way than ever before.
|
90 |
FairCPU: Uma Arquitetura para Provisionamento de MÃquinas Virtuais Utilizando CaracterÃsticas de Processamento / FairCPU: An Architecture for Provisioning Virtual Machines Using Processing FeaturesPaulo Antonio Leal Rego 02 March 2012 (has links)
FundaÃÃo Cearense de Apoio ao Desenvolvimento Cientifico e TecnolÃgico / O escalonamento de recursos à um processo chave para a plataforma de ComputaÃÃo em Nuvem, que geralmente utiliza mÃquinas virtuais (MVs) como unidades de escalonamento. O uso de tÃcnicas de virtualizaÃÃo fornece grande flexibilidade com a habilidade de instanciar vÃrias MVs em uma mesma mÃquina fÃsica (MF), modificar a capacidade das MVs e migrÃ-las entre as MFs. As tÃcnicas de consolidaÃÃo e alocaÃÃo dinÃmica de MVs tÃm tratado o impacto da sua utilizaÃÃo como uma medida independente de localizaÃÃo. à geralmente aceito que o desempenho de uma MV serà o mesmo, independentemente da MF em que ela à alocada. Esta à uma suposiÃÃo razoÃvel para um ambiente homogÃneo, onde as MFs sÃo idÃnticas e as MVs estÃo executando o mesmo sistema operacional e aplicativos. No entanto, em um ambiente de ComputaÃÃo em Nuvem, espera-se compartilhar um conjunto composto por recursos heterogÃneos, onde as MFs podem variar em termos de capacidades de seus recursos e afinidades de dados. O objetivo principal deste trabalho à apresentar uma arquitetura que possibilite a padronizaÃÃo da representaÃÃo do poder de processamento das MFs e MVs, em funÃÃo de Unidades de Processamento (UPs), apoiando-se na limitaÃÃo do uso da CPU para prover isolamento de desempenho e manter a capacidade de processamento das MVs independente da MF subjacente. Este trabalho busca suprir a necessidade de uma soluÃÃo que considere a heterogeneidade das MFs presentes na infraestrutura da Nuvem e apresenta polÃticas de escalonamento baseadas na utilizaÃÃo das UPs. A arquitetura proposta, chamada FairCPU, foi implementada para trabalhar com os hipervisores KVM e Xen, e foi incorporada a uma nuvem privada, construÃda com o middleware OpenNebula, onde diversos experimentos foram realizados para avaliar a soluÃÃo proposta. Os resultados comprovam a eficiÃncia da arquitetura FairCPU em utilizar as UPs para reduzir a variabilidade no desempenho das MVs, bem como para prover uma nova maneira de representar e gerenciar o poder de processamento das MVs e MFs da infraestrutura. / Resource scheduling is a key process for cloud computing platform, which generally
uses virtual machines (VMs) as scheduling units. The use of virtualization techniques
provides great flexibility with the ability to instantiate multiple VMs on one physical machine
(PM), migrate them between the PMs and dynamically scale VMâs resources. The techniques
of consolidation and dynamic allocation of VMs have addressed the impact of its use as an
independent measure of location. It is generally accepted that the performance of a VM will be
the same regardless of which PM it is allocated. This assumption is reasonable for a homogeneous
environment where the PMs are identical and the VMs are running the same operating
system and applications. Nevertheless, in a cloud computing environment, we expect that a set
of heterogeneous resources will be shared, where PMs will face changes both in terms of their
resource capacities and as also in data affinities. The main objective of this work is to propose
an architecture to standardize the representation of the processing power by using processing
units (PUs). Adding to that, the limitation of CPU usage is used to provide performance isolation
and maintain the VMâs processing power at the same level regardless the underlying PM.
The proposed solution considers the PMs heterogeneity present in the cloud infrastructure and
provides scheduling policies based on PUs. The proposed architecture is called FairCPU and
was implemented to work with KVM and Xen hypervisors. As study case, it was incorporated
into a private cloud, built with the middleware OpenNebula, where several experiments were
conducted. The results prove the efficiency of FairCPU architecture to use PUs to reduce VMsâ
performance variability, as well as to provide a new way to represent and manage the processing
power of the infrastructureâs physical and virtual machines.
|
Page generated in 0.4418 seconds