Global ETD Search

11	Runtime specialization for heterogeneous CPU-GPU platforms Farooqui, Naila 27 May 2016 (has links) Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices. Dynamic instrumentation Dynamic compilation GPU computing Heterogeneous computing Profile-guided optimizations Program analysis Workload characterization Compiler Runtime Multicore CUDA OpenCL SIMD
12	Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server Awan, Ahsan Javed January 2017 (has links) The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers.Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads are bound by the latency of frequent data accesses to the memory. By enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1cache misses and higher core utilization).For data accesses, we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average,(ii) disabling next-line L1-D prefetchers can reduce the execution time by upto14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6xto 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators. / <p>QC 20171121</p> Workload Characterization Big Data Analytics Multicore Performance Apache Spark Near Data Processing NUMA Hyperthreading Prefetchers Coherently attached accelerators Computer Systems Datorsystem
13	An investigation into parallel job scheduling using service level agreements Ali, Syed Zeeshan January 2014 (has links) A scheduler, as a central components of a computing site, aggregates computing resources and is responsible to distribute the incoming load (jobs) between the resources. Under such an environment, the optimum performance of the system against the service level agreement (SLA) based workloads, can be achieved by calculating the priority of SLA bound jobs using integrated heuristic. The SLA defines the service obligations and expectations to use the computational resources. The integrated heuristic is the combination of different SLA terms. It combines the SLA terms with a specific weight for each term. Theweights are computed by applying parameter sweep technique in order to obtain the best schedule for the optimum performance of the system under the workload. The sweepingof parameters on the integrated heuristic observed to be computationally expensive. The integrated heuristic becomes more expensive if no value of the computed weights result in improvement in performance with the resulting schedule. Hence, instead of obtaining optimum performance it incurs computation cost in such situations. Therefore, there is a need of detection of situations where the integrated heuristic can be exploited beneficially. For that reason, in this thesis we propose a metric based on the concept of utilization, to evaluate the SLA based parallel workloads of independent jobs to detect any impact of integrated heuristic on the workload. 004
14	Characterization of Dynamic Resource Consumption for Interference-Aware Consolidation Hähnel, Markus 15 May 2023 (has links) Nowadays, our daily live concerns the usage of Information Technology, increasingly. As a result, a huge amount of data has to be processed which is outsourced from local devices to data centers. Due to fluctuating demands these are not fully utilized all the time and consume a significant amount of energy while idling. A common approach to avoid unnecessary idle times is to consolidate running services on a subset of machines and switch off the remaining ones. Unfortunately, the services on a single machine interfere with each other due to the competition for shared resources such as caches after the consolidation, which leads to a degradation of performance. Hence, data centers have to trade off between reducing the energy consumption and certain performance criteria defined in the Service Level Agreement. In order to make the trade off in advance, it is necessary to characterize services and quantify the impact to each other after a potential consolidation. Our approach is to use random variables for characterization, which includes the fluctuations of the resource consumptions. Furthermore, we would like to model the interference of services to provide a probability of exceeding a certain performance criterion. info:eu-repo/classification/ddc/004 ddc:004
15	A Coordination Framework for Deploying Hadoop MapReduce Jobs on Hadoop Cluster Raja, Anitha January 2016 (has links) Apache Hadoop is an open source framework that delivers reliable, scalable, and distributed computing. Hadoop services are provided for distributed data storage, data processing, data access, and security. MapReduce is the heart of the Hadoop framework and was designed to process vast amounts of data distributed over a large number of nodes. MapReduce has been used extensively to process structured and unstructured data in diverse fields such as e-commerce, web search, social networks, and scientific computation. Understanding the characteristics of Hadoop MapReduce workloads is the key to achieving improved configurations and refining system throughput. Thus far, MapReduce workload characterization in a large-scale production environment has not been well studied. In this thesis project, the focus is mainly on composing a Hadoop cluster (as an execution environment for data processing) to analyze two types of Hadoop MapReduce (MR) jobs via a proposed coordination framework. This coordination framework is referred to as a workload translator. The outcome of this work includes: (1) a parametric workload model for the target MR jobs, (2) a cluster specification to develop an improved cluster deployment strategy using the model and coordination framework, and (3) better scheduling and hence better performance of jobs (i.e. shorter job completion time). We implemented a prototype of our solution using Apache Tomcat on (OpenStack) Ubuntu Trusty Tahr, which uses RESTful APIs to (1) create a Hadoop cluster version 2.7.2 and (2) to scale up and scale down the number of workers in the cluster. The experimental results showed that with well tuned parameters, MR jobs can achieve a reduction in the job completion time and improved utilization of the hardware resources. The target audience for this thesis are developers. As future work, we suggest adding additional parameters to develop a more refined workload model for MR and similar jobs. / Apache Hadoop är ett öppen källkods system som levererar pålitlig, skalbar och distribuerad användning. Hadoop tjänster hjälper med distribuerad data förvaring, bearbetning, åtkomst och trygghet. MapReduce är en viktig del av Hadoop system och är designad att bearbeta stora data mängder och även distribuerad i flera leder. MapReduce är använt extensivt inom bearbetning av strukturerad och ostrukturerad data i olika branscher bl. a e-handel, webbsökning, sociala medier och även vetenskapliga beräkningar. Förståelse av MapReduces arbetsbelastningar är viktiga att få förbättrad konfigurationer och resultat. Men, arbetsbelastningar av MapReduce inom massproduktions miljö var inte djup-forskat hittills. I detta examensarbete, är en hel del fokus satt på ”Hadoop cluster” (som en utförande miljö i data bearbetning) att analysera två typer av Hadoop MapReduce (MR) arbeten genom ett tilltänkt system. Detta system är refererad som arbetsbelastnings översättare. Resultaten från denna arbete innehåller: (1) en parametrisk arbetsbelastningsmodell till inriktad MR arbeten, (2) en specifikation att utveckla förbättrad kluster strategier med båda modellen och koordinations system, och (3) förbättrad planering och arbetsprestationer, d.v.s kortare tid att utföra arbetet. Vi har realiserat en prototyp med Apache Tomcat på (OpenStack) Ubuntu Trusty Tahr som använder RESTful API (1) att skapa ”Hadoop cluster” version 2.7.2 och (2) att båda skala upp och ner antal medarbetare i kluster. Forskningens resultat har visat att med vältrimmad parametrar, kan MR arbete nå förbättringar dvs. sparad tid vid slutfört arbete och förbättrad användning av hårdvara resurser. Målgruppen för denna avhandling är utvecklare. I framtiden, föreslår vi tilläggning av olika parametrar att utveckla en allmän modell för MR och liknande arbeten. Hadoop Workload Characterization Parametric Modeling Coordination framework OpenStack Workload deployment Hadoop Arbetsbelastning Karakterisering Parametrisk Utformning Koordinations system OpenStack Arbetsbelastnings Utplacering Communication Systems Kommunikationssystem
16	Scheduling Local and Remote Memory in Cluster Computers Serrano Gómez, Mónica 02 September 2013 (has links) Los cl'usters de computadores representan una soluci'on alternativa a los supercomputadores. En este tipo de sistemas, se suele restringir el espacio de direccionamiento de memoria de un procesador dado a la placa madre local. Restringir el sistema de esta manera es mucho m'as barato que usar una implementaci'on de memoria compartida entre las placas. Sin embargo, las diferentes necesidades de memoria de las aplicaciones que se ejecutan en cada placa pueden dar lugar a un desequilibrio en el uso de memoria entre las placas. Esta situaci'on puede desencadenar intercambios de datos con el disco, los cuales degradan notablemente las prestaciones del sistema, a pesar de que pueda haber memoria no utilizada en otras placas. Una soluci'on directa consiste en aumentar la cantidad de memoria disponible en cada placa, pero el coste de esta soluci'on puede ser prohibitivo. Por otra parte, el hardware de acceso a memoria remota (RMA) es una forma de facilitar interconexiones r'apidas entre las placas de un cl'uster de computadores. En trabajos recientes, esta caracter'¿stica se ha usado para aumentar el espacio de direccionamiento en ciertas placas. En este trabajo, la m'aquina base usa esta capacidad como mecanismo r'apido para permitir al sistema operativo local acceder a la memoria DRAM instalada en una placa remota. En este contexto, una plani¿caci'on de memoria e¿ciente constituye una cuesti'on cr'¿tica, ya que las latencias de memoria tienen un impacto importante sobre el tiempo de ejecuci'on global de las aplicaciones, debido a que las latencias de memoria remota pueden ser varios 'ordenes de magnitud m'as altas que los accesos locales. Adem'as, el hecho de cambiar la distribuci'on de memoria es un proceso lento que puede involucrar a varias placas, as'¿ pues, el plani¿cador de memoria ha de asegurarse de que la distribuci'on objetivo proporciona mejores prestaciones que la actual. La presente disertaci'on pretende abordar los asuntos mencionados anteriormente mediante la propuesta de varias pol'¿ticas de plani¿caci'on de memoria. En primer lugar, se presenta un algoritmo ideal y una estrategia heur'¿stica para asignar memoria principal ubicada en las diferentes regiones de memoria. Adicionalmente, se ha dise¿nado un mecanismo de control de Calidad de Servicio para evitar que las prestaciones de las aplicaciones en ejecuci'on se degraden de forma inadmisible. El algoritmo ideal encuentra la distribuci'on de memoria 'optima pero su complejidad computacional es prohibitiva dado un alto n'umero de aplicaciones. De este inconveniente se encarga la estrategia heur'¿stica, la cual se aproxima a la mejor distribuci'on de memoria local y remota con un coste computacional aceptable. Los algoritmos anteriores se basan en pro¿ling. Para tratar este defecto potencial, nos centramos en soluciones anal'¿ticas. Esta disertaci'on propone un modelo anal'¿tico que estima el tiempo de ejecuci'on de una aplicaci'on dada para cierta distribuci'on de memoria. Dicha t'ecnica se usa como un predictor de prestaciones que proporciona la informaci'on de entrada a un plani¿cador de memoria. El plani¿cador de memoria usa las estimaciones para elegir din'amicamente la distribuci'on de memoria objetivo 'optima para cada aplicaci'on que se est'e ejecutando en el sistema, de forma que se alcancen las mejores prestaciones globales. La plani¿caci'on a granularidad m'as alta permite pol'¿ticas de plani¿caci'on m'as simples. Este trabajo estudia la viabilidad de plani¿car a nivel de granularidad de p'agina del sistema operativo. Un entrelazado convencional basado en hardware a nivel de bloque y un entrelazado a nivel de p'agina de sistema operativo se han tomado como esquemas de referencia. De la comparaci'on de ambos esquemas de referencia, hemos concluido que solo algunas aplicaciones se ven afectadas de forma signi¿cativa por el uso del entrelazado a nivel de p'agina. Las razones que causan este impacto en las prestaciones han sido estudiadas y han de¿nido la base para el dise¿no de dos pol'¿ticas de distribuci'on de memoria basadas en sistema operativo. La primera se denomina on-demand (OD), y es una estrategia simple que funciona colocando las p'aginas nuevas en memoria local hasta que dicha regi'on se llena, de manera que se bene¿cia de la premisa de que las p'aginas m'as accedidas se piden y se ubican antes que las menos accedidas para mejorar las prestaciones. Sin embargo, ante la ausencia de dicha premisa para algunos de los benchmarks, OD funciona peor. La segunda pol'¿tica, denominada Most-accessed in-local (Mail), se propone con el objetivo de evitar este problema. / Cluster computers represent a cost-effective alternative solution to supercomputers. In these systems, it is common to constrain the memory address space of a given processor to the local motherboard. Constraining the system in this way is much cheaper than using a full-fledged shared memory implementation among motherboards. However, memory usage among motherboards may be unfairly balanced depending on the memory requirements of the applications running on each motherboard. This situation can lead to disk-swapping, which severely degrades system performance, although there may be unused memory on other motherboards. A straightforward solution is to increase the amount of available memory in each motherboard, but the cost of this solution may become prohibitive. On the other hand, remote memory access (RMA) hardware provides fast interconnects among the motherboards of a cluster computer. In recent works, this characteristic has been used to extend the addressable memory space of selected motherboards. In this work, the baseline machine uses this capability as a fast mechanism to allow the local OS to access to DRAM memory installed in a remote motherboard. In this context, efficient memory scheduling becomes a major concern since main memory latencies have a strong impact on the overall execution time of the applications, provided that remote memory accesses may be several orders of magnitude higher than local accesses. Additionally, changing the memory distribution is a slow process which may involve several motherboards, hence the memory scheduler needs to make sure that the target distribution provides better performance than the current one. This dissertation aims to address the aforementioned issues by proposing several memory scheduling policies. First, an ideal algorithm and a heuristic strategy to assign main memory from the different memory regions are presented. Additionally, a Quality of Service control mechanism has been devised in order to prevent unacceptable performance degradation for the running applications. The ideal algorithm finds the optimal memory distribution but its computational cost is prohibitive for a high number of applications. This drawback is handled by the heuristic strategy, which approximates the best local and remote memory distribution among applications at an acceptable computational cost. The previous algorithms are based on profiling. To deal with this potential shortcoming we focus on analytical solutions. This dissertation proposes an analytical model that estimates the execution time of a given application for a given memory distribution. This technique is used as a performance predictor that provides the input to a memory scheduler. The estimates are used by the memory scheduler to dynamically choose the optimal target memory distribution for each application running in the system in order to achieve the best overall performance. Scheduling at a higher granularity allows simpler scheduler policies. This work studies the feasibility of scheduling at OS page granularity. A conventional hardware-based block interleaving and an OS-based page interleaving have been assumed as the baseline schemes. From the comparison of the two baseline schemes, we have concluded that only the performance of some applications is significantly affected by page-based interleaving. The reasons that cause this impact on performance have been studied and have provided the basis for the design of two OS-based memory allocation policies. The first one, namely on-demand (OD), is a simple strategy that works by placing new pages in local memory until this region is full, thus benefiting from the premise that most of the accessed pages are requested and allocated before than the least accessed ones to improve the performance. Nevertheless, in the absence of this premise for some benchmarks, OD performs worse. The second policy, namely Most-accessed in-local (Mail), is proposed to avoid this problem / Serrano Gómez, M. (2013). Scheduling Local and Remote Memory in Cluster Computers [Tesis doctoral]. Editorial Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/31639 Cluster computers Memory allocation Interleaved memory Memory management Workload characterization Memory scheduling Remote memory assignment Performance estimation Quality of service Analysis of performance
17	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) <p>Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility.</p><p>This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support.</p><p>The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs.</p><p>Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs.</p><p>We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.</p> shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik
18	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility. This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support. The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs. Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs. We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling. shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik

Search results