Global ETD Search

171	Programming Model and Protocols for Reconfigurable Distributed Systems Arad, Cosmin Ionel January 2013 (has links) Distributed systems are everywhere. From large datacenters to mobile devices, an ever richer assortment of applications and services relies on distributed systems, infrastructure, and protocols. Despite their ubiquity, testing and debugging distributed systems remains notoriously hard. Moreover, aside from inherent design challenges posed by partial failure, concurrency, or asynchrony, there remain significant challenges in the implementation of distributed systems. These programming challenges stem from the increasing complexity of the concurrent activities and reactive behaviors in a distributed system on the one hand, and the need to effectively leverage the parallelism offered by modern multi-core hardware, on the other hand. This thesis contributes Kompics, a programming model designed to alleviate some of these challenges. Kompics is a component model and programming framework for building distributed systems by composing message-passing concurrent components. Systems built with Kompics leverage multi-core machines out of the box, and they can be dynamically reconfigured to support hot software upgrades. A simulation framework enables deterministic execution replay for debugging, testing, and reproducible behavior evaluation for largescale Kompics distributed systems. The same system code is used for both simulation and production deployment, greatly simplifying the system development, testing, and debugging cycle. We highlight the architectural patterns and abstractions facilitated by Kompics through a case study of a non-trivial distributed key-value storage system. CATS is a scalable, fault-tolerant, elastic, and self-managing key-value store which trades off service availability for guarantees of atomic data consistency and tolerance to network partitions. We present the composition architecture for the numerous protocols employed by the CATS system, as well as our methodology for testing the correctness of key CATS algorithms using the Kompics simulation framework. Results from a comprehensive performance evaluation attest that CATS achieves its claimed properties and delivers a level of performance competitive with similar systems which provide only weaker consistency guarantees. More importantly, this testifies that Kompics admits efficient system implementations. Its use as a teaching framework as well as its use for rapid prototyping, development, and evaluation of a myriad of scalable distributed systems, both within and outside our research group, confirm the practicality of Kompics. / <p>QC 20130520</p> distributed systems programming model message-passing concurrency nested hierarchical composition reactive components software architecture dynamic reconfiguration multi-core discrete-event simulation peer-to-peer testing debugging distributed key-value stores data replication consistency linearizability network partition tolerance consistent hashing self-organization scalability elasticity fault tolerance consistent quorums
172	Modeling and Analysis of Large-Scale On-Chip Interconnects Feng, Zhuo 2009 December 1900 (has links) As IC technologies scale to the nanometer regime, efficient and accurate modeling and analysis of VLSI systems with billions of transistors and interconnects becomes increasingly critical and difficult. VLSI systems impacted by the increasingly high dimensional process-voltage-temperature (PVT) variations demand much more modeling and analysis efforts than ever before, while the analysis of large scale on-chip interconnects that requires solving tens of millions of unknowns imposes great challenges in computer aided design areas. This dissertation presents new methodologies for addressing the above two important challenging issues for large scale on-chip interconnect modeling and analysis: In the past, the standard statistical circuit modeling techniques usually employ principal component analysis (PCA) and its variants to reduce the parameter dimensionality. Although widely adopted, these techniques can be very limited since parameter dimension reduction is achieved by merely considering the statistical distributions of the controlling parameters but neglecting the important correspondence between these parameters and the circuit performances (responses) under modeling. This dissertation presents a variety of performance-oriented parameter dimension reduction methods that can lead to more than one order of magnitude parameter reduction for a variety of VLSI circuit modeling and analysis problems. The sheer size of present day power/ground distribution networks makes their analysis and verification tasks extremely runtime and memory inefficient, and at the same time, limits the extent to which these networks can be optimized. Given today?s commodity graphics processing units (GPUs) that can deliver more than 500 GFlops (Flops: floating point operations per second). computing power and 100GB/s memory bandwidth, which are more than 10X greater than offered by modern day general-purpose quad-core microprocessors, it is very desirable to convert the impressive GPU computing power to usable design automation tools for VLSI verification. In this dissertation, for the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with very promising performance. Our GPU based network analyzer is capable of solving tens of millions of power grid nodes in just a few seconds. Additionally, with the above GPU based simulation framework, more challenging three-dimensional full-chip thermal analysis can be solved in a much more efficient way than ever before.
173	Finite element modeling of electromagnetic radiation and induced heat transfer in the human body Kim, Kyungjoo 24 September 2013 (has links) This dissertation develops adaptive hp-Finite Element (FE) technology and a parallel sparse direct solver enabling the accurate modeling of the absorption of Electro-Magnetic (EM) energy in the human head. With a large and growing number of cell phone users, the adverse health effects of EM fields have raised public concerns. Most research that attempts to explain the relationship between exposure to EM fields and its harmful effects on the human body identifies temperature changes due to the EM energy as the dominant source of possible harm. The research presented here focuses on determining the temperature distribution within the human body exposed to EM fields with an emphasis on the human head. Major challenges in accurately determining the temperature changes lie in the dependence of EM material properties on the temperature. This leads to a formulation that couples the BioHeat Transfer (BHT) and Maxwell equations. The mathematical model is formed by the time-harmonic Maxwell equations weakly coupled with the transient BHT equation. This choice of equations reflects the relevant time scales. With a mobile device operating at a single frequency, EM fields arrive at a steady-state in the micro-second range. The heat sources induced by EM fields produce a transient temperature field converging to a steady-state distribution on a time scale ranging from seconds to minutes; this necessitates the transient formulation. Since the EM material properties depend upon the temperature, the equations are fully coupled; however, the coupling is realized weakly due to the different time scales for Maxwell and BHT equations. The BHT equation is discretized in time with a time step reflecting the thermal scales. After multiple time steps, the temperature field is used to determine the EM material properties and the time-harmonic Maxwell equations are solved. The resulting heat sources are recalculated and the process continued. Due to the weak coupling of the problems, the corresponding numerical models are established separately. The BHT equation is discretized with H¹ conforming elements, and Maxwell equations are discretized with H(curl) conforming elements. The complexity of the human head geometry naturally leads to the use of tetrahedral elements, which are commonly employed by unstructured mesh generators. The EM domain, including the head and a radiating source, is terminated by a Perfectly Matched Layer (PML), which is discretized with prismatic elements. The use of high order elements of different shapes and discretization types has motivated the development of a general 3D hp-FE code. In this work, we present new generic data structures and algorithms to perform adaptive local refinements on a hybrid mesh composed of different shaped elements. A variety of isotropic and anisotropic refinements that preserve conformity of discretization are designed. The refinement algorithms support one- irregular meshes with the constrained approximation technique. The algorithms are experimentally proven to be deadlock free. A second contribution of this dissertation lies with a new parallel sparse direct solver that targets linear systems arising from hp-FE methods. The new solver interfaces to the hierarchy of a locally refined mesh to build an elimination ordering for the factorization that reflects the h-refinements. By following mesh refinements, not only the computation of element matrices but also their factorization is restricted to new elements and their ancestors. The solver is parallelized by exploiting two-level task parallelism: tasks are first generated from a parallel post-order tree traversal on the assembly tree; next, those tasks are further refined by using algorithms-by-blocks to gain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executed after their dependencies are analyzed. This approach effectively reduces scheduling overhead and increases flexibility to handle irregular tasks. The solver outperforms the conventional general sparse direct solver for a class of problems formulated by high order FEs. Finally, numerical results for a 3D coupled BHT with Maxwell equations are presented. The solutions of this Maxwell code have been verified using the analytic Mie series solutions. Starting with simple spherical geometry, parametric studies are conducted on realistic head models for a typical frequency band (900 MHz) of mobile phones. / text hp-FEM Hybrid mesh Local mesh refinement algorithms Electromagnetics Specific Absorption Rate Dielectric heating Gaussian elimination Directed Acyclic Graph Direct method LU Multi-core Multi-frontal OpenMP Sparse matrix Supernodes Task parallelism Unassembled HyperMatrix GPU Dense linear algebra Algorithms-by-blocks Heterogeneous architectures BLAS
174	GPU-enhanced power flow analysis / Calcul de Flux de Puissance amélioré grâce aux Processeurs Graphiques Marin, Manuel 11 December 2015 (has links) Cette thèse propose un large éventail d'approches afin d'améliorer différents aspects de l'analyse des flux de puissance avec comme fils conducteur l'utilisation du processeurs graphiques (GPU). Si les GPU ont rapidement prouvés leurs efficacités sur des applications régulières pour lesquelles le parallélisme de données était facilement exploitable, il en est tout autrement pour les applications dites irrégulières. Ceci est précisément le cas de la plupart des algorithmes d'analyse de flux de puissance. Pour ce travail, nous nous inscrivons dans cette problématique d'optimisation de l'analyse de flux de puissance à l'aide de coprocesseur de type GPU. L'intérêt est double. Il étend le domaine d'application des GPU à une nouvelle classe de problème et/ou d'algorithme en proposant des solutions originales. Il permet aussi à l'analyse des flux de puissance de rester pertinent dans un contexte de changements continus dans les systèmes énergétiques, et ainsi d'en faciliter leur évolution. Nos principales contributions liées à la programmation sur GPU sont: (i) l'analyse des différentes méthodes de parcours d'arbre pour apporter une réponse au problème de la régularité par rapport à l'équilibrage de charge ; (ii) l'analyse de l'impact du format de représentation sur la performance des implémentations d'arithmétique floue. Nos contributions à l'analyse des flux de puissance sont les suivantes: (ii) une nouvelle méthode pour l'évaluation de l'incertitude dans l'analyse des flux de puissance ; (ii) une nouvelle méthode de point fixe pour l'analyse des flux de puissance, problème que l'on qualifie d'intrinsèquement parallèle. / This thesis addresses the utilization of Graphics Processing Units (GPUs) for improving the Power Flow (PF) analysis of modern power systems. Currently, GPUs are challenged by applications exhibiting an irregular computational pattern, as is the case of most known methods for PF analysis. At the same time, the PF analysis needs to be improved in order to cope with new requirements of efficiency and accuracy coming from the Smart Grid concept. The relevance of GPU-enhanced PF analysis is twofold. On one hand, it expands the application domain of GPU to a new class of problems. On the other hand, it consistently increases the computational capacity available for power system operation and design. The present work attempts to achieve that in two complementary ways: (i) by developing novel GPU programming strategies for available PF algorithms, and (ii) by proposing novel PF analysis methods that can exploit the numerous features present in GPU architectures. Specific contributions on GPU computing include: (i) a comparison of two programming paradigms, namely regularity and load-balancing, for implementing the so-called treefix operations; (ii) a study of the impact of the representation format over performance and accuracy, for fuzzy interval algebraic operations; and (iii) the utilization of architecture-specific design, as a novel strategy to improve performance scalability of applications. Contributions on PF analysis include: (i) the design and evaluation of a novel method for the uncertainty assessment, based on the fuzzy interval approach; and (ii) the development of an intrinsically parallel method for PF analysis, which is not affected by the Amdahl's law. Flux de Puissance Processeurs Graphiques Algorithmes parallèles Arithmétique des ordinateurs Architectures multicœurs Méthode de Newton Analyse d’incertitude Parcours d'arbre Itérations asynchrones Réseaux intelligents Intervalles floues Power Flow Graphic Processing Units Parallel algorithms Parallel algorithms Multi-core architectures Newton method Uncertainty analysis Tree traversals Asynchronous iterations Smart grids Fuzzy intervals 004 005
175	Algorithm And Architecture Design for Real-time Face Recognition Mahale, Gopinath Vasanth January 2016 (has links) (PDF) Face recognition is a field of biometrics that deals with identification of subjects based on features present in the images of their faces. The factors that make face recognition popular and favorite as compared to other biometric methods are easier operation and ability to identify subjects without their knowledge. With these features, face recognition has become an integral part of the present day security systems, targeting a smart and secure world. There are various factors that de ne the performance of a face recognition system. The most important among them are recognition accuracy of algorithm used and time taken for recognition. Recognition accuracy of the face recognition algorithm gets affected by changes in pose, facial expression and illumination along with occlusions in the images. There have been a number of algorithms proposed to enable recognition under these ambient changes. However, it has been hard to and a single algorithm that can efficiently recognize faces in all the above mentioned conditions. Moreover, achieving real time performance for most of the complex face recognition algorithms on embedded platforms has been a challenge. Real-time performance is highly preferred in critical applications such as identification of crime suspects in public. As available software solutions for FR have significantly large latency in recognizing individuals, they are not suitable for such critical real-time applications. This thesis focuses on real-time aspect of FR, where acceleration of the algorithms is achieved by means of parallel hardware architectures. The major contributions of this work are as follows. We target to design a face recognition system that can identify at most 30 faces in each frame of video at 15 frames per second, which amounts to 450 recognitions per second. In addition, we target to achieve good recognition accuracy along with scalability in terms of database size and input image resolutions. To design a system with these specifications, as a first step, we explore algorithms in literature and come up with a hybrid face recognition algorithm. This hybrid algorithm shows good recognition accuracy on face images with changes in illumination, pose and expressions, and also with occlusions. In addition the computations in the algorithm are modular in nature which are suitable for real-time realizations through parallel processing. The face recognition system consists of a face detection module to detect faces in the input image, which is followed by a face recognition module to identify the detected faces. There are well established algorithms and architectures for face detection in literature which can perform detection at 15 frames per second on video frames. Detected faces of different sizes need to be scaled to the size specified by the face recognition module. To meet the real-time constraints, we propose a hardware architecture for real-time bi-cubic convolution interpolation with dynamic scaling factors. To recognize the resized faces in real-time, a scalable parallel pipelined architecture is designed for the hybrid algorithm which can perform 450 recognitions per second on a database containing grayscale images of at most 450 classes on Virtex 6 FPGA. To provide flexibility and programmability, we extend this design to REDEFINE, a multi-core massively parallel reconfigurable architecture. In this design, we come up with FR specific programmable cores termed Scalable Unit for Region Evaluation (SURE) capable of performing modular computations in the hybrid face recognition algorithm. We replicate SUREs in each tile of REDEFINE to construct a face recognition module termed REDEFINE for Face Recognition using SURE Homogeneous Cores (REFRESH). There is a need to learn new unseen faces on-line in practical face recognition systems. Considering this, for real-time on-line learning of unseen face images, we design tiny processors termed VOP, Processor for Vector Operations. VOPs function as coprocessors to process elements under each tile of REDEFINE to accelerate micro vector operations appearing in the synaptic weight computations. We also explore deep neural networks which operate similar to the processing in human brain and capable of working on very large face databases. We explore the field of Random matrix theory to come up with a solution for synaptic weight initialization in deep neural networks for better classification . In addition, we perform design space exploration of hardware architecture for deep convolution networks and conclude with directions for future work. Real-time Face Recognition Biometrics Face Recognition Multi-Core Architecture Random Matrix Theory Deep Convolutional Neural Networks Neural Networks Image Scaling Computer Architecture Processor Architecture Online Learning, Neural Networks VOP Vector Processsors Real-time FR Electronic Engineering
176	Performance Optimisation of Discrete-Event Simulation Software on Multi-Core Computers / Prestandaoptimering av händelsestyrd simuleringsmjukvara på flerkärniga datorer Kaeslin, Alain E. January 2016 (has links) SIMLOX is a discrete-event simulation software developed by Systecon AB for analysing logistic support solution scenarios. To cope with ever larger problems, SIMLOX's simulation engine was recently enhanced with a parallel execution mechanism in order to take advantage of multi-core processors. However, this extension did not result in the desired reduction in runtime for all simulation scenarios even though the parallelisation strategy applied had promised linear speedup. Therefore, an in-depth analysis of the limiting scalability bottlenecks became necessary and has been carried out in this project. Through the use of a low-overhead profiler and microarchitecture analysis, the root causes were identified: atomic operations causing a high communication overhead, poor locality leading to translation lookaside buffer thrashing, and hot spots that consume significant amounts of CPU time. Subsequently, appropriate optimisations to overcome the limiting factors were implemented: eliminating the expensive operations, more efficient handling of heap memory through the use of a scalable memory allocator, and data structures that make better use of caches. Experimental evaluation using real world test cases demonstrated a speedup of at least 6.75x on an eight-core processor. Most cases even achieve a speedup of more than 7.2x. The various optimisations implemented further helped to lower run times for sequential execution by 1.5x or more. It can be concluded that achieving nearly linear speedup on a multi-core processor is possible in practice for discrete-event simulation. / SIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, vars huvudsakliga funktion är en händelsestyrd simuleringskärna för analys av underhållslösningar för komplexa tekniska system. För hantering av stora problem så används parallellexekvering för simuleringen, vilket i teorin borde ge en nästan linjär skalning med antal trådar. Prestandaförbättringen som observerats i praktiken var dock ytterst begränsad, varför en ordentlig analys av skalbarheten har gjorts i detta projekt. Genom användandet av ett profileringsverktyg med liten overhead och mikroarkitektur-analys, så kunde orsakerna hittas: atomiska operationer som skapar mycket overhead för kommunikation, dålig lokalitet ger fragmentering vid översättning till fysiska adresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar som kräver mycket CPU-kraft. Därefter implementerades och testade optimeringar för att undvika de identifierade problem. Testade lösningar inkluderar eliminering av dyra operationer, ökad effektivitet i minneshantering genom skalbara minneshanteringsalgoritmer och implementation av datastrukturer som ger bättre lokalitet och därmed bättre användande av cache-strukturen. Verifiering på verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger på en processor med 8 kärnor. De flesta fall visade på en uppsnabbning med en faktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktor på åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed att det är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typ av händelsestyrd simulering. cache hierarchy caches communication overhead data structures discrete-event simulation heap memory linear speedup logistic support low-overhead profiler memory allocator microarchitecture microarchitecture analysis multi-core optimisation parallel execution profiler runtime scalability scalability bottlenecks scalable memory allocator simulation translation lookaside buffer translation lookaside buffer thrashing TLB Computer Sciences Datavetenskap (datalogi)
177	Interference Analysis and Resource Management in Server Processors: from HPC to Cloud Computing Pons Escat, Lucía 01 September 2023 (has links) [ES] Una de las principales preocupaciones de los centros de datos actuales es maximizar la utilización de los servidores. En cada servidor se ejecutan simultáneamente varias aplicaciones para aumentar la eficiencia de los recursos. Sin embargo, las prestaciones dependen en gran medida de la proporción de recursos que recibe cada aplicación. El mayor número de núcleos (y de aplicaciones ejecutándose) con cada nueva generación de procesadores hace que crezca la preocupación por la interferencia en los recursos compartidos. Esta tesis se centra en mitigar la interferencia cuando diferentes aplicaciones se consolidan en un mismo procesador desde dos perspectivas: computación de alto rendimiento (HPC) y computación en la nube. En el contexto de HPC, esta tesis propone políticas de gestión para dos de los recursos más críticos: la caché de último nivel (LLC) y los núcleos del procesador. La LLC desempeña un papel clave en las prestaciones de los procesadores actuales al reducir considerablemente el número de accesos de alta latencia a memoria principal. Se proponen estrategias de particionado de la LLC tanto para cachés inclusivas como no inclusivas, ambos diseños presentes en los procesadores para servidores actuales. Para los esquemas, se detectan nuevos comportamientos problemáticos y se asigna un mayor espacio de caché a las aplicaciones que hacen mejor uso de este. En cuanto a los núcleos del procesador, muchas aplicaciones paralelas (como aplicaciones de grafos) no escalan bien con un mayor número de núcleos. Además, el planificador de Linux aplica una estrategia de tiempo compartido que no ofrece buenas prestaciones cuando se ejecutan aplicaciones de grafo. Para maximizar la utilización del sistema, esta tesis propone ejecutar múltiples aplicaciones de grafo en el mismo procesador, asignando a cada una el número óptimo de núcleos (y adaptando el número de hilos creados) dinámicamente. En cuanto a la computación en la nube, esta tesis aborda tres grandes retos: la compleja infraestructura de estos sistemas, las características de sus aplicaciones y el impacto de la interferencia entre máquinas virtuales (MV). Primero, esta tesis presenta la plataforma experimental desarrollada con los principales componentes de un sistema en la nube. Luego, se presenta un amplio estudio de caracterización sobre un conjunto de aplicaciones de latencia crítica representativas con el fin de identificar los puntos que los proveedores de servicios en la nube deben tener en cuenta para mejorar el rendimiento y la utilización de los recursos. Por último, se realiza una propuesta que permite detectar y estimar dinámicamente la interferencia entre MV. El enfoque usa métricas que pueden monitorizarse fácilmente en la nube pública, ya que las MV deben tratarse como "cajas negras". Toda la investigación descrita se lleva a cabo respetando las restricciones y cumpliendo los requisitos para ser aplicable en entornos de producción de nube pública. En resumen, esta tesis aborda la contención en los principales recursos compartidos del sistema en el contexto de la consolidación de servidores. Los resultados experimentales muestran importantes ganancias sobre Linux. En los procesadores con LLC inclusiva, el tiempo de ejecución (TT) se reduce en más de un 40%, mientras que se mejora el IPC más de un 3%. Con una LLC no inclusiva, la equidad y el TT mejoran en un 44% y un 24%, respectivamente, al mismo tiempo que se mejora el rendimiento hasta un 3,5%. Al distribuir los núcleos del procesador de forma eficiente, se alcanza una equidad casi perfecta (94%), y el TT se reduce hasta un 80%. En entornos de computación en la nube, la degradación del rendimiento puede estimarse con un error de un 5% en la predicción global. Todas las propuestas presentadas han sido diseñadas para ser aplicadas en procesadores comerciales sin requerir ninguna información previa, tomando las decisiones dinámicamente con datos recogidos de los contadores de prestaciones. / [CAT] Una de les principals preocupacions dels centres de dades actuals és maximitzar la utilització dels servidors. A cada servidor s'executen simultàniament diverses aplicacions per augmentar l'eficiència dels recursos. Tot i això, el rendiment depèn en gran mesura de la proporció de recursos que rep cada aplicació. El nombre creixent de nuclis (i aplicacions executant-se) amb cada nova generació de processadors fa que creixca la preocupació per l'efecte causat per les interferències en els recursos compartits. Aquesta tesi se centra a mitigar la interferència en els recursos compartits quan diferents aplicacions es consoliden en un mateix processador des de dues perspectives: computació d'alt rendiment (HPC) i computació al núvol. En el context d'HPC, aquesta tesi proposa polítiques de gestió per a dos dels recursos més crítics: la memòria cau d'últim nivell (LLC) i els nuclis del processador. La LLC exerceix un paper clau a les prestacions del sistema en els processadors actuals reduint considerablement el nombre d'accessos d'alta latència a la memòria principal. Es proposen estratègies de particionament de la LLC tant per a caus inclusives com no inclusives, ambdós dissenys presents en els processadors actuals. Per als dos esquemes, se detecten nous comportaments problemàtics i s'assigna un major espai de memòria cau a les aplicacions que en fan un millor ús. Pel que fa als nuclis del processador, moltes aplicacions paral·leles (com les aplicacions de graf) no escalen bé a mesura que s'incrementa el nombre de nuclis. A més, el planificador de Linux aplica una estratègia de temps compartit que no ofereix bones prestacions quan s'executen aplicacions de graf. Per maximitzar la utilització del sistema, aquesta tesi proposa executar múltiples aplicacions de grafs al mateix processador, assignant a cadascuna el nombre òptim de nuclis (i adaptant el nombre de fils creats) dinàmicament. Pel que fa a la computació al núvol, aquesta tesi aborda tres grans reptes: la complexa infraestructura d'aquests sistemes, les característiques de les seues aplicacions i l'impacte de la interferència entre màquines virtuals (MV). En primer lloc, aquesta tesi presenta la plataforma experimental desenvolupada amb els principals components d'un sistema al núvol. Després, es presenta un ampli estudi de caracterització sobre un conjunt d'aplicacions de latència crítica representatives per identificar els punts que els proveïdors de serveis al núvol han de tenir en compte per millorar el rendiment i la utilització dels recursos. Finalment, es fa una proposta que de manera dinàmica permet detectar i estimar la interferència entre MV. L'enfocament es basa en mètriques que es poden monitoritzar fàcilment al núvol públic, ja que les MV han de tractar-se com a "caixes negres". Tota la investigació descrita es duu a terme respectant les restriccions i complint els requisits per ser aplicable en entorns de producció al núvol públic. En resum, aquesta tesi aborda la contenció en els principals recursos compartits del sistema en el context de la consolidació de servidors. Els resultats experimentals mostren que s'obtenen importants guanys sobre Linux. En els processadors amb una LLC inclusiva, el temps d'execució (TT) es redueix en més d'un 40%, mentres que es millora l'IPC en més d'un 3%. En una LLC no inclusiva, l'equitat i el TT es milloren en un 44% i un 24%, respectivament, al mateix temps que s'obté una millora del rendiment de fins a un 3,5%. Distribuint els nuclis del processador de manera eficient es pot obtindre una equitat quasi perfecta (94%), i el TT pot reduir-se fins a un 80%. En entorns de computació al núvol, la degradació del rendiment pot estimar-se amb un error de predicció global d'un 5%. Totes les propostes presentades en aquesta tesi han sigut dissenyades per a ser aplicades en processadors de servidors comercials sense requerir cap informació prèvia, prenent decisions dinàmicament amb dades recollides dels comptadors de prestacions. / [EN] One of the main concerns of today's data centers is to maximize server utilization. In each server processor, multiple applications are executed concurrently, increasing resource efficiency. However, performance and fairness highly depend on the share of resources that each application receives, leading to performance unpredictability. The rising number of cores (and running applications) with every new generation of processors is leading to a growing concern for interference at the shared resources. This thesis focuses on addressing resource interference when different applications are consolidated on the same server processor from two main perspectives: high-performance computing (HPC) and cloud computing. In the context of HPC, resource management approaches are proposed to reduce inter-application interference at two major critical resources: the last level cache (LLC) and the processor cores. The LLC plays a key role in the system performance of current multi-cores by reducing the number of long-latency main memory accesses. LLC partitioning approaches are proposed for both inclusive and non-inclusive LLCs, as both designs are present in current server processors. In both cases, newly problematic LLC behaviors are identified and efficiently detected, granting a larger cache share to those applications that use best the LLC space. As for processor cores, many parallel applications, like graph applications, do not scale well with an increasing number of cores. Moreover, the default Linux time-sharing scheduler performs poorly when running graph applications, which process vast amounts of data. To maximize system utilization, this thesis proposes to co-locate multiple graph applications on the same server processor by assigning the optimal number of cores to each one, dynamically adapting the number of threads spawned by the running applications. When studying the impact of system-shared resources on cloud computing, this thesis addresses three major challenges: the complex infrastructure of cloud systems, the nature of cloud applications, and the impact of inter-VM interference. Firstly, this thesis presents the experimental platform developed to perform representative cloud studies with the main cloud system components (hardware and software). Secondly, an extensive characterization study is presented on a set of representative latency-critical workloads which must meet strict quality of service (QoS) requirements. The aim of the studies is to outline issues cloud providers should consider to improve performance and resource utilization. Finally, we propose an online approach that detects and accurately estimates inter-VM interference when co-locating multiple latency-critical VMs. The approach relies on metrics that can be easily monitored in the public cloud as VMs are handled as ``black boxes''. The research described above is carried out following the restrictions and requirements to be applicable to public cloud production systems. In summary, this thesis addresses contention in the main system shared resources in the context of server consolidation, both in HPC and cloud computing. Experimental results show that important gains are obtained over the Linux OS scheduler by reducing interference. In inclusive LLCs, turnaround time (TT) is reduced by over 40% while improving IPC by more than 3%. In non-inclusive LLCs, fairness and TT are improved by 44% and 24%, respectively, while improving performance by up to 3.5%. By distributing core resources efficiently, almost perfect fairness can be obtained (94%), and TT can be reduced by up to 80%. In cloud computing, performance degradation due to resource contention can be estimated with an overall prediction error of 5%. All the approaches proposed in this thesis have been designed to be applied in commercial server processors without requiring any prior information, making decisions dynamically with data collected from hardware performance counters. / Pons Escat, L. (2023). Interference Analysis and Resource Management in Server Processors: from HPC to Cloud Computing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/195840 Computación en la nube Compartición de recursos Multiprocesadores multinúcleo Virtualización Cargas de latencia crítica Retardo de cola Multi-core multiprocessors Interapplication interference Resource sharing Resource contention Performance High-performance computing Cache memories Cache partitioning Memory structures Memory hierarchy Graph processing Scheduling System utilization Runtime optimization Cloud computing Public cloud Virtualization Latency-critical workloads Tail latency Hyper-Threading Quality of Service (QoS)

Page generated in 0.0446 seconds