91 |
Clustering server properties and syntactic structures in state machines for hyperscale data center operationsJatko, Johan January 2021 (has links)
In hyperscale data center operations, automation is applied in many ways as it is becomes very hard to scale otherwise. There are however areas relating to understanding, grouping and diagnosing of error reports that are done manually at Facebook today. This master's thesis investigates solutions for applying unsupervised clustering methods to server error reports, server properties and historical data to speed up and enhance the process of finding and root causing systematic issues. By utilizing data representations that can embed both key-value data and historical event log data, the thesis shows that clustering algorithms together with data representations that capture syntactic and semantic structures in the data can be applied with good results in a real-world scenario.
|
92 |
Dealing with DataLundberg, Agnes January 2021 (has links)
Being an architect means dealing with data. All architectural thinking—whether it is done with pen and paper or the most advanced modeling softwares—starts with data, conveying information about the world, and ultimately outputs data, in the form of drawings or models. Reality is neither the input nor the output. All architectural work is abstractions of reality, mediated by data. What if data, the abstractions of reality that are crucial for our work as architects, was to be used more literally? Could data actually be turned into architecture? Could data be turned into, for example, a volume, a texture, or an aperture? What qualities would such an architecture have? These questions form the basis of this thesis project. The topic was investigated first by developing a simple design method for generating architectural forms from data, through an iterative series of tests. Then, the design method was applied to create a speculative design proposal for a combined data center and museum, located in Södermalm, Stockholm.
|
93 |
Network-Layer Protocols for Data Center Scalability / Protocoles de couche réseau pour l’extensibilité des centres de donnéesDesmouceaux, Yoann 10 April 2019 (has links)
Du fait de la croissance de la demande en ressources de calcul, les architectures de centres de données gagnent en taille et complexité.Dès lors, cette thèse prend du recul par rapport aux architectures réseaux traditionnelles, et montre que fournir des primitives génériques directement à la couche réseau permet d'améliorer l'utilisation des ressources, et de diminuer le trafic réseau et le surcoût administratif.Deux architectures réseaux récentes, Segment Routing (SR) et Bit-Indexed Explicit Replication (BIER), sont utilisées pour construire et analyser des protocoles de couche réseau, afin de fournir trois primitives: (1) mobilité des tâches, (2) distribution fiable de contenu, et (3) équilibre de charge.Premièrement, pour la mobilité des tâches, SR est utilisé pour fournir un service de migration de machine virtuelles sans perte.Cela ouvre l'opportunité d'étudier comment orchestrer le placement et la migration de tâches afin de (i) maximiser le débit inter-tâches, tout en (ii) maximisant le nombre de nouvelles tâches placées, mais (iii) minimisant le nombre de tâches migrées.Deuxièmement, pour la distribution fiable de contenu, BIER est utilisé pour fournir un protocole de multicast fiable, dans lequel les retransmissions de paquets perdus sont ciblés vers l'ensemble précis de destinations n'ayant pas reçu ce packet : ainsi, le surcoût de trafic est minimisé.Pour diminuer la charge sur la source, cette approche est étendue en rendant possible des retransmissions par des pairs locaux, utilisant SR afin de trouver un pair capable de retransmettre.Troisièmement, pour l'équilibre de charge, SR est utilisé pour distribuer des requêtes à travers plusieurs applications candidates, chacune prenant une décision locale pour accepter ou non ces requêtes, fournissant ainsi une meilleure équité de répartition comparé aux approches centralisées.La faisabilité d'une implémentation matérielle de cette approche est étudiée, et une solution (utilisant des canaux cachés pour transporter de façon invisible de l'information vers l'équilibreur) est implémentée pour une carte réseau programmable de dernière génération.Finalement, la possibilité de fournir de l'équilibrage automatique comme service réseau est étudiée : en faisant passer (avec SR) des requêtes à travers une chaîne fixée d'applications, l'équilibrage est initié par la dernière instance, selon son état local. / With the development of demand for computing resources, data center architectures are growing both in scale and in complexity.In this context, this thesis takes a step back as compared to traditional network approaches, and shows that providing generic primitives directly within the network layer is a great way to improve efficiency of resource usage, and decrease network traffic and management overhead.Using recently-introduced network architectures, Segment Routing (SR) and Bit-Indexed Explicit Replication (BIER), network layer protocols are designed and analyzed to provide three high-level functions: (1) task mobility, (2) reliable content distribution and (3) load-balancing.First, task mobility is achieved by using SR to provide a zero-loss virtual machine migration service.This then opens the opportunity for studying how to orchestrate task placement and migration while aiming at (i) maximizing the inter-task throughput, while (ii) maximizing the number of newly-placed tasks, but (iii) minimizing the number of tasks to be migrated.Second, reliable content distribution is achieved by using BIER to provide a reliable multicast protocol, in which retransmissions of lost packets are targeted towards the precise set of destinations having missed that packet, thus incurring a minimal traffic overhead.To decrease the load on the source link, this is then extended to enable retransmissions by local peers from the same group, with SR as a helper to find a suitable retransmission candidate.Third, load-balancing is achieved by way of using SR to distribute queries through several application candidates, each of which taking local decisions as to whether to accept those, thus achieving better fairness as compared to centralized approaches.The feasibility of hardware implementation of this approach is investigated, and a solution using covert channels to transparently convey information to the load-balancer is implemented for a state-of-the-art programmable network card.Finally, the possibility of providing autoscaling as a network service is investigated: by letting queries go through a fixed chain of applications using SR, autoscaling is triggered by the last instance, depending on its local state.
|
94 |
Ein Beitrag zum energie- und kostenoptimierten Betrieb von Rechenzentren mit besonderer Berücksichtigung der Separation von Kalt- und WarmluftHackenberg, Daniel 07 February 2022 (has links)
In der vorliegenden Arbeit wird eine simulationsbasierte Methodik zur Energie- und Kostenoptimierung der Kühlung von Rechenzentren mit Kalt-/Warmluft-Separation vorgestellt. Dabei wird die spezifische Charakteristik der Luftseparation für einen wesentlich einfacheren und schnelleren Simulationsansatz genutzt, als das mit herkömmlichen, strömungsmechanischen Methoden möglich ist. Außerdem wird der Energiebedarf des Lufttransports – einschließlich der IT-Ventilatoren – in der Optimierung berücksichtigt. Beispielhaft entwickelte Komponentenmodelle umfassen die IT-Systeme und alle für die Kühlung relevanten Anlagen in einer dem aktuellen Stand der Technik entsprechenden Ausführung. Die besonders wichtigen Aspekte Freikühlbetrieb und Verdunstungskühlung werden berücksichtigt. Anhand verschiedener Konfigurationen eines Modellrechenzentrums wird beispielhaft die Minimierung der jährlichen verbrauchsgebunden Kosten durch Anpassung von Temperatursoll- werten und anderen Parametern der Regelung demonstriert; bestehendes Einsparpotenzial wird quantifiziert. Da die Kalt-/Warmluft-Separation in modernen Installationen mit hoher Leistungsdichte auch Auswirkungen auf bauliche Anforderungen hat, wird ein für diesen Anwendungsfall optimiertes Gebäudekonzept vorgeschlagen und praktisch untersucht, das sich insbesondere durch Vorteile hinsichtlich Energieeffizienz, Flexibilität und Betriebssicherheit auszeichnet.:1 Einleitung
1.1 Motivation
1.2 Kategorisierung von Rechenzentren
1.3 Effizienzmetriken für Rechenzentren
1.4 Wissenschaftlicher Beitrag und Abgrenzung
2 Luftgekühlte IT-Systeme: Anforderungen und Trends
2.1 Anforderungen an das Raumklima
2.1.1 Lufttemperatur
2.1.2 Luftfeuchte
2.1.3 Luftzustand im Warmgang
2.1.4 Schalldruckpegel und Schadgase
2.1.5 Betriebsabläufe und Personal
2.2 Kühllasten
2.2.1 Leistungsbedarf der IT-Systeme
2.2.2 Lastgänge und Teillastbetrieb der IT-Systeme
2.2.3 Flächenspezifische Kühllasten
2.3 Leckströme
2.4 Entwicklungstendenzen
3 Rechenzentrumskühlung: Übliche Lösungen und Optimierungskonzepte
3.1 Anlagenkonzepte zur Entwärmung von Rechenzentren
3.1.1 Freie Kühlung
3.1.2 Maschinelle Kälteerzeugung
3.1.3 Umluftkühlung von Rechnerräumen
3.2 Umluftkühlung mit Separation von Kalt- und Warmluft
3.2.1 Konzept
3.2.2 Umsetzung
3.2.3 Regelung der Umluftkühlgeräte
3.2.4 Effizienzoptimierung durch Anhebung der Lufttemperatur
3.2.5 Betriebssicherheit
3.3 Modellbasierte Untersuchungen in der Literatur
3.4 Zwischenfazit
4 Modellbildung
4.1 Struktur des Modells und Ablauf der Simulation
4.2 Annahmen und Randbedingungen
4.3 Modellierung der IT-Systeme
4.3.1 Testsysteme und -software
4.3.2 Testaufbau und Messung der relevanten physikalischen Größen
4.3.3 Drehzahl der internen Ventilatoren
4.3.4 Leistungsaufnahme der internen Ventilatoren
4.3.5 Luftvolumenstrom
4.3.6 Leistungsaufnahme der IT-Systeme ohne Lüfter
4.3.7 Ausblastemperatur
4.4 Modellierung der Kühlsysteme
4.4.1 Pumpen, Rohrnetz und Ventilatoren
4.4.2 Wärmeübertrager
4.4.3 Umluftkühlgeräte
4.4.4 Pufferspeicher
4.4.5 Kältemaschinen
4.4.6 Rückkühlwerke
4.4.7 Freie Kühlung
4.5 Regelstrategien, Sollwertvorgaben und Lastprofile
4.5.1 Kaltluft
4.5.2 Kaltwasser
4.5.3 Kühlwasser
4.5.4 Kälteerzeuger
4.5.5 Lastprofil der IT-Systeme
4.5.6 Wetterdaten
4.5.7 Standortspezifische Kosten für sonstige Betriebsstoffe
4.6 Validierung der Simulationsumgebung
4.6.1 Stichprobenartige experimentelle Prüfung der ULKG-Modellierung
4.6.2 Stichprobenartige experimentelle Prüfung der Modellierung der Kälteerzeugung
4.6.3 Plausibilitätskontrolle und Modellgrenzen
4.7 Zwischenfazit
5 Variantenuntersuchungen und Ableitung von Empfehlungen
5.1 Konfiguration und ausgewählte Betriebspunkte des Modellrechenzentrums
5.2 Optimierung des Jahresenergiebedarfs mit konstanten Kühlmedientemperaturen
5.2.1 Jahresenergiebedarf des Modell-RZs und Optimierung nach Best Practices
5.2.2 Bestimmung der optimalen (konstanten) ULKG-Ausblastemperatur
5.2.3 Einfluss von Last und Typ der IT-Systeme
5.2.4 Einfluss von Standortfaktoren
5.2.5 Einsparpotenzial Pumpenenergie
5.3 Optimierung mit variablen Kühlmedientemperaturen, RKW trocken
5.3.1 Dynamische Sollwertschiebung der Luft- und Kaltwassertemperaturen
5.3.2 Sollwertschiebung der Kühlwassertemperaturen im Kältemaschinenbetrieb
5.3.3 Kombination der Optimierungen und Übertragung auf andere Standorte
5.4 Optimierung mit variablen Kühlmedientemperaturen, RKW benetzt
5.4.1 Dynamische Sollwertschiebung der Luft- und Kaltwassertemperaturen
5.4.2 Optimierung eines modifizierten Modells ohne Kältemaschinen
5.4.3 Betriebssicherheit der Konfiguration ohne Kältemaschinen
5.4.4 Optimierung der Betriebssicherheit durch Eisspeicher
5.5 Zwischenfazit
6 Vorstellung und Diskussion eines neuen Gebäudekonzepts für Rechenzentren
6.1 Gebäudekonzepte und Anforderungen an Sicherheit, Effizienz und Flexibilität
6.1.1 Limitierungen klassischer Konstruktionsprinzipien
6.1.2 Alternative Konzepte für Umluftkühlung in Rechenzentren
6.1.3 Rechenzentren mit Installationsgeschoss statt Doppelboden
6.2 Plenum statt Doppelboden: Konzept und Umsetzung
6.2.1 Aufgabenstellung und konzeptionelle Anforderungen
6.2.2 Lösung mit dem Plenums-Konzept
6.2.3 Anforderungen an die Regelung der Umluftkühlgeräte
6.3 Experimentelle Leistungsbestimmung und Optimierung
6.3.1 Testaufbau und Messung der relevanten physikalischen Größen
6.3.2 Regelung von Luftvolumenstrom und -Temperatur bei konstanter Last
6.3.3 Optimierung der Kaskadenschaltung der Umluftkühlgeräte bei Lastwechseln
6.3.4 Optimierung der Betriebssicherheit der Umluftkühlung bei Stromausfällen
6.3.5 Ermittlung der Leistungsgrenzen
6.4 Zwischenfazit und weiteres Optimierungspotenzial
7 Zusammenfassung und Ausblick
|
95 |
Passive Optical Top-of-Rack Interconnect for Data Center NetworksCheng, Yuxin January 2017 (has links)
Optical networks offering ultra-high capacity and low energy consumption per bit are considered as a good option to handle the rapidly growing traffic volume inside data center (DCs). However, most of the optical interconnect architectures proposed for DCs so far are mainly focused on the aggregation/core tiers of the data center networks (DCNs), while relying on the conventional top-of-rack (ToR) electronic packet switches (EPS) in the access tier. A large number of ToR switches in the current DCNs brings serious scalability limitations due to high cost and power consumption. Thus, it is important to investigate and evaluate new optical interconnects tailored for the access tier of the DCNs. We propose and evaluate a passive optical ToR interconnect (POTORI) architecture for the access tier. The data plane of the POTORI consists mainly of passive components to interconnect the servers within the rack as well as the interfaces toward the aggregation/core tiers. Using the passive components makes it possible to significantly reduce power consumption while achieving high reliability in a cost-efficient way. Meanwhile, our proposed POTORI’s control plane is based on a centralized rack controller, which is responsible for coordinating the communications among the servers in the rack. It can be reconfigured by software-defined networking (SDN) operation. A cycle-based medium access control (MAC) protocol and a dynamic bandwidth allocation (DBA) algorithm are designed for the POTORI to efficiently manage the exchange of control messages and the data transmission inside the rack. Simulation results show that under realistic DC traffic scenarios, the POTORI with the proposed DBA algorithm is able to achieve an average packet delay below 10 μs with the use of fast tunable optical transceivers. Moreover, we further quantify the impact of different network configuration parameters on the average packet delay. / <p>QC 20170503</p>
|
96 |
Combinatorial Optimization for Data Center Operational Cost ReductionRostami, Somayye January 2023 (has links)
This thesis considers two kinds of problems, motivated by practical applications in
data center operations and maintenance. Data centers are the brain of the internet,
each hosting as many as tens of thousands of IT devices, making them a considerable global energy consumption contributor (more than 1 percent of global power
consumption). There is a large body of work at different layers aimed at reducing
the total power consumption for data centers. One of the key places to save power
is addressing the thermal heterogeneity in data centers by thermal-aware workload
distribution. The corresponding optimization problem is challenging due to its combinatorial nature and the computational complexity of thermal models. In this thesis,
a holistic theoretical approach is proposed for thermal-aware workload distribution
which uses linearization to make the problem model-independent and easier to study.
Two general optimization problems are defined. In the first problem, several cooling
parameters and heat recirculation effects are considered, where two red-line temperatures are defined for idle and fully utilized servers to allow the cooling effort to be
reduced. The resulting problem is a mixed integer linear programming problem which
is solved approximately using a proposed heuristic. Numerical results confirm that
the proposed approach outperforms commonly considered baseline algorithms and commercial solvers (MATLAB) and can reduce the power consumption by more than
10 percent. In the next problem, additional operational costs related to reliability
of the servers are considered. The resulting problem is solved by a generalization of
the proposed heuristics integrated with a Model Predictive Control (MPC) approach,
where demand predictions are available. Finally, in the second type of problems,
we address a problem in inventory management related to data center maintenance,
where we develop an efficient dynamic programming algorithm to solve a lot-sizing
problem. The algorithm is based on a key structural property that may be of more
general interest, that of a just-in-time ordering policy. / Thesis / Doctor of Philosophy (PhD) / Data centers, each hosting as many as tens of thousands of IT devices, contribute to a
considerable portion of energy usage worldwide (more than 1 percent of global power
consumption). They also encounter other operational costs mostly related to reliability of devices and maintenance. One of the key places to reduce energy consumption is
through addressing the thermal heterogeneity in data centers by thermal-aware work load distribution for the servers. This prevents hot spot generation and addresses the
trade-off between IT and cooling power consumption, the two main power consump tion contributors. The corresponding optimization problem is challenging due to its
combinatorial nature and the complexity of thermal models. In this thesis, we present
a holistic approach for thermal-aware workload distribution in data centers, using lin earization to make the problem model-independent and simpler to study. Two quite
general nonlinear optimization problems are defined. The results confirm that the
proposed approach completed by a proposed heuristic solves the problems efficiently
and with high precision. Finally, we address a problem in inventory management
related to data center maintenance, where we develop an efficient algorithm to solve
a lot-sizing problem that has a goal of reducing data center operational costs.
|
97 |
Performance Modeling and Optimization Techniques in the Presence of Random Process Variations to Improve Parametric Yield of VLSI CircuitsBASU, SHUBHANKAR 28 August 2008 (has links)
No description available.
|
98 |
Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center MicroservicesMahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links)
<p> </p>
<p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p>
<p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p>
<p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p>
<p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p>
|
99 |
Latency Tradeoffs in Distributed Storage AccessRay, Madhurima January 2019 (has links)
The performance of storage systems is central to handling the huge amount of data being generated from a variety of sources including scientific experiments, social media, crowdsourcing, and from an increasing variety of cyber-physical systems. The emerging high-speed storage technologies enable the ingestion of and access to such large volumes of data efficiently. However, the combination of high data volume requirements of new applications that largely generate unstructured and semistructured streams of data combined with the emerging high-speed storage technologies pose a number of new challenges, including the low latency handling of such data and ensuring that the network providing access to the data does not become the bottleneck. The traditional relational model is not well suited for efficiently storing and retrieving unstructured and semi-structured data. An alternate mechanism, popularly known as Key-Value Store (KVS) has been investigated over the last decade to handle such data. A KVS store only needs a 'key' to uniquely identify the data record, which may be of variable length and may or may not have further structure in the form of predefined fields. Most of the KVS in existence have been designed for hard-disk based storage (before the SSDs gain popularity) where avoiding random accesses is crucial for good performance. Unfortunately, as the modern solid-state drives become the norm as the data center storage, the HDD-based KV structures result in high read, write, and space amplifications which becomes detrimental to both the SSD’s performance and endurance. Also note that regardless of how the storage systems are deployed, access to large amounts of storage by many nodes must necessarily go over the network. At the same time, the emerging storage technologies such as Flash, 3D-crosspoint, phase change memory (PCM), etc. coupled with highly efficient access protocols such as NVMe are capable of ingesting and reading data at rates that challenge even the leading edge networking technologies such as 100Gb/sec Ethernet. At the same time, some of the higher-end storage technologies (e.g., Intel Optane storage based on 3-D crosspoint technology, PCM, etc.) coupled with lean protocols like NVMe are capable of providing storage access latencies in the 10-20$\mu s$ range, which means that the additional latency due to network congestion could become significant. The purpose of this thesis is to addresses some of the aforementioned issues. We propose a new hash-based and SSD-friendly key-value store (KVS) architecture called FlashKey which is especially designed for SSDs to provide low access latencies, low read and write amplification, and the ability to easily trade-off latencies for any sequential access, for example, range queries. Through detailed experimental evaluation of FlashKey against the two most popular KVs, namely, RocksDB and LevelDB, we demonstrate that even as an initial implementation we are able to achieve substantially better write amplification, average, and tail latency at a similar or better space amplification. Next, we try to deal with network congestion by dynamically replicating the data items that are heavily used. The tradeoff here is between the latency and the replication or migration overhead. It is important to reverse the replication or migration as the congestion fades away since our observation tells that placing data and applications (that access the data) together in a consolidated fashion would significantly reduce the propagation delay and increase the network energy-saving opportunities which is required as the data center network nowadays are equipped with high-speed and power-hungry network infrastructures. Finally, we designed a tradeoff between network consolidation and congestion. Here, we have traded off the latency to save power. During the quiet hours, we consolidate the traffic is fewer links and use different sleep modes for the unused links to save powers. However, as the traffic increases, we reactively start to spread out traffic to avoid congestion due to the upcoming traffic surge. There are numerous studies in the area of network energy management that uses similar approaches, however, most of them do energy management at a coarser time granularity (e.g. 24 hours or beyond). As opposed to that, our mechanism tries to steal all the small to medium time gaps in traffic and invoke network energy management without causing a significant increase in latency. / Computer and Information Science
|
100 |
DISTRIBUTED COOLING FOR DATA CENTERS: BENEFITS, PERFORMANCE EVALUATION AND PREDICTION TOOLSMoazamigoodarzi, Hosein January 2019 (has links)
Improving the efficiency of conventional air-cooled solutions for Data Centers (DCs) is still a major thermal management challenge. Improvements can be made in two ways, through better (1) architectural design and (2) operation. There are three conventional DC cooling architectures: (a) room-based, (b) row-based, and (c) rack-based. Architectures (b) and (c) allows a modular DC design, where the ITE is within an enclosure containing a cooling unit. Due to scalability and ease of implementation, operational cost, and complexity, these modular systems have gained in popularity for many computing applications. However, the yet poor insight into their thermal management leads to limited strategies to scale the size of a DC facility for applications gaining in importance, e.g., edge and hyperscale. We improve the body of knowledge by comparing three cooling architecture’s power consumption.
Energy efficiency during DC operation can be improved in two ways: (1) utilizing energy efficient control systems, (2) optimizing the arrangement of ITE. For both cases, a temperature prediction tool is required which can provide real-time information about the temperature distribution as a function of system parameters and the ITE arrangement. To construct such a prediction tool, we must develop a deeper understanding of the airflow, pressure and temperature distributions around the ITE and how these parameters change dynamically with IT load. As yet primitive tools have been developed, but only for architecture (a) listed above. These tools are not transferrable to other architectures due to significant differences in thermal-fluid transport. We examine the airflow and thermal transport within confined racks with separated cold and hot chambers that employ rack- or row-based cooling units, and then propose a parameter-free transient zonal model to obtain the real-time temperature distributions. / Thesis / Doctor of Philosophy (PhD)
|
Page generated in 0.1237 seconds