Global ETD Search

521	Elastic, Interoperable and Container-based Cloud Infrastructures for High Performance Computing López Huguet, Sergio 02 September 2021 (has links) Tesis por compendio / [ES] Las aplicaciones científicas implican generalmente una carga computacional variable y no predecible a la que las instituciones deben hacer frente variando dinámicamente la asignación de recursos en función de las distintas necesidades computacionales. Las aplicaciones científicas pueden necesitar grandes requisitos. Por ejemplo, una gran cantidad de recursos computacionales para el procesado de numerosos trabajos independientes (High Throughput Computing o HTC) o recursos de alto rendimiento para la resolución de un problema individual (High Performance Computing o HPC). Los recursos computacionales necesarios en este tipo de aplicaciones suelen acarrear un coste muy alto que puede exceder la disponibilidad de los recursos de la institución o estos pueden no adaptarse correctamente a las necesidades de las aplicaciones científicas, especialmente en el caso de infraestructuras preparadas para la ejecución de aplicaciones de HPC. De hecho, es posible que las diferentes partes de una aplicación necesiten distintos tipos de recursos computacionales. Actualmente las plataformas de servicios en la nube se han convertido en una solución eficiente para satisfacer la demanda de las aplicaciones HTC, ya que proporcionan un abanico de recursos computacionales accesibles bajo demanda. Por esta razón, se ha producido un incremento en la cantidad de clouds híbridos, los cuales son una combinación de infraestructuras alojadas en servicios en la nube y en las propias instituciones (on-premise). Dado que las aplicaciones pueden ser procesadas en distintas infraestructuras, actualmente la portabilidad de las aplicaciones se ha convertido en un aspecto clave. Probablemente, las tecnologías de contenedores son la tecnología más popular para la entrega de aplicaciones gracias a que permiten reproducibilidad, trazabilidad, versionado, aislamiento y portabilidad. El objetivo de la tesis es proporcionar una arquitectura y una serie de servicios para proveer infraestructuras elásticas híbridas de procesamiento que puedan dar respuesta a las diferentes cargas de trabajo. Para ello, se ha considerado la utilización de elasticidad vertical y horizontal desarrollando una prueba de concepto para proporcionar elasticidad vertical y se ha diseñado una arquitectura cloud elástica de procesamiento de Análisis de Datos. Después, se ha trabajo en una arquitectura cloud de recursos heterogéneos de procesamiento de imágenes médicas que proporciona distintas colas de procesamiento para trabajos con diferentes requisitos. Esta arquitectura ha estado enmarcada en una colaboración con la empresa QUIBIM. En la última parte de la tesis, se ha evolucionado esta arquitectura para diseñar e implementar un cloud elástico, multi-site y multi-tenant para el procesamiento de imágenes médicas en el marco del proyecto europeo PRIMAGE. Esta arquitectura utiliza un almacenamiento distribuido integrando servicios externos para la autenticación y la autorización basados en OpenID Connect (OIDC). Para ello, se ha desarrollado la herramienta kube-authorizer que, de manera automatizada y a partir de la información obtenida en el proceso de autenticación, proporciona el control de acceso a los recursos de la infraestructura de procesamiento mediante la creación de las políticas y roles. Finalmente, se ha desarrollado otra herramienta, hpc-connector, que permite la integración de infraestructuras de procesamiento HPC en infraestructuras cloud sin necesitar realizar cambios en la infraestructura HPC ni en la arquitectura cloud. Cabe destacar que, durante la realización de esta tesis, se han utilizado distintas tecnologías de gestión de trabajos y de contenedores de código abierto, se han desarrollado herramientas y componentes de código abierto y se han implementado recetas para la configuración automatizada de las distintas arquitecturas diseñadas desde la perspectiva DevOps. / [CA] Les aplicacions científiques impliquen generalment una càrrega computacional variable i no predictible a què les institucions han de fer front variant dinàmicament l'assignació de recursos en funció de les diferents necessitats computacionals. Les aplicacions científiques poden necessitar grans requisits. Per exemple, una gran quantitat de recursos computacionals per al processament de nombrosos treballs independents (High Throughput Computing o HTC) o recursos d'alt rendiment per a la resolució d'un problema individual (High Performance Computing o HPC). Els recursos computacionals necessaris en aquest tipus d'aplicacions solen comportar un cost molt elevat que pot excedir la disponibilitat dels recursos de la institució o aquests poden no adaptar-se correctament a les necessitats de les aplicacions científiques, especialment en el cas d'infraestructures preparades per a l'avaluació d'aplicacions d'HPC. De fet, és possible que les diferents parts d'una aplicació necessiten diferents tipus de recursos computacionals. Actualment les plataformes de servicis al núvol han esdevingut una solució eficient per satisfer la demanda de les aplicacions HTC, ja que proporcionen un ventall de recursos computacionals accessibles a demanda. Per aquest motiu, s'ha produït un increment de la quantitat de clouds híbrids, els quals són una combinació d'infraestructures allotjades a servicis en el núvol i a les mateixes institucions (on-premise). Donat que les aplicacions poden ser processades en diferents infraestructures, actualment la portabilitat de les aplicacions s'ha convertit en un aspecte clau. Probablement, les tecnologies de contenidors són la tecnologia més popular per a l'entrega d'aplicacions gràcies al fet que permeten reproductibilitat, traçabilitat, versionat, aïllament i portabilitat. L'objectiu de la tesi és proporcionar una arquitectura i una sèrie de servicis per proveir infraestructures elàstiques híbrides de processament que puguen donar resposta a les diferents càrregues de treball. Per a això, s'ha considerat la utilització d'elasticitat vertical i horitzontal desenvolupant una prova de concepte per proporcionar elasticitat vertical i s'ha dissenyat una arquitectura cloud elàstica de processament d'Anàlisi de Dades. Després, s'ha treballat en una arquitectura cloud de recursos heterogenis de processament d'imatges mèdiques que proporciona distintes cues de processament per a treballs amb diferents requisits. Aquesta arquitectura ha estat emmarcada en una col·laboració amb l'empresa QUIBIM. En l'última part de la tesi, s'ha evolucionat aquesta arquitectura per dissenyar i implementar un cloud elàstic, multi-site i multi-tenant per al processament d'imatges mèdiques en el marc del projecte europeu PRIMAGE. Aquesta arquitectura utilitza un emmagatzemament integrant servicis externs per a l'autenticació i autorització basats en OpenID Connect (OIDC). Per a això, s'ha desenvolupat la ferramenta kube-authorizer que, de manera automatitzada i a partir de la informació obtinguda en el procés d'autenticació, proporciona el control d'accés als recursos de la infraestructura de processament mitjançant la creació de les polítiques i rols. Finalment, s'ha desenvolupat una altra ferramenta, hpc-connector, que permet la integració d'infraestructures de processament HPC en infraestructures cloud sense necessitat de realitzar canvis en la infraestructura HPC ni en l'arquitectura cloud. Es pot destacar que, durant la realització d'aquesta tesi, s'han utilitzat diferents tecnologies de gestió de treballs i de contenidors de codi obert, s'han desenvolupat ferramentes i components de codi obert, i s'han implementat receptes per a la configuració automatitzada de les distintes arquitectures dissenyades des de la perspectiva DevOps. / [EN] Scientific applications generally imply a variable and an unpredictable computational workload that institutions must address by dynamically adjusting the allocation of resources to their different computational needs. Scientific applications could require a high capacity, e.g. the concurrent usage of computational resources for processing several independent jobs (High Throughput Computing or HTC) or a high capability by means of using high-performance resources for solving complex problems (High Performance Computing or HPC). The computational resources required in this type of applications usually have a very high cost that may exceed the availability of the institution's resources or they are may not be successfully adapted to the scientific applications, especially in the case of infrastructures prepared for the execution of HPC applications. Indeed, it is possible that the different parts that compose an application require different type of computational resources. Nowadays, cloud service platforms have become an efficient solution to meet the need of HTC applications as they provide a wide range of computing resources accessible on demand. For this reason, the number of hybrid computational infrastructures has increased during the last years. The hybrid computation infrastructures are the combination of infrastructures hosted in cloud platforms and the computation resources hosted in the institutions, which are named on-premise infrastructures. As scientific applications can be processed on different infrastructures, the application delivery has become a key issue. Nowadays, containers are probably the most popular technology for application delivery as they ease reproducibility, traceability, versioning, isolation, and portability. The main objective of this thesis is to provide an architecture and a set of services to build up hybrid processing infrastructures that fit the need of different workloads. Hence, the thesis considered aspects such as elasticity and federation. The use of vertical and horizontal elasticity by developing a proof of concept to provide vertical elasticity on top of an elastic cloud architecture for data analytics. Afterwards, an elastic cloud architecture comprising heterogeneous computational resources has been implemented for medical imaging processing using multiple processing queues for jobs with different requirements. The development of this architecture has been framed in a collaboration with a company called QUIBIM. In the last part of the thesis, the previous work has been evolved to design and implement an elastic, multi-site and multi-tenant cloud architecture for medical image processing has been designed in the framework of a European project PRIMAGE. This architecture uses a storage integrating external services for the authentication and authorization based on OpenID Connect (OIDC). The tool kube-authorizer has been developed to provide access control to the resources of the processing infrastructure in an automatic way from the information obtained in the authentication process, by creating policies and roles. Finally, another tool, hpc-connector, has been developed to enable the integration of HPC processing infrastructures into cloud infrastructures without requiring modifications in both infrastructures, cloud and HPC. It should be noted that, during the realization of this thesis, different contributions to open source container and job management technologies have been performed by developing open source tools and components and configuration recipes for the automated configuration of the different architectures designed from the DevOps perspective. The results obtained support the feasibility of the vertical elasticity combined with the horizontal elasticity to implement QoS policies based on a deadline, as well as the feasibility of the federated authentication model to combine public and on-premise clouds. / López Huguet, S. (2021). Elastic, Interoperable and Container-based Cloud Infrastructures for High Performance Computing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/172327 / Compendio Cloud computing Containers Kubernetes Vertical elasticity Cloud architecture Scientific computing High-Performance Computing Cloud Checkpointing Computación en la nube Contenedores Elasticidad vertical Arquitecturas cloud Multi-tenancy
522	Improving Network-on-Chip Performance in Multi-Core Systems Gorgues Alonso, Miguel 10 September 2018 (has links) Tesis por compendio / La red en el chip (NoC) se han convertido en el elemento clave para la comunicación eficiente entre los núcleos dentro de los chip multiprocesador (CMP). Tanto el uso de aplicaciones paralelas en los CMPs como el incremento de la cantidad de memoria necesitada por las aplicaciones, ha impulsado que la red de comunicación gane una mayor importancia. La NoC es la encargada de transportar toda la información requerida por los núcleos. Además, el incremento en el número de núcleos en los CMPs impulsa las NoC a ser diseñadas de forma escalable, pero al mismo tiempo sin que esto afecte a las prestaciones de la red (latencia y productividad). Por tanto, el diseño de la red en el chip se convierte en crítico. Esta tesis presenta diferentes propuestas que atacan el problema de la mejora de las prestaciones de la red en tres escenarios distintos. Los tres escenarios en los que se centran nuestras propuestas son: 1) NoCs que implementan un algoritmo de encaminamiento adaptativo, 2) escenarios con necesidad de tiempos de acceso a memoria bajos y 3) sistemas con previsión de seguridad a nivel de aplicación. Las primeras propuestas se centran en el aumento de la productividad en la red utilizando algoritmos de encaminamiento adaptativos mediante un mejor uso de los recursos de la red, primera propuesta SUR, y evitando que se ramifique la congestión cuando existe tráfico intenso hacia un único destinatario, segunda propuesta EPC. La tercera y principal contribución de esta tesis se centra la problemática de reducir el tiempo de acceso a memoria. PROSA, mediante un diseño híbrido de conmutación de paquete y conmuntación de circuito, permite reducir la latencia de la red aprovechando la latencia de acceso a memoria para establecer circuitos. De esta forma cuando la información llega a la NoC, esta es servida sin retardos. Por último, la propuesta Token Based TDM se centra en el escenario con redes de interconexión seguras. En este tipo de NoC las aplicaciones esta divididas en dominios y la red debe garantizar que no existen interferencias entre los diferentes dominios para evitar de este modo la intrusión de posibles aplicaciones maliciosas. Token-based TDM permite el aislamiento de los dominios sin tener impacto en el diseño de los conmutados de la NoC. Los resultados obtenidos demuestran como estas propuestas han servido para mejorar las prestaciones de la red en los diferentes escenarios. La implementación y la simulación de las propuestas muestra como mediante el balanceado de la utilización de los recursos de la red, los CMPs con algoritmos de encaminamiento adaptativos son capaces de aumentar el tráfico soportado por la red. Además, el uso de un filtro para limitar el encaminamiento adaptativo en situaciones de congestión previene a los mensajes de la ramificación de la congestión a lo largo de la red. Por otra parte, los resultados demuestran que el uso combinado de la conmutación de paquete y conmutación de circuito reduce muy significativa de la latencia de red acceso a memoria, contribuyendo a una reducción significativa del tiempo de ejecución de la aplicación. Por último, Token-Based TDM incrementa las prestaciones de las redes TDM debido a su alta flexibilidad dado que no requiere ninguna modificación en la red para soportar una cantidad diferente de dominios mientras mejora la latencia de la red y mantiene un aislamiento perfecto entre los tráficos de las aplicaciones. / The Network on Chip (NoC) has become the key element for an efficient communication between cores within the multiprocessor chip (CMP). The use of parallel applications in CMPs and the increase in the amount of memory needed by applications have pushed the network communication to gain importance. The NoC is in charge of transporting all the data needed by the processors cores. Moreover, the increase in the number of cores pushes the NoCs to be designed in a scalable way, but at the same time, without affecting network performance (latency and productivity). Thus, network-on-chip design becomes critical. This thesis presents different proposals that attack the problem of improving the network performance in three different scenarios. The three scenarios in which our proposals are focused are: 1) NoCs with an adaptive routing algorithm, 2) scenarios with low memory access time needs, and 3) high-assurance NoCs. The first proposals focus on increasing network throughput with adaptive routing algorithms via the improvement of the network resources utilization, the first proposal SUR, and avoiding congestion spreading when an intense traffic to a single destination occurs, second proposal ECP. The third one and main contribution of this thesis focuses on the problem of reducing memory access latency. PROSA, through a hybrid circuit-packet switching architecture design, reduces the network latency by getting benefit of the memory access latency slack and to establishing circuits during that delay. In this way the information when arrives to the NoC is served without any delay. Finally, the proposal Token-Based TDM focuses on the scenario with high assurance networks on chips. In this type of NoCs the applications are divided into domains and the network must guarantee that there are no interferences between the different domains avoiding this way intrusion of possible malicious applications. Token-based TDM allows domain isolation with no design impact on NoC routers. The results show how these proposals improve the performance of the network in each different scenario. The implementation and simulations of the proposals show the efficient use of network resources in CMPs with adaptive routing algorithms which leads to an increasement of the injected traffic supported by the network. In addition, using a filter to limit the adaptivity of the routing algorithm under congested situations prevents messages from spreading the congestion along the network. On the other hand, the results show that the combined use of circuit and packet switching reduces the memory access latency significantly, contributing to a significant reduction in application execution time. Finally, Token-Based TDM increases network performance of TDM networks due to its high flexibility and efficient arbitration. Moreover, Token-Based TDM does not require any modification in the network to support a different number of domains while improving latency and keeping a strong traffic isolation from different domains. / La xarxa en el xip (NoC) s'ha convertit en un element clau per a una comunicació eficient entre els diferents nuclis dins d'un xip multiprocessador (CMP). Tant la utilització d'aplicacions paral·leles en el CMP com l'increment de la quantitat de memòria necessitada per les aplicacions, hi ha produït que la xarxa de comunicació tinga una major importància. La NoC és l'encarregada de transportar tota la informació necessària pels nuclis. A més, l'increment del nombre de nuclis dins del CMP fa que la NoC haja de ser dissenyada d'una forma escalable, sense que afecte les prestacions de la xarxa (latència i productivitat). Per tant, el disseny de la xarxa en el xip es converteix crític. Aquesta tesi presenta diferents propostes que ataquen el problema de la millora de les prestacions de la xarxa en tres escenaris distints. Els tres escenaris en els quals se centren les nostres propostes són: 1) NoCs que implementen un algoritme d'encaminament adaptatiu, 2) escenaris amb necessitat de temps baix d'accés a memòria i 3) sistemes amb previsió de seguretat en l'àmbit d'aplicació. Les primeres propostes se centren en l'augment de la productivitat en la xarxa utilitzant algoritmes d'encaminament adaptatiu mitjançant una millor utilització dels recursos de la xarxa, primera proposta SUR, i evitant que es ramifique la congestió quan existeix un trànsit intens cap a un únic destinatari, segona proposta EPC. La tercera i principal contribució d'aquesta tesi es basa en la problemàtica de reduir el temps d'accés a memòria. PROSA, mitjançant un disseny híbrid de commutació de paquet i commutació de circuit, redueix la latència de la xarxa aprofitant la latència d'accés a memòria i establint els circuits durant aquesta latència. D'aquesta forma la informació quan arriba a la NoC pot ser enviada sense cap retràs. Per últim, la proposta Token-based TDM se centra en l'escenari amb xarxes d'interconnexió d'alta seguretat. En aquest tipus de NoC les aplicacions estan dividides en dominis i la xarxa deu garantir que no existeixen interferències entre els diferents dominis per a evitar d'aquesta forma la intrusió de possibles aplicacions malicioses. Token-based TDM permet l'aïllament dels dominis sense tindre impacte en el disseny dels encaminadors de la NoC. Els resultats demostren com aquestes propostes han servit per a millorar les prestacions de la xarxa en els diferents escenaris. La seua implementació i simulació demostra com mitjançant el balancejat de la utilització dels recursos de la xarxa, els CMP amb algoritmes d'encaminament adaptatiu són capaços d'augmentar el trànsit suportat per la xarxa. A més, l'ús d'un filtre per a limitar l'adaptabilitat de l'encaminament adaptatiu en situacions de congestió permet prevenir els missatges de la congestió al llarg de la xarxa. Per altra banda, els resultats demostren que l'ús combinat de la commutació de paquet i commutació de circuit redueix molt significativament de la latència d'accés a memòria, contribuint en una reducció significativa del temps d'execució de l'aplicació. Per últim, Token-based TDM incrementa les prestacions de les xarxes TDM debut a la seua alta flexibilitat donat que no requereix cap modificació en la xarxa per a suportar una quantitat diferent de dominis mentre millora la latència de la xarxa i mantén un aïllament perfecte entre els trànsits de les aplicacions. / Gorgues Alonso, M. (2018). Improving Network-on-Chip Performance in Multi-Core Systems [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/107336 / Compendio Congestion Circuit Switching Fully Adaptive Routing Algorithm High Performance Computing Network-on-Chip Packet Switching Security in Network on Chip Switch Allocation System-on-Chip Time Division Multiplexing
523	Strategies For Recycling Krylov Subspace Methods and Bilinear Form Estimation Swirydowicz, Katarzyna 10 August 2017 (has links) The main theme of this work is effectiveness and efficiency of Krylov subspace methods and Krylov subspace recycling. While solving long, slowly changing sequences of large linear systems, such as the ones that arise in engineering, there are many issues we need to consider if we want to make the process reliable (converging to a correct solution) and as fast as possible. This thesis is built on three main components. At first, we target bilinear and quadratic form estimation. Bilinear form $c^TA^{-1}b$ is often associated with long sequences of linear systems, especially in optimization problems. Thus, we devise algorithms that adapt cheap bilinear and quadratic form estimates for Krylov subspace recycling. In the second part, we develop a hybrid recycling method that is inspired by a complex CFD application. We aim to make the method robust and cheap at the same time. In the third part of the thesis, we optimize the implementation of Krylov subspace methods on Graphic Processing Units (GPUs). Since preconditioners based on incomplete matrix factorization (ILU, Cholesky) are very slow on the GPUs, we develop a preconditioner that is effective but well suited for GPU implementation. / Ph. D. / In many applications we encounter the repeated solution of a large number of slowly changing large linear systems. The cost of solving these systems typically dominates the computation. This is often the case in medical imaging, or more generally inverse problems, and optimization of designs. Because of the size of the matrices, Gaussian elimination is infeasible. Instead, we find a sufficiently accurate solution using iterative methods, so-called Krylov subspace methods, that improve the solution with every iteration computing a sequence of approximations spanning a Krylov subspace. However, these methods often take many iterations to construct a good solution, and these iterations can be expensive. Hence, we consider methods to reduce the number of iterations while keeping the iterations cheap. One such approach is Krylov subspace recycling, in which we recycle judiciously selected subspaces from previous linear solves to improve the rate of convergence and get a good initial guess. In this thesis, we focus on improving efficiency (runtimes) and effectiveness (number of iterations) of Krylov subspace methods. The thesis has three parts. In the first part, we focus on efficiently estimating sequences of bilinear forms, c<sup>T</sup>A⁻¹b. We approximate the bilinear forms using the properties of Krylov subspaces and Krylov subspace solvers. We devise an algorithm that allows us to use Krylov subspace recycling methods to efficiently estimate bilinear forms, and we test our approach on three applications: topology optimization for the optimal design of structures, diffuse optical tomography, and error estimation and grid adaptation in computational fluid dynamics. In the second part, we focus on finding the best strategy for Krylov subspace recycling for two large computational fluid dynamics problems. We also present a new approach, which lets us reduce the computational cost of Krylov subspace recycling. In the third part, we investigate Krylov subspace methods on Graphics Processing Units. We use a lid driven cavity problem from computational fluid dynamics to perform a thorough analysis of how the choice of the Krylov subspace solver and preconditioner influences runtimes. We propose a new preconditioner, which is designed to work well on Graphics Processing Units. Bilinear form estimation quadratic form estimation high performance computing Krylov subspace recycling diffuse optical tomography topology optimization computational fluid dynamics
524	High-Performance Network-on-Chip Design for Many-Core Processors Wang, Boqian January 2020 (has links) With the development of on-chip manufacturing technologies and the requirements of high-performance computing, the core count is growing quickly in Chip Multi/Many-core Processors (CMPs) and Multiprocessor System-on-Chip (MPSoC) to support larger scale parallel execution. Network-on-Chip (NoC) has become the de facto solution for CMPs and MPSoCs in addressing the communication challenge. In the thesis, we tackle a few key problems facing high-performance NoC designs. For general-purpose CMPs, we encompass a full system perspective to design high-performance NoC for multi-threaded programs. By exploring the cache coherence under the whole system scenario, we present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the Network Interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the Virtual Channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets’ VC allocation and switch arbitration delay. Besides, we also propose an admission control method in NoC with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node using the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. For application-specific MPSoCs, we focus on developing high-performance NoC and NI compatible with the common AMBA AXI4 interconnect protocol. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the NoC based communication interconnect in the many-core system. Due to possible out-of-order transmission in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, in the first place, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost solution to the ordering requirements. The microarchitectures and the functionalities of the transaction ordering units are also described and explained in detail for ease of implementation. Then, we focus on the NI and the Quality of Service (QoS) support in NoC. In our design, the NI is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. The NoC based communication architecture is designed to support high-performance multiple QoS schemes. The NoC system contains Time Division Multiplexing (TDM) and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in NoC. / Med utvecklingen av tillverkningsteknologi av on-chip och kraven på högpresterande da-toranläggning växer kärnantalet snabbt i Chip Multi/Many-core Processors (CMPs) ochMultiprocessor Systems-on-Chip (MPSoCs) för att stödja större parallellkörning. Network-on-Chip (NoC) har blivit den de facto lösningen för CMP:er och MPSoC:er för att mötakommunikationsutmaningen. I uppsatsen tar vi upp några viktiga problem med hög-presterande NoC-konstruktioner.Allmänna CMP:er omfattas ett fullständigt systemperspektiv för att design högprester-ande NoC för flertrådad program. Genom att utforska cachekoherensen under hela system-scenariot presenterar vi en smart kommunikationstjänst, AVCR (Advance Virtual ChannelReservation) för att tillhandahålla en motorväg till målpaket, vilket i hög grad kan min-ska deras förseningar i NoC. AVCR utnyttjar det faktum att vi kan veta eller förutsägadestinationen för vissa paket före deras ankomst till nätverksgränssnittet (Network inter-face, NI). Genom att utnyttja tidsintervallet innan ett paket är klart, etablerar AVCRen ände till ände motorväg från källan NI till destinationen NI. Denna motorväg byggsupp genom att reservera virtuell kanal (Virtual Channel, VC) resurser före målpaket-söverföringen och erbjuda prioriterade tjänster till flisar i den reserverade VC i wormholerouter. Dessutom föreslår vi också en tillträdeskontrollmetod i NoC med en centraliseradartificiellt neuronät (Artificial Neural Network, ANN) tillträdeskontroll, som kan förbättrasystemets prestanda genom att förutsäga den mest lämpliga injektionshastigheten för varjenod via nätverksprestationsinformationen. I onlinekontrollprocessen används en förbehan-dlingsenhet på data för att förenkla ANN-arkitekturen och göra förutsägningsresultatenmer korrekta. Baserat på den förbehandlade informationen bestämmer ANN-prediktornkontrollstrategin och sänder den till varje nod där tillträdeskontrollen kommer att tilläm-pas.För applikationsspecifika MPSoC:er fokuserar vi på att utveckla högpresterande NoCoch NI kompatibla med det gemensamma AMBA AXI4 protokoll. För att erbjuda möj-ligheten att använda AXI4-baserade processorer och kringutrustning i det on-chip baseradenätverkssystemet föreslår vi en hel systemarkitekturlösning för att göra AXI4 protokolletkompatibelt med den NoC-baserade kommunikation i det multikärnsystemet. På grundav den out-of-order överföring i NoC, som strider mot ordningskraven som anges i AXI4-protokollet, fokuserar vi i första hand på utformningen av transaktionsordningsenheterna,för att förverkliga en hög prestanda och låg kostnad-lösning på ordningskraven. Sedanfokuserar vi på NI och Quality of Service (QoS)-stödet i NoC. I vår design föreslås NI attgöra NoC-arkitekturen oberoende av AXI4-protokollet via meddelandeformatkonverteringmellan AXI4 signalformatet och paketformatet, vilket erbjuder NoC-designen hög flexi-bilitet. Den NoC-baserade kommunikationsarkitekturen är utformad för att stödja fleraQoS-schema med hög prestanda. NoC-systemet innehåller Time-Division Multiplexing(TDM) och VC-subnät för att tillämpa flera QoS-scheman på AXI4-signaler med olikaQoS-taggar och NI ansvarar för trafikdistribution mellan två subnät. Dessutom tillämpasen QoS-arvsmekanism i slav-sidan NI för att stödja QoS under paketets tur-returöverföringiNoC / <p>QC 20201008</p> Network-on-Chip Chip Multi/Many-core Processors Multiprocessor System-on-Chip High-Performance Computing Cache Coherence Virtual Channel Reservation Admission Control Artificial Neural Network AXI4 Quality of Service Network-on-Chip Chip Multi/Many-core Processors Multiprocessor Sys-tem on a Chip High-Performance Computing Cache Coherence Virtual Channel Reser-vation Admission Control Artificial Neural Network AXI4 Quality of Servic Computer Engineering Datorteknik Computer Systems Datorsystem
525	Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning Weijian Zheng (14226626) 08 December 2022 (has links) <p>High-performance computing (HPC) and machine learning (ML) have been widely adopted by both academia and industries to address enormous data problems at extreme scales. While research has reported on the interactions of HPC and ML, achieving high performance and scalability for parallel and distributed ML algorithms is still a challenging task. This dissertation first summarizes the major challenges for applying HPC to ML applications: 1) poor performance and scalability, 2) loss of the convergence rate, 3) lower quality of the trained model, and 4) a lack of performance optimization techniques designed for specific applications. Researchers can address the four challenges in new ML applications. This dissertation shows how to solve them for two specific applications: 1) a clustering algorithm and 2) graph optimization algorithms that use reinforcement learning (RL).</p> <p>As to the clustering algorithm, we first propose an algorithm called the simulated-annealing clustering algorithm. By combining a blocked data layout and asynchronous local optimization within each thread, the simulated-annealing enhanced clustering algorithm has a convergence rate that is comparable to the K-means algorithm but with much higher performance. Experiments with synthetic and real-world datasets show that the simulated-annealing enhanced clustering algorithm is significantly faster than the MPI K-means library using up to 1024 cores. However, the optimization costs (Sum of Square Error (SSE)) of the simulated-annealing enhanced clustering algorithm became higher than the original costs. To tackle this problem, we devise a new algorithm called the full-step feel-the-way clustering algorithm. In the full-step feel-the-way algorithm, there are L local steps within each block of data points. We use the first local step’s results to compute accurate global optimization costs. Our results show that the full-step algorithm can significantly reduce the global number of iterations needed to converge while obtaining low SSE costs. However, the time spent on the local steps is greater than the benefits of the saved iterations. To improve this performance, we next optimize the local step time by incorporating a sampling-based method called reassignment-history-aware sampling. Extensive experiments with various synthetic and real world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that our parallel algorithms can outperform the fastest open-source MPI K-means implementation by up to 110% on 4,096 CPU cores with comparable SSE costs.</p> <p>Our evaluations of the sampling-based feel-the-way algorithm establish the effectiveness of the local optimization strategy, the blocked data layout, and the sampling methods for addressing the challenges of applying HPC to ML applications. To explore more parallel strategies and optimization techniques, we focus on a more complex application: graph optimization problems using reinforcement learning (RL). RL has proved successful for automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised RL’s ability to solve large scale graph optimization problems due to the lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop OpenGraphGym-MG, a high performance distributed-GPU RL framework for solving graph optimization problems. OpenGraphGym-MG focuses on a class of computationally demanding RL problems in which both the RL environment and the policy model are highly computation intensive. In this work, we distribute large-scale graphs across distributed GPUs and use spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of spatial and data parallelism and highlight their differences. To support graph neural network (GNN) layers that take data samples partitioned across distributed GPUs as input, we design new parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design new parallel graph environments to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance OpenGraphGym-MG training and inference algorithms in parallel.</p> <p>To summarize, after proposing the major challenges for applying HPC to ML applications, this thesis explores several parallel strategies and performance optimization techniques using two ML applications. Specifically, we propose a local optimization strategy, a blocked data layout, and sampling methods for accelerating the clustering algorithm, and we create a spatial parallelism strategy, a parallel graph environment, agent, and policy model, and an optimized replay buffer, and multi-node selection strategy for solving large optimization problems over graphs. Our evaluations prove the effectiveness of these strategies and demonstrate that our accelerations can significantly outperform the state-of-the-art ML libraries and frameworks without loss of quality in trained models.</p> Graph, social and multimedia data Distributed systems and algorithms High performance computing Reinforcement learning High Performance Computing (HPC) Clustering Algorithm Reinforcement Learning combinatorial optimization problems graph problems Travelling salesperson problem Minimum Vertex Cover Problem Distributed processing of data NP-Hard optimization problems model parallelism data parallelism
526	The molecular structure of selected South African coal-chars to elucidate fundamental principles of coal gasification / Mokone Joseph Roberts Roberts, Mokone Joseph January 2015 (has links) Advances in the knowledge of chemical structure of coal and development of high performance computational techniques led to more than hundred and thirty four proposed molecular level representations (models) of coal between 1942 and 2010. These models were virtually on the carboniferous coals from the northern hemisphere. There are only two molecular models based on the inertinite- and vitrinite-rich coals from the southern hemisphere. The current investigation is based on the chars derived from the Permian-aged coals in two major South African coalfields, Witbank #4 seam and Waterberg Upper Ecca. The two coals were upgraded to 85 and 93% inertinite- and vitrinite-rich concentrates, on visible mineral matter free basis. The coals were slow heated in inert atmosphere at 20 ℃ min-1 to 450, 700 and 1000 ℃ and held at that temperature for an hour. After the HCl-HF treatment technique at ambient temperatures, the characteristics of the coals and chars were examined with proximate, ultimate, helium density, porosity, surface area, petrographic, solid-state 13C NMR, XRD and HRTEM analytical techniques. The results largely showed that substantial transitions occurred at 700-1000 ℃, where the chars became physically different but chemically similar. Consequently, the chars at the highest temperature (1000 ℃) drew attention to the detailed study of the atomistic properties that may give rise to different reactivity behaviours with CO2 gas. The H/C atomic ratios for the inertinite- and vitrinite-rich chars were respectively 0.31 and 0.49 at 450 ℃ and 0.10 and 0.12 at 1000 ℃. The true density was respectively 1.48 and 1.38 g.cm-3 at 450 ℃ and 1.87 and 1.81 g.cm-3 at 1000 ℃. The char form results from the petrographic analysis technique indicated that the 700-1000 ℃ inertinite-rich chars have lower proportions of thick-walled isotropic coke derived from pure vitrinites (5-8%) compared with the vitrinite-rich chars (91-95%). This property leads to the creation of pores and increases of volume and surface area as the softening walls expand. It was found that the average crystallite diameter, La, and the mean length of the aromatic carbon fringes from the XRD and HRTEM techniques, respectively, were in good agreement and made a definite distinction between the 1000 ℃ inertinite- and vitrinite-rich chars. The crystallite diameter on peak (10) approximations, La(10), of 37.6Å for the 1000 ℃ inertinite-rich chars fell within the HRTEM’s range of minimummaximum length boundary of 11x11 aromatic fringes (27-45Å). The La (10) of 30.7Å for the vitrinite-rich chars fell nearly on the minimum-maximum length range of 7x7 aromatic fringes (17-28Å.) The HRTEM results showed that the 1000 ℃ inertinite-rich chars comprised a higher distribution of larger aromatic fringes (11x11 parallelogram catenations) compared with a higher distribution of smaller aromatic fringes (7x7 parallelogram catenations). The mechanism for the similarity between the 700-1000 ℃ inertinite- and vitrinite-rich chars was the greater transition occurring in the vitrinite-rich coal to match the more resistant inertinite-rich coal. This emphasised that the transitions in the properties of vitrinite-rich coals were more thermally accelerated than those of the inertinite-rich coals. The similarity between the inertinite- and vitrinite-rich chars was shown by the total maceral reflectance, proximate, ultimate, skeletal density and aromaticity results. Evidence for this was the carbon content by mass for the inertinite- and vitrinite-rich chars of respectively 90.5 and 85.3% at 450 ℃ and 95.9 and 94.1% at 1000 ℃. The aromaticity from the XRD technique was respectively 87 and 77% at 450 ℃ and 98 and 96% at 1000 ℃. A similar pattern was found in the hydrogen and oxygen contents, the atomic O/C ratios and the aromaticity from the NMR technique. The subsequent construction of large-scale molecular structures for the 1000 ℃ inertinite-rich chars comprised 106 molecules constructed from a total of 42929 atoms, while the vitrinite-rich char model was made up of 185 molecules consisting of a total of 44315 atoms. The difference between the number of molecules was due to the inertinite-rich char model comprising a higher distribution of larger molecules compared with the vitrinite-rich char model, in agreement with the XRD and HRTEM results. These char structures were used to examine the behaviour on the basis of gasification reactivity with CO2. The density functional theory (DFT) was used to evaluate the interactions between CO2 and the atomistic representations of coal char derived from the inertinite- and vitrinite rich South African coals. The construction of char models used the modal aromatic fringes (fringes of highest frequencies in size distributions) from the HRTEM, for the inertinite- and vitrinite-rich chars, respectively (11x11 and 7x7 parallelogram-shaped aromatic carbon rings). The structures were DFT geometrically optimized and used to measure reactivity with the Fukui function, f+(r) and to depict a representative reactive carbon edge for the simulations of coal gasification reaction mechanism with CO2 gas. The f+(r) reactivity indices of the reactive edge follows the sequence: zigzag C remote from the tip C (Czi = 0.266) > first armchair C (Cr1 = 0.087) > tip C (Ct = 0.075) > second armchair C (Cr2 = 0.029) > zigzag C proximate to the tip C (Cz = 0.027). The DFT simulated mean activation energy, ΔEb, for the gasification reaction mechanism (formation of second CO gas molecule) was 233 kJ mol-1. The reaction for the formation of second CO molecule is defines gasification in essence. The experimental activation energy determined with the TGA and random pore model to account essentially for the pore variation in addition to the gasification chemical reaction were found to be very similar: 191 ± 25 kJ mol-1 and 210 ± 8 kJ mol-1; and in good agreement with the atomistic results. The investigation gave promise towards the utility of molecular representations of coal char within the context of fundamental coal gasification reaction mechanism with CO2. / PhD (Chemical Engineering), North-West University, Potchefstroom Campus, 2015 Advanced coal properties Pyrolysis Micro-image processing and analyses High performance computing Computational chemistry Molecular modelling Density functional theory Energy Reactivity Experimental verification Inertinite- and vitrinite-rich chars Fukui function Transition theory
527	The molecular structure of selected South African coal-chars to elucidate fundamental principles of coal gasification / Mokone Joseph Roberts Roberts, Mokone Joseph January 2015 (has links) Advances in the knowledge of chemical structure of coal and development of high performance computational techniques led to more than hundred and thirty four proposed molecular level representations (models) of coal between 1942 and 2010. These models were virtually on the carboniferous coals from the northern hemisphere. There are only two molecular models based on the inertinite- and vitrinite-rich coals from the southern hemisphere. The current investigation is based on the chars derived from the Permian-aged coals in two major South African coalfields, Witbank #4 seam and Waterberg Upper Ecca. The two coals were upgraded to 85 and 93% inertinite- and vitrinite-rich concentrates, on visible mineral matter free basis. The coals were slow heated in inert atmosphere at 20 ℃ min-1 to 450, 700 and 1000 ℃ and held at that temperature for an hour. After the HCl-HF treatment technique at ambient temperatures, the characteristics of the coals and chars were examined with proximate, ultimate, helium density, porosity, surface area, petrographic, solid-state 13C NMR, XRD and HRTEM analytical techniques. The results largely showed that substantial transitions occurred at 700-1000 ℃, where the chars became physically different but chemically similar. Consequently, the chars at the highest temperature (1000 ℃) drew attention to the detailed study of the atomistic properties that may give rise to different reactivity behaviours with CO2 gas. The H/C atomic ratios for the inertinite- and vitrinite-rich chars were respectively 0.31 and 0.49 at 450 ℃ and 0.10 and 0.12 at 1000 ℃. The true density was respectively 1.48 and 1.38 g.cm-3 at 450 ℃ and 1.87 and 1.81 g.cm-3 at 1000 ℃. The char form results from the petrographic analysis technique indicated that the 700-1000 ℃ inertinite-rich chars have lower proportions of thick-walled isotropic coke derived from pure vitrinites (5-8%) compared with the vitrinite-rich chars (91-95%). This property leads to the creation of pores and increases of volume and surface area as the softening walls expand. It was found that the average crystallite diameter, La, and the mean length of the aromatic carbon fringes from the XRD and HRTEM techniques, respectively, were in good agreement and made a definite distinction between the 1000 ℃ inertinite- and vitrinite-rich chars. The crystallite diameter on peak (10) approximations, La(10), of 37.6Å for the 1000 ℃ inertinite-rich chars fell within the HRTEM’s range of minimummaximum length boundary of 11x11 aromatic fringes (27-45Å). The La (10) of 30.7Å for the vitrinite-rich chars fell nearly on the minimum-maximum length range of 7x7 aromatic fringes (17-28Å.) The HRTEM results showed that the 1000 ℃ inertinite-rich chars comprised a higher distribution of larger aromatic fringes (11x11 parallelogram catenations) compared with a higher distribution of smaller aromatic fringes (7x7 parallelogram catenations). The mechanism for the similarity between the 700-1000 ℃ inertinite- and vitrinite-rich chars was the greater transition occurring in the vitrinite-rich coal to match the more resistant inertinite-rich coal. This emphasised that the transitions in the properties of vitrinite-rich coals were more thermally accelerated than those of the inertinite-rich coals. The similarity between the inertinite- and vitrinite-rich chars was shown by the total maceral reflectance, proximate, ultimate, skeletal density and aromaticity results. Evidence for this was the carbon content by mass for the inertinite- and vitrinite-rich chars of respectively 90.5 and 85.3% at 450 ℃ and 95.9 and 94.1% at 1000 ℃. The aromaticity from the XRD technique was respectively 87 and 77% at 450 ℃ and 98 and 96% at 1000 ℃. A similar pattern was found in the hydrogen and oxygen contents, the atomic O/C ratios and the aromaticity from the NMR technique. The subsequent construction of large-scale molecular structures for the 1000 ℃ inertinite-rich chars comprised 106 molecules constructed from a total of 42929 atoms, while the vitrinite-rich char model was made up of 185 molecules consisting of a total of 44315 atoms. The difference between the number of molecules was due to the inertinite-rich char model comprising a higher distribution of larger molecules compared with the vitrinite-rich char model, in agreement with the XRD and HRTEM results. These char structures were used to examine the behaviour on the basis of gasification reactivity with CO2. The density functional theory (DFT) was used to evaluate the interactions between CO2 and the atomistic representations of coal char derived from the inertinite- and vitrinite rich South African coals. The construction of char models used the modal aromatic fringes (fringes of highest frequencies in size distributions) from the HRTEM, for the inertinite- and vitrinite-rich chars, respectively (11x11 and 7x7 parallelogram-shaped aromatic carbon rings). The structures were DFT geometrically optimized and used to measure reactivity with the Fukui function, f+(r) and to depict a representative reactive carbon edge for the simulations of coal gasification reaction mechanism with CO2 gas. The f+(r) reactivity indices of the reactive edge follows the sequence: zigzag C remote from the tip C (Czi = 0.266) > first armchair C (Cr1 = 0.087) > tip C (Ct = 0.075) > second armchair C (Cr2 = 0.029) > zigzag C proximate to the tip C (Cz = 0.027). The DFT simulated mean activation energy, ΔEb, for the gasification reaction mechanism (formation of second CO gas molecule) was 233 kJ mol-1. The reaction for the formation of second CO molecule is defines gasification in essence. The experimental activation energy determined with the TGA and random pore model to account essentially for the pore variation in addition to the gasification chemical reaction were found to be very similar: 191 ± 25 kJ mol-1 and 210 ± 8 kJ mol-1; and in good agreement with the atomistic results. The investigation gave promise towards the utility of molecular representations of coal char within the context of fundamental coal gasification reaction mechanism with CO2. / PhD (Chemical Engineering), North-West University, Potchefstroom Campus, 2015 Advanced coal properties Pyrolysis Micro-image processing and analyses High performance computing Computational chemistry Molecular modelling Density functional theory Energy Reactivity Experimental verification Inertinite- and vitrinite-rich chars Fukui function Transition theory
528	A live imaging paradigm for studying Drosophila development and evolution Schmied, Christopher 30 March 2016 (has links) (PDF) Proper metazoan development requires that genes are expressed in a spatiotemporally controlled manner, with tightly regulated levels. Altering the expression of genes that govern development leads mostly to aberrations. However, alterations can also be beneficial, leading to the formation of new phenotypes, which contributes to the astounding diversity of animal forms. In the past the expression of developmental genes has been studied mostly in fixed tissues, which is unable to visualize these highly dynamic processes. We combine genomic fosmid transgenes, expressing genes of interest close to endogenous conditions, with Selective Plane Illumination Microscopy (SPIM) to image the expression of genes live with high temporal resolution and at single cell level in the entire embryo. In an effort to expand the toolkit for studying Drosophila development we have characterized the global expression patterns of various developmentally important genes in the whole embryo. To process the large datasets generated by SPIM, we have developed an automated workflow for processing on a High Performance Computing (HPC) cluster. In a parallel project, we wanted to understand how spatiotemporally regulated gene expression patterns and levels lead to different morphologies across Drosophila species. To this end we have compared by SPIM the expression of transcription factors (TFs) encoded by Drosophila melanogaster fosmids to their orthologous Drosophila pseudoobscura counterparts by expressing both fosmids in D. melanogaster. Here, we present an analysis of divergence of expression of orthologous genes compared A) directly by expressing the fosmids, tagged with different fluorophore, in the same D. melanogaster embryo or B) indirectly by expressing the fosmids, tagged with the same fluorophore, in separate D. melanogaster embryos. Our workflow provides powerful methodology for the study of gene expression patterns and levels during development, such knowledge is a basis for understanding both their evolutionary relevance and developmental function. Evolution von Entwicklung Entwicklungsbiolgie Drosophila Light sheet microscopy SPIM Hochleistungsrechner HPC Automatisierung Multiview Rekonstruktion Evolution of Development Development Drosophila Light sheet microscopy SPIM High performance Computing HPC Automation Multiview reconstruction ddc:570 rvk:WG 2100
529	ZIH-Info 18 December 2015 (has links) (PDF) - HRSK-II-Wartung - Zentrale Firewall an der TU Dresden - Einsatz von Windows 10 an der TU Dresden - Probleme mit Office 2016 und SharePoint 2013 - I/O-Engpässe im HPC überwinden - Workshop zum Thema „Big Data in Business” - ZIH präsentiert sich auf der SC15 in Austin/Texas - ZIH-Publikationen - Veranstaltungen ZIH Rechenzentrum data processing center ddc:004 ddc:621.39 rvk:AL rvk:SQ Technische Universität Dresden Dresden Rechenzentrum Zeitschrift
530	ZIH-Info 12 August 2016 (has links) (PDF) - IT-Service-Katalog der TU Dresden - Konferenzzugänge für eduroam - Adobe ETLA-Desktop-Rahmenvertrag für Sachsen - Workshop „Videokonferenzen im Wissenschaftsnetz“ - Das ZIH läuft Mitteilung aus dem Dezernat 8 - Eröffnung Frontdesk des ServiceCenterStudium - ZIH-Publikationen - Veranstaltungen ZIH Rechenzentrum data processing center ddc:004 ddc:621.39 rvk:AL 51908 rvk:QX 840 rvk:AK 29000 Technische Universität Dresden Dresden Rechenzentrum Zeitschrift

Search results