1 |
Register Caching for Energy Efficient GPGPU Tensor Core Computing / Registrera cachelagring för energieffektiv GPGPU Tensor Core ComputingQian, Qiran January 2023 (has links)
The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations. / Den allmänna ändamålsgrafikprocessorn (GPGPU) har framträtt som den dominerande beräkningsenheten för omfattande parallella arbetsbelastningar inom områdena för artificiell intelligens (AI) och vetenskaplig beräkning, huvudsakligen tack vare dess antagande av arkitekturen för enkel instruktion, flera trådar (Single Instruction Multiple Thread), vilket inte bara ger en mängd trådcontext utan också effektivt döljer de latenser som exponeras vid enskilda trådars utförande. När beräkningskraven har utvecklats har moderna GPGPU:er inkorporerat specialiserade matrismotorer, t.ex., NVIDIAs Tensor Core (TC), för att leverera avsevärt högre genomströmning för täta matrisberäkningar jämfört med traditionella skalär- eller vektorarkitekturer. Bortom endast genomströmning är energieffektivitet en central oro inom GPGPUberäkning. Registerfilen är den största minnesstrukturen på GPGPU-dien och svarar vanligtvis för över 20% av den dynamiska effektförbrukningen För att förbättra energieffektiviteten inkorporerar GPGPU:er en teknik vid namn registercachning, lånad från CPU-världen. Registercachning fångar temporal lokalitet bland registeroperanderna för att minska energiförbrukningen inom en 2-nivåers registerfilstruktur. Närvaron av TC innebär nya utmaningar för Register Cache (RC)-design, eftersom varje matrisinstruktion genererar intensiv operandleverans på registerfilbankarna. I denna studie fördjupar vi oss i RC-designavvägandena i GPGPU:er. Vi genomför en omfattande utforskning av designutrymmet, som omfattar olika arbetsbelastningar. Våra experiment avslöjar inte bara de grundläggande designövervägandena för RC utan klargör också att konventionella cachestrategier underpresterar, särskilt vid hantering av TC-beräkningar, främst på grund av dålig temporal lokalitet och den betydande trafiken med registeroperand. Baserat på dessa resultat föreslår vi en förbättrad cachestrategi med en look-ahead-alloceringspolicy för att minimera onödiga cacheallokeringar för destinationens registeroperand. Dessutom, för att dra nytta av energieffektiviteten hos Tensor Core-beräkning, belyser vi en alternativ instruktionsplaneringsram för Tensor Core-instruktioner som samarbetar med en specialiserad cachelayout, vilket resulterar i en anmärkningsvärd minskning av upp till 50% i dynamisk energiförbrukning inom registerfilen under Tensor Core GEMM-beräkningar.
|
2 |
Evaluation of FPGA-based High Performance Computing PlatformsFrick-Lundgren, Martin January 2023 (has links)
High performance computing is a topic that has risen to the top in the era ofdigitalization, AI and automation. Therefore, the search for more cost and timeeffective ways to implement HPC work is always a subject extensively researched.One part of this is to have hardware that is capable to improve on these criteria. Different hardware usually have different code languages to implement theseworks though, cross-platform solution like Intel’s oneAPI framework is startingto gaining popularity.In this thesis, the capabilities of Intel’s oneAPI framework to implement andexecute HPC benchmarks on different hardware platforms will be discussed. Using the hardware available through Intel’s DevCloud services, Intel’s Xeon Gold6128, Intel’s UHD Graphics P630 and the Arria10 FPGA board were all chosento use for implementation. The benchmarks that were chosen to be used wereGEMM (General Matrix Multiplication) and BUDE (Bristol University DockingEngine). They were implemented using DPC++ (Data Parallel C++), Intel’s ownSYCL-based C++ extension. The benchmarks were also tried to be improved uponwith HPC speed-up methods like loop unrolling and some hardware manipulation.The performance for CPU and GPU were recorded and compared, as the FPGAimplementation could not be preformed because of technical difficulties. Theresults are good comparison to related work, but did not improve much uponthem. This because the hardware used is quite weak compared to industry standard. Though further research on the topic would be interesting, to compare aworking FPGA implementation to the other results and results from other studies. This implementation also probably has the biggest improvement potential,so to see how good one could make it would be interesting. Also, testing someother more complex benchmarks could be interesting.
|
3 |
Use of genetically engineered mouse models in preclinical drug developmentCreedon, Helen January 2015 (has links)
The paucity of well validated preclinical models is frequently cited as a contributing factor to the high attrition rates seen in clinical oncological trials. There remains a critical need to develop models which are accurately able to recapitulate the features of human disease. The aims of this study were to use genetically engineered mouse models (GEMMs) to explore the efficacy of novel treatment strategies in HER2 positive breast cancer and to further develop the model to facilitate the study of mechanisms underpinning drug resistance. Using the BLG--HER2KI-PTEN+/- model, we demonstrated that Src plays an important role in the early stages of tumour development. Chemopreventative treatment with dasatinib delayed tumour inititation (p= 0.046, Wilcoxon signed rank test) and prolonged overall survival (OS) (p=0.06, Wilcoxon signed rank test). Dasatinib treatment also induced squamous metaplasia in 66% of drug treated tumours. We used 2 cell lines derived from this model to further explore dasatinib’s mechanism of action and demonstrated reduced proliferation, migration and invasion following in vitro treatment. Due to the prolonged tumour latency and the low metastatic rate seen in this model, further studies were undertaken with the MMTV-NIC model. This model also allowed us to study the impact of PTEN loss on therapeutic response. We validated this model by treating a cohort of MMTV-NIC PTEN+/- mice with paclitaxel and demonstrated prolonged OS (p=0.035, Gehan Breslow Wilcoxon test). AZD8931 is an equipotent signalling inhibitor of HER2, HER3 and EGFR. We observed heterogeneity in tumour response but overall AZD8931 treatment prolonged OS in both MMTV-NIC PTEN FL/+ and MMTV-NIC PTEN+/- models. PTEN loss was associated with reduced sensitivity to AZD8931 and failure to suppress Src activity, suggesting these may be suitable predictive biomarkers of AZD8931 response. To facilitate further studies exploring resistance, we transplanted MMTV-NIC PTEN+/- fragments into syngeneic mice and generated 3 tumours with acquired resistance to AZD8931. These tumours displayed differing resistance strategies; 1 tumour continued to express HER2 whilst the remaining 2 underwent EMT and lost HER2 expression reflecting to a very limited degree some of the heterogeneity of resistance strategies seen in human disease. To further explore resistance to HER2 targeting tyrosine kinase inhibitors, we generated a panel of human cell lines with acquired resistance to AZD8931 and lapatinib. Western blotting demonstrated loss of HER2, HER3 and PTEN in all resistant lines. Acquisition of resistance was associated with a marked change in phenotype and western blotting confirmed all lines had undergone EMT. We used a combination of RPPA and mass spectrometry to further characterise the AZD8931 resistant lines and identified multiple potential novel proteins involved in the resistant phenotype, including several implicated in EMT. In conclusion, when coupled with appropriate in vitro techniques, the MMTV-NIC model is a valuable tool for selection of emerging drugs to carry forward into clinical trials of HER2 positive breast cancer.
|
4 |
Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix EquationsJonsson, Isak January 2003 (has links)
<p>This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable.</p><p>The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines.</p><p>Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library. </p>
|
5 |
Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix EquationsJonsson, Isak January 2003 (has links)
This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable. The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines. Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library.
|
6 |
Biological functions of microRNA-216 and microRNA-217 during the development of pancreatic cancerAzevedo-Pouly, Ana Clara P. 17 October 2013 (has links)
No description available.
|
7 |
Defining Mutation-Specific NRAS Functions that Drive MelanomagenesisMurphy, Brandon M. January 2021 (has links)
No description available.
|
Page generated in 0.539 seconds