Global ETD Search

251	Designing Scalable and High Performance One Sided Communication Middleware for Modern Interconnects Santhanaraman, Gopalakrishnan 02 September 2009 (has links) No description available. Computer Science one-sided Middleware InfiniBand MPI-2 Communication Programming-models overlap networks
252	Interior Penalty Discontinuous Galerkin Finite Element Method for the Time-Domain Maxwell's Equations Dosopoulos, Stylianos 22 June 2012 (has links) No description available. Electrical Engineering Electromagnetics Discontinuous Galerkin Time Domain Non-conformal MPI/GPU parallelization
253	Enhancement of LIMIC-Based Collectives for Multi-core Clusters Dhanraj, Vijay 28 August 2012 (has links) No description available. Computer Engineering Computer Science LiMIC MPI Collectives Multi-Leader Topology Hierarchical Framework
254	Constraint based network communications in a virtual environment of a proprietary hardware Bhonagiri, Saaish, Mudugonda, Soumith Kumar January 2022 (has links) The specialized hardware remains a key component of the mobile networks, but at the same time, the telecom industry is adapting a vision of a fully programable distributed end-to-end network with cloud style management and Software-Defined Networking. In the specialized hardware programmable network, it will be possible to place workloads across abstracted compute and networking infrastructure. But, whereas virtualization standard compute resources is a mature technology and well supported in cloud management systems such as OpenStack and Kubernetes, this is not the case for specialized hardware with more complex constraints. There is a significant gap in terms of advanced constraints and service level aware schedulers. The main objective of this thesis is that the specialised hardware needs to adapt to the features of edge computing. Edge computing provides the opportunity to explore how technologies can advance industrial processes. To achieve flexibility by choosing where the workload should be processed on the board based on available resources. Utilising this technology, highly intensive applications can be handled at the network’s edge. There is a necessity to virtualize the proprietary hardware and run workloads in VMs and containers. In this thesis, we discuss kernel bypass, PCI passthrough and MPI communication technologies in a virtual environment by considering the hardware constraints and software requirements so that these technologies can be integrated into OpenStack and Kubernetes in future. Network communication SR-IOV OVS+DPDK MPI OpenStack Kubernetes Telecommunications Telekommunikation
255	CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications Xue, Weicheng 25 January 2021 (has links) Computational Fluid Dynamics (CFD) applications usually involve intensive computations, which can be accelerated through using open accelerators, especially GPUs due to their common use in the scientific computing community. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented numerically correctly, which is called code verification. This dissertation focuses on accelerating research CFD codes on multi-CPUs/GPUs using MPI and OpenACC, as well as the code verification for turbulence model implementation using the method of manufactured solutions and code-to-code comparisons. First, a variety of performance optimizations both agnostic and specific to applications and platforms are developed in order to 1) improve the heterogeneous CPU/GPU compute utilization; 2) improve the memory bandwidth to the main memory; 3) reduce communication overhead between the CPU host and the GPU accelerator; and 4) reduce the tedious manual tuning work for GPU scheduling. Both finite difference and finite volume CFD codes and multiple platforms with different architectures are utilized to evaluate the performance optimizations used. A maximum speedup of over 70 is achieved on 16 V100 GPUs over 16 Xeon E5-2680v4 CPUs for multi-block test cases. In addition, systematic studies of code verification are performed for a second-order accurate finite volume research CFD code. Cross-term sinusoidal manufactured solutions are applied to verify the Spalart-Allmaras and k-omega SST model implementation, both in 2D and 3D. This dissertation shows that the spatial and temporal schemes are implemented numerically correctly. / Doctor of Philosophy / Computational Fluid Dynamics (CFD) is a numerical method to solve fluid problems, which usually requires a large amount of computations. A large CFD problem can be decomposed into smaller sub-problems which are stored in discrete memory locations and accelerated by a large number of compute units. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented correctly, which is called code verification. This dissertation focuses on the CFD code acceleration as well as the code verification for turbulence model implementation. In this dissertation, multiple Graphic Processing Units (GPUs) are utilized to accelerate two CFD codes, considering that the GPU has high computational power and high memory bandwidth. A variety of optimizations are developed and applied to improve the performance of CFD codes on different parallel computing systems. The program execution time can be reduced significantly especially when multiple GPUs are used. In addition, code-to-code comparisons with some NASA CFD codes and the method of manufactured solutions are utilized to verify the correctness of a research CFD code. GPU OpenACC MPI Domain Decomposition Performance Optimization GPUDirect Code Verification OOA Discretization Error
256	Data and Processor Mapping Strategies for Dynamically Resizable Parallel Applications Chinnusamy, Malarvizhi 18 August 2004 (has links) Due to the unpredictability in job arrival times in clusters and widely varying resource requirements, dynamic scheduling of parallel computing resources is necessary to increase system throughput. Dynamically resizable applications provide the flexibility needed for dynamic scheduling. These applications can expand to take advantage of additional free processors, or to meet a Quality of Service (QoS) deadline, or can shrink to accommodate a high priority application, without getting suspended. This thesis is part of a larger effort to define a framework for dynamically resizable parallel applications. This framework includes a scheduler that supports resizing applications, an API to enable applications to interact with the scheduler, and libraries that make resizing viable. This thesis focuses on libraries for efficient resizing of parallel applications—efficient in terms of minimizing the cost of data redistribution, choosing and allocating the right set of additional processors, and focusing on the performance of the application after resizing. We explore the tradeoffs between these goals on both homogeneous and heterogeneous clusters. We focus on structured applications that have 2D data arrays distributed across a 2D processor grid. Our library includes algorithms for processor selection and processor mapping. For homogeneous clusters, processor selection involves selecting the number of processors that needs to be added and processor mapping decides the placement of the new processors in the context of the given topology such that it minimizes the amount of data that is to be redistributed. For heterogeneous clusters, since the processing powers of the processors vary, there is also an additional problem of choosing the right set of processors that needs to be added. We also present results that demonstrate the effectiveness of our approach. / Master of Science Grid Computing Remapping Heterogeneous resources MPI Processor allocation dynamic resizable applications ScaLAPACK
257	Adjusting Process Count on Demand for Petascale Global Optimization Radcliffe, Nicholas Ryan 16 January 2012 (has links) There are many challenges that need to be met before efficient and reliable computation at the petascale is possible. Many scientific and engineering codes running at the petascale are likely to be memory intensive, which makes thrashing a serious problem for many petascale applications. One way to overcome this challenge is to use a dynamic number of processes, so that the total amount of memory available for the computation can be increased on demand. This thesis describes modifications made to the massively parallel global optimization code pVTdirect in order to allow for a dynamic number of processes. In particular, the modified version of the code monitors memory use and spawns new processes if the amount of available memory is determined to be insufficient. The primary design challenges are discussed, and performance results are presented and analyzed. / Master of Science Petascale computing Global optimization Message Passing Interface (MPI) Dynamic process count
258	Programming High-Performance Clusters with Heterogeneous Computing Devices Aji, Ashwin M. 19 May 2015 (has links) Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization. / Ph. D. Runtime Systems Programming Models Message Passing Interface (MPI) CUDA OpenCL
259	Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing Helal, Ahmed Elmohamadi Mohamed 29 January 2020 (has links) In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems can enable dramatic acceleration of user applications, extracting optimal performance via manual analysis and optimization is a complicated and time-consuming process. This dissertation presents graph-structured program representations to reason about the performance bottlenecks on modern HPC systems and to guide novel automation frameworks for performance analysis and modeling and runtime adaptation. The proposed program representations exploit domain knowledge and capture the inherent computation and communication patterns in user applications, at multiple levels of computational granularity, via compiler analysis and dynamic instrumentation. The empirical results demonstrate that the introduced modeling frameworks accurately estimate the realizable parallel performance and scalability of a given sequential code when ported to heterogeneous HPC systems. As a result, these frameworks enable efficient workload distribution schemes that utilize all the available compute resources in a performance-proportional way. In addition, the proposed runtime adaptation frameworks significantly improve the end-to-end performance of important real-world applications which suffer from limited parallelism and fine-grained data dependencies. Specifically, compared to the state-of-the-art methods, such an adaptive parallel execution achieves up to an order-of-magnitude speedup on the target HPC systems while preserving the inherent data dependencies of user applications. / Doctor of Philosophy / Current supercomputers integrate a massive number of heterogeneous compute units with varying speed, computational throughput, memory bandwidth, and memory access latency. This trend represents a major challenge to end users, as their applications have been designed from the ground up to primarily exploit homogeneous CPUs. While heterogeneous systems can deliver several orders of magnitude speedup compared to traditional CPU-based systems, end users need extensive software and hardware expertise as well as significant time and effort to efficiently utilize all the available compute resources. To streamline such a daunting process, this dissertation presents automated frameworks for analyzing and modeling the performance on parallel architectures and for transforming the execution of user applications at runtime. The proposed frameworks incorporate domain knowledge and adapt to the input data and the underlying hardware using novel static and dynamic analyses. The experimental results show the efficacy of the introduced frameworks across many important application domains, such as computational fluid dynamics (CFD), and computer-aided design (CAD). In particular, the adaptive execution approach on heterogeneous systems achieves up to an order-of-magnitude speedup over the optimized parallel implementations. Parallel Architectures Accelerators Heterogeneous Computing Performance Modeling Runtime Adaptation Scheduling Performance Portability MPI GPU LLVM
260	Parallel implementation and application of particle scale heat transfer in the Discrete Element Method Amritkar, Amit Ravindra 25 July 2013 (has links) Dense fluid-particulate systems are widely encountered in the pharmaceutical, energy, environmental and chemical processing industries. Prediction of the heat transfer characteristics of these systems is challenging. Use of a high fidelity Discrete Element Method (DEM) for particle scale simulations coupled to Computational Fluid Dynamics (CFD) requires large simulation times and limits application to small particulate systems. The overall goal of this research is to develop and implement parallelization techniques which can be applied to large systems with O(105- 106) particles to investigate particle scale heat transfer in rotary kiln and fluidized bed environments. The strongly coupled CFD and DEM calculations are parallelized using the OpenMP paradigm which provides the flexibility needed for the multimodal parallelism encountered in fluid-particulate systems. The fluid calculation is parallelized using domain decomposition, whereas N-body decomposition is used for DEM. It is shown that OpenMP-CFD with the first touch policy, appropriate thread affinity and careful tuning scales as well as MPI up to 256 processors on a shared memory SGI Altix. To implement DEM in the OpenMP framework, ghost particle transfers between grid blocks, which consume a substantial amount of time in DEM, are eliminated by a suitable global mapping of the multi-block data structure. The global mapping together with enforcing perfect particle load balance across OpenMP threads results in computational times between 2-5 times faster than an equivalent MPI implementation. Heat transfer studies are conducted in a rotary kiln as well as in a fluidized bed equipped with a single horizontal tube heat exchanger. Two cases, one with mono-disperse 2 mm particles rotating at 20 RPM and another with a poly-disperse distribution ranging from 1-2.8 mm and rotating at 1 RPM are investigated. It is shown that heat transfer to the mono-disperse 2 mm particles is dominated by convective heat transfer from the thermal boundary layer that forms on the heated surface of the kiln. In the second case, during the first 24 seconds, the heat transfer to the particles is dominated by conduction to the larger particles that settle at the bottom of the kiln. The results compare reasonably well with experiments. In the fluidized bed, the highly energetic transitional flow and thermal field in the vicinity of the tube surface and the limits placed on the grid size by the volume-averaged nature of the governing equations result in gross under prediction of the heat transfer coefficient at the tube surface. It is shown that the inclusion of a subgrid stress model and the application of a LES wall function (WMLES) at the tube surface improves the prediction to within ± 20% of the experimental measurements. / Ph. D. Computational fluid dynamics (CFD) Heat--Transmission OpenMP MPI Hybrid parallelization Performance tools Multiphase flows

Search results