Global ETD Search

171	The fast multipole method at exascale Chandramowlishwaran, Aparna 13 January 2014 (has links) This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models. Fast multipole method Performance analysis High performance computing Multi-body problem Algorithms Big data
172	Web-based front-end design and scientific computing for material stress simulation software Lin, Tien-Ju 12 January 2015 (has links) A precise simulation requires a large amount of input data such as geometrical descriptions of the crystal structure, the external forces and loads, and quantitative properties of the material. Although some powerful applications already exist for research purposes, they are not widely used in education due to complex structure and unintuitive operation. To cater to the generic user base, a front-end application for material simulation software is introduced. With a graphic interface, it provides a more efficient way to conduct the simulation and to educate students who want to enlarge knowledge in relevant fields. We first discuss how we explore the solution for the front-end application and how to develop it on top of the material simulation software developed by mechanical engineering lab from Georgia Tech Lorraine. The user interface design, the functionality and the whole user experience are primary factors determining the product success or failure. This material simulation software helps researchers resolve the motion and the interactions of a large ensemble of dislocations for single or multi-layered 3D materials. However, the algorithm it utilizes is not well optimized and parallelized, so its performance of speedup cannot scale when using more CPUs in the cluster. This problem leads to the second topic on scientific computing, so in this thesis we offer different approaches that attempt to improve the parallelization and optimize the scalability. Front-end design Front-end development Scientific computing High performance computing Parallelization
173	Multigigabit multimedia processor for 60GHz WPAN: a hardware software codesign implementation Dudebout, Nicolas 19 November 2008 (has links) The emergence of a multitude of bandwidth hungry multimedia applications has ex- acerbated the need for multi-gigabit wireless solutions and made it out of the reach of conventional WLAN technology (802.11a, b and g). This thesis presents a system on chip which demonstrates the potential of 60GHz transceivers. This system is based on an FPGA board on which a GNU/Linux kernel has been run. This document will give some insight on the design process as well as on the finished product. Both the hardware and the software parts of the design are presented. This document is organized as follow. Chapter I presents an overview of the problem to be solved and some insight on the motivation to work at 60GHz. Chapter II gives a high level view of the multimedia processor that has been designed and implemented. Chapters III and IV respectively give more detail on the hardware parts and on the software components of the pro ject. Finally, Chapter V draws the conclusion of this work and presents the future of the work that has been started to enhance this multimedia processor. PowerPC Virtex2Pro FPGA Multimedia systems Computer architecture High performance computing
174	KernTune: self-tuning Linux kernel performance using support vector machines. Yi, Long. January 2006 (has links) <p>Self-tuning has been an elusive goal for operating systems and is becoming a pressing issue for modern operating systems. Well-trained system administrators are able to tune an operating system to achieve better system performance for a specific system class. Unfortunately, the system class can change when the running applications change. The model for self-tuning operating system is based on a monitor-classify-adjust loop. The idea of this loop is to continuously monitor certain performance metrics, and whenever these change, the system determines the new system class and dynamically adjusts tuning parameters for this new class. This thesis described KernTune, a prototype tool that identifies the system class and improves system performance automatically. A key aspect of KernTune is the notion of Artificial Intelligence oriented performance tuning. Its uses a support vector machine to identify the system class, and tunes the operating system for that specific system class. This thesis presented design and implementation details for KernTune. It showed how KernTune identifies a system class and tunes the operating system for improved performance.</p> Linux Operating systems (Computers) High performance computing System analysis Data processing.
175	Indexing and partitioning schemes for distributed tensor computing with application to multiple sequence alignment Helal, Manal , Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links) This thesis investigates indexing and partitioning schemes for high dimensional scientific computational problems. Building on the foundation offered by Mathematics of Arrays (MoA) for tensor-based computation, the ultimate contribution of the thesis is a unified partitioning scheme that works invariant of the dataset dimension and shape. Consequently, portability is ensured between different high performance machines, cluster architectures, and potentially computational grids. The Multiple Sequence Alignment (MSA) problem in computational biology has an optimal dynamic programming based solution, but it becomes computationally infeasible as its dimensionality (the number of sequences) increases. Even sub-optimal approximations may be unmanageable for more than eight sequences. Furthermore, no existing MSA algorithms have been formulated in a manner invariant over the number of sequences. This thesis presents an optimal distributed MSA method based on MoA. The latter offers a set of constructs that help represent multidimensional arrays in memory in a linear, concise and efficient way. Using MoA allows the partitioning of the dynamic programming algorithm to be expressed independently of dimension. MSA is the highest dimensional scientific problem considered for MoA-based partitioning to date. Two partitioning schemes are presented: the first is a master/slave approach which is based on both master/slave scheduling and slave/slave coupling. The second approach is a peer-to-peer design, in which the scheduling and dependency communication are calculated independently by each process, with no need for a master scheduler. A search space reduction technique is introduced to cater for the exponential expansion as the problem dimensionality increases. This technique relies on defining a hyper-diagonal through the tensor space, and choosing a band of neighbouring partitions around the diagonal to score. In contrast, other sub-optimal methods in the literature only consider projections on the surface of the hyper-cube. The resulting massively parallel design produces a scalable solution that has been implemented on high performance machines and cluster architectures. Experimental results for these implementations are presented for both simulated and real datasets. Comparisons between the reduced search space technique of this thesis with other sub-optimal methods for the MSA problem are presented. Partitioning Tensor Computing High Performance Computing Dynamic Programming Algorithms Parallel Processing Bioinformatics
176	Low-power high-performance register file design for chip multiprocessors Khasawneh, Shadi Turki. January 2006 (has links) Thesis (M.S.)--State University of New York at Binghamton, Department of Computer Science, Thomas J. Watson School of Engineering and Applied Science, 2006. / Includes bibliographical references.
177	Integrated compiler optimizations for tensor contractions Gao, Xiaoyang, January 2008 (has links) Thesis (Ph. D.)--Ohio State University, 2008. / Title from first page of PDF file. Includes bibliographical references (p. 140-144).
178	Parallel query processing on a cluster-based database system / Imasaki, Kenji, January 1900 (has links) Thesis (Ph. D.)--Carleton University, 2004. / Includes bibliographical references (p. 155-166). Also available in electronic format on the Internet.
179	Porting GCC to X32V architecture / Venkatachalapathy, Savithri H. January 1900 (has links) Thesis (M.S.)--Oregon State University, 2004. / Printout. Includes bibliographical references (leaf 54). Also available on the World Wide Web.
180	Large scale feature extraction and tracking Dhume, Pinakin. January 2007 (has links) Thesis (M.S.)--Rutgers University, 2007. / "Graduate Program in Electrical and Computer Engineering." Includes bibliographical references (p. 114-115).

Search results