Global ETD Search

21	Modeling and Analysis of Large-Scale On-Chip Interconnects Feng, Zhuo 2009 December 1900 (has links) As IC technologies scale to the nanometer regime, efficient and accurate modeling and analysis of VLSI systems with billions of transistors and interconnects becomes increasingly critical and difficult. VLSI systems impacted by the increasingly high dimensional process-voltage-temperature (PVT) variations demand much more modeling and analysis efforts than ever before, while the analysis of large scale on-chip interconnects that requires solving tens of millions of unknowns imposes great challenges in computer aided design areas. This dissertation presents new methodologies for addressing the above two important challenging issues for large scale on-chip interconnect modeling and analysis: In the past, the standard statistical circuit modeling techniques usually employ principal component analysis (PCA) and its variants to reduce the parameter dimensionality. Although widely adopted, these techniques can be very limited since parameter dimension reduction is achieved by merely considering the statistical distributions of the controlling parameters but neglecting the important correspondence between these parameters and the circuit performances (responses) under modeling. This dissertation presents a variety of performance-oriented parameter dimension reduction methods that can lead to more than one order of magnitude parameter reduction for a variety of VLSI circuit modeling and analysis problems. The sheer size of present day power/ground distribution networks makes their analysis and verification tasks extremely runtime and memory inefficient, and at the same time, limits the extent to which these networks can be optimized. Given today?s commodity graphics processing units (GPUs) that can deliver more than 500 GFlops (Flops: floating point operations per second). computing power and 100GB/s memory bandwidth, which are more than 10X greater than offered by modern day general-purpose quad-core microprocessors, it is very desirable to convert the impressive GPU computing power to usable design automation tools for VLSI verification. In this dissertation, for the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with very promising performance. Our GPU based network analyzer is capable of solving tens of millions of power grid nodes in just a few seconds. Additionally, with the above GPU based simulation framework, more challenging three-dimensional full-chip thermal analysis can be solved in a much more efficient way than ever before.
22	Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems Mishra, Ashirbad January 2016 (has links) (PDF) Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles. Network Analysis Betweenness Centrality CPU-GPU Hybrid Systems Space Complexity Border Matrix Reduction Distributed Computing Graphics Processing Unit (GPU) Graph Partitioning Bulk Synchronous Paradigm (BSP) Central Processing Unit (CPU) Parallel Processing High Performance Computing Network Theory System Analysis Graph Theory Hybrid Betweenness Centrality Algorithm Computer Science
23	DSA Image Registration And Respiratory Motion Tracking Using Probabilistic Graphical Models Sundarapandian, Manivannan January 2016 (has links) (PDF) This thesis addresses three problems related to image registration, prediction and tracking, applied to Angiography and Oncology. For image analysis, various probabilistic models have been employed to characterize the image deformations, target motions and state estimations. (i) In Digital Subtraction Angiography (DSA), having a high quality visualization of the blood motion in the vessels is essential both in diagnostic and interventional applications. In order to reduce the inherent movement artifacts in DSA, non-rigid image registration is used before subtracting the mask from the contrast image. DSA image registration is a challenging problem, as it requires non-rigid matching across spatially non-uniform control points, at high speed. We model the problem of sub-pixel matching, as a labeling problem on a non-uniform Markov Random Field (MRF). We use quad-trees in a novel way to generate the non uniform grid structure and optimize the registration cost using graph-cuts technique. The MRF formulation produces a smooth displacement field which results in better artifact reduction than with the conventional approach of independently registering the control points. The above approach is further improved using two models. First, we introduce the concept of pivotal and non-pivotal control points. `Pivotal control points' are nodes in the Markov network that are close to the edges in the mask image, while 'non-pivotal control points' are identified in soft tissue regions. This model leads to a novel MRF framework and energy formulation. Next, we propose a Gaussian MRF model and solve the energy minimization problem for sub-pixel DSA registration using Random Walker (RW). An incremental registration approach is developed using quad-tree based MRF structure and RW, wherein the density of control points is hierarchically increased at each level M depending of the features to be used and the required accuracy. A novel numbering scheme of the control points allows us to reuse the computations done at level M in M + 1. Both the models result in an accelerated performance without compromising on the artifact reduction. We have also provided a CUDA based design of the algorithm, and shown performance acceleration on a GPU. We have tested the approach using 25 clinical data sets, and have presented the results of quantitative analysis and clinical assessment. (ii) In External Beam Radiation Therapy (EBRT), in order to monitor the intra fraction motion of thoracic and abdominal tumors, the lung diaphragm apex can be used as an internal marker. However, tracking the position of the apex from image based observations is a challenging problem, as it undergoes both position and shape variation. We propose a novel approach for tracking the ipsilateral hemidiaphragm apex (IHDA) position on CBCT projection images. We model the diaphragm state as a spatiotemporal MRF, and obtain the trace of the apex by solving an energy minimization problem through graph-cuts. We have tested the approach using 15 clinical data sets and found that this approach outperforms the conventional full search method in terms of accuracy. We have provided a GPU based heterogeneous implementation of the algorithm using CUDA to increase the viability of the approach for clinical use. (iii) In an adaptive radiotherapy system, irrespective of the methods used for target observations there is an inherent latency in the beam control as they involve mechanical movement and processing delays. Hence predicting the target position during `beam on target' is essential to increase the control precision. We propose a novel prediction model (called o set sine model) for the breathing pattern. We use IHDA positions (from CBCT images) as measurements and an Unscented Kalman Filter (UKF) for state estimation. The results based on 15 clinical datasets show that, o set sine model outperforms the state of the art LCM model in terms of prediction accuracy. Digital Subtraction Angiography (DSA) Image Registration Respiratory Motion Analysis Biomedical Image Processing Radiation Oncology Computerized Medical Imaging Martkov Random Field (MRF) Random Walker Image Registration (RWIR) Medical Imaging External Beam Radiation Therapy (EBRT) Image Guided Radiation Therapy Cone Beam Computed Tomography (CBCT) Graphics Processing Unit (GPU) Unscented Kalman Filter (UKF) Electrical Engineering
24	Distributed Support Vector Machine With Graphics Processing Units Zhang, Hang 06 August 2009 (has links) Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) optimization problem. Sequential Minimal Optimization (SMO) is a decomposition-based algorithm which breaks this large QP problem into a series of smallest possible QP problems. However, it still costs O(n2) computation time. In our SVM implementation, we can do training with huge data sets in a distributed manner (by breaking the dataset into chunks, then using Message Passing Interface (MPI) to distribute each chunk to a different machine and processing SVM training within each chunk). In addition, we moved the kernel calculation part in SVM classification to a graphics processing unit (GPU) which has zero scheduling overhead to create concurrent threads. In this thesis, we will take advantage of this GPU architecture to improve the classification performance of SVM.

Page generated in 0.0907 seconds