21 |
Acceleration of Hardware Testing and Validation Algorithms using Graphics Processing UnitsLi, Min 16 November 2012 (has links)
With the advances of very large scale integration (VLSI) technology, the feature size has been shrinking steadily together with the increase in the design complexity of logic circuits. As a result, the efforts taken for designing, testing, and debugging digital systems have increased tremendously. Although the electronic design automation (EDA) algorithms have been studied extensively to accelerate such processes, some computational intensive applications still take long execution times. This is especially the case for testing and validation. In order tomeet the time-to-market constraints and also to come up with a bug-free design or product, the work presented in this dissertation studies the acceleration of EDA algorithms on Graphics Processing Units (GPUs). This dissertation concentrates on a subset of EDA algorithms related to testing and validation. In particular, within the area of testing, fault simulation, diagnostic simulation and reliability analysis are explored. We also investigated the approaches to parallelize state justification on GPUs, which is one of the most difficult problems in the validation area.
Firstly, we present an efficient parallel fault simulator, FSimGP2, which exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA Compute Unified Device Architecture (CUDA). A novel three-dimensional parallel fault simulation technique is proposed to achieve extremely high computation efficiency on the GPU. The experimental results demonstrate a speedup of up to 4Ã compared to another GPU-based fault simulator.
Then, another GPU based simulator is used to tackle an even more computation-intensive task, diagnostic fault simulation. The simulator is based on a two-stage framework which exploits high computation efficiency on the GPU. We introduce a fault pair based approach to alleviate the limited memory capacity on GPUs. Also, multi-fault-signature and dynamic load balancing techniques are introduced for the best usage of computing resources on-board.
With continuously feature size scaling and advent of innovative nano-scale devices, the reliability analysis of the digital systems becomes more important nowadays. However, the computational cost to accurately analyze a large digital system is very high. We proposes an high performance reliability analysis tool on GPUs. To achieve highmemory bandwidth on GPUs, two algorithms for simulation scheduling and memory arrangement are proposed. Experimental results demonstrate that the parallel analysis tool is efficient, reliable and scalable.
In the area of design validation, we investigate state justification. By employing the swarm intelligence and the power of parallelism on GPUs, we are able to efficiently find a trace that could help us reach the corner cases during the validation of a digital system.
In summary, the work presented in this dissertation demonstrates that several applications in the area of digital design testing and validation can be successfully rearchitected to achieve maximal performance on GPUs and obtain significant speedups. The proposed algorithms based on GPU parallelism collectively aim to contribute to improving the performance of EDA tools in Computer aided design (CAD) community on GPUs and other many-core platforms. / Ph. D.
|
22 |
A parallel geometric multigrid method for finite elements on octree meshes applied to elastic image registrationSampath, Rahul Srinivasan 24 June 2009 (has links)
The first component of this work is a parallel algorithm for constructing non-uniform octree meshes for finite element computations. Prior to octree meshing, the linear octree data structure must be constructed and a constraint known as "2:1 balancing" must be enforced; parallel algorithms for these two subproblems are also presented. The second component of this work is a parallel matrix-free geometric multigrid algorithm for solving elliptic partial differential equations (PDEs) using these octree meshes. The last component of this work is a parallel multiscale Gauss Newton optimization algorithm for solving the elastic image registration problem. The registration problem is discretized using finite elements on octree meshes and the parallel geometric multigrid algorithm is used as a preconditioner in the Conjugate Gradient (CG) algorithm to solve the linear system of equations formed in each Gauss Newton iteration.
Several ideas were used to reduce the overhead for constructing the octree meshes. These include (a) a way to lower communication costs by reducing the number of synchronizations and reducing the communication message size, (b) a way to reduce the number of searches required to build element-to-vertex mappings, and (c) a compression scheme to reduce the memory footprint of the entire data structure. To our knowledge, the multigrid algorithm presented in this work is the only matrix-free multiplicative geometric multigrid implementation for solving finite element equations on octree meshes using thousands of processors. The proposed registration algorithm is also unique; it is a combination of many different ideas: adaptivity, parallelism, fast optimization algorithms, and fast linear solvers.
All the algorithms were implemented in C++ using the Message Passing Interface (MPI) standard and were built on top of the PETSc library from Argonne National Laboratory. The multigrid implementation has been released as an open source software: Dendro. Several numerical experiments were performed to test the performance of the algorithms. These experiments were performed on a variety of NSF TeraGrid platforms. Our largest run was a highly-nonuniform, 8-billion-unknown, elasticity calculation on 32,000 processors.
|
23 |
Some Domain Decomposition and Convex Optimization Algorithms with Applications to Inverse ProblemsChen, Jixin 15 June 2018 (has links)
Domain decomposition and convex optimization play fundamental roles in current computation and analysis in many areas of science and engineering. These methods have been well developed and studied in the past thirty years, but they still require further study and improving not only in mathematics but in actual engineering computation with exponential increase of computational complexity and scale. The main goal of this thesis is to develop some efficient and powerful algorithms based on domain decomposition method and convex optimization. The topicsstudied in this thesis mainly include two classes of convex optimization problems: optimal control problems governed by time-dependent partial differential equations and general structured convex optimization problems. These problems have acquired a wide range of applications in engineering and also demand a very high computational complexity. The main contributions are as follows: In Chapter 2, the relevance of an adequate inner loop starting point (as opposed to a sufficient inner loop stopping rule) is discussed in the context of a numerical optimization algorithm consisting of nested primal-dual proximal-gradient iterations. To study the optimal control problem, we obtain second order domain decomposition methods by combining Crank-Nicolson scheme with implicit Galerkin method in the sub-domains and explicit flux approximation along inner boundaries in Chapter 3. Parallelism can be easily achieved for these explicit/implicit methods. Time step constraints are proved to be less severe than that of fully explicit Galerkin finite element method. Based on the domain decomposition method in Chapter 3, we propose an iterative algorithm to solve an optimal control problem associated with the corresponding partial differential equation with pointwise constraint for the control variable in Chapter 4. In Chapter 5, overlapping domain decomposition methods are designed for the wave equation on account of prediction-correction" strategy. A family of unit decomposition functions allow reasonable residual distribution or corrections. No iteration is needed in each time step. This dissertation also covers convergence analysis from the point of view of mathematics for each algorithm we present. The main discretization strategy we adopt is finite element method. Moreover, numerical results are provided respectivelyto verify the theory in each chapter. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished
|
24 |
Méthodes numériques pour la résolution accélérée des systèmes linéaires de grandes tailles sur architectures hybrides massivement parallèles / Numerical methods for the accelerated resolution of large scale linear systems on massively parallel hybrid architectureCheik Ahamed, Abal-Kassim 07 July 2015 (has links)
Les progrès en termes de puissance de calcul ont entraîné de nombreuses évolutions dans le domaine de la science et de ses applications. La résolution de systèmes linéaires survient fréquemment dans le calcul scientifique, comme par exemple lors de la résolution d'équations aux dérivées partielles par la méthode des éléments finis. Le temps de résolution découle alors directement des performances des opérations algébriques mises en jeu.Cette thèse a pour but de développer des algorithmes parallèles innovants pour la résolution de systèmes linéaires creux de grandes tailles. Nous étudions et proposons comment calculer efficacement les opérations d'algèbre linéaire sur plateformes de calcul multi-coeur hétérogènes-GPU afin d'optimiser et de rendre robuste la résolution de ces systèmes. Nous proposons de nouvelles techniques d'accélération basées sur la distribution automatique (auto-tuning) des threads sur la grille GPU suivant les caractéristiques du problème et le niveau d'équipement de la carte graphique, ainsi que les ressources disponibles. Les expérimentations numériques effectuées sur un large spectre de matrices issues de divers problèmes scientifiques, ont clairement montré l'intérêt de l'utilisation de la technologie GPU, et sa robustesse comparée aux bibliothèques existantes comme Cusp.L'objectif principal de l'utilisation du GPU est d'accélérer la résolution d'un problème dans un environnement parallèle multi-coeur, c'est-à-dire "Combien de temps faut-il pour résoudre le problème?". Dans cette thèse, nous nous sommes également intéressés à une autre question concernant la consommation énergétique, c'est-à-dire "Quelle quantité d'énergie est consommée par l'application?". Pour répondre à cette seconde question, un protocole expérimental est établi pour mesurer la consommation d'énergie d'un GPU avec précision pour les opérations fondamentales d'algèbre linéaire. Cette méthodologie favorise une "nouvelle vision du calcul haute performance" et apporte des réponses à certaines questions rencontrées dans l'informatique verte ("green computing") lorsque l'on s'intéresse à l'utilisation de processeurs graphiques.Le reste de cette thèse est consacré aux algorithmes itératifs synchrones et asynchrones pour résoudre ces problèmes dans un contexte de calcul hétérogène multi-coeur-GPU. Nous avons mis en application et analysé ces algorithmes à l'aide des méthodes itératives basées sur les techniques de sous-structurations. Dans notre étude, nous présentons les modèles mathématiques et les résultats de convergence des algorithmes synchrones et asynchrones. La démonstration de la convergence asynchrone des méthodes de sous-structurations est présentée. Ensuite, nous analysons ces méthodes dans un contexte hybride multi-coeur-GPU, qui devrait ouvrir la voie vers les méthodes hybrides exaflopiques.Enfin, nous modifions la méthode de Schwarz sans recouvrement pour l'accélérer à l'aide des processeurs graphiques. La mise en oeuvre repose sur l'accélération par les GPUs de la résolution locale des sous-systèmes linéaires associés à chaque sous-domaine. Pour améliorer les performances de la méthode de Schwarz, nous avons utilisé des conditions d'interfaces optimisées obtenues par une technique stochastique basée sur la stratégie CMA-ES (Covariance Matrix Adaptation Evolution Strategy). Les résultats numériques attestent des bonnes performances, de la robustesse et de la précision des algorithmes synchrones et asynchrones pour résoudre de grands systèmes linéaires creux dans un environnement de calcul hétérogène multi-coeur-GPU. / Advances in computational power have led to many developments in science and its applications. Solving linear systems occurs frequently in scientific computing, as in the finite element discretization of partial differential equations. The running time of the overall resolution is a direct result of the performance of the involved algebraic operations.In this dissertation, different ways of efficiently solving large and sparse linear systems are put forward. We present the best way to effectively compute linear algebra operations in an heterogeneous multi-core-GPU environment in order to make solvers such as iterative methods more robust and therefore reduce the computing time of these systems. We propose new techniques to speed algorithms up the auto-tuning of the threading design, according to the problem characteristics and the equipment level in the hardware and available resources. Numerical experiments performed on a set of large-size sparse matrices arising from diverse engineering and scientific problems, have clearly shown the benefit of the use of GPU technology to solve large sparse systems of linear equations, and its robustness and accuracy compared to existing libraries such as Cusp.The main priority of the GPU program is computational time to obtain the solution in a parallel environment, i.e, "How much time is needed to solve the problem?". In this thesis, we also address another question regarding energy issues, i.e., "How much energy is consumed by the application?". To answer this question, an experimental protocol is established to measure the energy consumption of a GPU for fundamental linear algebra operations accurately. This methodology fosters a "new vision of high-performance computing" and answers some of the questions outlined in green computing when using GPUs.The remainder of this thesis is devoted to synchronous and asynchronous iterative algorithms for solving linear systems in the context of a multi-core-GPU system. We have implemented and analyzed these algorithms using iterative methods based on sub-structuring techniques. Mathematical models and convergence results of synchronous and asynchronous algorithms are presented here, as are the convergence results of the asynchronous sub-structuring methods. We then analyze these methods in the context of a hybrid multi-core-GPU, which should pave the way for exascale hybrid methods.Lastly, we modify the non-overlapping Schwarz method to accelerate it, using GPUs. The implementation is based on the acceleration of the local solutions of the linear sub-systems associated with each sub-domain using GPUs. To ensure good performance, optimized conditions obtained by a stochastic technique based on the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) are used. Numerical results illustrate the good performance, robustness and accuracy of synchronous and asynchronous algorithms to solve large sparse linear systems in the context of an heterogeneous multi-core-GPU system.
|
25 |
Fast Viterbi Decoder Algorithms for Multi-Core SystemJu, Zilong January 2012 (has links)
In this thesis, fast Viterbi Decoder algorithms for a multi-core system are studied. New parallel Viterbi algorithms for decoding convolutional codes are proposed based on tail biting trellises. The performances of the new algorithms are first evaluated by MATLAB and then Eagle (E-UTRA algorithms for LTE) link level simulations where the optimal parameter settings are obtained based on various simulations. One of the algorithms is proposed for implementation in the product due to its good BLER performance and low implementation complexity. The new parallel algorithm is then implemented on target DSPs for Ericsson internal multi-core system to decode the PUSCH (Physical Uplink Shared Channel) CQI (Channel Quality Indicator) in LTE (Long Term Evolution). And the performance of the new algorithm in the real multi-core system is compared against the current implementation regarding both cycle and memory consumption. As a fast decoder, the proposed parallel Viterbi decoder is computationally efficient which reduces significantly the decoding latency and solves memory limitation problems on DSP.
|
26 |
Characterization and Enhancement of Data Locality and Load Balancing for Irregular ApplicationsNiu, Qingpeng 14 May 2015 (has links)
No description available.
|
27 |
安全多方計算平行演算法之實證研究 / An Empirical Study on the Parallel Implementation of Secure Multi-Party Computation王啟典, Wang, Chi-Tien Unknown Date (has links)
安全多方計算是資訊安全研究裡的一個重要主題,其概念為多方在不洩漏各自私有資訊下能一起完成某種函式的計算。在安全多方計算研究領域裡,有一種作法是以scalar product來當作計算的基礎演算邏輯單元,重而建構其他更複雜的安全多方計算。本論文首先針對scalar product發展一套平行性實作架構,藉此我們再實作出多個不同演算法之comparison計算,其中包含了循序演算法以及平行演算法。我們透過實驗來找出適當的平行計算基礎架構與影響執行時間效能的主要因子,並以執行時間效能上的分析來推導相關時間公式。由上述實證研究我們對於不同演算法之comparison計算來作執行時間效能的預測,從實驗結果可以得知我們推導出來之時間公式極為準確,希望能給予使用者在執行comparison計算有所考量,使其在不同執行環境執行comparison計算能有最佳的執行時間效能。 / Loosely speaking, secure multi-party computation (SMC) involves computing functions with inputs from two or more parties in a distributed network while ensuring that no additional information, other than what can be inferred from each participant’s input and output, is revealed to parties not privy to that information. This thesis concerns the parallel implementation of SMC using a scalar-product (SP) based approach. In this approach, SP is considered as the basic building block for constructing more complex SMC. My thesis first develops a concurrent architecture for implementing two-party scalar product computation. Then it implements several algorithms of secure comparison. Finally, a series of experiments are conducted to collect performance statistics for building time functions that can predict the execution time of comparison computation based on that of the scalar product and other parameters, such as CPU core numbers. From the experimental results, we find that these time functions are very accurate. Hence we argue that these time functions can assist users to obtain the better runtime performance for comparison protocols under their specific execution environments.
|
28 |
線性三對角方程組之平行解法 / Parallel Algorithm for Linear Tridiagonal System Solver林伯勳, Lin, Frank Unknown Date (has links)
本論文對線性三對角方程組之解法提出平行演算法於超立方體網路 (
hypercube network), 並且此平行演算法能達到最佳費用 (optimal
cost ) O(N). 討論的解法包含 (1)循環消減法 (cyclic reduction
method)及 (2)高斯消去法 (Gaussian elimination method), 基於
(1)法之平行演算法當使用處理器個數為 O(N/logN)時, 其執行時間為 O(
logN); 基於 (2) 法之平行演算法當使用處理器個數為 O(N/(logN)^2)
時, 其執行時間為 O((logN)^2); 費用 (cost) 等於處理器個數乘以執行
時間.
|
29 |
Hierarchical Data Structures for Pattern RecognitionChoudhury, Sabyasachy 05 1900 (has links)
Pattern recognition is an important area with potential applications in computer vision, Speech understanding, knowledge engineering, bio-medical data classification, earth sciences, life sciences, economics, psychology, linguistics, etc. Clustering is an unsupervised classification process corning under the area of pattern recognition. There are two types of clustering approaches:
1) Non-hierarchical methods 2) Hierarchical methods. Non-hierarchical algorithms are iterative in nature and. perform well in the context of isotropic clusters. Time-complexity of these algorithms is order of (0 (n) ) and above, Hierarchical agglomerative algorithms, on the other hand, are effective when clusters are non-isotropic. The single linkage method of hierarchical category produces a dendrogram which corresponds to the minimal spanning tree, conventional approaches are time consuming requiring O (n2 ) computational time.
In this thesis we propose an intelligent partitioning scheme for generating the minimal spanning tree in the co-ordinate space. This is computationally elegant as it avoids the computation of similarity between many pairs of samples me minimal spanning tree generated can be used to produce C disjoint clusters by breaking the (C-1) longest edges in the tree.
A systolic architecture has been proposed to increase the speed of the algorithm further. Simulation study has been conducted and the corresponding results are reported. The simulation package has been developed on DEC-1090 in Pascal. It is observed based on the simulation study that the parallel implementation reduces the time enormously. The number of processors required for the parallel implementation is a constant making the approach more attractive.
Texture analysis and synthesis has been extensively studied in the context of computer vision, Two important approaches which have been studied extensively by researchers earlier are statistical and structural approaches, Texture is understood to be a periodic pattern with primitive sub patterns repeating in a particular fashion. This has been used to characterize texture with the help of the hierarchical data structure, tree. It is convenient to use a tree data structure as, along with the operations like merging, splitting, deleting a node, adding a node, etc, .it would be useful to handle a periodic pattern. Various functions like angular second moment, correlation etc, which are used to characterize texture have been translated into the new language of hierarchical data structure.
|
30 |
Otimização por enxame de partículas em arquiteturas paralelas de alto desempenho. / Particle swarm optimization in high-performance parallel architectures.Rogério de Moraes Calazan 21 February 2013 (has links)
A Otimização por Enxame de Partículas (PSO, Particle Swarm Optimization) é uma técnica de otimização que vem sendo utilizada na solução de diversos problemas, em diferentes áreas do conhecimento. Porém, a maioria das implementações é realizada de modo sequencial. O processo de otimização necessita de um grande número de avaliações da função objetivo, principalmente em problemas complexos que envolvam uma grande quantidade de partículas e dimensões. Consequentemente, o algoritmo pode se tornar ineficiente em termos do desempenho obtido, tempo de resposta e até na qualidade do resultado esperado. Para superar tais dificuldades, pode-se utilizar a computação de alto desempenho e paralelizar o algoritmo, de acordo com as características da arquitetura, visando o aumento de desempenho, a minimização do tempo de resposta e melhoria da qualidade do resultado final. Nesta dissertação, o algoritmo PSO é paralelizado utilizando três estratégias que abordarão diferentes granularidades do problema, assim como dividir o trabalho de otimização entre vários subenxames cooperativos. Um dos algoritmos paralelos desenvolvidos, chamado PPSO, é implementado diretamente em hardware, utilizando uma FPGA. Todas as estratégias propostas, PPSO (Parallel PSO), PDPSO (Parallel Dimension PSO) e CPPSO (Cooperative Parallel PSO), são implementadas visando às arquiteturas paralelas baseadas em multiprocessadores, multicomputadores e GPU. Os diferentes testes realizados mostram que, nos problemas com um maior número de partículas e dimensões e utilizando uma estratégia com granularidade mais fina (PDPSO e CPPSO), a GPU obteve os melhores resultados. Enquanto, utilizando uma estratégia com uma granularidade mais grossa (PPSO), a implementação em multicomputador obteve os melhores resultados. / Particle Swarm Optimization (PSO) is an optimization technique that is used to solve many problems in different applications. However, most implementations are sequential. The optimization process requires a large number of evaluations of the objective function, especially in complex problems, involving a large amount of particles and dimensions. As a result, the algorithm may become inefficient in terms of performance, execution time and even the quality of the expected result. To overcome these difficulties,high performance computing and parallel algorithms can be used, taking into account to the characteristics of the architecture. This should increase performance, minimize response time and may even improve the quality of the final result. In this dissertation, the PSO algorithm is parallelized using three different strategies that consider different granularities of the problem, and the division of the optimization work among several cooperative sub-swarms. One of the developed parallel algorithms, namely PPSO, is implemented directly in hardware, using an FPGA. All the proposed strategies, namely PPSO ( Parallel PSO), PDPSO (Parallel Dimension PSO) and CPPSO (Cooperative Parallel PSO), are implemented in a multiprocessor, multicomputer and GPU based parallel architectures. The different performed assessments show that the GPU achieved the best results for problems with high number of particles and dimensions when a strategy with finer granularity is used, namely PDPSO and CPPSO. In contrast with this, when using a strategy with a coarser granularity, namely PPSO, the multi-computer based implementation achieved the best results.
|
Page generated in 0.0617 seconds