Global ETD Search

31	Estudo do desempenho de aplica??es da mec?nica dos s?lidos em computa??o paralela / Study of the performance of solid mechanics applications in parallel computing Pinho, Ronilson Rodrigues 06 October 2014 (has links) Submitted by Celso Magalhaes (celsomagalhaes@ufrrj.br) on 2017-06-19T12:18:08Z No. of bitstreams: 1 2014 - Ronilson Rodrigues Pinho.pdf: 623700 bytes, checksum: 7bc5eefc4b9dab2877f833cbdab95b9f (MD5) / Made available in DSpace on 2017-06-19T12:18:08Z (GMT). No. of bitstreams: 1 2014 - Ronilson Rodrigues Pinho.pdf: 623700 bytes, checksum: 7bc5eefc4b9dab2877f833cbdab95b9f (MD5) Previous issue date: 2014-10-06 / The Boundary Element Method (BEM) is a computational method for differential equations solutions, formulated in the form of integral domains. Thus, it is applied in Fluid Mechanics, Acoustics, Electromagnetics and Fractures study. The BEM requires discretization only regarding boundary geometry of the problem, but not inside as a whole, reducing the computational effort. In order to reduce computational effort, parallel computing is an efficient form of information processing emphasizing concurrent events exploitation during software execution. This processing status arises primarily due to high computational performance requirements and difficulty in increasing single processor core speed. Despite central processing units (CPUs), whether multiprocessors or multicore processors, are easily found today, several algorithms are not suitable to run on parallel architectures yet. The present study aimed to develop parallelism research, acting in a sequential program, using Fortran 77 language (VERA-TUDELLA, 2003), making numerical analysis of stress and strain 2D specific problems) of Solids Mechanics with BEM, as well as, its clamped and tensioned bar physical representation. This application implementation is intended to exploit the maximum parallelism / O M?todo de Elementos de Contorno (MEC) ? um m?todo computacional para a solu??o de sistemas de equa??es diferenciais, formuladas em forma de integrais. Aplicado na Mec?nica dos fluidos, Ac?stica, Eletromagn?ticos, Estudo de fraturas etc. O MEC requer discretiza??o apenas no contorno da geometria do problema, mas n?o do seu interior como um todo, diminuindo o esfor?o computacional. Com o intuito em diminuir o esfor?o computacional, a Computa??o paralela ? uma forma eficiente de processamento de informa??o com ?nfase na explora??o de eventos simult?neos na execu??o de um software. Ele surge principalmente devido ?s elevadas exig?ncias de desempenho computacional e ? dificuldade em aumentar a velocidade de um ?nico n?cleo de processamento. Apesar das CPUs multiprocessadas, ou processadores multicore, serem facilmente encontrados atualmente, diversos algoritmos ainda n?o s?o adequados para executar em arquiteturas paralelas. O presente estudo objetivou-se com o intuito de prosseguir na pesquisa sobre paralelismo, atuando num programa sequencial, desenvolvido na linguagem Fortran 77 (VERA-TUDELA, 2003), que efetua an?lises num?ricas de problemas espec?ficos tens?o e deforma??o em 2D) da Mec?nica dos S?lidos via MEC com representa??o f?sica da barra engastada e tracionada. A implementa??o da aplica??o, visa explorar o m?ximo o paralelismo Parallelism Boundary Element Thread multicore Paralelismo Elemento-Contorno Thread multicore Ci?ncias Exatas e da Terra
32	Problematika přechodu od jednojádrové k vícejádrové implementaci operačního systému / Issue of Migrating from Single-Core to Multi-Core Implementation of Operating System Skopal, Jakub January 2017 (has links) This thesis deals with the modifications of the hardware design and operating systems of the ZedBoard multi-core platform so that both ARM Cortex A9 processor cores included in SoC Zynq7000 can be used. It analyses the general issue of the multi-core environment and the core functions of the kernel and the operating system. It describes selected means of implementation ZedBoard and FreeRTOS. In the implementation section, specific steps are demonstrated to convert a single-core operating system to a multi-core system but also steps required to run two different operating systems on two processor cores. In the last section all achieved results are summarized.
33	Static Execution Time Analysis of Parallel Systems Gustavsson, Andreas January 2016 (has links) The past trend of increasing processor throughput by increasing the clock frequency and the instruction level parallelism is no longer feasible due to extensive power consumption and heat dissipation. Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level. This is most often done using multiple, relatively slow and simple, processing cores situated on a single processor chip. The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g., a bus, to that memory and also all higher levels of memory). To fully exploit this type of parallel processor chip, programs running on it will have to be concurrent. Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code. A real-time system is any system whose correctness is dependent both on its functional and temporal behavior. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. Therefore, it is crucial that methods to derive safe estimations on the timing properties of parallel computer systems are developed, if at all possible. This thesis presents a method to derive safe (lower and upper) bounds on the execution time of a given parallel system, thus showing that such methods must exist. The interface to the method is a small concurrent programming language, based on communicating and synchronizing threads, that is formally (syntactically and semantically) defined in the thesis. The method is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way. The thesis also proves the soundness of the presented method (i.e., that the estimated timing bounds are indeed safe) and evaluates a prototype implementation of it. / Den strategi som historiskt sett använts för att öka processorers prestanda (genom ökad klockfrekvens och ökad instruktionsnivåparallellism) är inte längre hållbar på grund av den ökade energikonsumtion som krävs. Därför är den nuvarande trenden inom processordesign att låta mjukvaran påverka det parallella exekveringsbeteendet. Detta görs vanligtvis genom att placera multipla processorkärnor på ett och samma processorchip. Kärnorna delar vanligtvis på några av processorchipets resurser, såsom cache-minne (och därmed också det nätverk, till exempel en buss, som ansluter kärnorna till detta minne, samt alla minnen på högre nivåer). För att utnyttja all den prestanda som denna typ av processorer erbjuder så måste mjukvaran som körs på dem kunna delas upp över de tillgängliga kärnorna. Eftersom flerkärniga processorer är standard idag så måste även realtidssystem baseras på dessa och den nämnda typen av kod. Ett realtidssystem är ett datorsystem som måste vara både funktionellt och tidsmässigt korrekt. För vissa typer av realtidssystem kan ett inkorrekt tidsmässigt beteende ha katastrofala följder. Därför är det ytterst viktigt att metoder för att analysera och beräkna säkra gränser för det tidsmässiga beteendet hos parallella datorsystem tas fram. Denna avhandling presenterar en metod för att beräkna säkra gränser för exekveringstiden hos ett givet parallellt system, och visar därmed att sådana metoder existerar. Gränssnittet till metoden är ett litet formellt definierat trådat programmeringsspråk där trådarna tillåts kommunicera och synkronisera med varandra. Metoden baseras på abstrakt exekvering för att effektivt beräkna de säkra (men ofta överskattade) gränserna för exekveringstiden. Abstrakt exekvering baseras i sin tur på abstrakta interpreteringstekniker som vida används inom tidsanalys av sekventiella datorsystem. Avhandlingen bevisar även korrektheten hos den presenterade metoden (det vill säga att de beräknade gränserna för det analyserade systemets exekveringstid är säkra) och utvärderar en prototypimplementation av den. / Worst-Case Execution Time Analysis of Parallel Systems / RALF3 - Software for Embedded High Performance Architectures WCET analysis parallel systems multi-core multicore threaded programming language
34	Parallelization of a software based intrusion detection system - Snort Zhang, Huan January 2011 (has links) Computer networks are already ubiquitous in people’s lives and work and network security is becoming a critical part. A simple firewall, which can only scan the bottom four OSI layers, cannot satisfy all security requirements. An intrusion detection system (IDS) with deep packet inspection, which can filter all seven OSI layers, is becoming necessary for more and more networks. However, the processing throughputs of the IDSs are far behind the current network speed. People have begun to improve the performance of the IDSs by implementing them on different hardware platforms, such as Field-Programmable Gate Array (FPGA) or some special network processors. Nevertheless, all of these options are either less flexible or more expensive to deploy. This research focuses on some possibilities of implementing a parallelized IDS on a general computer environment based on Snort, which is the most popular open-source IDS at the moment. In this thesis, some possible methods have been analyzed for the parallelization of the pattern-matching engine based on a multicore computer. However, owing to the small granularity of the network packets, the pattern-matching engine of Snort is unsuitable for parallelization. In addition, a pipelined structure of Snort has been implemented and analyzed. The universal packet capture API - LibPCAP has been modified for a new feature, which can capture a packet directly to an external buffer. Then, the performance of the pipelined Snort can have an improvement up to 60% on an Intel i7 multicore computer for jumbo frames. A primary limitation is on the memory bandwidth. With a higher bandwidth, the performance of the parallelization can be further improved. Snort IDS Intrusion Detection Multicore Parallelization Pattern Matching
35	Automated Orchestra for Industrial Automation on Virtualized Multicore Environment / Extending Real-Time component-based Framework to Virtual Nodes : Demonstration: Automated Orchestra real-time Application Mahmud, Nesredin January 2013 (has links) Industrial control systems are applied in many areas e.g., motion control for industrial robotics, process control of large plants such as in the area of oil and gas, and in large national power grids. Since the last decade with advancement and adoption of virtualization and multicore technology (e.g., Virtual Monitoring Machine, cloud computing, server virtualization, application virtualization), IT systems, automation industries have benefited from low investment, effective system management and high service availability. However, virtualization and multicore technologies have posed a serious challenge to real-time systems, which is violating timeliness and predictability of real-time application running on control systems. To address the challenge, we have extended a real-time component-based framework with virtual nodes; and evaluated the framework in the context of virtualized multicore environment. The evaluation is demonstrated by modeling and implementing an orchestra application with QoS for CPU, memory and network bandwidth. The orchestra application is a real-time and distributed application deployed on virtualized multicore PCs connected with speakers. The result shows undistorted orchestra performance played through speakers connected to physical computer nodes. The contribution of the thesis can be considered: 1) extending a real-time component-based framework, Future Automation Software Architecture (FASA) with virtual nodes using Virtual Computation Resource (VCR) and 2) design and installation of reusable test environment for development, debugging and testing of real-time application on a network of virtualized multicore environment. / Vinnova project “AUTOSAR for Multi-Core in Automotive and Automation Industries “ real-time embedded component virtualization multicore xen hypervisor
36	Towards more scalable mutual exclusion for multicore architectures Lozi, Jean-Pierre 16 July 2014 (has links) (PDF) The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this thesis is a new lock algorithm, Remote Core Locking (RCL), that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated hardware thread, which is referred to as the server. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock because such data can typically remain in the server's cache. Other contributions presented in this thesis include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool developed with Julia Lawall that transforms POSIX locks into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an x86 machine with four AMD Opteron processors and 48 hardware threads. Using RCL locks, performance is improved by up to 2.5 times with respect to POSIX locks on Memcached, and up to 11.6 times with respect to Berkeley DB with the TPC-C client. On an SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to POSIX locks on Memcached, and up to 7.9 times with respect to Berkeley DB with the TPC-C client. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Multicore Lock
37	Algorithm Architecture Co-design for Dense and Sparse Matrix Computations January 2018 (has links) abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core. / Dissertation/Thesis / Masters Thesis Computer Engineering 2018 Electrical engineering accelerator linear algebra matrix multicore reconfigurable sparse
38	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels in high performance computing. The introduction of new machines from vendors provides us opportunities to optimize DLA libraries for the new machines and thus exploit their power. Unfortunately the optimization phase is not straightforward. The optimum code of a certain Basic Linear Algebra Subprogram (BLAS) kernel, which is the core of DLA algorithms, in two different machines with different semiconductor process can be different even if they share the same features in terms of instruction set architecture, memory hierarchy and clock speed. It has become a tradition to optimize BLAS for new machines. Vendors maintain highly optimized BLAS libraries targeting their CPUs. Unfortunately the existing BLAS for GPUs is not highly optimized for DLA algorithms. In my research, I have provided new algorithms for several important BLAS kernels for different generation of GPUs and introduced a pointer redirecting approach to make BLAS run faster in generic problem size. I have also presented an auto-tuning approach to parameterize the developed BLAS algorithms and select the best set of parameters for a given card. The hardware trends have also brought up the need for updates on existing legacy DLA software packages, such as the sequential LAPACK. To take advantage of the new computational environment, successors of LAPACK must incorporate algorithms of three main characteristics: high parallelism, reduced communication, and heterogeneity-awareness. On multicore architectures, Parallel Linear Algebra Software for Multicore Architectures (PLASMA) has been developed to meet the challenges in multicore. On the other extreme, Matrix Algebra on GPU and Multicore Architectures (MAGMA) library demonstrated a hybridization approach that indeed streamlined the development of high performance DLA for multicores with GPU accelerators. The performance of these two libraries depend upon right choice of parameters for a given problem size and given number of cores and/or GPUs. In this work, the issue of automatically tuning these two libraries is presented. A prune based empirical auto-tuning method has been proposed for tuning PLASMA. Part of the tuning method for PLASMA was considered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture
39	A Benchmarking Platform For Network-On-Chip (NOC) Multiprocessor System-On- Chips Malave-Bonet, Javier 2010 December 1900 (has links) Network-on-Chip (NOC) based designs have garnered significant attention from both researchers and industry over the past several years. The analysis of these designs has focused on broad topics such as NOC component micro-architecture, fault-tolerant communication, and system memory architecture. Nonetheless, the design of lowlatency, high-bandwidth, low-power and area-efficient NOC is extremely complex due to the conflicting nature of these design objectives. Benchmarks are an indispensable tool in the design process; providing thorough measurement and fair comparison between designs in order to achieve optimal results (i.e performance, cost, quality of service). This research proposes a benchmarking platform called NoCBench for evaluating the performance of Network-on-chip. Although previous research has proposed standard guidelines to develop benchmarks for Network-on-Chip, this work moves forward and proposes a System-C based simulation platform for system-level design exploration. It will provide an initial set of synthetic benchmarks for on-chip network interconnection validation along with an initial set of standardized processing cores, NOC components, and system-wide services. The benchmarks were constructed using synthetic applications described by Task Graphs For Free (TGFF) task graphs extracted from the E3S benchmark suite. Two benchmarks were used for characterization: Consumer and Networking. They are characterized based on throughput and latency. Case studies show how they can be used to evaluate metrics beyond throughput and latency (i.e. traffic distribution). The contribution of this work is two-fold: 1) This study provides a methodology for benchmark creation and characterization using NoCBench that evaluates important metrics in NOC design (i.e. end-to-end packet delay, throughput). 2) The developed full-system simulation platform provides a complete environment for further benchmark characterization on NOC based MpSoC as well as system-level design space exploration. Network-on-Chip Benchmarking System-on-Chip Multicore System C
40	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels inhigh performance computing. The introduction of new machines from vendorsprovides us opportunities to optimize DLA libraries for the new machinesand thus exploit their power. Unfortunately the optimization phase is notstraightforward. The optimum code of a certain Basic Linear AlgebraSubprogram (BLAS) kernel, which is the core of DLA algorithms, in twodifferent machines with different semiconductor process can be differenteven if they share the same features in terms of instruction setarchitecture, memory hierarchy and clock speed. It has become a traditionto optimize BLAS for new machines. Vendors maintain highly optimized BLASlibraries targeting their CPUs. Unfortunately the existing BLAS for GPUsis not highly optimized for DLA algorithms. In my research, I haveprovided new algorithms for several important BLAS kernels for differentgeneration of GPUs and introduced a pointer redirecting approach to makeBLAS run faster in generic problem size. I have also presented anauto-tuning approach to parameterize the developed BLAS algorithms andselect the best set of parameters for a given card.The hardware trends have also brought up the need for updates on existinglegacy DLA software packages, such as the sequential LAPACK. To takeadvantage of the new computational environment, successors of LAPACK mustincorporate algorithms of three main characteristics: high parallelism,reduced communication, and heterogeneity-awareness. On multicorearchitectures, Parallel Linear Algebra Software for MulticoreArchitectures (PLASMA) has been developed to meet the challenges inmulticore. On the other extreme, Matrix Algebra on GPU and MulticoreArchitectures (MAGMA) library demonstrated a hybridization approach thatindeed streamlined the development of high performance DLA for multicoreswith GPU accelerators. The performance of these two libraries depend uponright choice of parameters for a given problem size and given number ofcores and/or GPUs. In this work, the issue of automatically tuning thesetwo libraries is presented. A prune based empirical auto-tuning method hasbeen proposed for tuning PLASMA. Part of the tuning method for PLASMA wasconsidered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture

Search results