Global ETD Search

41	Evaluations of the parallel extensions in .NET 4.0 Islam, Md. Rashedul, Islam, Md. Rofiqul, Mazumder, Tahidul Arafhin January 2011 (has links) Parallel programming or making parallel application is a great challenging part of computing research. The main goal of parallel programming research is to improve performance of computer applications. A well-structured parallel application can achieve better performance in terms of execution speed over sequential execution on existing and upcoming parallel computer architecture. This thesis named "Evaluations of the parallel extensions in .NET 4.0" describes the experimental evaluation of different parallel application performance with thread-safe data structure and parallel constructions in .NET Framework 4.0. Described different performance issues of this thesis help to make efficient parallel application for better performance. Before describing the experimental evaluation, this thesis describes some methodologies relevant to parallel programming like Parallel computer architecture, Memory architectures, Parallel programming models, decomposition, threading etc. It describes the different APIs in .NET Framework 4.0 and the way of coding for making an efficient parallel application in different situations. It also presents some implementations of different parallel constructs or APIs like Static Multithreading, Using ThreadPool, Task, Parallel.For, Parallel.ForEach, PLINQ etc. The evaluation of parallel application has been done by experimental result evaluation and performance measurements. In most of the cases, the result evaluation shows better performance of parallelism like less execution time and increase CPU uses over traditional sequential execution. In addition parallel loop doesn’t show better performance in case of improper partitioning, oversubscription, improper workloads etc. The discussion about proper partitioning, oversubscription and proper work load balancing will help to make more efficient parallel application. / Program: Magisterutbildning i informatik parallel application decomposition multithreading parallel frameworks data parallelism task parallelism parallel performance Engineering and Technology Teknik och teknologier
42	Adaptive degree of parallelism for the spar runtime Vogel, Adriano Jos? 28 March 2018 (has links) Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-08-22T17:39:55Z No. of bitstreams: 1 ADRIANO_JOS?_VOGEL_DIS.pdf: 2344216 bytes, checksum: e0dd6229d4a98c039e793851290beab0 (MD5) / Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2018-08-24T14:50:33Z (GMT) No. of bitstreams: 1 ADRIANO_JOS?_VOGEL_DIS.pdf: 2344216 bytes, checksum: e0dd6229d4a98c039e793851290beab0 (MD5) / Made available in DSpace on 2018-08-24T17:31:58Z (GMT). No. of bitstreams: 1 ADRIANO_JOS?_VOGEL_DIS.pdf: 2344216 bytes, checksum: e0dd6229d4a98c039e793851290beab0 (MD5) Previous issue date: 2018-03-28 / As aplica??es de stream se tornaram cargas de trabalho representativas nos sistemas computacionais. Diversos dom?nios de aplica??es representam stream, como v?deo, ?udio, processamento de gr?ficos e imagens. Uma parcela significativa dessas aplica??es demanda paralelismo para aumentar o desempenho. Por?m, programadores frequentemente enfrentam desafios entre produtividade de c?digo e desempenho devido ?s complexidades decorrentes do paralelismo. Buscando facilitar a paraleliza??o, a DSL SPar foi criada usando atributos (stage, input, output, and replicate) do C++-11 para representar paralelismo. O compilador reconhece os atributos da SPar e gera c?digo paralelo automaticamente. Um desafio relevante que ocorre em est?gios paralelos da SPar ? a demanda pela defini??o manual do grau de paralelismo, o que consome tempo e pode induzir a erros. O grau de paralelismo ? definido atrav?s do n?mero de r?plicas em est?gios paralelos. Por?m, a execu??o de diversas aplica??es pode ser pouco eficiente se executadas com um n?mero de r?plicas inadequado ou usando um n?mero est?tico que ignora a natureza din?mica de algumas aplica??es. Para resolver esse problema, ? introduzido o conceito de um n?mero de r?plicas transparente e adaptativo para a SPar. Al?m disso, os mecanismos implementados e as regras de transforma??o s?o descritos para possibilitar a gera??o de c?digo paralelo na SPar com um n?mero adaptativo de r?plicas. Os mecanismos adaptativos foram avaliados demonstrando a sua efic?cia. Ainda, aplica??es reais foram usadas para demonstrar que os mecanismos adaptativos propostos podem oferecer abstra??es de alto n?vel sem significativas perdas de desempenho. / In recent years, stream processing applications have become a traditional workload in computing systems. They are traditionally found in video, audio, graphic and image processing. Many of these applications demand parallelism to increase performance. However, programmers must often face the trade-off between coding productivity and performance that introducing parallelism creates. SPar Domain-Specific Language (DSL) was created to achieve the optimal balance for programmers, with the C++-11 attribute annotation mechanism to ensure that essential properties of stream parallelism could be represented (stage, input, output, and replicate). The compiler recognizes the SPar attributes and generates parallel code automatically. The need to manually define parallelism is one crucial challenge for increasing SPAR?s abstraction level, because it is time consuming and error prone. Also, executing several applications can fail to be efficient when running a non suitable number of replicas. This occurs when the defined number of replicas in a parallel region is not optimal or when a static number is used, which ignores the dynamic nature of stream processing applications. In order to solve this problem, we introduced the concept of the abstracted and adaptive number of replicas for SPar. Moreover, we described our implemented strategy as well as transformation rules that enable SPar to generate parallel code with the adaptive degree of parallelism support. We experimentally evaluated the implemented adaptive strategies regarding their effectiveness. Thus, we used real-world applications to demonstrate that our adaptive strategy implementations can provide higher abstraction levels without significant performance degradation. Stream Parallelism Paralelismo de Stream Paralelismo Adaptativo e Abstrato
43	Unstructured Computations on Emerging Architectures Al Farhan, Mohammed 05 May 2019 (has links) This dissertation describes detailed performance engineering and optimization of an unstructured computational aerodynamics software system with irregular memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as representatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code optimizations and tuning in light of thread- and data-level parallelism, with a focus on strong thread scaling at the node-level. Our approaches are based upon novel multi-level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architectures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addition, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. Performance Optimizations Thread-level parallelism Data-level parallelism Unstructured Grids Computational Aerodynamics Intel Xeon Phi
44	Policy-Aware Parallel Execution of Composite Services / 複合サービスのポリシーアウェアな並列実行 Mai, Xuan Trang 23 March 2016 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19855号 / 情博第606号 / 新制\|\|情\|\|105(附属図書館) / 32891 / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授石田亨, 教授吉川正俊, 教授岡部寿男 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Service Composition Parallel Execution Policies Data Parallelism Degree of Parallelism Performance prediction 007
45	Performance and energy efficiency via an adaptive MorphCore architecture Khubaib 09 July 2014 (has links) The level of Thread-Level Parallelism (TLP), Instruction-Level Parallelism (ILP), and Memory-Level Parallelism (MLP) varies across programs and across program phases. Hence, every program requires different underlying core microarchitecture resources for high performance and/or energy efficiency. Current core microarchitectures are inefficient because they are fixed at design time and do not adapt to variable TLP, ILP, or MLP. I show that if a core microarchitecture can adapt to the variation in TLP, ILP, and MLP, significantly higher performance and/or energy efficiency can be achieved. I propose MorphCore, a low-overhead adaptive microarchitecture built from a traditional OOO core with small changes. MorphCore adapts to TLP by operating in two modes: (a) as a wide-width large-OOO-window core when TLP is low and ILP is high, and (b) as a high-performance low-energy highly-threaded in-order SMT core when TLP is high. MorphCore adapts to ILP and MLP by varying the superscalar width and the out-of-order (OOO) window size by operating in four modes: (1) as a wide-width large-OOO-window core, 2) as a wide-width medium-OOO-window core, 3) as a medium-width large-OOO-window core, and 4) as a medium-width medium-OOO-window core. My evaluation with single-thread and multi-thread benchmarks shows that when highest single-thread performance is desired, MorphCore achieves performance similar to a traditional out-of-order core. When energy efficiency is desired on single-thread programs, MorphCore reduces energy by up to 15% (on average 8%) over an out-of-order core. When high multi-thread performance is desired, MorphCore increases performance by 21% and reduces energy consumption by 20% over an out-of-order core. Thus, for multi-thread programs, MorphCore's energy efficiency is similar to highly-threaded throughput-optimized small and medium core architectures, and its performance is two-thirds of their potential. / text Computer architecture Microprocessor Thread-level parallelism Instruction-level parallelism Memory-level parallelism Adaptive microprocessor Energy efficiency High performance Power efficiency Chip-multiprocessor Heterogeneous computing
46	Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems Awan, Ammar Ahmad 10 September 2020 (has links) No description available. Computer Science Artificial Intelligence Data Parallelism Model Parallelism Hybrid Parallelism Keras Caffe TensorFlow PyTorch, MPI, Eager Execution Deep Learning Scalable DNN Training MVAPICH2-GDR CUDA-Aware MPI
47	Machine learning based mapping of data and streaming parallelism to multi-cores Wang, Zheng January 2011 (has links) Multi-core processors are now ubiquitous and are widely seen as the most viable means of delivering performance with increasing transistor densities. However, this potential can only be realised if the application programs are suitably parallel. Applications can either be written in parallel from scratch or converted from existing sequential programs. Regardless of how applications are parallelised, the code must be efficiently mapped onto the underlying platform to fully exploit the hardware’s potential. This thesis addresses the problem of finding the best mappings of data and streaming parallelism—two types of parallelism that exist in broad and important domains such as scientific, signal processing and media applications. Despite significant progress having been made over the past few decades, state-of-the-art mapping approaches still largely rely upon hand-crafted, architecture-specific heuristics. Developing a heuristic by hand, however, often requiresmonths of development time. Asmulticore designs become increasingly diverse and complex, manually tuning a heuristic for a wide range of architectures is no longer feasible. What are needed are innovative techniques that can automatically scale with advances in multi-core technologies. In this thesis two distinct areas of computer science, namely parallel compiler design and machine learning, are brought together to develop new compiler-based mapping techniques. Using machine learning, it is possible to automatically build highquality mapping schemes, which adapt to evolving architectures, with little human involvement. First, two techniques are proposed to find the best mapping of data parallelism. The first technique predicts whether parallel execution of a data parallel candidate is profitable on the underlying architecture. On a typical multi-core platform, it achieves almost the same (and sometimes a better) level of performance when compared to the manually parallelised code developed by independent experts. For a profitable candidate, the second technique predicts how many threads should be used to execute the candidate across different program inputs. The second technique achieves, on average, over 96% of the maximum available performance on two different multi-core platforms. Next, a new approach is developed for partitioning stream applications. This approach predicts the ideal partitioning structure for a given stream application. Based on the prediction, a compiler can rapidly search the program space (without executing any code) to generate a good partition. It achieves, on average, a 1.90x speedup over the already tuned partitioning scheme of a state-of-the-art streaming compiler. 005.3
48	On the simulation and design of manycore CMPs Thompson, Christopher Callum January 2015 (has links) The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market. 004
49	Processamento paralelo aplicado em análise não linear de cascas / Parallel processing applied to nonlinear structural analysis Carrijo, Elias Calixto 20 June 2001 (has links) Este trabalho tem o intuito de fazer uso do processamento paralelo na análise não linear de cascas pelo método dos elementos finitos. O elemento finito de casca é obtido com o acoplamento de um elemento de placa e um de chapa. O elemento de placa utiliza formulação de Kirchhof (DKT) para placas delgadas e o elemento de chapa faz uso da formulação livre (FF), introduzindo um grau de liberdade rotacional nos vértices. A análise não-linear com plasticidade utiliza o modelo de plasticidade associada com algoritmo de integração explícito, modelo de escoamento de von Mises com integração em camadas (modelo estratificado), para materiais isotrópicos. A implementação em paralelo é realizada em um sistema com memória distribuída e biblioteca de troca de mensagens PVM (Parallel Virtual Machine). O procedimento não-linear é completamente paralelizado, excetuando a impressão final de resultados. As etapas que constituem o método dos elementos finitos, matriz de rigidez da estrutura e resolução do sistema de equações lineares são paralelizadas. Para o cálculo da matriz de rigidez utiliza-se um algoritmo com decomposição de domínio explícito. Para resolução do sistema de equações lineares utiliza-se o método dos gradientes conjugados com implementação em paralelo. É apresentada uma breve revisão bibliográfica sobre o paralelismo, com comentários sobre perspectivas em análise estrutural / This work aims at using parallel processing for nonlinear analysis of shells through finite element method. The shell finite element is obtained by coupling a plate element with a membrane one. The plate element uses Kirchhoffs formulation (DKT) for thin plates and the membrane element makes use of the free formulation (FF), introducing a rotational degree of freedom at the vertexes. Nonlinear plastic analysis uses associated plasticity model with explicit integration algorithm, von Mises yelding model with layer integration (stratified model) for isotropic materials. Parallel implementation is done on a distributed memory system and message exchange library PVM (Parallel Virtual Machine). Nonlinear procedure is completely parallelised, but final printing of results. The finite element method steps, structural stifness matrix and solution of the linear equation system are parallelised. An explicit domain decomposition algorithm is used for the stifness matrix evaluation. To solve the linear equation system, one uses conjugated gradients method, with parallel implementation. A brief bibliography about parallelism is presented, with comments on structural analysis perspectives casca elementos finitos finite elements paralelismo parallelism plasticidade plasticity shell
50	Exploring parallelism on pure functional languages with ACQuA / Explorando paralelismo em linguagens funcionais puras com ACQuA Tanus, Felipe de Oliveira January 2017 (has links) Moore’s law reaching its physical limitations has pushed the industry to produce multicore processors. However, programming those processors with an imperative language is not easy since it requires developers to create and synchronize threads. A pure functional language is an adequate tool for this task both from the architectural point of view and from the developer’s. We will show that an architecture can benefit from the implicit parallelism present on functional programs and from the lack of side effects making it easier to parallelize. The developer benefits from functional languages from the superior expressiveness of the language to avoid bugs. In this dissertation, we present the ACQuA architecture, a multicore accelerator created to explore parallelism available in function calls from a pure functional program. ACQuA uses hardware support and a specificallytailored memory organization to minimize the overheads of scheduling, communication, and synchronization. Function calls are placed into a queue and are scheduled to different processing units. The processing units are interconnected and exchange results from function applications. In this work we defined a high level model of the accelerator and how to compile a functional program to it. We also simulated the accelerator and evaluated results, such as speedup, memory usage, and communication overhead of the proposed architecture. We defined the necessary traits of a program to achieve a good speedup on the architecture. On the ideal use case, we can increase the speed up at the same rate we increase the number of processing units in the architecture. Linguagens funcionais Processamento paralelo Architecture Accelerator Functional programming Parallelism

Search results