Spelling suggestions: "subject:"arallel programming"" "subject:"arallel erogramming""
271 |
Numerical linear algebra problems in structural analysisKannan, Ramaseshan January 2014 (has links)
A range of numerical linear algebra problems that arise in finite element-based structural analysis are considered. These problems were encountered when implementing the finite element method in the software package Oasys GSA. We present novel solutions to these problems in the form of a new method for error detection, algorithms with superior numerical effeciency and algorithms with scalable performance on parallel computers. The solutions and their corresponding software implementations have been integrated into GSA's program code and we present results that demonstrate the use of these implementations by engineers to solve real-world structural analysis problems.
|
272 |
Analyzujte možnosti vývoja viacvláknových aplikácií na platforme Java / Analyse multithreaded applications development possibilities on the Java platformChamila, Sergius January 2013 (has links)
This thesis is an analysis of frameworks which support multithreading application devel-opment on platform Java. In particular this thesis analyses the overview of libraries which supports multithreading development in this language. Provided features are shown and described by the code examples. The thesis describes the principles of concurrent pro-gramming. Although multithreaded applications undoubtedly have many advantages, one of the goals of the thesis is to highlight the possible risks associated with their use. By the analysis of Java platform, I used method of description, analysis, synthesis and comparison. This thesis is composed of theoretical part, which defines the basic concepts and principles of concurrent applications. Theoretical parts also contain recherché of works, which deal with the topic of concurrent programming. In the analytical part are described and evaluated changes at various versions of Java language. Benefit of this work is clear documentation and evaluation of Java language concurrent utilities, from the oldest version of Java 1 to latest Java 8. So this work can help those, who need to learn how to develop concurrent applications in Java.
|
273 |
Resource-aware clustering design for NoC-based MPSoCs / Projeto de MPSoCs baseados em NoC utilizando clusterização e gerenciamento de recursosSilva, Gustavo Girão Barreto da January 2014 (has links)
Atualmente, o paradigma multicore é uma tendência fortemente estabelecida também na área de sistemas embarcados. O grau de paralelismo provido por tal arquitetura tem sido a principal causa de avanços de performance na área além de economia de energia e potência. Entretanto, para obter paralelismo eficiente desta arquitetura não é uma tarefa simples. Assim, desenvolvedores propuseram diversos modelos de ambientes de programação tentando prover o máximo de transparência possível. No nível do hardware, este crescente aumento no número de componentes dentro chip cria um problema de gerenciamento a ser tratado. No contexto deste cenário complexo, esta tese propõe o uso de abordagens de gerenciamento de recursos para aumentar a eficiência, levando em consideração tanto performance quanto consumo de energia, de ambientes MPSoC em diferentes níveis. Além disso, estas abordagens tem em comum a noção de clusterização, a qual tenta agregar recursos logicamente de acordo com as demandas da aplicação. Primeiramente no nível do processador/aplicação, é proposto um hardware dinamicamente adaptável para suportar modelos de programação paralelos distintos sem nenhum sobrecusto computacional uma vez que todo o processo é completamente transparente para o programador. Ainda neste ambiente, onde aplicações distintas podem ser executadas, é proposto um mecanismo de escalonamento visando gerenciamento de recursos para aumentar a performance chamado Processor Clustering. São propostas quatro diferentes políticas de mapeamento de recursos que tiram vantagem de aspectos distintos da natureza paralela das aplicações e das restrições arquiteturais do sistema. Entretanto, algumas aplicações tem demandas de memória mais altas do que demandas computacionais. Logo, uma abordagem similar pode ser utilizada no nível da hierarquia de memória. Neste caso, o objetivo é redistribuir recursos de memória de acordo com as demandas da aplicação. Redistribuição de memória é explorada tanto em tempo de projeto quanto em tempo de execução. Um mecanismo de mapeamento de distribuição é proposto baseado na quantidade de requisições de acesso à memória externa. Finalmente, é proposto um mecanismo de tolerância à falhas baseado em gerenciamento de recursos para memórias distribuídas dentro do chip em NoCs. É introduzido um modelo de Reliability Clustering que tira proveito da infraestrutura da NoC. Neste caso, os roteadores tem conhecimento dos blocos com falhas e blocos redundantes. Baseado neste conhecimento, o mecanismo é capaz evitar altas latências de acesso à memória. / The multicore paradigm is a solid trend nowadays, also in the field of embedded systems. The degree of parallelism provided by such architecture has been the foundation of performance advancements in the field as well as for power and energy savings. However, to obtain efficient parallelism of such architecture is not an easy task. Therefore, developers come up with several proposals of programming environments trying to provide as much transparency as possible. On the hardware side, this increasing number of on-chip components creates a management issue to be handled. In the context of this complex scenario this thesis proposes the use of resource management approaches to improve the efficiency, regarding both performance and energy consumption, of MPSoC environments at different levels. Also, these approaches have in common the notion of clustering, which tries to logically aggregate resources according to application demands. First, at the processor/application level, we propose a dynamically adaptable hardware to support distinct parallel programming models at no computational overhead, since the entire process is completely transparent to the programmer. Also, in this environment, where distinct applications can be executed, we propose a resource-aware scheduling mechanism to improve performance named Processor Clustering. We propose four different resource mapping policies that leverage on distinct aspects of the parallel nature of the applications and on architecture constraints. However, some applications have higher memory demands than computational demands. Therefore, a similar approach can be used at the memory level. In this case, we aim at redistributing memory resources according to application demands. We explore memory redistribution at both design time and runtime and propose a distribution mapping mechanism based on the amount of off-chip memory requests. Finally, we propose a resource-aware fault-tolerance mechanism for distributed on-chip memories in NoCs. We introduce a Reliability Clustering model that leverages on the NoC infrastructure. In this case, the routers have knowledge of faulty blocks and redundancy blocks and, based on that, they are able to avoid higher memory access latency.
|
274 |
Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacionalCordeiro, Silvio Ricardo January 2014 (has links)
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
|
275 |
A parallel algorithm to solve the mathematical problem "double coset enumeration of S₂₄ over M₂₄"Harris, Elena Yavorska 01 January 2003 (has links)
This thesis presents and evaluates a new parallel algorithm that computes all single cosets in the double coset M₂₄ P M₂₄, where P is a permutation on n points of a certain cycle structure, and M₂₄ is the Mathieu group related to a Steiner system S(5, 8, 24) as its automorphism group. The purpose of this work is not to replace the existing algorithms, but rather to explore a possibility to extend calculations of single cosets beyond the limits encountered when using currently available methods.
|
276 |
Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous SystemsErnstsson, August January 2020 (has links)
Today's society is increasingly software-driven and dependent on powerful computer technology. Therefore it is important that advancements in the low-level processor hardware are made available for exploitation by a growing number of programmers of differing skill level. However, as we are approaching the end of Moore's law, hardware designers are finding new and increasingly complex ways to increase the accessible processor performance. It is getting more and more difficult to effectively target these processing resources without expert knowledge in parallelization, heterogeneous computation, communication, synchronization, and so on. To ensure that the software side can keep up, advanced programming environments and frameworks are needed to bridge the widening gap between hardware and software. One such example is the pattern-centric skeleton programming model and in particular the SkePU project. The work presented in this thesis first redesigns the SkePU framework based on modern C++ variadic template metaprogramming and state-of-the-art compiler technology. It then explores new ways to improve performance: by providing new patterns, improving the data access locality of existing ones, and using both static and dynamic knowledge about program flow. The work combines novel ideas with practical evaluation of the approach on several applications. The advancements also include the first skeleton API that allows variadic skeletons, new data containers, and finally an approach to make skeleton programming more customizable without compromising universal portability. / <p>Ytterligare forskningsfinansiärer: EU H2020 project EXA2PRO (801015); SeRC.</p>
|
277 |
Návrhové vzory v paralelních a distribuovaných systémech / Design Patterns for Parallel and Distributed SystemsJurnečka, Peter Unknown Date (has links)
This Ph.D. thesis describes proposed notation and method for working with parallel design patterns, which allowes proposing of automatic corrections to existing parallel source code with help of refactoring. In order to define the proposed notation, this work must cover areas of static code analysis, formal description of parallel design patterns and refactoring. Static code analysis is used to analyse the existing parallel source code for definition of places where you want to insert specified design pattern. Formal description of design pattern allows you to automatically apply the pattern to the existing source code. Finally, refactoring allows you to edit an existing source code without changing its functionality. The first part is devoted to the description of the current status in these three areas e.g. code analysis, design patterns and refactoring. The second part is devoted to a description of the methodology and experimental verification of its deployment.
|
278 |
Knihovna operací nad konečnými automaty / Library for Operations over Finite AutomataBartůněk, Petr January 2010 (has links)
This work deals with two basic operations over finite automata. Determination of nondeterministic finite automata and minimization of deterministic finite automata. For these two operations I proposed sequential algorithms that are parallelizable. I deal mainly with finding the speedup of SSE instructions, or use the OpenMP library. The trend today is mainly in increasing the number of processors, so I propose parallel algorithms for multiple processors. When searching for the optimal solution, I will be to examine other ways to achieve speedup, for example efficient saving of the data structures in memory.
|
279 |
Approximate Computing: From Circuits to SoftwareYounghoon Kim (10184063) 01 March 2021 (has links)
<div>Many modern workloads such as multimedia, recognition, mining, search, vision, etc. possess the characteristic of intrinsic application resilience: The ability to produce acceptable-quality outputs despite their underlying computations being performed in an approximate manner. Approximate computing has emerged as a paradigm that exploits intrinsic application resilience to design systems that produce outputs of acceptable quality with significant performance/energy improvement. The research community has proposed a range of approximate computing techniques spanning across circuits, architecture, and software over the last decade. Nevertheless, approximate computing is yet to be incorporated into mainstream HW/SW design processes largely due to the deviation from the conventional design flow and the lack of runtime approximation controllability by the user.</div><div><br></div><div>The primary objective of this thesis is to provide approximate computing techniques across different layers of abstraction that possess the two following characteristics: (i) They can be applied with minimal change to the conventional design flow, and (ii) the approximation is controllable at runtime by the user with minimal overhead. To this end, this thesis proposes three novel approximate computing techniques: Clock overgating which targets HW design at the Register Transfer Level (RTL), value similarity extensions which enhance general-purpose processors with a set of microarchitectural and ISA extensions, and data subsetting which targets SW executing for commodity platforms.</div><div><br></div><div>The thesis first explores clock overgating, which extends the concept of clock gating: A conventional low-power technique that turns off the clock to a Flip-Flop (FF) when the value remains unchanged. In contrast to traditional clock gating, in clock overgating the clock signals to selected FFs in the circuit are gated even when the circuit functionality is sensitive to their state. This saves additional power in the clock tree, the gated FFs and in their downstream logic, while a quality loss occurs if the erroneous FF states propagate to the circuit outputs. This thesis develops a systematic methodology to identify an energy-efficient clock overgating configuration for any given circuit and quality constraint. Towards this end, three key strategies for efficiently pruning the large space of possible overgating configurations are proposed: Significance-based overgating, grouping FFs into overgating islands, and utilizing internal signals of the circuit as triggers for overgating. Across a suite of 6 machine learning accelerators, energy benefits of 1.36X on average are achieved at the cost of a very small (<0.5%) loss in classification accuracy.</div><div><br></div><div>The thesis also explores value similarity extensions, a set of lightweight micro-architectural and ISA extensions for general-purpose processors that provide performance improvements for computations on data structures with value similarity. The key idea is that programs often contain repeated instructions that are performed on very similar inputs (e.g., neighboring pixels within a homogeneous region of an image). In such cases, it may be possible to skip an instruction that operates on data similar to a previously executed instruction, and approximate the skipped instruction's result with the saved result of the previous one. The thesis provides three key strategies for realizing this approach: Identifying potentially skippable instructions from user annotations in SW, obtaining similarity information for future load values from the data cache line currently being accessed, and a mechanism for saving & reusing results of potentially skippable instructions. As a further optimization, the thesis proposes to replace multiple loop iterations that produce similar results with a specialized instruction sequence. The proposed extensions are modeled on the gem5 architectural simulator, achieving speedup of 1.81X on average across 6 machine-learning benchmarks running on a microcontroller-class in-order processor.</div><div><br></div><div>Finally, the thesis explores a data-centric approach to approximate computing called data subsetting that shifts the focus of approximation from computations to data. The key idea is to restrict the application's data accesses to a subset of its elements so that the overall memory footprint becomes smaller. Constraining the accesses to lie within a smaller memory footprint renders the memory accesses more cache-friendly, thereby improving performance. This thesis presents a C++ data structure template called SubsettableTensor, which embodies mechanisms to define an accessible subset of data and redirect accesses away from non-subset elements, for realizing data subsetting in SW. The proposed concept is evaluated on parallel SW implementations of 7 machine learning applications on a 48-core AMD Opteron server. Experimental results indicate that 1.33X-4.44X performance improvement can be achieved within a <0.5% loss in classification accuracy.</div><div><br></div><div>In summary, the proposed approximation techniques have shown significant efficiency improvements for various machine learning applications in circuits, architecture and SW, underscoring their promise as designer-friendly approaches to approximate computing.</div>
|
280 |
A Performance Evaluation of MPI Shared Memory Programming / En utvärdering av MPI shared memory - programmering med inriktning på prestandaKarlbom, David January 2016 (has links)
The thesis investigates the Message Passing Interface (MPI) support for shared memory programming on modern hardware architecture with multiple Non-Uniform Memory Access (NUMA) domains. We investigate its performance in two case studies: the matrix-matrix multiplication and Conway’s game of life. We compare MPI shared memory performance in terms of execution time and memory consumption with the performance of implementations using OpenMP and MPI point-to-point communication, also called "MPI two-sided". We perform strong scaling tests in both test cases. We observe that MPI two-sided implementation is 21% and 18% faster than the MPI shared and OpenMP implementations respectively in the matrix-matrix multiplication when using 32 processes. MPI shared uses less memory space: when compared to MPI two-sided, MPI shared uses 45% less memory. In the Conway’s game of life, we find that MPI two-sided implementation is 10% and 82% faster than the MPI shared and OpenMP implementations respectively when using 32 processes. We also observe that not mapping virtual memory to a specific NUMA domain can lead to an increment in execution time of 64% when using 32 processes. The use of MPI shared is viable for intranode communication on modern hardware architecture with multiple NUMA domains. / I detta examensarbete undersöker vi Message Passing Inferfaces (MPI) support för shared memory programmering på modern hårdvaruarkitektur med flera Non-Uniform Memory Access (NUMA) domäner. Vi undersöker prestanda med hjälp av två fallstudier: matris-matris multiplikation och Conway’s game of life. Vi jämför prestandan utav MPI shared med hjälp utav exekveringstid samt minneskonsumtion jämtemot OpenMP och MPI punkt-till-punkt kommunikation, även känd som MPI two-sided. Vi utför strong scaling tests för båda fallstudierna. Vi observerar att MPI-two sided är 21% snabbare än MPI shared och 18% snabbare än OpenMP för matris-matris multiplikation när 32 processorer användes. För samma testdata har MPI shared en 45% lägre minnesförburkning än MPI two-sided. För Conway’s game of life är MPI two-sided 10% snabbare än MPI shared samt 82% snabbare än OpenMP implementation vid användandet av 32 processorer. Vi kunde också utskilja att om ingen mappning av virtuella minnet till en specifik NUMA domän görs, leder det till en ökning av exekveringstiden med upp till 64% när 32 processorer används. Vi kom fram till att MPI shared är användbart för intranode kommunikation på modern hårdvaruarkitektur med flera NUMA domäner.
|
Page generated in 0.0832 seconds