Global ETD Search

51	Compilation of Stream Programs onto Embedded Multicore Architectures January 2012 (has links) abstract: In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream programs distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Current design frameworks for developing stream applications on SPM enhanced embedded architectures typically do not include a compiler that can perform automatic partitioning, mapping and scheduling under limited on-chip SPM capacities and memory access delays. Consequently, many designs are implemented manually, which leads to lengthy tasks and inferior designs. In this work, optimization techniques that automatically compile stream programs onto embedded multi-core architectures are proposed. As an initial case study, we implemented an automatic target recognition (ATR) algorithm on the IBM Cell Broadband Engine (BE). Then integer linear programming (ILP) and heuristic approaches were proposed to schedule stream programs on a single core embedded processor that has an SPM with code overlay. Later, ILP and heuristic approaches for Compiling Stream programs on SPM enhanced Multicore Processors (CSMP) were studied. The proposed CSMP ILP and heuristic approaches do not optimize for cycles in stream applications. Further, the number of software pipeline stages in the implementation is dependent on actor to processing engine (PE) mapping and is uncontrollable. We next presented a Retiming technique for Throughput optimization on Embedded Multi-core processors (RTEM). RTEM approach inherently handles cycles and can accept an upper bound on the number of software pipeline stages to be generated. We further enhanced RTEM by incorporating unrolling (URSTEM) that preserves all the beneficial properties of RTEM heuristic and also scales with the number of PEs through unrolling. / Dissertation/Thesis / Ph.D. Computer Science 2012 Computer science compilation embedded multicore parallel scratchpad stream
52	Parallelism in Node.js applications : Data flow analysis of concurrent scripts Jansson, Linda January 2017 (has links) To fully utilize multicore processors in Node.js applications, the applications must be programmed as multiple processes. Parallel execution can increase the throughput of data and hence lower data buffering for inter-process communica- tion. Node.js’s asynchronous programming model and interface to the operating system make for convenient tools that are well suited for multiprocess program- ming. However, the run-time behavior of asynchronous processes results in non-deterministic processor load and data flow. That means the performance gain from increasing concurrency depends on both the application’s run-time state and the hardware’s capacity for parallel execution. The objective of this thesis work is to explore the effects of increasing parallel- ism in Node.js applications by measuring the differences in the amount of data buffering when distributed processes run of a varying number of cores with a fixed rate of asynchronously arriving data. The goal is to simulate and examine the run-time behavior of three basic multiprocess Node.js application architec- tures in order to discuss and evaluate software parallelism techniques. The three architectures are: pipelined nodes for temporally dependent processing, a vector of nodes for data parallel processing, and a grid of nodes for one-to-many branched processing. To simulate and visualize the run-time behavior, a simulation environment us- ing multiple Node.js processes is created. The simulation is agent-based, where the agent is an abstraction for a specific data flow within the application. The simulation models and visualizes all of the data flows within a distributed appli- cation where processes communicate asynchronously via messages through sockets. The results show that performance can increase when distributing Node.js ap- plications across multiple processes running in parallel on multicore hardware. There are however diminishing returns as the number of active processes equal or exceed the number of cores. A good rule of thumb seem to be to distribute the decoupled logic across as many processes as there are cores. The interaction between asynchronous processes is on the whole made very simple with Node.js. Although running multiple instances of Node.js requires more memory, the distributed architecture has the potential to increase performance by nearly as many times as the number of cores in the processor. Node.js parallelism concurrent programming multicore Computer Sciences Datavetenskap (datalogi)
53	Uma metodologia de avaliação de desempenho para identificar as melhore regiões paralelas para reduzir o consumo de energia / A performance evaluation methodology to find the best parallel regions to reduce energy consumption Millani, Luís Felipe Garlet January 2015 (has links) Devido as limitações de consumo energético impostas a supercomputadores, métricas de eficiência energética estão sendo usadas para analisar aplicações paralelas desenvolvidas para computadores de alto desempenho. O objetivo é a redução do custo energético dessas aplicações. Algumas estratégias de redução de consumo energética consideram a aplicação como um todo, outras reduzem ajustam a frequência dos núcleos apenas em certas regiões do código paralelo. Fases de balanceamento de carga ou de comunicação bloqueante podem ser oportunas para redução do consumo energético. A análise de eficiência dessas estratégias é geralmente realizada com metodologias tradicionais derivadas do domínio de análise de desempenho. Uma metodologia de grão mais fino, onde a redução de energia é avaliada para cada região de código e frequência pode lever a um melhor entendimento de como o consumo energético pode ser minimizado para uma determinada implementação. Para tal, os principais desafios são: (a) a detecção de um número possivelmente grande de regiões paralelas; (b) qual frequência deve ser adotada para cada região de forma a limitar o impacto no tempo de execução; e (c) o custo do ajuste dinâmico da frequência dos núcleos. O trabalho descrito nesta dissertação apresenta uma metodologia de análise de desempenho para encontrar, dentre as regiões paralelas, os melhores candidatos a redução do consumo energético. (Cotninua0 Esta proposta consiste de: (a) um design inteligente de experimentos baseado em Plackett-Burman, especialmente importante quando um grande número de regiões paralelas é detectado na aplicação; (b) análise tradicional de energia e desempenho sobre as regiões consideradas candidatas a redução do consumo energético; e (c) análise baseada em eficiência de Pareto mostrando a dificuldade em otimizar o consumo energético. Em (c) também são mostrados os diferentes pontos de equilíbrio entre desempenho e eficiência energética que podem ser interessantes ao desenvolvedor. Nossa abordagem é validada por três aplicações: Graph500, busca em largura, e refinamento de Delaunay. / Due to energy limitations imposed to supercomputers, parallel applications developed for High Performance Computers (HPC) are currently being investigated with energy efficiency metrics. The idea is to reduce the energy footprint of these applications. While some energy reduction strategies consider the application as a whole, certain strategies adjust the core frequency only for certain regions of the parallel code. Load balancing or blocking communication phases could be used as opportunities for energy reduction, for instance. The efficiency analysis of such strategies is usually carried out with traditional methodologies derived from the performance analysis domain. It is clear that a finer grain methodology, where the energy reduction is evaluated per each code region and frequency configuration, could potentially lead to a better understanding of how energy consumption can be reduced for a particular algorithm implementation. To get this, the main challenges are: (a) the detection of such, possibly parallel, code regions and the large number of them; (b) which frequency should be adopted for that region (to reduce energy consumption without too much penalty for the runtime); and (c) the cost to dynamically adjust core frequency. The work described in this dissertation presents a performance analysis methodology to find the best parallel region candidates to reduce energy consumption. The proposal is three folded: (a) a clever design of experiments based on screening, especially important when a large number of parallel regions is detected in the applications; (b) a traditional energy and performance evaluation on the regions that were considered as good candidates for energy reduction; and (c) a Pareto-based analysis showing how hard is to obtain energy gains in optimized codes. In (c), we also show other trade-offs between performance loss and energy gains that might be of interest of the application developer. Our approach is validated against three HPC application codes: Graph500; Breadth-First Search, and Delaunay Refinement. Supercomputadores Processamento paralelo Methodology Energy HPC DVFS Multicore Performance OpenMP
54	Multi-Core Pattern Bendiuga, Volodymyr January 2012 (has links) No description available. multicore parallelization design pattern synchronization algorithm Software Engineering Programvaruteknik
55	Implementation och utvärdering av trådintensiv simulator i Java, Jetlang och Erlang Holmberg, Henrik January 2011 (has links) Trådning är svårt att få så effektivt som möjligt och det är komplicerat att skriva kod för det. Det här examensarbetet beskriver en implementation och jämför sedan implementationerna. Utvärderingen av implementationerna i Java, Jetlang och Erlang avslöjar att Jetlang skalar bäst både när det gäller komplexitet och hårdvara. Erlang behöver minst extra kod för att hantera trådning. När det gäller många små körningar har även Erlang bäst prestanda men tappar försprånget snabbt när komplexiteten ökar. / Efficient threading is difficult to create and it is complicated to write. This thesis describes an implementation and compares these. The evaluationof the implementations in Java, Jetlang and Erlang reveal that Jetlang is the best when it comes to scaling regarding complexity as well as hardware. Erlang requires the least extra code for the threading. When many small executions are needed, Erlang performs the best but loses the lead quickly when the complexity increases. Trådning Java Jetlang Erlang Simulator Multicore Computer Sciences Datavetenskap (datalogi)
56	High-level structured programming models for explicit and automatic parallelization on multicore architectures / Modèle de programmation de haut niveau pour la parallélisation expicite et automatique : application aux architectures multicoeurs Khammassi, Nader 05 December 2014 (has links) La prolifération des architectures multi-coeurs est source d’unepression importante pour les developpeurs, qui doivent chercherà paralléliser leurs applications de manière à profiter au mieux deces plateformes. Malheureusement, les modèles de programmationde bas niveau amplifient les difficultés inhérentes à la conceptiond’applications complexes et parallèles. Il existe donc une attentepour des modèles de programmation de plus haut niveau, quipuissent simplifier la vie des programmeurs de manière significative,tout en proposant des abstractions suffisantes pour absorberl’hétérogénéité des architectures matérielles.Contrairement à une multitude de modèles de programmation parallèlequi introduisent de nouveaux langages, annotations ou étendentdes langages existants et requièrent donc des compilateurs spécialisés,nous exploitons ici le potentiel du language C++ standardet traditionnel. En particulier nous avons recours à ses capacitésen terme de meta-programmation, afin de fournir au programmeurune interface de programmation parallèle simple et directe. Cetteinterface autorise le programmeur à exprimer le parallélismede son application au prix d’une altération négligeable du codeséquentiel initial. Un runtime intelligent se charge d’extraire touteinformation relative aux dépendances de données entre tâches,ainsi que celles relatives à l’ordonnancement. Nous montronscomment ce runtime est à même d’exploiter ces informations dansle but de détecter et protéger les données partagées, puis réaliserun ordonnancement prenant en compte les particularités des caches.L’implémentation initiale de notre modèle de programmation est unelibrairie C++ pure appelée XPU. XPU est conÃ˘gue dans le but defaciliter l’explicitation, par le programmeur, du parallélisme applicatif.Une seconde réalisation appelée FATMA doit être considérée commeune extension d’XPU qui permet une détection automatique desdépendances dans une séquence de tâches : il s’agit donc de parallélisationautomatique, sans recours à quelque outil que se soit,excepté un compilateur C++ standard. Afin de démontrer le potentielde notre approche, nous utilisons ces deux outils –XPU et FATMA–pour paralléliser des problèmes populaires, ainsi que des applicationsindustrielles réelles. Nous montrons qu’en dépit de leur abstractionélevée, nos modèles de programmation présentent des performancescomparables à des modèles de programmation de basniveau,et offrent un meilleur compromis productivité-performance. / The continuous proliferation of multicore architectures has placeddevelopers under great pressure to parallelize their applicationsaccordingly with what such platforms can offer. Unfortunately,traditional low-level programming models exacerbate the difficultiesof building large and complex parallel applications. High-level parallelprogramming models are in high-demand as they reduce the burdenson programmers significantly and provide enough abstraction toaccommodate hardware heterogeneity. In this thesis, we proposea flexible parallelization methodology, and we introduce a newtask-based parallel programming model designed to provide highproductivity and expressiveness without sacrificing performance.Our programming model aims to ease expression of both sequentialexecution and several types of parallelism including task, data andpipeline parallelism at different granularity levels to form a structuredhomogeneous programming model.Contrary to many parallel programming models which introducenew languages, compiler annotations or extend existing languagesand thus require specialized compilers, extra-hardware or virtualmachines..., we exploit the potential of the traditional standardC++ language and particularly its meta-programming capabilities toprovide a light-weight and smart parallel programming interface. Thisprogramming interface enable programmer to express parallelismat the cost of a little amount of extra-code while reuse its legacysequential code almost without any alteration. An intelligent run-timesystem is able to extract transparently many information on task-datadependencies and ordering. We show how the run-time system canexploit these valuable information to detect and protect shared dataautomatically and perform cache-aware scheduling.The initial implementation of our programming model is a pure C++library named "XPU" and is designed for explicit parallelism specification.A second implementation named "FATMA" extends XPU andexploits the transparent task dependencies extraction feature to provideautomatic parallelization of a given sequence of tasks withoutneed to any specific tool apart a standard C++ compiler. In order todemonstrate the potential of our approach, we use both of the explicitand automatic parallel programming models to parallelize popularproblems as well as real industrial applications. We show thatdespite its high abstraction, our programming models provide comparableperformances to lower-level programming models and offersa better productivity-performance tradeoff. XPU FATMA CHATS Parallélisation automatique Squelettes algorithmiques Multicore architectures 005.275
57	A Multi-core Testbed on Desktop Computer for Research on Power/Thermal Aware Resource Management Dierivot, Ashley 06 June 2014 (has links) Our goal is to develop a flexible, customizable, and practical multi-core testbed based on an Intel desktop computer that can be utilized to assist the theoretical research on power/thermal aware resource management in design of computer systems. By integrating different modules, i.e. thread mapping/scheduling, processor/core frequency and voltage variation, temperature/power measurement, and run-time performance collection, into a systematic and unified framework, our testbed can bridge the gap between the theoretical study and practical implementation. The effectiveness for our system was validated using appropriately selected benchmarks. The importance of this research is that it complements the current theoretical research by validating the theoretical results in practical scenarios, which are closer to that in the real world. In addition, by studying the discrepancies of results of theoretical study and their applications in real world, the research also aids in identifying new research problems and directions. power thermal cpu testbed multicore CMP multiprocessor real-time
58	TIME-PREDICTABLE FAST MEMORIES: CACHES VS. SCRATCHPAD MEMORIES Liu, Yu 01 August 2011 (has links) In modern processor architectures, caches are widely used to shorten the gap between the processor speed and memory access time. However, caches are time unpredictable, especially the shared L2 cache used by different cores on multicore processors. Thus, it can significantly increase the complexity of worst-case execution time (WCET) analysis, which is crucial for real-time systems. This dissertation designs several time-predictable scratchpad memory (SPM) based architectures for both VLIW (Very Long InstructionWord) based single-core and multicore processors. First, this dissertation proposes a time predictable two-level SPM based architecture for VLIW based single-core processors, and an ILP (Integer Linear Programming) based static memory objects allocation algorithm is extended to support the multi-level SPMs without harming the time predictability of SPMs. Second, several SPM based architectures for VLIW based multicore processors are designed. To support these architectures, the dynamic memory objects allocation based partition, the static memory objects allocation based partition and the static memory objects allocation based priority L2 SPM strategy are proposed, which retain the characteristic of time predictability. Also, both the WCET and worst-case energy consumption (WCEC) of our SPM based single-core and multicore architectures are completely evaluated in this dissertation. Last, to exploit the load/store latencies that are statically known in this architecture, we study a SPM-aware scheduling method to improve the performance. Our experimental results indicate the strengths and weaknesses of each proposed architecture and allocation method, which offers interesting memory design options to enable real-time computing. The strength of the two-level architecture is its superior performance compared to the one-level architecture, while the strength of the one-level architecture is its simple implementation. Also, the two-level architecture with separated L1 SPM for each core better fits for the data-intensive real-time applications, which not only retains good performance but also achieves a higher bandwidth by accessing both instruction and data SPM at the same time. Compared to the static based strategies, the dynamic allocation based partition L2 SPM strategy offers the better performance on each core because of the reuse of SPM space at the run-time, but has much higher complexity. In addition, the experimental results show that the timing and energy performance of our proposed SPM based architectures are superior to the similar cache based and hybrid architectures. Meanwhile, our architectures can ensure time predictability which is desirable for the real-time systems. Caches Multicore Real-Time Systems Scratchpad Memories Time-Predictable
59	Parallelization of Dataset Transformation with Processing Order Constraints in Python / Parallelisering av datamängdstransformation med ordningsbegränsningar i Python Gramfors, Dexter January 2016 (has links) Financial data is often represented with rows of values, contained in a dataset. This data needs to be transformed into a common format in order for comparison and matching to be made, which can take a long time for larger datasets. The main goal of this master’s thesis is speeding up these transformations through parallelization using Python multiprocessing. The datasets in question consist of several rows representing trades, and are transformed into a common format using rules known as filters. In order to devise a parallelization strategy, the filters were analyzed in order to find ordering constraints, and the Python profiler cProfile was used to find bottlenecks and potential parallelization points. This analysis resulted in the use of a task-based approach for the implementation, in which the transformation was divided into an initial sequential pre-processing step, a parallel step where chunks of several trade rows were distributed among workers, and a sequential post processing step. The implementation was tested by transforming four datasets of differing sizes using up to 16 workers, and execution time and memory consumption was measured. The results for the tiny, small, medium, and large datasets showed a speedup of 0.5, 2.1, 3.8, and 4.81. They also showed linearly increasing memory consumption for all datasets. The test transformations were also profiled in order to understand the parallel program’s behaviour for the different datasets. The experiments gave way to the conclusion that dataset size heavily influences the speedup, partly because of the fact that the sequential parts become less significant. In addition, the large memory increase for larger amount of workers is noted as a major downside of multiprocessing when using caching mechanisms, as data is duplicated instead of shared. This thesis shows that it is possible to speed up the dataset transformations using chunks of rows as tasks, though the speedup is relatively low. / Finansiell data representeras ofta med rader av värden, samlade i en datamängd. Denna data måste transformeras till ett standardformat för att möjliggöra jämförelser och matchning. Detta kan ta lång tid för stora datamängder. Huvudmålet för detta examensarbete är att snabba upp dessa transformationer genom parallellisering med hjälp av Python-modulen multiprocessing. Datamängderna omvandlas med hjälp av regler, kallade filter. Dessa filter analyserades för att identifiera begränsningar på ordningen i vilken datamängden kan behandlas, och därigenom finna en parallelliseringsstrategi. Python-profileraren cProfile an- vändes även för att hitta potentiella parallelliseringspunkter i koden. Denna analys resulterade i användandet av ett “task”-baserat tillvägagångssätt, där transformationen delades in i ett sekventiellt pre-processingsteg, ett parallelt steg där grupper av rader distribuerades ut bland arbetar-processer, och ett sekventiellt post-processingsteg. Implementationen testades genom transformation av fyra datamängder av olika storlekar, med upp till 16 arbetarprocesser. Resultaten för de fyra datamängderna var en speedup på 0.5, 2.1, 3.8 respektive 4.81. En linjär ökning i minnesanvändning uppvisades även. Experimenten resulterade i slutsatsen att datamängdens storlek var en betydande faktor i hur mycket speedup som uppvisades, delvis på grund av faktumet att de sekventiella delarna tar upp en mindre del av programmet. Den stora minnesåtgången noterades som en nackdel med att använda multiprocessing i kombination med cachning, på grund av duplicerad data. Detta examensarbete visar att det är möjligt att snabba upp datamängdstransformation genom att använda radgrupper som tasks, även om en relativt låg speedup uppvisades. parallel computing multicore Python Computer Sciences Datavetenskap (datalogi)
60	Towards more scalable mutual exclusion for multicore architectures / Vers des mécanismes d'exclusion mutuelle plus efficaces pour les architectures multi-cœur Lozi, Jean-Pierre 16 July 2014 (has links) Le passage à l'échelle des applications multi-fil sur les systèmes multi-cœuractuels est limité par la performance des algorithmes de verrou, à cause descoûts d'accès à la mémoire sous forte congestion et des défauts de cache. Lacontribution principale présentée dans cette thèse est un nouvel algorithme,Remote Core Locking (RCL), qui a pour objectif d'améliorer la vitessed'exécution des sections critiques des applications patrimoniales sur lesarchitectures multi-cœur. L'idée de RCL est de remplacer les acquisitions deverrou par des appels de fonction distants (RPC) optimisés vers un fild'exécution matériel dédié appelé serveur. RCL réduit l'effondrement desperformances observé avec d'autres algorithmes de verrou lorsque de nombreuxfils d'exécution essaient d'obtenir un verrou de façon concurrente, et supprimele besoin de transférer les données partagées protégées par le verrou vers lefil d'exécution matériel qui l'acquiert car ces données peuvent souventdemeurer dans les caches du serveur.D'autres contributions sont présentées dans cette thèse, notamment un profilerpermettant d'identifier les verrous qui sont des goulots d'étranglement dansles applications multi-fil et qui peuvent par conséquent être remplacés par RCLafin d'améliorer les performances, ainsi qu'un outil de réécriture de codedéveloppé avec l'aide de Julia Lawall. Cet outil transforme les acquisitions deverrou POSIX en acquisitions RCL. L'évaluation de RCL a porté sur dix-huitapplications: les neuf applications des benchmarks SPLASH-2, les septapplications des benchmarks Phoenix 2, Memcached, ainsi que Berkeley DB avec unclient TPC-C. Huit de ces applications sont incapables de passer à l'échelle àcause de leurs verrous et leur performance est améliorée par RCL sur unemachine x86 avec quatre processeurs AMD Opteron et 48 fils d'exécutionmatériels. Utiliser RCL permet de multiplier les performances par 2.5 parrapport aux verrous POSIX sur Memcached, et par 11.6 fois sur Berkeley DB avecle client TPC-C. Sur une machine SPARC avec deux processeurs Sun Ultrasparc T2+et 128 fils d'exécution matériels, les performances de trois applications sontaméliorées par RCL: les performances sont multipliées par 1.3 par rapport auxverrous POSIX sur Memcached et par 7.9 fois sur Berkeley DB avec le clientTPC-C. / The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this thesis is a new lock algorithm, Remote Core Locking (RCL), that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated hardware thread, which is referred to as the server. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock because such data can typically remain in the server's cache. Other contributions presented in this thesis include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool developed with Julia Lawall that transforms POSIX locks into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an x86 machine with four AMD Opteron processors and 48 hardware threads. Using RCL locks, performance is improved by up to 2.5 times with respect to POSIX locks on Memcached, and up to 11.6 times with respect to Berkeley DB with the TPC-C client. On an SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to POSIX locks on Memcached, and up to 7.9 times with respect to Berkeley DB with the TPC-C client. Multicoeur Synchronisation Verrou Combining Rpc Congestion mémoire Multicore Lock 004

Search results