Global ETD Search

41	Proposta e simulação de uma arquitetura a fluxo de dados de segunda geração. / Proposal and simulation of data flow architecture of second generation. Patrícia Magna 04 March 1997 (has links) Neste trabalho é apresentada a arquitetura SEED, proposta a partir das experiências adquiridas com as arquiteturas baseadas no modelo a fluxo de dados que foram estudadas até o presente. A arquitetura SEED utiliza o modelo a fluxo de dados para escalonar e executar blocos de instruções, visando aproveitar a principal qualidade apresentada pelo modelo, que consiste em expor o máximo de paralelismo existente nos programas. No entanto, a arquitetura explora paralelismo de granularidade mais grossa que as arquiteturas a fluxo de dados, a fim de reduzir o trafego de fichas de dados na arquitetura. Esta redução tenta resolver ou amenizar problemas como a excessiva ocupação de memória e a grande complexidade exigida do hardware. Além da especificação da funcionalidade de toda a arquitetura SEED, este trabalho apresenta uma proposta para o particionamento do código. A utilização desta proposta permite a geração de blocos de códigos que podem ser executados corretamente pela arquitetura SEED. Alguns benchmarks foram gerados utilizando essa proposta de particionamento de código. Estes benchmarks foram executados no simulador da arquitetura SEED, visando analisar e avaliar o comportamento da arquitetura com diversas configurações de hardware. / In this work is presented the SEED architecture. This architecture was proposed considering the experiences obtained with existing architectures based on dataflow model. The SEED architecture uses dataflow model to schedule and execute sets of instructions, called code blocks. This approach tries to make use of the main quality of the dataflow model that is to expose the maximum parallelism of the programs. However, this architecture explores coarser granularity than the one usually considered in dataflow architectures in order to reduce the data token traffic in the architecture. This type of reduction tries to solve problems like excessive occupation of memory and high complexity of the hardware. Besides the specification of all units that compose the SEED architecture, this work also proposes a way of partitioning programs, creating code blocks that may be executed by SEED architecture. Some benchmarks were generated using this proposal for partitioning programs. These benchmarks were executed in the SEED architecture simulator, in order to analyze the behavior of the proposed architecture under special configurations. Arquitetura dataflow Arquitetura não convencional Arquiteturas paralelas Processamento paralelo Dataflow architectures Non conventional architectures Parallel architectures Parallel processing
42	Access Path Based Dataflow Analysis For Sequential And Concurrent Programs Arnab De, * 12 1900 (has links) (PDF) In this thesis, we have developed a flow-sensitive data flow analysis framework for value set analyses for Java-like languages. Our analysis frame work is based on access paths—a variable followed by zero or more field accesses. We express our abstract states as maps from bounded access paths to abstract value sets. Using access paths instead of allocation sites enables us to perform strong updates on assignments to dynamically allocated memory locations. We also describe several optimizations to reduce the number of access paths that need to be tracked in our analysis. We have instantiated this frame work for flow-sensitive pointer and null-pointer analysis for Java. We have implemented our analysis inside the Chord frame work. A major part of our implementation is written declaratively using Datalog. We leverage the use of BDDs in Chord for keeping our memory usage low. We show that our analysis is much more precise and faster than traditional flow-sensitive and flow-insensitive pointer and null-pointer analysis for Java. We further extend our access path based analysis frame work to concurrent Java programs. We use the synchronization structure of the programs to transfer abstract states from one thread to another. Therefore, we do not need to make conservative assumptions about reads or writes to shared memory. We prove our analysis to be sound for the happens-before memory model, which is weaker than most common memory models, including sequential consistency and the Java Memory Model. We implement a null-pointer analysis for concurrent Java programs and show it to be more precise than the traditional analysis. Dataflow Analysis Java (Computer Program Language) Concurrent Programs - Dataflow Analysis Flow-Sensitive Pointer Analysis Null-Pointer Analysis Flow-Sensitive Dataflow Analysis Concurrent Java Programs Sequential Programs - Dataflow Analysis Acess Path Based Dataflow Analysis Relaxed Memory Model Java Memory Model Computer Science
43	A Coarse Grained Reconfigurable Architecture Framework Supporting Macro-Dataflow Execution Varadarajan, Keshavan 12 1900 (has links) (PDF) A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing platform which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C, N, T, P, O, M, H) where each of the terms have the following meaning: C -choice of computation unit, N -choice of interconnection network, T -Choice of number of context frame (single or multiple), P -presence of partial reconfiguration, O choice of orchestration mechanism, M -design of memory hierarchy and H host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C -ALU, N -Network-on-Chip (NoC), T -Multiple contexts, P -support for partial reconfiguration, O -Macro Dataflow based orchestration, M -data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the interconnection of computation units), H -loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the architecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host-CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is transformed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execution when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Verilog. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hardware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5x improvement in performance as compared to the base version. The reconfiguration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units interconnected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 x 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP. Coarse Grained Computation Reconfigurable Architectures Macro Dataflow Execution Macro-Dataflow Orchestration Microarchitectural Optimizations Reconfigurable Fabric Macro Dataflow Execution Computer Science
44	Towards real-time image understanding with convolutional networks / Analyse sémantique des images en temps-réel avec des réseaux convolutifs Farabet, Clément 18 December 2013 (has links) One of the open questions of artificial computer vision is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter ? More interestingly, how could an artificial vision system {em learn} appropriate internal representations automatically, the way animals and humans seem to learn by simply looking at the world ? Another related question is that of computational tractability, and more precisely that of computational efficiency. Given a good visual representation, how efficiently can it be trained, and used to encode new sensorial data. Efficiency has several dimensions: power requirements, processing speed, and memory usage. In this thesis I present three new contributions to the field of computer vision:(1) a multiscale deep convolutional network architecture to easily capture long-distance relationships between input variables in image data, (2) a tree-based algorithm to efficiently explore multiple segmentation candidates, to produce maximally confident semantic segmentations of images,(3) a custom dataflow computer architecture optimized for the computation of convolutional networks, and similarly dense image processing models. All three contributions were produced with the common goal of getting us closer to real-time image understanding. Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. In the first part of this thesis, I propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment contains a single object (...) / One of the open questions of artificial computer vision is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter ? More interestingly, how could an artificial vision system {em learn} appropriate internal representations automatically, the way animals and humans seem to learn by simply looking at the world ? Another related question is that of computational tractability, and more precisely that of computational efficiency. Given a good visual representation, how efficiently can it be trained, and used to encode new sensorial data. Efficiency has several dimensions: power requirements, processing speed, and memory usage. In this thesis I present three new contributions to the field of computer vision:(1) a multiscale deep convolutional network architecture to easily capture long-distance relationships between input variables in image data, (2) a tree-based algorithm to efficiently explore multiple segmentation candidates, to produce maximally confident semantic segmentations of images,(3) a custom dataflow computer architecture optimized for the computation of convolutional networks, and similarly dense image processing models. All three contributions were produced with the common goal of getting us closer to real-time image understanding. Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. In the first part of this thesis, I propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment contains a single object. The system yields record accuracies on several public benchmarks. The computation of convolutional networks, and related models heavily relies on a set of basic operators that are particularly fit for dedicated hardware implementations. In the second part of this thesis I introduce a scalable dataflow hardware architecture optimized for the computation of general-purpose vision algorithms, neuFlow, and a dataflow compiler, luaFlow, that transforms high-level flow-graph representations of these algorithms into machine code for neuFlow. This system was designed with the goal of providing real-time detection, categorization and localization of objects in complex scenes, while consuming 10 Watts when implemented on a Xilinx Virtex 6 FPGA platform, or about ten times less than a laptop computer, and producing speedups of up to 100 times in real-world applications (results from 2011) Réseaux Convolutifs Deep Learning Vision Informatique Temps-Réel Fpga/asic Dataflow Convolutional Networks Apprentissage Profond Computer Vision Real-Time Fpga/asic Dataflow
45	From dataflow models to energy efficient application specific processors Hautala, I. (Ilkka) 11 October 2019 (has links) Abstract The development of wireless networks has provided the necessary conditions for several new applications. The emergence of the virtual and augmented reality and the Internet of things and during the era of social media and streaming services, various demands related to functionality and performance have been set for mobile and wearable devices. Meeting these demands is complicated due to minimal energy budgets, which are characteristic of embedded devices. Lately, the energy efficiency of devices has been addressed by increasing parallelism and the use of application-specific hardware resources. This has been hindered by hardware development as well as software development because the conventional development methods are based on the use of low-level abstractions and sequential programming paradigms. On the other hand, deployment of high-level design methods is slowed down because of final solutions that are too much compromised when energy efficiency and performance are considered. This doctoral thesis introduces a model-driven framework for the development of signal processing systems that facilitates hardware and software co-design. The design flow exploits an easily customizable, re-programmable and energy-efficient processor template. The proposed design flow enables tailoring of multiple heterogeneous processing elements and the connections between them to the demands of an application. Application software is described by using high-level dataflow models, which enable the automatic synthesis of parallel applications for different multicore hardware platforms and speed up design space exploration. Suitability of the proposed design flow is demonstrated by using three different applications from different signal processing domains. The experiments showed that raising the level of abstraction has only a minor impact on performance. Video processing algorithms are selected to be the main application area in this thesis. The thesis proposes tailored and reprogrammable energy-efficient processing elements for video coding algorithms. The solutions are based on the use of multiple processing elements by exploiting the pipeline parallelism of the application, which is characteristic of many signal processing algorithms. Performance, power and area metrics for the designed solutions have been obtained using post-layout simulation models. In terms of energy efficiency, the proposed programmable processors form a new compromise solution between fixed hardware accelerators and conventional embedded processors for video coding. / Tiivistelmä Langattomien verkkojen kehittyminen on luonut edellytykset useille uusille sovelluksille. Muiden muassa sosiaalisen media, suoratoistopalvelut, virtuaalitodellisuus ja esineiden internet asettavat kannettaville ja puettaville laitteille moninaisia toimintoihin, suorituskykyyn, energiankulutukseen ja fyysiseen muotoon liittyviä vaatimuksia. Yksi isoimmista haasteista on sulautettujen laitteiden energiankulutus. Laitteiden energiatehokkuutta on pyritty parantamaan rinnakkaislaskentaa ja räätälöityjä laskentaresursseja hyödyntämällä. Tämä puolestaan on vaikeuttanut niin laite- kuin sovelluskehitystä, koska laajassa käytössä olevat kehitystyökalut perustuvat matalan tason abstraktioihin ja hyödyntävät alun perin yksi ydinprosessoreille suunniteltuja ohjelmointikieliä. Korkean tason ja automatisoitujen kehitysmenetelmien käyttöönottoa on hidastanut aikaansaatujen järjestelmien puutteellinen suorituskyky ja laiteresurssien tehoton hyödyntäminen. Väitöskirja esittelee datavuopohjaiseen suunnitteluun perustuvan työkaluketjun, joka on tarkoitettu energiatehokkaiden signaalikäsittelyjärjestelmien toteuttamiseen. Työssä esiteltävä suunnitteluvuo pohjautuu laitteistoratkaisuissa räätälöitävään ja ohjelmoitavaan siirtoliipaistavaan prosessoritemplaattiin. Ehdotettu suunnitteluvuo mahdollistaa useiden heterogeenisten prosessoriytimien ja niiden välisten kytkentöjen räätälöimisen sovelluksien tarpeiden vaatimalla tavalla. Suunnitteluvuossa ohjelmistot kuvataan korkean tason datavuomallien avulla. Tämä mahdollistaa erityisesti rinnakkaista laskentaa sisältävän ohjelmiston automaattisen sovittamisen erilaisiin moniprosessorijärjestelmiin ja nopeuttaa erilaisten järjestelmätason ratkaisujen kartoittamista. Suunnitteluvuon käyttökelpoisuus osoitetaan käyttäen esimerkkinä kolmea eri signaalinkäsittelysovellusta. Tulokset osoittavat, että suunnittelumenetelmien abstraktiotasoa on mahdollista nostaa ilman merkittävää suorituskyvyn heikkenemistä. Väitöskirjan keskeinen sovellusalue on videonkoodaus. Työ esittelee videonkoodaukseen suunniteltuja energiatehokkaita ja uudelleenohjelmoitavia prosessoriytimiä. Ratkaisut perustuvat usean prosessoriytimen käyttämiseen hyödyntäen erityisesti videonkäsittelyalgoritmeille ominaista liukuhihnarinnakkaisuutta. Prosessorien virrankulutus, suorituskyky ja pinta-ala on analysoitu käyttämällä simulointimalleja, jotka huomioivat logiikkasolujen sijoittelun ja johdotuksen. Ehdotetut sovelluskohtaiset prosessoriratkaisut tarjoavat uuden energiatehokkaan kompromissiratkaisun tavanomaisten ohjelmoitavien prosessoreiden ja kiinteästi johdotettujen video-kiihdyttimien välille. application-specific processing dataflow modelling dataflow-based design framework energy-efficient computing video coding datavuomallinnus datavuopohjainen suunnittelu energiatehokas laskenta sovelluskohtainen laskenta videonkoodaus
46	Design and Implementation of an Audio Codec (AMR-WB) using Dataflow Programming Language CAL in the OpenDF Environment Ali, Hazem, Patoary, Mohammad Nazrul Ishlam January 2010 (has links) <p>Over the last three decades, computer architects have been able to achieve an increase in performance for single processors by, e.g., increasing clock speed, introducing cache memories and using instruction level parallelism. However, because of power consumption and heat dissipation constraints, this trend is going to cease. In recent times, hardware engineers have instead moved to new chip architectures with multiple processor cores on a single chip. With multi-core processors, applications can complete more total work than with one core alone. To take advantage of multi-core processors, we have to develop parallel applications that assign tasks to different cores. On each core, pipeline, data and task parallelization can be used to achieve higher performance. Dataflow programming languages are attractive for achieving parallelism because of their high-level, machine-independent, implicitly parallel notation and because of their fine-grain parallelism. These features are essential for obtaining effective, scalable utilization of multi-core processors.</p><p>In this thesis work we have parallelized an existing audio codec - Adaptive Multi-Rate Wide Band (AMR-WB) - written in the C language for single core processor. The target platform is a multi-core AMR11 MP developer board. The final result of the efforts is a working AMR-WB encoder implemented in CAL and running in the OpenDF simulator. The C specification of the AMR-WB encoder was analysed with respect to dataflow and parallelism. The final implementation was developed in the CAL Actor Language, with the goal of exposing available parallelism - different dataflows - as well as removing unwanted data dependencies. Our thesis work discusses mapping techniques and guidelines that we followed and which can be used in any future work regarding mapping C based applications to CAL. We also propose solutions for some specific dependencies that were revealed in the AMR-WB encoder analysis and suggest further investigation of possible modifications to the encoder to enable more efficient implementation on a multi-core target system.</p> ACTORS project CAL AMR-WB Audio Codecs Dataflow Multi-core CAL Actor Language Dataflow Programming Language OpenDF Computer engineering Datorteknik Computer science Datavetenskap
47	Optimisation de la localité des données sur architectures manycœurs / Data locality on manycore architectures Amstel, Duco van 18 July 2016 (has links) L'évolution continue des architectures des processeurs a été un moteur important de la recherche en compilation. Une tendance dans cette évolution qui existe depuis l'avènement des ordinateurs modernes est le rapport grandissant entre la puissance de calcul disponible (IPS, FLOPS, ...) et la bande-passante correspondante qui est disponible entre les différents niveaux de la hiérarchie mémoire (registres, cache, mémoire vive). En conséquence la réduction du nombre de communications mémoire requis par un code donnée a constitué un sujet de recherche important. Un principe de base en la matière est l'amélioration de la localité temporelle des données: regrouper dans le temps l'ensemble des accès à une donnée précise pour qu'elle ne soit requise que pendant peu de temps et pour qu'elle puisse ensuite être transféré vers de la mémoire lointaine (mémoire vive) sans communications supplémentaires.Une toute autre évolution architecturale a été l'arrivée de l'ère des multicoeurs et au cours des dernières années les premières générations de processeurs manycoeurs. Ces architectures ont considérablement accru la quantité de parallélisme à la disposition des programmes et algorithmes mais ceci est à nouveau limité par la bande-passante disponible pour les communications entres coeurs. Ceci a amené dans le monde de la compilation et des techniques d'optimisation des problèmes qui étaient jusqu'à là uniquement connus en calcul distribué.Dans ce texte nous présentons les premiers travaux sur une nouvelle technique d'optimisation, le pavage généralisé qui a l'avantage d'utiliser un modèle abstrait pour la réutilisation des données et d'être en même temps utilisable dans un grand nombre de contextes. Cette technique trouve son origine dans le pavage de boucles, une techniques déjà bien connue et qui a été utilisée avec succès pour l'amélioration de la localité des données dans les boucles imbriquées que ce soit pour les registres ou pour le cache. Cette nouvelle variante du pavage suit une vision beaucoup plus large et ne se limite pas au cas des boucles imbriquées. Elle se base sur une nouvelle représentation, le graphe d'utilisation mémoire, qui est étroitement lié à un nouveau modèle de besoins en termes de mémoire et de communications et qui s'applique à toute forme de code exécuté itérativement. Le pavage généralisé exprime la localité des données comme un problème d'optimisation pour lequel plusieurs solutions sont proposées. L'abstraction faite par le graphe d'utilisation mémoire permet la résolution du problème d'optimisation dans différents contextes. Pour l'évaluation expérimentale nous montrons comment utiliser cette nouvelle technique dans le cadre des boucles, imbriquées ou non, ainsi que dans le cas des programmes exprimés dans un langage à flot-de-données. En anticipant le fait d'utiliser le pavage généralisé pour la distribution des calculs entre les cœurs d'une architecture manycoeurs nous donnons aussi des éléments de réponse pour modéliser les communications et leurs caractéristiques sur ce genre d'architectures. En guise de point final, et pour montrer l'étendue de l'expressivité du graphe d'utilisation mémoire et le modèle de besoins en mémoire et communications sous-jacent, nous aborderons le sujet du débogage de performances et l'analyse des traces d'exécution. Notre but est de fournir un retour sur le potentiel d'amélioration en termes de localité des données du code évalué. Ce genre de traces peut contenir des informations au sujet des communications mémoire durant l'exécution et a de grandes similitudes avec le problème d'optimisation précédemment étudié. Ceci nous amène à une brève introduction dans le monde de l'algorithmique des graphes dirigés et la mise-au-point de quelques nouvelles heuristiques pour le problème connu de joignabilité mais aussi pour celui bien moins étudié du partitionnement convexe. / The continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues thatpreviously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning. Optimisation de code Localité des données Ordonnancement Pavage de boucles Langages dataflow Architectures manycœurs Code optimisation Data locality Scheduling Loop tiling Dataflow languages Manycore architectures 004
48	Design and Implementation of an Audio Codec (AMR-WB) using Dataflow Programming Language CAL in the OpenDF Environment Ali, Hazem, Patoary, Mohammad Nazrul Ishlam January 2010 (has links) Over the last three decades, computer architects have been able to achieve an increase in performance for single processors by, e.g., increasing clock speed, introducing cache memories and using instruction level parallelism. However, because of power consumption and heat dissipation constraints, this trend is going to cease. In recent times, hardware engineers have instead moved to new chip architectures with multiple processor cores on a single chip. With multi-core processors, applications can complete more total work than with one core alone. To take advantage of multi-core processors, we have to develop parallel applications that assign tasks to different cores. On each core, pipeline, data and task parallelization can be used to achieve higher performance. Dataflow programming languages are attractive for achieving parallelism because of their high-level, machine-independent, implicitly parallel notation and because of their fine-grain parallelism. These features are essential for obtaining effective, scalable utilization of multi-core processors. In this thesis work we have parallelized an existing audio codec - Adaptive Multi-Rate Wide Band (AMR-WB) - written in the C language for single core processor. The target platform is a multi-core AMR11 MP developer board. The final result of the efforts is a working AMR-WB encoder implemented in CAL and running in the OpenDF simulator. The C specification of the AMR-WB encoder was analysed with respect to dataflow and parallelism. The final implementation was developed in the CAL Actor Language, with the goal of exposing available parallelism - different dataflows - as well as removing unwanted data dependencies. Our thesis work discusses mapping techniques and guidelines that we followed and which can be used in any future work regarding mapping C based applications to CAL. We also propose solutions for some specific dependencies that were revealed in the AMR-WB encoder analysis and suggest further investigation of possible modifications to the encoder to enable more efficient implementation on a multi-core target system. ACTORS project CAL AMR-WB Audio Codecs Dataflow Multi-core CAL Actor Language Dataflow Programming Language OpenDF Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
49	Concevoir et partager des workflows d’analyse de données : application aux traitements intensifs en bioinformatique / Design and share data analysis workflows : application to bioinformatics intensive treatments Moreews, François 11 December 2015 (has links) Dans le cadre d'une démarche d'Open science, nous nous intéressons aux systèmes de gestion de workflows (WfMS) scientifiques et à leurs applications pour l'analyse de données intensive en bioinformatique. Nous partons de l'hypothèse que les WfMS peuvent évoluer pour devenir des plates-formes pivots capables d'accélérer la mise au point et la diffusion de méthodes d'analyses innovantes. Elles pourraient capter et fédérer autour d'une thématique disciplinaire non seulement le public actuel des consommateurs de services mais aussi celui des producteurs de services. Pour cela, nous considérons que ces environnements doivent à la fois être adaptés aux pratiques des scientifiques concepteurs de méthodes et fournir un gain de productivité durant la conception et le traitement. Ces contraintes nous amènent à étudier la capture rapide des workflows, la simplification de l'intégration des tâches techniques, comme le parallélisme nécessaire au haut-débit, et la personnalisation du déploiement. Tout d'abord, nous avons défini un langage graphique DataFlow expressif, adapté à la capture rapide des workflows. Celui-ci est interprétable par un moteur de workflows basé sur un nouveau modèle de calcul doté de performances élevées, obtenues par l'exploitation des multiples niveaux de parallélisme. Nous présentons ensuite une approche de conception orientée modèle qui facilite la génération du parallélisme de données et la production d'implémentations adaptées à différents contextes d'exécution. Nous décrivons notamment l'intégration d'un métamodèle des composants et des plates-formes, employé pour automatiser la configuration des dépendances des workflows. Enfin, dans le cas du modèle Container as a Service (CaaS), nous avons élaboré une spécification de workflows intrinsèquement diffusable et ré-exécutable. L'adoption de ce type de modèle pourrait déboucher sur une accélération des échanges et de la mise à disposition des chaînes de traitements d'analyse de données. / As part of an Open Science initiative, we are particularly interested in the scientific Workflow Management Systems (WfMS) and their applications for intensive data analysis in bioinformatics. We start from the assumption that WfMS can evolve to become efficient hubs able to speed up the development and the dissemination of innovative analysis methods. These software platforms could rally and unite not only the current stakeholders, who are service consumers, but also the service producers, around a disciplinary theme. We therefore consider that these environments must be both adapted to the practices of the scientists who are method designers and also enhanced with increased productivity during design and treatment. These constraints lead us to study the rapid capture of workflows, the simplification of technical tasks integration, like parallelisation and the deployment customization. First, we define an expressive graphic worfklow language, adapted to the quick capture of workflows. This is interpreted by a workflow engine based on a new model of computation with high performances obtained by the use of multiple levels of parallelism. Then, we present a Model-Driven design approach that facilitates the data parallelism generation and the production of suitable implementations for different execution contexts. We describe in particular the integration of a components and platforms meta-model used to automate the configuration of workflows’ dependencies. Finally, in the case of the cloud model Container as a Service (CaaS), we develop a workflow specification intrinsically re-executable and readily disseminatable. The adoption of this kind of model could lead to an acceleration of exchanges and a better availability of data analysis workflows. Calcul intensif Modèle de calcul Dataflow Mde Méta-Modèle Bioinformatique Workflow Paralléllisme de données Intensive computation Model of computation Dataflow Mde Meta-Model Bioinformatics Workflow Data parallelism
50	Towards a Gold Standard for Points-to Analysis Gutzmann, Tobias January 2010 (has links) Points-to analysis is a static program analysis that computes reference informationfor a given input program. It serves as input to many client applicationsin optimizing compilers and software engineering tools. Unfortunately, the Gold Standard – i.e., the exact reference information for a given program– is impossible to compute automatically for all but trivial cases, and thus, little can been said about the accuracy of points-to analysis. This thesis aims at paving the way towards a Gold Standard for points-to analysis. For this, we discuss theoretical implications and practical challenges that occur when comparing results obtained by different points-to analyses. We also show ways to improve points-to analysis by different means, e.g., combining different analysis implementations, and a novel approach to path sensitivity. We support our theories with a number of experiments. Points-to Analysis Dataflow Analysis Static Analysis Dynamic Analysis Gold Standard

Search results