Spelling suggestions: "subject:"instructionlevel"" "subject:"instructorlevel""
21 |
High Performance by Exploiting Information Locality through Reverse ComputingBahi, Mouad 21 December 2011 (has links) (PDF)
The main resources for computation are time, space and energy. Reducing them is the main challenge in the field of processor performance.In this thesis, we are interested in a fourth factor which is information. Information has an important and direct impact on these three resources. We show how it contributes to performance optimization. Landauer has suggested that independently on the hardware where computation is run information erasure generates dissipated energy. This is a fundamental result of thermodynamics in physics. Therefore, under this hypothesis, only reversible computations where no information is ever lost, are likely to be thermodynamically adiabatic and do not dissipate power. Reversibility means that data can always be retrieved from any point of the program. Information may be carried not only by the data but also by the process and input data that generate it. When a computation is reversible, information can also be retrieved from other already computed data and reverse computation. Hence reversible computing improves information locality.This thesis develops these ideas in two directions. In the first part, we address the issue of making a computation DAG (directed acyclic graph) reversible in terms of spatial complexity. We define energetic garbage as the additional number of registers needed for the reversible computation with respect to the original computation. We propose a reversible register allocator and we show empirically that the garbage size is never more than 50% of the DAG size. In the second part, we apply this approach to the trade-off between recomputing (direct or reverse) and storage in the context of supercomputers such as the recent vector and parallel coprocessors, graphical processing units (GPUs), IBM Cell processor, etc., where the gap between processor cycle time and memory access time is increasing. We show that recomputing in general and reverse computing in particular helps reduce register requirements and memory pressure. This approach of reverse rematerialization also contributes to the increase of instruction-level parallelism (Cell) and thread-level parallelism in multicore processors with shared register/memory file (GPU). On the latter architecture, the number of registers required by the kernel limits the number of running threads and affects performance. Reverse rematerialization generates additional instructions but their cost can be hidden by the parallelism gain. Experiments on the highly memory demanding Lattice QCD simulation code on Nvidia GPU show a performance gain up to 11%.
|
22 |
A study of CABAC hardware acceleration with configurability in multi-standard media processing / En studie i konfigurerbar hårdvaruaccelerering för CABAC i flerstandards mediabearbetningFlordal, Oskar January 2005 (has links)
To achieve greater compression ratios new video and image CODECs like H.264 and JPEG 2000 take advantage of Context adaptive binary arithmetic coding. As it contains computationally heavy algorithms, fast implementations have to be made when they are performed on large amount of data such as compressing high resolution formats like HDTV. This document describes how entropy coding works in general with a focus on arithmetic coding and CABAC. Furthermore the document dicusses the demands of the different CABACs and propose different options to hardware and instruction level optimisation. Testing and benchmarking of these implementations are done to ease evaluation. The main contribution of the thesis is parallelising and unifying the CABACs which is discussed and partly implemented. The result of the ILA is improved program flow through a specialised branching operations. The result of the DHA is a two bit parallel accelerator with hardware sharing between JPEG 2000 and H.264 encoder with limited decoding support.
|
23 |
Avalia??o da execu??o de aplica??es orientadas ? dados na arquitetura de redes em chip IPNoSysNobre, Christiane de Ara?jo 17 August 2012 (has links)
Made available in DSpace on 2014-12-17T15:48:05Z (GMT). No. of bitstreams: 1
ChristianeAN_DISSERT.pdf: 2651034 bytes, checksum: 1c708aec5eba3fd620f2944124931c55 (MD5)
Previous issue date: 2012-08-17 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior / The increasing complexity of integrated circuits has boosted the development of communications architectures like Networks-on-Chip (NoCs), as an architecture; alternative for interconnection of Systems-on-Chip (SoC). Networks-on-Chip complain for component reuse, parallelism and scalability, enhancing reusability in projects of dedicated applications. In the literature, lots of proposals have been made, suggesting different configurations for networks-on-chip architectures. Among all networks-on-chip considered, the architecture of IPNoSys is a non conventional one, since it allows the execution of operations, while the communication process is performed. This study aims to evaluate the execution of data-flow based applications on IPNoSys, focusing on their adaptation against the design constraints. Data-flow based applications are characterized by the flowing of continuous stream of data, on which operations are executed. We expect that these type of applications can be improved when running on IPNoSys, because they have a programming model similar to the execution model of this network. By observing the behavior of these applications when running on IPNoSys, were performed changes in the execution model of the network IPNoSys, allowing the implementation of an instruction level parallelism. For these purposes, analysis of the implementations of dataflow applications were performed and compared / A crescente complexidade dos circuitos integrados impulsionou o surgimento de arquiteturas de comunica??o do tipo Redes em chip ou NoC (do ingl?s, Network-on-Chip), como alternativa de arquitetura de interconex?o para Sistemas-em-Chip (SoC; Systems-on-Chip). As redes em chip possuem capacidade de reuso de componentes, paralelismo e escalabilidade, permitindo a reutiliza??o em projetos diversos. Na literatura, t?m-se uma grande quantidade de propostas com diferentes configura??es de redes em chip. Dentre as redes em chip estudadas, a rede IPNoSys possui arquitetura diferenciada, pois permite a execu??o de opera??es, em conjunto com as atividades de comunica??o. Este trabalho visa avaliar a execu??o de aplica??es orientadas a dados na rede IPNoSys, focando na sua adequa??o frente ?s restri??es de projeto. As aplica??es orientadas a dados s?o caracterizadas pela comunica??o de um fluxo cont?nuo de dados sobre os quais, opera??es s?o executadas. Espera-se ent?o, que estas aplica??es possam ser beneficiadas quando de sua execu??o na rede IPNoSys, devido ao seu elevado grau de paralelismo e por possu?rem modelo de programa??o semelhante ao modelo de execu??o desta rede. Uma vez observadas a execu??o de aplica??es na rede IPNoSys, foram realizadas modifica??es no modelo de execu??o da rede IPNoSys, o que permitiu a explora??o do paralelismo em n?vel de instru??es. Para isso, an?lises das execu??es de aplica??es data flow foram realizadas e comparadas
|
24 |
Evaluation Of Register Allocation And Instruction Scheduling Methods In Multiple Issue ProcessorsValluri, Madhavi Gopal 01 1900 (has links) (PDF)
No description available.
|
25 |
High Performance by Exploiting Information Locality through Reverse Computing / Hautes Performances en Exploitant la Localité de l'Information via le Calcul Réversible.Bahi, Mouad 21 December 2011 (has links)
Les trois principales ressources du calcul sont le temps, l'espace et l'énergie, les minimiser constitue un des défis les plus importants de la recherche de la performance des processeurs.Dans cette thèse, nous nous intéressons à un quatrième facteur qui est l'information. L'information a un impact direct sur ces trois facteurs, et nous montrons comment elle contribue ainsi à l'optimisation des performances. Landauer a montré que c’est la destruction - logique - d’information qui coûte de l’énergie, ceci est un résultat fondamental de la thermodynamique en physique. Sous cette hypothèse, un calcul ne consommant pas d’énergie est donc un calcul qui ne détruit pas d’information. On peut toujours retrouver les valeurs d’origine et intermédiaires à tout moment du calcul, le calcul est réversible. L'information peut être portée non seulement par une donnée mais aussi par le processus et les données d’entrée qui la génèrent. Quand un calcul est réversible, on peut aussi retrouver une information au moyen de données déjà calculées et du calcul inverse. Donc, le calcul réversible améliore la localité de l'information. La thèse développe ces idées dans deux directions. Dans la première partie, partant d'un calcul, donné sous forme de DAG (graphe dirigé acyclique), nous définissons la notion de « garbage » comme étant la taille mémoire – le nombre de registres - supplémentaire nécessaire pour rendre ce calcul réversible. Nous proposons un allocateur réversible de registres, et nous montrons empiriquement que le garbage est au maximum la moitié du nombre de noeuds du graphe.La deuxième partie consiste à appliquer cette approche au compromis entre le recalcul (direct ou inverse) et le stockage dans le contexte des supercalculateurs que sont les récents coprocesseurs vectoriels et parallèles, cartes graphiques (GPU, Graphics Processing Unit), processeur Cell d’IBM, etc., où le fossé entre temps d’accès à la mémoire et temps de calcul ne fait que s'aggraver. Nous montons comment le recalcul en général, et le recalcul inverse en particulier, permettent de minimiser la demande en registres et par suite la pression sur la mémoire. Cette démarche conduit également à augmenter significativement le parallélisme d’instructions (Cell BE), et le parallélisme de threads sur un multicore avec mémoire et/ou banc de registres partagés (GPU), dans lequel le nombre de threads dépend de manière importante du nombre de registres utilisés par un thread. Ainsi, l’ajout d’instructions du fait du calcul inverse pour la rematérialisation de certaines variables est largement compensé par le gain en parallélisme. Nos expérimentations sur le code de Lattice QCD porté sur un GPU Nvidia montrent un gain de performances atteignant 11%. / The main resources for computation are time, space and energy. Reducing them is the main challenge in the field of processor performance.In this thesis, we are interested in a fourth factor which is information. Information has an important and direct impact on these three resources. We show how it contributes to performance optimization. Landauer has suggested that independently on the hardware where computation is run information erasure generates dissipated energy. This is a fundamental result of thermodynamics in physics. Therefore, under this hypothesis, only reversible computations where no information is ever lost, are likely to be thermodynamically adiabatic and do not dissipate power. Reversibility means that data can always be retrieved from any point of the program. Information may be carried not only by the data but also by the process and input data that generate it. When a computation is reversible, information can also be retrieved from other already computed data and reverse computation. Hence reversible computing improves information locality.This thesis develops these ideas in two directions. In the first part, we address the issue of making a computation DAG (directed acyclic graph) reversible in terms of spatial complexity. We define energetic garbage as the additional number of registers needed for the reversible computation with respect to the original computation. We propose a reversible register allocator and we show empirically that the garbage size is never more than 50% of the DAG size. In the second part, we apply this approach to the trade-off between recomputing (direct or reverse) and storage in the context of supercomputers such as the recent vector and parallel coprocessors, graphical processing units (GPUs), IBM Cell processor, etc., where the gap between processor cycle time and memory access time is increasing. We show that recomputing in general and reverse computing in particular helps reduce register requirements and memory pressure. This approach of reverse rematerialization also contributes to the increase of instruction-level parallelism (Cell) and thread-level parallelism in multicore processors with shared register/memory file (GPU). On the latter architecture, the number of registers required by the kernel limits the number of running threads and affects performance. Reverse rematerialization generates additional instructions but their cost can be hidden by the parallelism gain. Experiments on the highly memory demanding Lattice QCD simulation code on Nvidia GPU show a performance gain up to 11%.
|
Page generated in 0.0925 seconds