Global ETD Search

1	On the distribution of control in asynchronous processor architectures Rebello, Vinod January 1997 (has links) The effective performance of computer systems is to a large measure determined by the synergy between the processor architecture, the instruction set and the compiler. In the past, the sequencing of information within processor architectures has normally been synchronous: controlled centrally by a clock. However, this global signal could possibly limit the future gains in performance that can potentially be achieved through improvements in implementation technology. This thesis investigates the effects of relaxing this strict synchrony by distributing control within processor architectures through the use of a novel asynchronous design model known as a micronet. The impact of asynchronous control on the performance of a RISC-style processor is explored at different levels. Firstly, improvements in the performance of individual instructions by exploiting actual run-time behaviours are demonstrated. Secondly, it is shown that micronets are able to exploit further (both spatial and temporal) instructionlevel parallelism (ILP) efficiently through the distribution of control to datapath resources. Finally, exposing fine-grain concurrency within a datapath can only be of benefit to a computer system if it can easily be exploited by the compiler. Although compilers for micronet-based asynchronous processors may be considered to be more complex than their synchronous counterparts, it is shown that the variable execution time of an instruction does not adversely affect the compiler's ability to schedule code efficiently. In conclusion, the modelling of a processor's datapath as a micronet permits the exploitation of both finegrain ILP and actual run-time delays, thus leading to the efficient utilisation of functional units and in turn resulting in an improvement in overall system performance. 004 Asynchronous Processor Architectures
2	Cloud Computing for Digital Libraries Poulo, Lebeko Bearnard 01 May 2013 (has links) Information management systems (digital libraries/repositories, learning management systems, content management systems) provide key technologies for the storage, preservation and dissemination of knowledge in its various forms, such as research documents, theses and dissertations, cultural heritage documents and audio files. These systems can make use of cloud computing to achieve high levels of scalability, while making services accessible to all at reasonable infrastructure costs and on-demand. This research aims to develop techniques for building scalable digital information management systems based on efficient and on-demand use of generic grid-based technologies such as cloud computing. In particular, this study explores the use of existing cloud computing resources offered by some popular cloud computing vendors such as Amazon Web Services. This involves making use of Amazon Simple Storage Service (Amazon S3) to store large and increasing volumes of data, Amazon Elastic Compute Cloud (Amazon EC2) to provide the required computational power and Amazon SimpleDB for querying and data indexing on Amazon S3. A proof-of-concept application comprising typical digital library services was developed and deployed in the cloud environment and evaluated for scalability when the demand for more data and services increases. The results from the evaluation show that it is possible to adopt cloud computing for digital libraries in addressing issues of massive data handling and dealing with large numbers of concurrent requests. Existing digital library systems could be migrated and deployed into the cloud. H.3 INFORMATION STORAGE AND RETRIEVAL C.1 PROCESSOR ARCHITECTURES
3	Clock Distribution in a 3d Microprocessor Arunachalam, Venkatesh 01 January 2009 (has links) (PDF) As technology scales, the device delay decreases while the interconnect delay increases. As more devices are being packed into a single chip, the cost of interconnecting these devices increases. Many three-dimensional (3D) schemes have been proposed to reduce interconnect length, to improve performance with lower power consumption. The impact of wire length reduction on global clock distribution networks is limited. The delay and skew of a clock grid is mostly dominated by the area of the chip it has to cover. Another challenge in distributing clock to multiple layers in a vertical stack is achieving synchronization between the various layers. In this work the use of a clock layer exclusively for generating and distributing clocks is proposed. Vertical vias connect the clock grid in each layer to the clock layer, and hence provides synchronization between the various layers. In all synchronous systems clock is the single most critical signal, it is routed throughout the chip and provides the synchronization between the various operations of the chip. Clock distribution networks are extremely critical from the performance and power standpoint. They account for about 30% of the total power dissipated in current generation microprocessors. As technology scales, the chip sizes are also increasing due to the increased functionality. This means larger clock distribution networks and hence more power lost in the clock network. Another critical parameter in clock networks is that skew in the clock network affects performance of the synchronous system. As frequency scales with technology, the goal is to achieve the skew as a fixed percentage of clock period. This implies an aggressive clock network design which minimizes power dissipation but still provides the same performance. A clock distribution methodology for a 3D multilayer single-core microprocessor, using a single clock layer is proposed. The clock distribution network consists of a symmetric H-tree driving the global clock grids in each layer of the multilayer microprocessor. This arrangement of a 3D chip stack reduces Power lost in (a) Long interconnects at block level and (b) In the clock distribution. Using the proposed clock distribution scheme a 15-20% saving on the clock distribution power was achieved compared to a 2D structure with the same distribution scheme. By switching off the global clock grids in individual layers, when all the underlying logic is turned off, an additional 5-10% savings in power is achieved. The 3D clock distribution network also provides better skew numbers than its 2D counterpart and hence achieves the goal of improving performance and reducing power. The 3D clock distribution network was also verified with an RLC model for the interconnect. The effect of a vertical temperature profile was also investigated on the clock distribution network. 3D IC's 3D processor architectures Clock distribution Electrical engineering
4	Improving processor utilization in multiple context processor architectures Killeen, Timothy F. January 1997 (has links) No description available. processor utilization context processor architectures multiple context processor interprocess communication
5	Reuso especulativo de traços com instruções de acesso à memória / Speculative trace reuse with memory access instructions Laurino, Luiz Sequeira January 2007 (has links) Mesmo com o crescente esforço para a detecção e tratamento de instruções redundantes, as dependências verdadeiras ainda causam um grande atraso na execução dos programas. Mecanismos que utilizam técnicas de reuso e previsão de valores têm sido constantemente estudados como alternativa para estes problemas. Dentro desse contexto destaca-se a arquitetura RST (Reuse through Speculation on Traces), aliando essas duas técnicas e atingindo um aumento significativo no desempenho de microprocessadores superescalares. A arquitetura RST original, no entanto, não considera instruções de acesso à memória como candidatas ao reuso. Desse modo, esse trabalho introduz um novo mecanismo de reuso e previsão de valores chamado RSTm (Reuse through Speculation on Traces with Memory), que estende as funcionalidades do mecanismo original, com a adição de instruções de acesso à memória ao domínio de reuso da arquitetura. Dentre as soluções analisadas, optou-se pela utilização de uma tabela dedicada (Memo_Table_L) para o armazenamento das instruções de carga/escrita. Esta solução garante boa economia de hardware, não limita o número de instruções de acesso à memória por traço e, também, armazena tanto o endereço como seu respectivo valor. Os experimentos, realizados com benchmarks do SPEC2000 integer e floating-point, mostram um crescimento de 2,97% (média harmônica) no desempenho do RSTm sobre o mecanismo original e de17,42% sobre a arquitetura base. O ganho é resultado de uma combinação de diversos fatores: traços maiores (em média, 7,75 instruções por traço; o RST original apresenta 3,17 em média), embora com taxa de reuso de aproximadamente 10,88% (inferior ao RST, que apresenta taxa de 15,23%); entretanto, a latência das instruções presentes nos traços do RSTm é maior e compensa a taxa de reuso inferior. / Even with the growing efforts to detect and handle redundant instructions, the true dependencies are still one of the bottlenecks of the computations. Value reuse and value prediction techniques have been studied in order to become an alternative to these issues. Following this approach, RST (Reuse through Speculation on Traces) combines both reuse mechanisms and has achieved some good performance improvements for superscalar processors. However, the original RST mechanism does not consider load/store instructions as reuse candidates. Because of this, our work presents a new value reuse and value prediction technique named RSTm (Reuse through Speculation on Traces with Memory), that extends RST and adds memory-access instructions to the reuse domain of the architecture. Among all studied solutions, we chose the approach of using a dedicated table (Memo_Table_L) to take care of the load/store instructions. This solution guarantees low hardware overhead, does not limit the number of memory-access instructions that could be stored for each trace and stores both the address and its value. From our experiments, performed with SPEC2000 integer and floating-point benchmarks, RSTm can achieve average performance improvements (harmonic means) of 2,97% over the original RST and 17,42% over the baseline architecture. These performance improvements are due to several reasons: bigger traces (in average, 7,75 per trace; the original RST has 3,17 in average), with a reuse rate of around 10,88% (less than RST, that presents reuse rate of 15,23%) because the latency of the instructions in the RSTm traces is bigger and compensates the smaller reuse rate. Arquitetura super escalares Desempenho : Computadores Processor architectures Value reuse Value prediction
6	Reuso especulativo de traços com instruções de acesso à memória / Speculative trace reuse with memory access instructions Laurino, Luiz Sequeira January 2007 (has links) Mesmo com o crescente esforço para a detecção e tratamento de instruções redundantes, as dependências verdadeiras ainda causam um grande atraso na execução dos programas. Mecanismos que utilizam técnicas de reuso e previsão de valores têm sido constantemente estudados como alternativa para estes problemas. Dentro desse contexto destaca-se a arquitetura RST (Reuse through Speculation on Traces), aliando essas duas técnicas e atingindo um aumento significativo no desempenho de microprocessadores superescalares. A arquitetura RST original, no entanto, não considera instruções de acesso à memória como candidatas ao reuso. Desse modo, esse trabalho introduz um novo mecanismo de reuso e previsão de valores chamado RSTm (Reuse through Speculation on Traces with Memory), que estende as funcionalidades do mecanismo original, com a adição de instruções de acesso à memória ao domínio de reuso da arquitetura. Dentre as soluções analisadas, optou-se pela utilização de uma tabela dedicada (Memo_Table_L) para o armazenamento das instruções de carga/escrita. Esta solução garante boa economia de hardware, não limita o número de instruções de acesso à memória por traço e, também, armazena tanto o endereço como seu respectivo valor. Os experimentos, realizados com benchmarks do SPEC2000 integer e floating-point, mostram um crescimento de 2,97% (média harmônica) no desempenho do RSTm sobre o mecanismo original e de17,42% sobre a arquitetura base. O ganho é resultado de uma combinação de diversos fatores: traços maiores (em média, 7,75 instruções por traço; o RST original apresenta 3,17 em média), embora com taxa de reuso de aproximadamente 10,88% (inferior ao RST, que apresenta taxa de 15,23%); entretanto, a latência das instruções presentes nos traços do RSTm é maior e compensa a taxa de reuso inferior. / Even with the growing efforts to detect and handle redundant instructions, the true dependencies are still one of the bottlenecks of the computations. Value reuse and value prediction techniques have been studied in order to become an alternative to these issues. Following this approach, RST (Reuse through Speculation on Traces) combines both reuse mechanisms and has achieved some good performance improvements for superscalar processors. However, the original RST mechanism does not consider load/store instructions as reuse candidates. Because of this, our work presents a new value reuse and value prediction technique named RSTm (Reuse through Speculation on Traces with Memory), that extends RST and adds memory-access instructions to the reuse domain of the architecture. Among all studied solutions, we chose the approach of using a dedicated table (Memo_Table_L) to take care of the load/store instructions. This solution guarantees low hardware overhead, does not limit the number of memory-access instructions that could be stored for each trace and stores both the address and its value. From our experiments, performed with SPEC2000 integer and floating-point benchmarks, RSTm can achieve average performance improvements (harmonic means) of 2,97% over the original RST and 17,42% over the baseline architecture. These performance improvements are due to several reasons: bigger traces (in average, 7,75 per trace; the original RST has 3,17 in average), with a reuse rate of around 10,88% (less than RST, that presents reuse rate of 15,23%) because the latency of the instructions in the RSTm traces is bigger and compensates the smaller reuse rate. Arquitetura super escalares Desempenho : Computadores Processor architectures Value reuse Value prediction
7	Reuso especulativo de traços com instruções de acesso à memória / Speculative trace reuse with memory access instructions Laurino, Luiz Sequeira January 2007 (has links) Mesmo com o crescente esforço para a detecção e tratamento de instruções redundantes, as dependências verdadeiras ainda causam um grande atraso na execução dos programas. Mecanismos que utilizam técnicas de reuso e previsão de valores têm sido constantemente estudados como alternativa para estes problemas. Dentro desse contexto destaca-se a arquitetura RST (Reuse through Speculation on Traces), aliando essas duas técnicas e atingindo um aumento significativo no desempenho de microprocessadores superescalares. A arquitetura RST original, no entanto, não considera instruções de acesso à memória como candidatas ao reuso. Desse modo, esse trabalho introduz um novo mecanismo de reuso e previsão de valores chamado RSTm (Reuse through Speculation on Traces with Memory), que estende as funcionalidades do mecanismo original, com a adição de instruções de acesso à memória ao domínio de reuso da arquitetura. Dentre as soluções analisadas, optou-se pela utilização de uma tabela dedicada (Memo_Table_L) para o armazenamento das instruções de carga/escrita. Esta solução garante boa economia de hardware, não limita o número de instruções de acesso à memória por traço e, também, armazena tanto o endereço como seu respectivo valor. Os experimentos, realizados com benchmarks do SPEC2000 integer e floating-point, mostram um crescimento de 2,97% (média harmônica) no desempenho do RSTm sobre o mecanismo original e de17,42% sobre a arquitetura base. O ganho é resultado de uma combinação de diversos fatores: traços maiores (em média, 7,75 instruções por traço; o RST original apresenta 3,17 em média), embora com taxa de reuso de aproximadamente 10,88% (inferior ao RST, que apresenta taxa de 15,23%); entretanto, a latência das instruções presentes nos traços do RSTm é maior e compensa a taxa de reuso inferior. / Even with the growing efforts to detect and handle redundant instructions, the true dependencies are still one of the bottlenecks of the computations. Value reuse and value prediction techniques have been studied in order to become an alternative to these issues. Following this approach, RST (Reuse through Speculation on Traces) combines both reuse mechanisms and has achieved some good performance improvements for superscalar processors. However, the original RST mechanism does not consider load/store instructions as reuse candidates. Because of this, our work presents a new value reuse and value prediction technique named RSTm (Reuse through Speculation on Traces with Memory), that extends RST and adds memory-access instructions to the reuse domain of the architecture. Among all studied solutions, we chose the approach of using a dedicated table (Memo_Table_L) to take care of the load/store instructions. This solution guarantees low hardware overhead, does not limit the number of memory-access instructions that could be stored for each trace and stores both the address and its value. From our experiments, performed with SPEC2000 integer and floating-point benchmarks, RSTm can achieve average performance improvements (harmonic means) of 2,97% over the original RST and 17,42% over the baseline architecture. These performance improvements are due to several reasons: bigger traces (in average, 7,75 per trace; the original RST has 3,17 in average), with a reuse rate of around 10,88% (less than RST, that presents reuse rate of 15,23%) because the latency of the instructions in the RSTm traces is bigger and compensates the smaller reuse rate. Arquitetura super escalares Desempenho : Computadores Processor architectures Value reuse Value prediction
8	mustafa_ali_dissertation.pdf Mustafa Fayez Ahmed Ali (14171313) 30 November 2022 (has links) <p>Energy efficient machine learning accelerator design</p> Electrical circuits and systems Digital processor architectures Deep learning machine learning-based VLSI circuit
9	Implementace generického procesoru v FPGA / Implementation of Generic Processor in FPGA Mikušek, Petr Unknown Date (has links) This thesis studies processor architectures suitable for embedded processors. This includes Transport Triggered Architectures (TTA). TTA is programmed by specifying data transport; operations are triggered as a side effect of data transports. In traditional Operation Triggered Architectures (OTA) requested operations are determined by program. Data transports are handled internally by hardware so it's impossible to control and optimize data transfer by compiler. This approach brings an advantage of hardware and software aspects. The aim of this thesis is to design and implement a sample TTA processor in VHDL followed by realization in FPGA. This processor is designed in a generic manner, i.e. customized by set of generic parameters such as data width, number of buses, etc.
10	EFFICIENT AND PRODUCTIVE GPU PROGRAMMING Mengchi Zhang (13109886) 28 July 2022 (has links) <p> </p> <p>Productive programmable accelerators, like GPUs, have been developed for generations to support programming features. The ever-increasing performance improves the usability of programming features on GPUs, and these programming features further ease the porting of code and data structure from CPU to GPU. However, GPU programming features, such as function call or runtime polymorphism, have not been well explored or optimized.</p> <p>I identify efficient and productive GPU programming as a potential area to exploit. Although many programming paradigms are well studied and efficiently supported on CPU architectures, their performance on novel accelerators, like GPUs, has never been studied, evaluated, and made perfect. For instance, programming with functions is a commonplace programming paradigm that shapes software programs with modularity and simplifies code with reusability. A large amount of work has been proposed to alleviate function calling overhead on CPUs, however, less paper talked about its deficiencies on GPUs. On the other hand, polymorphism amplifies an object’s behaviors at runtime. A body of work targets</p> <p>efficient polymorphism on CPUs, but no work has ever discussed this feature under GPU contexts.</p> <p><br></p> <p>In this dissertation, I discussed those two programming features on GPU architectures. First, I performed the first study to identify the deficiency of GPU polymorphism. I created micro-benchmarks to evaluate virtual function overhead in controlled settings and the first GPU polymorphic benchmark suite, ParaPoly, to investigate real-world scenarios. The micro-benchmarks indicated that the virtual function overhead is usually negligible but can</p> <p>cause up to a 7x slowdown. Virtual functions in ParaPoly show a geometric meaning of 77% overhead on GPUs compared to the function’s inlined version. Second, I proposed two novel techniques that determine an object’s type only by its address pointer to improve GPU polymorphism. The first technique, Coordinated Object</p> <p>Allocation and function Lookup (COAL) is a software-only technique that uses the object’s address to determine its type. The second technique, TypePointer, needs hardware modification to embed the object’s type information into its address pointer. COAL achieves 80% and 6% improvements, and TypePointer achieves 90% and 12% over contemporary CUDA and our type-based SharedOA.</p> <p>Considering the growth of GPU programs, function calls become a pervasive paradigm to be consistently used on GPUs. I also identified the overhead of excessive register spilling with function calls on GPU. To diminish this cost, I proposed a novel Massively Multithreaded Register Windowing technique with Variable Size Register Window and Register-Conscious Warp Scheduling. Our techniques improve the representative workloads with a geometric</p> <p>mean of 1.18x with only 1.8% hardware storage overhead.</p> Digital processor architectures Distributed systems and algorithms Operating systems Programming languages GPU Programmability Function Virtual Function Polymorphism Object-Oriented Programming

Search results