Global ETD Search

1	Exploring the limitations of fine-grained parallelism for a superscalar architecture Potter, Richard Daniel January 1998 (has links) No description available. 621.39 Instruction-level parallelism
2	Improving ILP with the Vectorized Computing Mechanism in VLIW DSP Architecture Yang, Te-Shin 25 June 2003 (has links) In order to improving the performance for real-time application, current digital signal processors use VLIW architectures to increase the degree of instruction level parallelism (ILP). Two factors will limit the ILP, one is enough hardware resource for all parallel instructions. Another is the dependence relations between instructions. This thesis designs a VLIW architecture processing core called DVBTDSP molded by FFT algorithm and uses the software pipelining mechanism to schedule the loop to achieve the highest ILP degree when used to execute FFT butterfly operations. Furthermore, in order to provide the smooth data stream for pipeline operations, we design a mechanism to improve the modulo addressing, which will collect the discrete vectors into one continuous vector. The simulation results show that the DVBTDSP has double performance of the C6200 for the FFT processing, and has good performance for FIR, IIR and DCT algorithm computing. VLIW vector computing instruction level parallelism
3	A high performance pseudo-multi-core elliptic curve cryptographic processor over GF(2^163) Zhang, Yu 22 June 2010 Elliptic curve cryptosystem is one type of public-key system, and it can guarantee the same security level with Rivest, Shamir and Adleman (RSA) with a smaller key size. Therefore, the key of elliptic curve cryptography (ECC) can be more compact, and it brings many advantages such as circuit area, memory requirement, power consumption, performance and bandwidth. However, compared to private key system, like Advanced Encryption Standard (AES), ECC is still much more complicated and computationally intensive. In some real applications, people usually combine private-key system with public-key system to achieve high performance. The ultimate goal of this research is to architect a high performance ECC processor for high performance applications such as network server and cellular sites.<p> In this thesis, a high performance processor for ECC over Galois field (GF)(2^163) by using polynomial presentation is proposed for high-performance applications. It has three finite field (FF) reduced instruction set computer (RISC) cores and a main controller to achieve instruction-level parallelism (ILP) with pipeline so that the largely parallelized algorithm for elliptic curve point multiplication (PM) can be well suited on this platform. Instructions for combined FF operation are proposed to decrease clock cycles in the instruction set. The interconnection among three FF cores and the main controller is obtained by analyzing the data dependency in the parallelized algorithm. Five-stage pipeline is employed in this architecture. Finally, the u-code executed on these three FF cores is manually optimized to save clock cycles. The proposed design can reach 185 MHz with 20; 807 slices when implemented on Xilinx XC4VLX80 FPGA device and 263 MHz with 217,904 gates when synthesized with TSMC .18um CMOS technology. The implementation of the proposed architecture can complete one ECC PM in 1428 cycles, and is 1.3 times faster than the current fastest implementation over GF(2^163) reported in literature while consumes only 14:6% less area on the same FPGA device. polynomial basis elliptic curve cryptograhpy instruction-level parallelism
4	A high performance pseudo-multi-core elliptic curve cryptographic processor over GF(2^163) Zhang, Yu 22 June 2010 (has links) Elliptic curve cryptosystem is one type of public-key system, and it can guarantee the same security level with Rivest, Shamir and Adleman (RSA) with a smaller key size. Therefore, the key of elliptic curve cryptography (ECC) can be more compact, and it brings many advantages such as circuit area, memory requirement, power consumption, performance and bandwidth. However, compared to private key system, like Advanced Encryption Standard (AES), ECC is still much more complicated and computationally intensive. In some real applications, people usually combine private-key system with public-key system to achieve high performance. The ultimate goal of this research is to architect a high performance ECC processor for high performance applications such as network server and cellular sites.<p> In this thesis, a high performance processor for ECC over Galois field (GF)(2^163) by using polynomial presentation is proposed for high-performance applications. It has three finite field (FF) reduced instruction set computer (RISC) cores and a main controller to achieve instruction-level parallelism (ILP) with pipeline so that the largely parallelized algorithm for elliptic curve point multiplication (PM) can be well suited on this platform. Instructions for combined FF operation are proposed to decrease clock cycles in the instruction set. The interconnection among three FF cores and the main controller is obtained by analyzing the data dependency in the parallelized algorithm. Five-stage pipeline is employed in this architecture. Finally, the u-code executed on these three FF cores is manually optimized to save clock cycles. The proposed design can reach 185 MHz with 20; 807 slices when implemented on Xilinx XC4VLX80 FPGA device and 263 MHz with 217,904 gates when synthesized with TSMC .18um CMOS technology. The implementation of the proposed architecture can complete one ECC PM in 1428 cycles, and is 1.3 times faster than the current fastest implementation over GF(2^163) reported in literature while consumes only 14:6% less area on the same FPGA device. polynomial basis elliptic curve cryptograhpy instruction-level parallelism
5	Um mecanismo de busca especulativa de múltiplos fluxos de instruções / A multistreamed speculative instruction fetch mechanism Santos, Rafael Ramos dos January 1997 (has links) Este trabalho apresenta um novo modelo de busca especulativa de múltiplos fluxos de instruções em arquiteturas superescalares. A avaliação de desempenho de uma arquitetura superescalar com esta característica é também apresentada como forma de validar o modelo proposto e comparar seu desempenho frente a uma arquitetura superescalar real. O modelo em questão pretende eliminar a latência de busca de instruções introduzida pela ocorrência de comandos de desvio em pipelines superescalares. O desempenho de uma arquitetura superescalar dotada de escalonamento dinâmico de instruções, previsão de desvios e execução especulatva é bastante inferior ao desempenho máximo teórico esperado. Como demonstrado em outros trabalhos, isto ocorre devido às constantes quebras de fluxo, derivadas de instruções de desvio, e do conseqüente esvaziamento da fila de instruções. O emprego desta técnica permite encadear instruções pertencentes a diferentes fluxos lógicos, logo após a identificação de uma instrução de desvio, disponibilizando um maior número de instruções ao mecanismo de escalonamento dinâmico e diminuindo o número de ciclos com despacho nulo devido as quebras de fluxo. Algumas considerações sobre a implementação do modelo descrito são apresentadas ao final do trabalho assim como sugestões para trabalhos futuros. / This work presents a new model to fetch instructions along multiple streams in superscalar pipelines. Also, the performance evaluation of a superscalar architecture including this feature is presented in order to validate the model and to compare its performance with a real superscalar architecture. The proposed technique intents to eliminate the instruction fetch latency introduced by branch instructions in superscalar pipelines. The performance delivered by a superscalar architecture which incorporate dynamic instruction scheduling, branch prediction and speculative execution is not the expected one which should be at least proportional to the number of functional units. Related works have shown that constant stream breaks caused by disruptions in the sequential flow of control reduce the amount of instructions into the instruction queue. This technique allows instruction fetch through different logic streams, as soon as the branch instruction has been detected during the fetch. The scheduler needs a large instruction window to be able to schedule efficiently consequently the instructions window should hold as many instructions as possible to allow an efficient schedule. The improvement realized by the proposed scheme is to increase the size of the instruction window by putting there more instructions avoiding interruptions on the event of branch occurrence. Some considerations about the implementation of this model are presented at final as well as suggestions to future works. Arquitetura de computadores Arquiteturas super escalares Pipelining Instruction-level parallelism Superscalar architectures
6	Um mecanismo de busca especulativa de múltiplos fluxos de instruções / A multistreamed speculative instruction fetch mechanism Santos, Rafael Ramos dos January 1997 (has links) Este trabalho apresenta um novo modelo de busca especulativa de múltiplos fluxos de instruções em arquiteturas superescalares. A avaliação de desempenho de uma arquitetura superescalar com esta característica é também apresentada como forma de validar o modelo proposto e comparar seu desempenho frente a uma arquitetura superescalar real. O modelo em questão pretende eliminar a latência de busca de instruções introduzida pela ocorrência de comandos de desvio em pipelines superescalares. O desempenho de uma arquitetura superescalar dotada de escalonamento dinâmico de instruções, previsão de desvios e execução especulatva é bastante inferior ao desempenho máximo teórico esperado. Como demonstrado em outros trabalhos, isto ocorre devido às constantes quebras de fluxo, derivadas de instruções de desvio, e do conseqüente esvaziamento da fila de instruções. O emprego desta técnica permite encadear instruções pertencentes a diferentes fluxos lógicos, logo após a identificação de uma instrução de desvio, disponibilizando um maior número de instruções ao mecanismo de escalonamento dinâmico e diminuindo o número de ciclos com despacho nulo devido as quebras de fluxo. Algumas considerações sobre a implementação do modelo descrito são apresentadas ao final do trabalho assim como sugestões para trabalhos futuros. / This work presents a new model to fetch instructions along multiple streams in superscalar pipelines. Also, the performance evaluation of a superscalar architecture including this feature is presented in order to validate the model and to compare its performance with a real superscalar architecture. The proposed technique intents to eliminate the instruction fetch latency introduced by branch instructions in superscalar pipelines. The performance delivered by a superscalar architecture which incorporate dynamic instruction scheduling, branch prediction and speculative execution is not the expected one which should be at least proportional to the number of functional units. Related works have shown that constant stream breaks caused by disruptions in the sequential flow of control reduce the amount of instructions into the instruction queue. This technique allows instruction fetch through different logic streams, as soon as the branch instruction has been detected during the fetch. The scheduler needs a large instruction window to be able to schedule efficiently consequently the instructions window should hold as many instructions as possible to allow an efficient schedule. The improvement realized by the proposed scheme is to increase the size of the instruction window by putting there more instructions avoiding interruptions on the event of branch occurrence. Some considerations about the implementation of this model are presented at final as well as suggestions to future works. Arquitetura de computadores Arquiteturas super escalares Pipelining Instruction-level parallelism Superscalar architectures
7	Um mecanismo de busca especulativa de múltiplos fluxos de instruções / A multistreamed speculative instruction fetch mechanism Santos, Rafael Ramos dos January 1997 (has links) Este trabalho apresenta um novo modelo de busca especulativa de múltiplos fluxos de instruções em arquiteturas superescalares. A avaliação de desempenho de uma arquitetura superescalar com esta característica é também apresentada como forma de validar o modelo proposto e comparar seu desempenho frente a uma arquitetura superescalar real. O modelo em questão pretende eliminar a latência de busca de instruções introduzida pela ocorrência de comandos de desvio em pipelines superescalares. O desempenho de uma arquitetura superescalar dotada de escalonamento dinâmico de instruções, previsão de desvios e execução especulatva é bastante inferior ao desempenho máximo teórico esperado. Como demonstrado em outros trabalhos, isto ocorre devido às constantes quebras de fluxo, derivadas de instruções de desvio, e do conseqüente esvaziamento da fila de instruções. O emprego desta técnica permite encadear instruções pertencentes a diferentes fluxos lógicos, logo após a identificação de uma instrução de desvio, disponibilizando um maior número de instruções ao mecanismo de escalonamento dinâmico e diminuindo o número de ciclos com despacho nulo devido as quebras de fluxo. Algumas considerações sobre a implementação do modelo descrito são apresentadas ao final do trabalho assim como sugestões para trabalhos futuros. / This work presents a new model to fetch instructions along multiple streams in superscalar pipelines. Also, the performance evaluation of a superscalar architecture including this feature is presented in order to validate the model and to compare its performance with a real superscalar architecture. The proposed technique intents to eliminate the instruction fetch latency introduced by branch instructions in superscalar pipelines. The performance delivered by a superscalar architecture which incorporate dynamic instruction scheduling, branch prediction and speculative execution is not the expected one which should be at least proportional to the number of functional units. Related works have shown that constant stream breaks caused by disruptions in the sequential flow of control reduce the amount of instructions into the instruction queue. This technique allows instruction fetch through different logic streams, as soon as the branch instruction has been detected during the fetch. The scheduler needs a large instruction window to be able to schedule efficiently consequently the instructions window should hold as many instructions as possible to allow an efficient schedule. The improvement realized by the proposed scheme is to increase the size of the instruction window by putting there more instructions avoiding interruptions on the event of branch occurrence. Some considerations about the implementation of this model are presented at final as well as suggestions to future works. Arquitetura de computadores Arquiteturas super escalares Pipelining Instruction-level parallelism Superscalar architectures
8	SoMMA : a software managed memory architecture for multi-issue processors Jost, Tiago Trevisan January 2017 (has links) Processadores embarcados utilizam eficientemente o paralelismo a nível de instrução para atender as necessidades de desempenho e energia em aplicações atuais. Embora a melhoria de performance seja um dos principais objetivos em processadores em geral, ela pode levar a um impacto negativo no consumo de energia, uma restrição crítica para sistemas atuais. Nesta dissertação, apresentamos o SoMMA, uma arquitetura de memória gerenciada por software para processadores embarcados capaz de reduz consumo de energia e energy-delay product (EDP), enquanto ainda aumenta a banda de memória. A solução combina o uso de memórias gerenciadas por software com a cache de dados, de modo a reduzir o consumo de energia e EDP do sistema. SoMMA também melhora a performance do sistema, pois os acessos à memória podem ser realizados em paralelo, sem custo em portas de memória extra na cache de dados. Transformações de código do compilador auxiliam o programador a utilizar a arquitetura proposta. Resultados experimentais mostram que SoMMA é mais eficiente em termos de energia e desempenho tanto a nível de processador quanto a nível do sistema completo. A técnica apresenta speedups de 1.118x e 1.121x, consumindo 11% e 12.8% menos energia quando comparando processadores que utilizam e não utilizam SoMMA. Há ainda redução de até 41.5% em EDP do sistema, sempre mantendo a área dos processadores equivalentes. Por fim, SoMMA também reduz o número de cache misses quando comparado ao processador baseline. / Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this dissertation, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer’s effort to benefit from the proposed architecture. Our experimental results show that SoMMA is more energy- and performance-efficient not only for the processing cores, but also at full-system level. Comparisons were done using the VEX processor, a VLIW reconfigurable processor. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified processors and their baselines. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. Lastly, even with SoMMA halving the data cache size, we still reduce the number of data cache misses in comparison to baselines. Memoria : Computadores Sistemas embarcados Code generation process Software-managed memory Multi-issue processors Memory bandwidth limitation Instruction-level parallelism
9	Scheduling Algorithms for Instruction Set Extended Symmetrical Homogeneous Multiprocessor Systems-on-Chip Montcalm, Michael R. 10 June 2011 (has links) Embedded system designers face multiple challenges in fulfilling the runtime requirements of programs. Effective scheduling of programs is required to extract as much parallelism as possible. These scheduling algorithms must also improve speedup after instruction-set extensions have occurred. Scheduling of dynamic code at run time is made more difficult when the static components of the program are scheduled inefficiently. This research aims to optimize a program’s static code at compile time. This is achieved with four algorithms designed to schedule code at the task and instruction level. Additionally, the algorithms improve scheduling using instruction set extended code on symmetrical homogeneous multiprocessor systems. Using these algorithms, we achieve speedups up to 3.86X over sequential execution for a 4-issue 2-processor system, and show better performance than recent heuristic techniques for small programs. Finally, the algorithms generate speedup values for a 64-point FFT that are similar to the test runs. Scheduling ILP System on Chip SoC Instruction level parallelism Integer Linear Program Custom Instruction Instruction Set Extension Multiprocessor
10	Scheduling Algorithms for Instruction Set Extended Symmetrical Homogeneous Multiprocessor Systems-on-Chip Montcalm, Michael R. 10 June 2011 (has links) Embedded system designers face multiple challenges in fulfilling the runtime requirements of programs. Effective scheduling of programs is required to extract as much parallelism as possible. These scheduling algorithms must also improve speedup after instruction-set extensions have occurred. Scheduling of dynamic code at run time is made more difficult when the static components of the program are scheduled inefficiently. This research aims to optimize a program’s static code at compile time. This is achieved with four algorithms designed to schedule code at the task and instruction level. Additionally, the algorithms improve scheduling using instruction set extended code on symmetrical homogeneous multiprocessor systems. Using these algorithms, we achieve speedups up to 3.86X over sequential execution for a 4-issue 2-processor system, and show better performance than recent heuristic techniques for small programs. Finally, the algorithms generate speedup values for a 64-point FFT that are similar to the test runs. Scheduling ILP System on Chip SoC Instruction level parallelism Integer Linear Program Custom Instruction Instruction Set Extension Multiprocessor

Search results