Global ETD Search

21	Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code Zhang, Yuhua 12 1900 (has links) Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures. Multiprocessors. Computer architecture. SDF single-chip multiprocessor multithreading automatic parallelization scalability superscalar VLIW SMT
22	Návrh pokročilé architektury procesoru v jazyce VHDL / VHDL Design of Advanced CPU Slavík, Daniel January 2010 (has links) The goal of this project was to study pipelined processor architectures along with instruction and data cache. Chosen pipelined architecture should be designed and implemented using VHDL language. Firstly, I decided to implement the subscalar architecture first, secondly, three versions of scalar architecture. For these architectures synthesis into FPGA was done and performance of these architectures was compared on chosen algorithm. In the next part of this thesis I designed and implemented instruction and data cache logic for both architectures. However I was not able to synthetise these caches. Last chapter of this thesis deals with the superscalar architecture, which is the architecture of nowadays.
23	Superscalar Processor Models Using Statistical Learning Joseph, P J 04 1900 (has links) Processor architectures are becoming increasingly complex and hence architects have to evaluate a large design space consisting of several parameters, each with a number of potential settings. In order to assist in guiding design decisions we develop simple and accurate models of the superscalar processor design space using a detailed and validated superscalar processor simulator. Firstly, we obtain precise estimates of all signiﬁcant micro-architectural parameters and their interactions by building linear regression models using simulation based experiments. We obtain good approximate models at low simulation costs using an iterative process in which Akaike’s Information Criteria is used to extract a good linear model from a small set of simulations, and limited further simulation is guided by the model using D-optimal experimental designs. The iterative process is repeated until desired error bounds are achieved. We use this procedure for model construction and show that it provides a cost effective scheme to experiment with all relevant parameters. We also obtain accurate predictors of the processors performance response across the entire design-space, by constructing radial basis function networks from sampled simulation experiments. We construct these models, by simulating at limited design points selected by latin hypercube sampling, and then deriving the radial neural networks from the results. We show that these predictors provide accurate approximations to the simulator’s performance response, and hence provide a cheap alternative to simulation while searching for optimal processor design points. Supercomputers Supercomputers - Statistical Methods MATLAB Linear Regression Models Superscalar Processor Architecture Superscalar Processors - Linear Models Radial Basis Function Networks Linear Models RBF Networks Processor Performance Analysis Predictive Performance Model Predictive Modeling Computer Science
24	Προσαρμογή συχνότητας και τάσης λειτουργίας για τη βελτιστοποίηση κατανάλωσης ενέργειας επεξεργαστών Σπηλιόπουλος, Βασίλειος 19 April 2010 (has links) Η σύγχρονη αρχιτεκτονική στρέφεται σε λύσεις που έχουν ως στόχο την εξοικονόμηση ενέργειας, χωρίς όμως να επιβαρύνεται σε μεγάλο βαθμό η απόδοση του επεξεργαστή. Ιδιαίτερα οι υπερβαθμωτοί (superscalar) επεξεργαστές που επιτρέπουν εκτέλεση εκτός σειράς (out-of-order execution) διακρίνονται από υψηλή κατανάλωση ενέργειας, εξαιτίας των πολύπλοκων δομών που χρησιμοποιούν για την αύξηση της απόδοσης. Η δυναμική ρύθμιση τάσης – συχνότητας (DVFS) αποτελεί μία ευρέως χρησιμοποιούμενη τεχνική για την επίτευξη εξοικονόμησης ενέργειας. Μειώνοντας τη συχνότητα λειτουργίας ενός κυκλώματος, είναι δυνατόν να μειωθεί και η τάση τροφοδοσίας του κυκλώματος. Με τον τρόπο αυτό ελαττώνεται και η ενέργεια που καταναλώνει το κύκλωμα. Σκοπός της εργασίας είναι η ανάπτυξη ενός μηχανισμού πραγματικού χρόνου που θα ρυθμίζει τη συχνότητα και την τάση λειτουργίας ενός superscalar, out-of-order επεξεργαστή ώστε να επιτυγχάνεται εξοικονόμηση ενέργειας χωρίς μεγάλη μείωση της απόδοσης του επεξεργαστή. Αυτό μπορεί να επιτευχθεί ελαττώνοντας τη συχνότητα και την τάση κατά τις περιόδους που ο επεξεργαστής εκτελεί πολλές λειτουργίες μνήμης. Η εξομοίωση του μηχανισμού μας για μία σειρά από μετροπρογράμματα δείχνει ότι μπορούμε να επιτύχουμε μεγάλη εξοικονόμηση ενέργειας χωρίς σημαντική αύξηση του χρόνου εκτέλεσης των προγραμμάτων. / Modern research in computer architecture focuses on techniques whose purpose is to save energy, without much loss in processor's performance. Especially superscalar processors that allow out of order execution are characterized by high energy consumption, because of the complex structures the use in order to increase performance. Dynamic Voltage - Frequency Scaling (DVFS) is a widely used technique for energy saving. Reducing the frequency of the processor's clock, it is possible to reduce the supply voltage. In this way the consumed energy is also reduced. The purpose of this diploma thesis is to create a real time mechanism that will scale the frequency and the voltage of a superscalar, out of order processor so that the processor saves energy without much loss in processor's performance. This can be made by reducing the frequency and the voltage during the periods that the processor executes many memory functions. The simulation of our mechanism for a variety of benchmarks proved that we can save much energy without much increase in the benchmark's execution time. Επεξεργαστές 621.395 Processors Energy saving Superscalar processors
25	Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix Equations Jonsson, Isak January 2003 (has links) <p>This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable.</p><p>The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines.</p><p>Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library. </p> Recursive algorithm recursive blocked format Cholesky factorization Sylvester-type equations automatic blocking superscalar GEMM-based RECSY library Computer engineering Datorteknik
26	Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix Equations Jonsson, Isak January 2003 (has links) This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable. The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines. Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library. Recursive algorithm recursive blocked format Cholesky factorization Sylvester-type equations automatic blocking superscalar GEMM-based RECSY library Computer engineering Datorteknik
27	Šiuolaikinių procesorių architektūrų tyrimas, našumo lyginamoji analizė / Investigation on architectures of processors and comparative analysis of their efficiency Kislauskas, Nerijus 21 May 2005 (has links) The work deals with main aspects of computer efficiency increase. Object of investigation is a system consisting of processor, memory, other components and connecting links called buses. Rather different systems are used in modern world of information, so an interest arises to compare architectures of several producers. Comparison of systems is quite possible as the same architectural features bind them together: processor‘s clock speed, cache, memory, dual channel technology and others. To perform a comparative analysis, software has been used enabling to reveal increase of efficiency of separate computer components. Systems chosen for study are rather new from the point of view of technology: Intel Pentium 4, AMD Athlon XP and AMD Sempron. Experiments having been fulfilled, it came out that efficiency of a system for the most part depends on processor capacity: increase of its clock speed results in 9 – 13%, L1 cache has an effect even up to 1350% (theoretically), L2 cache – 30 – 38%. HyperThreading has been observed to mostly result in operations with floating point numbers (even up to 68%), and branch prediction would have theoretically to increase efficiency up to 47%. Estimating indicator of efficiency of the whole system, the results show that the main role belongs to processor. Influence of other components of the system is less noticeable. Working peculiarities of memory type determine rate of data selection and transmission from memory. The study has shown that... [to full text] Informatics Superkonvejerizacija Scalar processor Superscalar processor CISC RISC Skaliariškumas Procesorių lyginamoji analizė Processors comparative analysis Performance Superskaliariškumas Processor's superconveyer Našumas Processor's conveyer
28	Βελτιστοποίηση και επαλήθευση μοντέλων πρόβλεψης της απόδοσης Ρόκας, Παρασκευάς 21 October 2010 (has links) Η σχεδίαση μικροεπεξεργαστών είναι μια πολύπλοκη και σύνθετη διαδικασία, η οποία δυσκολεύει όσο οι τεχνολογικές εξελίξεις προχωράνε. Οι μελετητές της απόδοσης των μικροεπεξεργαστών, για να μελετήσουν την απόδοση ενός συστήματος καταλήγουν στη χρησιμοποίηση πλήρους προσομοίωσης, καάτι που είναι εξαιρετικά πολύπλοκο και χρονοβόρο. Σε αυτή την εργασία παρουσιάζεται ένα αναλυτικό μοντέλο που μοντελοποιεί τις επιδόσεις του επεξεργαστή με βάση το πρόγραμμα που εκτελεί και τα δομικά του χαρακτηριστικά. Το μοντέλο αυτό βασίζεται πάνω σε έναν εκτός σειράς υπερβαθμωτό επεξεργαστή. Η μοντελοποίηση βασίζεται στο γεγονός ότι ένας υπερβαθμωτός επεξεργαστής ο οποίος είναι ισορροπημένος διατηρεί σταθερή την απόδοση του εκτός αν συναντήσει ανασχετικά γεγονότα, όπως αποτυχία πρόσβασης στην κρυφή μνήμη ή λάθος στην πρόβλεψη διακλάδωσης. Τα δεδομένα του προγράμματος συλλέγονται κατά την εκτέλεση του προγράμματος με τη χρήση ενός εργαλείου παρεμβολής κώδικα σε εκτελέσιμο αρχείο, το οποίο ονομάζεται DIOTA. Παρουσιάζεται το μοντέλο σταθερής απόδοσης και μετριέται ο αντίκτυπος του κάθε ανασχετικού γεγονότος ξεχωριστά. / Microprocessor design is a complex and difficult process which day by day is getting more difficult as technology advances. Designers, in order to study the efficiency of a microprocessor tend to use full cycle simulation, which is extremely complex and time-consuming. In this thesis, an analytical model is presented, which is modelling the perfonmance of a proccessor in account with the executable and processor's functional characteristics. The model is based on an out of order superscalar processor. The modelling is based on the fact that a balanced superscalar processor is maintaining a steady performance rate, unless a disruptive miss event happens, such as a data cache miss or a branch misprediction. The data from the executable are gathered by using a binary rewriting tool, called DIOTA. The steady state model is being presented, and the impact of each miss event is measured. Επεξεργαστές Μοντελοποίηση 004.24 Computer architecture Processors Modelling Superscalar out of order Binary instrumentation
29	Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors Ubal Tena, Rafael 01 September 2010 (has links) Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore. / Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535 / Palancia Out-of-order retirement Reorder buffer Processor architecture Multithreading Multicore Superscalar Sequential consistency 120317 - Informática 120326 - Simulación 330406 - Arquitectura de ordenadores
30	Scalable Low Power Issue Queue And Store Queue Design For Superscalar Processors Vivekanandham, Rajesh 12 1900 (has links) A Large instruction window is a key requirement to exploit greater Instruction Level Parallelism in out-of-order superscalar processors. Along with the instruction window size, the size of various other structures including the issue queue, store queue and register file need to increase as well. However, the cycle time and energy consumption of conventional large monolithic Content Addressable Memories (CAMs), the underlying structure of most conventional issue queue and store queue designs, worsen rapidly with an increase in size. This results in a three way trade-off involving ILP, clock frequency and energy consumption. In this thesis, we propose efficient designs for the issue queue and the store queue that improve the circuit latency and energy consumption while minimizing the loss in IPC. We propose the Scalable Low power Issue Queue (SLIQ) design which segments the issue queue structure to reduce the latency. This is complemented with a fast Wakeup index to a consumer in the issue queue for every instruction. As this consumer instruction can be woken up directly, without any delay, this mitigates the IPC loss faced by the pipelined issue queue. Also, as the scheme incorporates a pipelined broadcast, the indices are not required for correctness and can simply be gang invalidated on branch mispredictions. The IPC loss of an 8 segment SLIQ is Within 2.3% for the entire SPEC CPU2000 benchmark suite while achieving a 39.3% reduction in issue latency. Further, in the SLIQ design unnecessary broadcasts to the higher segments are avoided most of the time as in a large majority of the cases, an instruction has a single consumer. This consumer is woken up either by direct indexing or by broadcast in the first segment of the SLIQ. This enables the 8 segment SLIQ to significantly reduce the energy consumption and the energy-delay product by 48.3% and 67.4% respectively on an average. SLIQ also allows the architects to segment the issue queue carefully so that the latency of the issue logic is just within the per pipeline stage latency goals of the design. We also propose the Scalable Low power Store Queue (SLSQ) to address similar problems associated with the store queue data forwarding logic. We extend the state- of-the-art Store Vector based Disambiguator to also predict the index of the store that will forward to a given load. SLSQ marginally adds to the hardware budget, but predicts the store queue index of the store which will forward with an accuracy of 99.5% on an average. SLSQ, thus, eliminates unnecessary address broadcasts and Compares and reduces energy consumption of the store-to-load forwarding logic by 78.4% and 91.6% for the SPEC Int and FP suites respectively. Another variant of SLSQ, eliminates the need for a CAM in the forwarding logic and achieves a 49.9% reduction in store to load data forwarding latency while incurring a minimal IPC loss less than 0.1% on average for the entire SPEC CPU2000 benchmark suite. Parallel Processing (Computer Science) Queing Processes Queue Design Superscalar Processors Large Instruction Window Computer Science

Search results