Global ETD Search

21	Implementation of coarse-grain coherence tracking support in ring-based multiprocessors Coté, Edmond A. 25 October 2007 (has links) As the number of processors in multiprocessor system-on-chip devices continues to increase, the complexity required for full cache coherence support is often unwarranted for application-specific designs. Bus-based interconnects are no longer suitable for larger-scale systems, and the logic and storage overhead associated with the use of a complex packet-switched network and directory-based cache coherence may be undesirable in single-chip systems. Unidirectional rings are a suitable alternative because they offer many properties favorable to both on-chip implementation and to supporting cache coherence. Reducing the overhead of cache coherence traffic is, however, a concern for these systems. This thesis adapts two filter structures that are based on principles of coarse-grained coherence tracking, and applies them to a ring-based multiprocessor. The first structure tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing requests, these structures reduce the extent of coherence traffic and limit the transmission of coherent requests to the necessary parts of the system. A complete single-chip multiprocessor system that includes the proposed filters is designed and implemented in programmable logic for this thesis. The system is composed of nodes of bus-based multiprocessors, and each node includes a common memory, two or more pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. Two coarse-grained filters are attached to each node to reduce the impact of coherence traffic on the system. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes. Results are presented to highlight the system's key implementation points. Synthesis results are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the functionality of the multiprocessor system and of the filter structures. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2007-10-24 10:16:47.81 / Financial support for this work was provided by the National Sciences and Engineering Research Council of Canada, Communications and Information Technology Ontario, and Queen's University. Cache coherence Ring-based multiprocessor Coarse-grain coherence tracking Prototype implementation Multiprocessor system-on-chip
22	Location Cache Design and Performance Analysis for Chip Multiprocessors NEMETH, JASON 19 September 2008 (has links) No description available. Engineering Technology CMP multiprocessor location cache cache microprocessor low power chip multiprocessor
23	Bounds For Scheduling In Non-Identical Uniform Multiprocessor Systems Darera, Vivek N 06 1900 (has links) With multiprocessors and multicore processors becoming ubiquitous, focus has shifted from research on uniprocessors to that on multiprocessors. Results derived for the uniprocessor case unfortunately do not always directly extend to the multiprocessor case in a straightforward manner. This necessitates a paradigm shift in the approach used to design and analyse the behaviour of such processors. In the case of Real-time systems, that is, systems characterised by explicit timing constraints, analysis and performance guarantees are more important, as failure is unacceptable. Scheduling algorithms used in Real-time systems have to be carefully designed as the performance of the system depends critically on them. Efficient tests for determining if a set of tasks can be feasibly scheduled on such a computing system using a particular scheduling algorithm thus assumes importance. Traditionally, the ‘task utilization’ parameter has been used for devising such tests. Utilization based tests for Earliest Deadline First(EDF) and Rate Monotonic(RM) scheduling algorithms are known and are well understood for uniprocessor systems. In our work, we derive limits on similar bounds for the multiprocessor case. Our work diners from previous literature in that we explore the case when the individual processors constituting the multiprocessor need not be identical. Each processor in such a system is characterised by a capacity, or speed, and the time taken by a task to execute on a processor is inversely proportional to its speed. Such instances may arise during system upgradation, when faster processors may be added to the system, making it a non-identical multiprocessor, or during processor design, when the different cores on the chip may have different processing power to handle dynamic workloads. We derive results for the partitioned paradigm of multiprocessor scheduling, that is, when tasks are partitioned among the processors, and interprocessor migration after a part of execution is completed is not allowed. Results are derived for both fixed priority algorithms(RM)and dynamic priority algorithms (EDF) used on individual processors. A maximum and minimum limit on the bounds for a ‘greedy’ class of algorithms are established, since the actual value of the bound depends on the algorithm that allocates the tasks. We also derive the utilization bound of an algorithm whose bound is close to the upper limit in both cases. We find that an expression for the utilization bound can be obtained when EDF is used as the uniprocessor scheduling algorithm, but when RM is the uniprocessor scheduling algorithm,an O(mn) algorithm is required to find the utilization bound, where m is the number of tasks in the system and n is the number of processors. Knowledge of such bounds allows us to carry out very fast schedulability tests, although we have the limitation that the tests are sufficient but not necessary to ensure schedulability. We also compare the value of the bounds with those achievable in ‘equivalent’ identical multiprocessor systems and ﬁnd that the performance guarantees provided by the non-identical multiprocessor system are far higher than those offered by the equivalent identical system. Multiprocessing Dynamic-Priority Scheduling Fixed-Priority Scheduling Uniform Multiprocessor Model Multiprocessor Scheduling Allocation Decreasing Algorithm Earliest Deadline First (EDF) Rate Monotonic (RM) Computer Science
24	A Run-Time Loop Parallelization Technique on Shared-Memory Multiprocessor Systems Wu, Chi-Fan 06 July 2000 (has links) High performance computing power is important for the current advanced calculations of scientific applications. A multiprocessor system obtains its high performance from the fact that some computations can proceed in parallel. A parallelizing compiler can take a sequential program as input and automatically translate it into parallel form for the target multiprocessor system. But when loops with arrays of irregular, nonlinear or dynamic access patterns, no any current parallelizing compiler can determine whether data dependences exist at compile-time. Thus a run-time parallel algorithm will be utilized to determine dependences and extract the potential parallelism of loops. In this thesis, we propose an efficient run-time parallelization technique to compute a proper parallel execution schedule in those loops. This new method first detects immediate predecessor iterations of each loop iteration and constructs an immediate predecessor table, then efficiently schedules the whole loop iterations into wavefronts for parallel execution. According to either theoretical analysis or experimental results, our new run-time parallelization technique reveals high speedup and low processing overhead. Furthermore, this new technique is appropriate to implement on multiprocessor systems due to the characteristics of high scalability. Run-time parallelization Parallelizing compiler Multiprocessor system Wavefront scheduling
25	A Broadcast Cube-Based Multiprocessor Architecture for Solving Partial Differential Equations Murthy, Siva Ram C 01 1900 (has links) Indian Institute of Science / A large number of mathematical models in engineering and physical sciences employ Partial Differential Equations (PDEs). The sheer number of operations required in numerically integrating PDEs in these applications has motivated the search for faster methods of computing. The conventional uniprocessor computers are often unable to fulfill the performance requirements for these computation intensive problems. In this dissertation, a cost-effective message-based multiprocessor system which we call the Broadcast Cube System (BCS) has been proposed for solving important computation intensive problems such as, systems of linear algebraic equations and PDEs. A simulator for performance evaluation of parallel algorithms to be executed on the BCS has been implemented. A strategy (task assignment . algorithm) for assigning program tasks with precedence and communication constraints to the Processing Elements (PEs) in the BCS has been developed and its effectiveness demonstrated. This task assignment algorithm has been shown to produce optimal assignments for PDE problems. Optimal partitioning of the problems, solving systems of linear algebraic equations and PDEs, into tasks and their assignment to the PEs in the BCS have been given. Efficient parallel algorithms for solving these problems on the BCS have been designed. The performance of the parallel algorithms has been evaluated by both analytical and simulation methods. The results indicate that the BCS is highly effective in solving systems of linear algebraic equations and PDEs. The performance of these algorithms on the BCS has also been compared with that of their implementations on popular hypercube machines. The results show that the performance of the BCS is better than that of the hypercubes for linear algebraic equations and compares very well for PDEs, with a modest number of PEs despite the constant PE connectivity of three in the BCS. Finally, the effectiveness of the BCS in solving non-linear PDEs occurring in two important practical problems, (i) heat transfer and fluid flow simulation and (ii) global weather modeling, has been demonstrated. Computer and Information Science Cube-based multiprocessor architecture Partial Differential Equations
26	Energy and Reliability in Future NOC Interconnected CMPS Kim, Hyungjun 16 December 2013 (has links) In this dissertation, I explore energy and reliability in future NoC (Network-on-Chip) interconnected CMPs (chip multiprocessors) as they have become a first-order constraint in future CMP design. In the first part, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words that we predicted would be useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy consumption through microarchitectural mechanisms that inhibit datapath switching activity caused by unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that (a) the pre- diction mechanism achieves very high accuracy, with an average rate of false-unused prediction of just 2.5%; (b) the combined NoC energy savings enabled by the predictor and microarchitectural support are 36% on average and up to 57% in the best case; and (c) there is no system performance penalty as a result of this technique. In the second part, we present a method for dynamic voltage/frequency scaling of networks-on-chip and last level caches in CMP designs, where the shared resources form a single voltage/frequency domain. We develop a new technique for monitoring and control and validate it by running PARSEC benchmarks through full system simulations. These techniques reduce energy-delay product by 46% compared to a state-of-the-art prior work. In the third part, we develop critical path models for HCI- and NBTI-induced wear assuming stress caused under realistic workload conditions, and apply them onto the interconnect microarchitecture. A key finding from this modeling is that, counter to prevailing wisdom, wearout in the CMP on-chip interconnect is correlated with a lack of load observed in the NoC routers, rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wearout-sensitive components exercised without significantly impacting the router’s cycle time, pipeline depth, and area or power consumption. We subsequently show that the proposed design yields a 13.8∼65× increase in CMP lifetime. Chip-Multiprocessor Network-on-Chip Low Power Design Reliability
27	Reliability and fault tolerance modelling of multiprocessor systems Valdivia, Roberto Abraham January 1989 (has links) Reliability evaluation by analytic modelling constitute an important issue of designing a reliable multiprocessor system. In this thesis, a model for reliability and fault tolerance analysis of the interconnection network is presented, based on graph theory. Reliability and fault tolerance are considered as deterministic and probabilistic measures of connectivity. Exact techniques for reliability evaluation fail for large multiprocessor systems because of the enormous computational resources required. Therefore, approximation techniques have to be used. Three approaches are proposed, the first by simplifying the symbolic expression of reliability; the other two by applying a hierarchical decomposition to the system. All these methods give results close to those obtained by exact techniques. 621.39
28	DESIGN ENHANCEMENT AND INTEGRATION OF A PROCESSOR-MEMORY INTERCONNECT NETWORK INTO A SINGLE-CHIP MULTIPROCESSOR ARCHITECTURE Bhide, Kanchan P. 01 January 2004 (has links) This thesis involves modeling, design, Hardware Description Language (HDL) design capture, synthesis, implementation and HDL virtual prototype simulation validation of an interconnect network for a Hybrid Data/Command Driven Computer Architecture (HDCA) system. The HDCA is a single-chip shared memory multiprocessor architecture system. Various candidate processor-memory interconnect topologies that may meet the requirements of the HDCA system are studied and evaluated related to utilization within the HDCA system. It is determined that the Crossbar network topology best meets the HDCA system requirements and it is therefore used as the processormemory interconnect network of the HDCA system. The design capture, synthesis, implementation and HDL simulation is done in VHDL using XILINX ISE 6.2.3i and ModelSim 5.7g CAD softwares. The design is validated by individually testing against some possible test cases and then integrated into the HDCA system and validated against two different applications. The inclusion of crossbar switch in the HDCA architecture involved major modifications to the HDCA system and some minor changes in the design of the switch. Virtual Prototype testing of the HDCA executing applications when utilizing crossbar interconnect revealed proper functioning of the interconnect and HDCA. Inclusion of the interconnect into the HDCA now allows it to implement dynamic node level reconfigurability and multiple forking functionality.
29	[en] INTELLIGENT INPUT/OUTPUT CONTROLLER FOR CYGNUS COMPUTER / [pt] UM CONTROLADOR INTELIGENTE DE ENTRADA E SAÍDA PARA O SISTEMA CYGNUS LUIS FERNANDO VIEIRA GOMES 20 June 2007 (has links) [pt] O sistema Cygnus é um computador multiprocessador de memória compartilhada e estrutura modular desenvolvido pelos departamentos de Energia Elétrica e Informática da PUC/RJ. Este trabalho tem como objetivo a introdução de um novo controlador de acesso a discos e impressora. Este controlador é baseado no microprocessador 68010 e utiliza técnicas de implementação de memórias cachê de disco em um ambiente de multiprogramação onde processos, através de troca de mensagens, cooperam para aceitar várias solicitações simultâneas provenientes dos diversos processadores que compõem o sistema. / [en] The Cygnus system is a multiprocessor computere based on a modular structure with shared memory, which was developed at the Department of Electrical Engineering and Computer Science of PUC/RJ. The goal of this work is the introduction of a new controller to access disks and printer. This controller is based on the 68010 microprocessor unit and employs implementation techniques of disk caching in a multitask environment. In this environment, processes cooperate via message passing to serve simultaneous requests issued by other processors in the system. [pt] MULTIPROCESSADOR [en] MULTIPROCESSOR [pt] MEMORIA COMPARTILHADA [en] SHARED MEMORY
30	[en] A METHODOLOGY TO ANALYZE THE PERFORMANCE OF SCIENTIFIC APPLICATIONS IN MULTIPROCESSOR SYSTEM / [pt] UMA METODOLOGIA PARA ANÁLISE DE DESEMPENHO DE APLICAÇÕES CIENTÍFICAS EM MULTIPROCESSADORES LUIZ ANDRE BARROSO 14 September 2009 (has links) [pt] Neste trabalho é abordado o problema da análise de desempenho de aplicações paralelas, especificamente de programas científicos. Apresentamos uma metodologia para a construção de modelos analíticos de desempenho para aplicações paralelas, executando em multiprocessadores de memória compartilhada. A metodologia é baseada na construção e integração de dois submodelos. O primeiro submodelo representa as características do código e do fluxo de execução de um programa paralelo, incluindo seu mapeamento na topologia do multiprocessador. O segundo submodelo é basicamente um modelo de interferência por memória, que representa a competição dos processadores pelos recursos de memória compartilhada. Um modelo de desempenho de baixo custo computacional é construído para exemplificar a metodologia, sendo validado através de simulações. / [en] In the present work we adress the issue of preformance analysis of parallel scientifc applications. We propose a methodology for building analytic models for parallel programs executing on a shared memory multiprocessor system. The methodology is based on the building and integration of two submodels. The first submodel represents the program code and its execution flow, including distribution of tasks among the processor elements. The second is basically a memory interference model that represents the contention for shared memory resources. A low cost performance model is built to illustrate the use of the methodology, which is validated by simulations. [pt] MULTIPROCESSADOR [en] MULTIPROCESSOR [pt] MEMORIA COMPARTILHADA [en] SHARED MEMORY

Search results