Global ETD Search

191	Resource-aware clustering design for NoC-based MPSoCs / Projeto de MPSoCs baseados em NoC utilizando clusterização e gerenciamento de recursos Silva, Gustavo Girão Barreto da January 2014 (has links) Atualmente, o paradigma multicore é uma tendência fortemente estabelecida também na área de sistemas embarcados. O grau de paralelismo provido por tal arquitetura tem sido a principal causa de avanços de performance na área além de economia de energia e potência. Entretanto, para obter paralelismo eficiente desta arquitetura não é uma tarefa simples. Assim, desenvolvedores propuseram diversos modelos de ambientes de programação tentando prover o máximo de transparência possível. No nível do hardware, este crescente aumento no número de componentes dentro chip cria um problema de gerenciamento a ser tratado. No contexto deste cenário complexo, esta tese propõe o uso de abordagens de gerenciamento de recursos para aumentar a eficiência, levando em consideração tanto performance quanto consumo de energia, de ambientes MPSoC em diferentes níveis. Além disso, estas abordagens tem em comum a noção de clusterização, a qual tenta agregar recursos logicamente de acordo com as demandas da aplicação. Primeiramente no nível do processador/aplicação, é proposto um hardware dinamicamente adaptável para suportar modelos de programação paralelos distintos sem nenhum sobrecusto computacional uma vez que todo o processo é completamente transparente para o programador. Ainda neste ambiente, onde aplicações distintas podem ser executadas, é proposto um mecanismo de escalonamento visando gerenciamento de recursos para aumentar a performance chamado Processor Clustering. São propostas quatro diferentes políticas de mapeamento de recursos que tiram vantagem de aspectos distintos da natureza paralela das aplicações e das restrições arquiteturais do sistema. Entretanto, algumas aplicações tem demandas de memória mais altas do que demandas computacionais. Logo, uma abordagem similar pode ser utilizada no nível da hierarquia de memória. Neste caso, o objetivo é redistribuir recursos de memória de acordo com as demandas da aplicação. Redistribuição de memória é explorada tanto em tempo de projeto quanto em tempo de execução. Um mecanismo de mapeamento de distribuição é proposto baseado na quantidade de requisições de acesso à memória externa. Finalmente, é proposto um mecanismo de tolerância à falhas baseado em gerenciamento de recursos para memórias distribuídas dentro do chip em NoCs. É introduzido um modelo de Reliability Clustering que tira proveito da infraestrutura da NoC. Neste caso, os roteadores tem conhecimento dos blocos com falhas e blocos redundantes. Baseado neste conhecimento, o mecanismo é capaz evitar altas latências de acesso à memória. / The multicore paradigm is a solid trend nowadays, also in the field of embedded systems. The degree of parallelism provided by such architecture has been the foundation of performance advancements in the field as well as for power and energy savings. However, to obtain efficient parallelism of such architecture is not an easy task. Therefore, developers come up with several proposals of programming environments trying to provide as much transparency as possible. On the hardware side, this increasing number of on-chip components creates a management issue to be handled. In the context of this complex scenario this thesis proposes the use of resource management approaches to improve the efficiency, regarding both performance and energy consumption, of MPSoC environments at different levels. Also, these approaches have in common the notion of clustering, which tries to logically aggregate resources according to application demands. First, at the processor/application level, we propose a dynamically adaptable hardware to support distinct parallel programming models at no computational overhead, since the entire process is completely transparent to the programmer. Also, in this environment, where distinct applications can be executed, we propose a resource-aware scheduling mechanism to improve performance named Processor Clustering. We propose four different resource mapping policies that leverage on distinct aspects of the parallel nature of the applications and on architecture constraints. However, some applications have higher memory demands than computational demands. Therefore, a similar approach can be used at the memory level. In this case, we aim at redistributing memory resources according to application demands. We explore memory redistribution at both design time and runtime and propose a distribution mapping mechanism based on the amount of off-chip memory requests. Finally, we propose a resource-aware fault-tolerance mechanism for distributed on-chip memories in NoCs. We introduce a Reliability Clustering model that leverages on the NoC infrastructure. In this case, the routers have knowledge of faulty blocks and redundancy blocks and, based on that, they are able to avoid higher memory access latency. Cluster Microeletrônica Multiprocessors Networks-on-chip Cluster Resource management Parallel programming Reliability
192	Torus routing in the presence of multicasts Ishibashi, Hiroki 01 January 1996 (has links) No description available. Multiprocessors Multiplexing Telecommunication -- Switching system Message processing Algorithms Computer Sciences
193	Compiling for a multithreaded dataflow architecture : algorithms, tools, and experience / Compilation pour une architecture multi-thread à flot de données : algorithmes, outils et retour d'expérience Li, Feng 20 May 2014 (has links) Quelque-soit le multiprocesseur et son architecture, la facilité de leur programmation demeure une difficulté majeure. Une croyance bien installée est que l’exploitation correcte et efficace du parallélisme dans une application est une question pour les concepteurs d’outils de développement logiciel. Selon cette vision, nous avons besoin de techniques de compilation plus sophistiqués pour partitionner une application en threads simultanés. Mais de nombreux experts revendiquent que l'architecture joue un rôle tout aussi important: il faut opérer un changement fondamental dans l'architecture de processeurs avant que l’on puisse espérer des progrès importants au niveau de leur programmabilité. Notre approche favorise la convergence de ces points de vue. La convergence entre le calcul parallèle “en flot de données” avec l'architecture de von Neumann est porteuse de nombreuses promesses. En particulier en termes de tolérance à la latence, en termes d’exploitation d'un haut degré de parallélisme, le tout pour un très faible coût de changement de contexte entre threads. Les architectures à flot de données multithread exigent un haut degré de parallélisme pour tolérer la latence. D'autre part, le partitionnement d’un programme en un grand nombre de threads à grain fin est une source d'erreurs commune pour les développeurs. Pour reconcilier ces faits, nous nous efforçons de faire progresser l'état de l'art dans le partitionnement automatique de threads, conjointement avec le support du langage de programmation pour l’exploitation de parallélisme à plus gros grain, tout en préservant un concurrence déterministe. Cette thèse présente un algorithme général de partitionnement de threads, pour transformer du code séquentiel en un programme exprimant du parallélisme en flot de données. Notre algorithme fonctionne sur le Program Dependence Graph (PDG) et la forme en assignation unique statique (Static Single Assignment, SSA), pour extraire du parallélisme de tâche, pipeline, et de données, en présence de flot de contrôle arbitraire. Nous avons conçu une nouvelle représentation intermédiaire pour faciliter la génération de code, et son exécution parallèle en flot de données. Nous avons également mis en œuvre ces algorithmes dans un prototype fondé sur GCC, et contribué au développement d’une plateforme de simulation permettant d’explorer la parallélisation en flot de données à grande échelle. Ces extensions et l'architecture simulée permettent l'exploration de modèles innovants de mémoire pour le parallélisme en flot de données. Ces outils et modèles ont également été évalués sur des applications réalistes. / Across the wide range of multiprocessor architectures, all seem to share one common problem: they are hard to program. It is a general belief that parallelism is a software problem, and that perhaps we need more sophisticated compilation techniques to partition the application into concurrent threads. Many experts also make the point that the underlining architecture plays an equally important architecture before one may expect significant progress in the programmability of multiprocessors. Our approach favors a convergence of these viewpoints. The convergence of dataflow and von Neumann architecture promises latency tolerance, the exploitation of a high degree of parallelism, and light thread switching cost. Multithreaded dataflow architectures require a high degree of parallelism to tolerate latency. On the other hand, it is error-prone for programmers to partition the program into large number of fine grain threads. To reconcile these facts, we aim to advance the state of the art in automatic thread partitioning, in combination with programming language support for coarse-grain, functionally deterministic concurrency. This thesis presents a general thread partitioning algorithm for transforming sequential code into a parallel data-flow program targeting a multithreaded dataflow architecture. Our algorithm operates on the program dependence graph and on the static single assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. We design a new intermediate representation to ease code generation for an explicit token match dataflow execution model. We also implement a GCC-based prototype. We also evaluate coarse-grain dataflow extensions of OpenMP in the context of a large-scale 1024-core, simulated multithreaded dataflow architecture. These extension and simulated architecture allow the exploration of innovative memory models for dataflow computing. We evaluate these tools and models on realistic applications. Flot de données Parallélisation Multiprocesseur Architecture Partitionnement d'un programme Dataflow Multiprocessors 004
194	Simulation and emulation of massively parallel processor for solving constraint satisfaction problems based on oracles Chaudhari, Gunavant Dinkar 01 January 2011 (has links) Most part of my thesis is devoted to efficient automated logic synthesis of oracle processors. These Oracle Processors are of interest to several modern technologies, including Scheduling and Allocation, Image Processing and Robot Vision, Computer Aided Design, Games and Puzzles, and Cellular Automata, but so far the most important practical application is to build logic circuits to solve various practical Constraint Satisfaction Problems in Intelligent Robotics. For instance, robot path planning can be reduced to Satisfiability. In short, an oracle is a circuit that has some proposition of solution on the inputs and answers yes/no to this proposition. In other language, it is a predicate or a concept-checking machine. Oracles have many applications in AI and theoretical computer science but so far they were not used much in hardware architectures. Systematic logic synthesis methodologies for oracle circuits were so far not a subject of a special research. It is not known how big advantages these processors will bring when compared to parallel processing with CUDA/GPU processors, or standard PC processing. My interest in this thesis is only in architectural and logic synthesis aspects and not in physical (technological) design aspects of these circuits. In future, these circuits will be realized using reversible, nano and some new technologies, but the interest in this thesis is not in the future realization technologies. We want just to answer the following question: Is there any speed advantage of the new oracle-based architectures, when compared with standard serial or parallel processors? Oracle processors Automated logic synthesis Processing speed Multiprocessors Logic circuits
195	REAL-TIME SCHEDULING ALGORITHMS FOR PRECEDENCE RELATED TASKS ON HETEROGENEOUS MULTIPROCESSORS AULUCK, NITIN 23 May 2005 (has links) No description available. Computer Science Real-Time Systems Task Scheduling Heterogeneous Multiprocessors Hard and Soft Deadlines Scalability Task Duplication
196	Capacity Metric for Chip Heterogeneous Multiprocessors Otoom, Mwaffaq Naif 05 March 2012 (has links) The primary contribution of this thesis is the development of a new performance metric, Capacity, which evaluates the performance of Chip Heterogeneous Multiprocessors (CHMs) that process multiple heterogeneous channels. Performance metrics are required in order to evaluate any system, including computer systems. A lack of appropriate metrics can lead to ambiguous or incorrect results, something discovered while developing the secondary contribution of this thesis, that of workload modes for CHMs — or Workload Specific Processors (WSPs). For many decades, computer architects and designers have focused on techniques that reduce latency and increase throughput. The change in modern computer systems built around CHMs that process multi-channel communications in the service of single users calls this focus into question. Modern computer systems are expected to integrate tens to hundreds of processor cores onto single chips, often used in the service of single users, potentially as a way to access the Internet. Here, the design goal is to integrate as much functionality as possible during a given time window. Without the ability to correctly identify optimal designs, not only will the best performing designs not be found, but resources will be wasted and there will be a lack of insight to what leads to better performing designs. To address performance evaluation challenges of the next generation of computer systems, such as multicore computers inside of cell phones, we found that a structurally different metric is needed and proceeded to develop such a metric. In contrast to single-valued metrics, Capacity is a surface with dimensionality related to the number of input streams, or channels, processed by the CHM. We develop some fundamental Capacity curves in two dimensions and show how Capacity shapes reveal interaction of not only programs and data, but the interaction of multiple data streams as they compete for access to resources on a CHM as well. For the analysis of Capacity surface shapes, we propose the development of a demand characterization method in which its output is in the form of a surface. By overlaying demand surfaces over Capacity surfaces, we are able to identify when a system meets its demands and by how much. Using the Capacity metric, computer performance optimization is evaluated against workloads in the service of individual users instead of individual applications, aggregate applications, or parallel applications. Because throughput was originally derived by drawing analogies between processor design and pipelines in the automobile industry, we introduce our Capacity metric for CHMs by drawing an analogy to automobile production, signifying that Capacity is the successor to throughput. By developing our Capacity metric, we illustrate how and why different processor organizations cannot be understood as being better performers without both magnitude and shape analysis in contrast to other metrics, such as throughput, that consider only magnitude. In this work, we make the following major contributions: • Definition and development of the Capacity metric as a surface with dimensionality related to the number of input streams, or channels, processed by the CHM. • Techniques for analysis of the Capacity metric. Since the Capacity metric was developed out of necessity, while pursuing the development of WSPs, this work also makes the following minor contributions: • Definition and development of three foundations in order to establish an experimental foundation — a CHM model, a multimedia cell phone example, and a Workload Specific Processor (WSP). • Definition of Workload Modes, which was the original objective of this thesis. • Definition and comparison of two approaches to workload mode identification at run time; The Workload Classification Model (WCM) and another model that is based on Hidden Markov Models (HMMs). • Development of a foundation for analysis of the Capacity metric, so that the impact of architectural features in a CHM may be better understood. In order to do this, we develop a Demand Characterization Method (DCM) that characterizes the demand of a specific usage pattern in the form of a curve (or a surface in general). By doing this, we will be able to overlay demand curves over Capacity curves of different architectures to compare their performance and thus identify optimal performing designs. / Ph. D. Multichannel Workloads Workload Modes Workload Specific Processors Chip Heterogeneous Multiprocessors Capacity Metric
197	Utility Accrual Real-Time Scheduling and Synchronization on Single and Multiprocessors: Models, Algorithms, and Tradeoffs Cho, Hyeonjoong 26 September 2006 (has links) This dissertation presents a class of utility accrual scheduling and synchronization algorithms for dynamic, single and multiprocessor real-time systems. Dynamic real-time systems operate in environments with run-time uncertainties including those on activity execution times and arrival behaviors. We consider the time/utility function (or TUF) timing model for specifying application time constraints, and the utility accrual (or UA) timeliness optimality criteria of satisfying lower bounds on accrued activity utility, and maximizing the total accrued utility. Efficient TUF/UA scheduling algorithms exist for single processors---e.g., the Resource-constrained Utility Accrual scheduling algorithm (RUA), and the Dependent Activity Scheduling Algorithm (DASA). However, they all use lock-based synchronization. To overcome shortcomings of lock-based (e.g., serialized object access, increased run-time overhead, deadlocks), we consider non-blocking synchronization including wait-free and lock-free synchronization. We present a buffer-optimal, scheduler-independent wait-free synchronization protocol (the first such), and develop wait-free versions of RUA and DASA. We also develop their lock-free versions, and upper bound their retries under the unimodal arbitrary arrival model. The tradeoff between wait-free, lock-free, and lock-based is fundamentally about their space and time costs. Wait-free sacrifices space efficiency in return for no additional time cost, as opposed to the blocking time of lock-based and the retry time of lock-free. We show that wait-free RUA/DASA outperform lock-based RUA/DASA when the object access times of both approaches are the same, e.g., when the shared data size is so large that the data copying process dominates the object access time of two approaches. We derive lower bounds on the maximum accrued utility that is possible with wait-free over lock-based. Further, we show that when maximum sojourn times under lock-free RUA/DASA is shorter than under lock-based, it is a necessary condition that the object access time of lock-free is shorter than that of lock-based. We also establish the maximum increase in activity utility that is possible under lock-free and lock-based. Multiprocessor TUF/UA scheduling has not been studied in the past. For step TUFs, periodic arrivals, and under-loads, we first present a non-quantum-based, optimal scheduling algorithm called Largest Local Remaining Execution time-tasks First (or LLREF) that yields the optimum total utility. We then develop another algorithm for non-step TUFs, arbitrary arrivals, and overloads, called the global Multiprocessor Utility Accrual scheduling algorithm (or gMUA). We show that gMUA lower bounds each activity's accrued utility, as well as the system-wide, total accrued utility. We consider lock-based, lock-free, and wait-free synchronization under LLREF and gMUA. We derive LLREF's and gMUA's minimum-required space cost for wait-free synchronization using our space-optimal wait-free algorithm, which also applies for multiprocessors. We also develop lock-free versions of LLREF and gMUA with bounded retries. While the tradeoff between wait-free LLREF/gMUA versus lock-based LLREF/gMUA is similar to that for the single processor case, that between lock-free LLREF/gMUA and lock-based LLREF/gMUA hinges on the cost of the lock-free retry, blocking time under lock-based, and the operating system overhead. / Ph. D. Real-Time Scheduling Time-Utility Functions Optimality Multiprocessors Utility Accrual Real-Time Scheduling Non-Blocking Synchronization
198	Processor and link assignment using simulated annealing Bollinger, S. Wayne 21 July 2010 (has links) Advances in VLSI technology have made possible a new generation of multicomputer systems that contain hundreds or thousands of processors and offer a combined processing power far beyond that possible in a single processor machine. Multicomputers can be used to solve a variety of computationally intensive applications, but they introduce the problem of handling communication between concurrent processes. In the design of multicomputer systems, the scheduling and mapping of a parallel algorithm onto a host architecture has a critical impact on overall system performance. The problem of how to best assign the resources of a host multicomputer system to the cooperating tasks of a parallel application program is known as the mapping problem. In this research we develop a graph-based solution to both aspects of the mapping problem using the simulated annealing optimization heuristic. A two phase mapping strategy is formulated: I) process annealing assigns parallel processes to processing nodes, and 2) connection annealing schedules traffic connections on network data links so that interprocess communication conflicts are minimized. To evaluate the quality of generated mappings, cost functions suitable for simulated annealing are derived that accurately quantify communication overhead. Communication efficiency is formulated to measure the quality of assignments when the optimal mapping is unknown. Application examples are presented using the hypercube as a host architecture, with host graphs containing up to 512 nodes. The influence of various parameters on the annealing algorithms is investigated, and results for several image graphs are presented. / Master of Science LD5655.V855 1988.B644 Digital mapping Multiprocessors
199	Interfacing the IBM PC with the STD bus for multiprocessing Datta, Diptish 14 November 2012 (has links) The advent of the Personal Computer into the technical world has, at an extremely reasonable expense and trouble, made available to us, considerable computational power. But, as it was with computers, the next logical step is to have multiple units running in concert, or, in other words, sharing the load. This leads to the concept of Multiprocessing in order to attain an enhancement in operation speed and superior efficiency. The IBM PC is a versatile and market proven personal computer with a very large volume of software support and the STD Bus is a standard that has been developed to cope with a variable support, i.e. different processors and different I/O capabilities. Together, they combine the user interface - the display and keyboard Â· of the PC, the processing capabilities of the PC, the I/O capabilities of the STD Bus and the support processing possible on the STD Bus. The resulting system is powerful, easy to use and it has a lot of scope for development. / Master of Science LD5655.V855 1985.D377 Computer interfaces IBM Personal Computer Microcomputers -- Buses Multiprocessors
200	Implementation of a Hardware-Optimized MPI Library for the SCMP Multiprocessor Poole, Jeffrey Hyatt 16 August 2004 (has links) As time progresses, computer architects continue to create faster and more complex microprocessors using techniques such as out-of-order execution, branch prediction, dynamic scheduling, and predication. While these techniques enable greater performance, they also increase the complexity and silicon area of the design. This creates larger development and testing times. The shrinking feature sizes associated with newer technology increase wire resistance and signal propagation delays, further complicating large designs. One potential solution is the Single-Chip Message-Passing (SCMP) Parallel Computer, developed at Virginia Tech. SCMP makes use of an architecture where a number of simple processors are tiled across a single chip and connected by a fast interconnection network. The system is designed to take advantage of thread-level parallelism and to keep wire traces short in preparation for even smaller integrated circuit feature sizes. This thesis presents the implementation of the MPI (Message-Passing Interface) communications library on top of SCMP's hardware communication support. Emphasis is placed on the specific needs of this system with regards to MPI. For example, MPI is designed to operate between heterogeneous systems; however, in the SCMP environment such support is unnecessary and wastes resources. The SCMP network is also designed such that messages can be sent with very low latency, but with cooperative multitasking it is difficult to assure a timely response to messages. Finally, the low-level network primitives have no support for send operations that occur before the receiver is prepared and that functionality is necessary for MPI support. / Master of Science Message-Passing Systems Single-Chip Systems Parallel Architecture Chip Multiprocessors Message Passing Interface MPI

Search results