Global ETD Search

101	Predictor Virtualization: Teaching Old Caches New Tricks Burcea, Ioana Monica 20 August 2012 (has links) To improve application performance, current processors rely on prediction-based hardware optimizations, such as data prefetching and branch prediction. These hardware optimizations store application metadata in on-chip predictor tables and use the metadata to anticipate and optimize for future application behavior. As application footprints grow, the predictor tables need to scale for predictors to remain effective. One important challenge in processor design is to decide which hardware optimizations to implement and how much resources to dedicate to a specific optimization. Traditionally, processor architects employ a one-size-fits-all approach when designing predictor-based hardware optimizations: for each optimization, a fixed portion of the on-chip resources is allocated to the predictor storage. This approach often leads to sub-optimal designs where: 1) resources are wasted for applications that do not benefit from a particular predictor or require only small predictor tables, or 2) predictors under-perform for applications that need larger predictor tables that can not be built due to area-latency-power constraints. This thesis introduces Predictor Virtualization (PV), a framework that uses the traditional processor memory hierarchy to store application metadata used in speculative hardware optimizations. This allows to emulate large, more accurate predictor tables, which, in return, leads to higher application performance. PV exploits the current trend of unprecedentedly large on- chip secondary caches and allocates on demand a small portion of the cache capacity to store application metadata used in hardware optimizations, adjusting to the application’s need for predictor resources. As a consequence, PV is a pay-as-you-go technique that emulates large predictor tables without increasing the dedicated storage overhead. To demonstrate the benefits of virtualizing hardware predictors, we present virtualized designs for three different hardware optimizations: a state-of-the-art data prefetcher, conventional branch target buffers and an object-pointer prefetcher. While each of these hardware predictors exhibit different characteristics that lead to different virtualized designs, virtualization improves the cost-performance trade-off for all these optimizations. PV increases the utility of traditional processor caches: in addition to being accelerators for slow off-chip memories, on-chip caches are leveraged for increasing the effectiveness of predictor-based hardware optimizations. predictor virtualization hardware optimizations processor caches 0984 0544
102	Predictor Virtualization: Teaching Old Caches New Tricks Burcea, Ioana Monica 20 August 2012 (has links) To improve application performance, current processors rely on prediction-based hardware optimizations, such as data prefetching and branch prediction. These hardware optimizations store application metadata in on-chip predictor tables and use the metadata to anticipate and optimize for future application behavior. As application footprints grow, the predictor tables need to scale for predictors to remain effective. One important challenge in processor design is to decide which hardware optimizations to implement and how much resources to dedicate to a specific optimization. Traditionally, processor architects employ a one-size-fits-all approach when designing predictor-based hardware optimizations: for each optimization, a fixed portion of the on-chip resources is allocated to the predictor storage. This approach often leads to sub-optimal designs where: 1) resources are wasted for applications that do not benefit from a particular predictor or require only small predictor tables, or 2) predictors under-perform for applications that need larger predictor tables that can not be built due to area-latency-power constraints. This thesis introduces Predictor Virtualization (PV), a framework that uses the traditional processor memory hierarchy to store application metadata used in speculative hardware optimizations. This allows to emulate large, more accurate predictor tables, which, in return, leads to higher application performance. PV exploits the current trend of unprecedentedly large on- chip secondary caches and allocates on demand a small portion of the cache capacity to store application metadata used in hardware optimizations, adjusting to the application’s need for predictor resources. As a consequence, PV is a pay-as-you-go technique that emulates large predictor tables without increasing the dedicated storage overhead. To demonstrate the benefits of virtualizing hardware predictors, we present virtualized designs for three different hardware optimizations: a state-of-the-art data prefetcher, conventional branch target buffers and an object-pointer prefetcher. While each of these hardware predictors exhibit different characteristics that lead to different virtualized designs, virtualization improves the cost-performance trade-off for all these optimizations. PV increases the utility of traditional processor caches: in addition to being accelerators for slow off-chip memories, on-chip caches are leveraged for increasing the effectiveness of predictor-based hardware optimizations. predictor virtualization hardware optimizations processor caches 0984 0544
103	Modeling and Optimization of Delay and Power for Key Components of Modern High-performance Processors Safi, Elham 13 April 2010 (has links) In designing a new processor, computer architects consider a myriad of possible organizations and designs to decide which best meets the constraints on performance, power and cost for each particular processor. To identify practical designs, architects need to have insight into the physical-level characteristics (delay, power and area) of various components of modern processors implemented in recent fabrication technologies. During early stages of design exploration, however, developing physical-level implementations for various design options (often in the order of thousands) is impractical or undesirable due to time and/or cost constraints. In lieu of actual measurements, analytical and/or empirical models can offer reasonable estimates of these physical-level characteristics. However, existing models tend to be out-dated for three reasons: (i) They have been developed based on old circuits in old fabrication technologies; (ii) The high-level designs of the components have evolved and older designs may no longer be representative; and, (iii) The overall architecture of processors has changed significantly, and new components for which no models exist have been introduced or are being considered. This thesis studies three key components of modern high-performance processors: Counting Bloom Filters (CBFs), Checkpointed Register Alias Tables (RATs), and Compacted Matrix Schedulers (CMSs). CBFs optimize membership tests (e.g., whether a block is cached). RAT and CMS increase the opportunities for exploiting instruction-level parallelism; RAT is the core of the renaming stage, and CMS is an implementation for the instruction scheduler. Physical-level studies or models for these components have been limited or non-existent. In addition to investigating these components at the physical level, this thesis (i) proposes a novel speed- and energy-efficient CBF implementation; (ii) studies how the number of RAT checkpoints affects its latency and energy, and overall processor performance; and, (iii) studies the CMS and its accompanying logic at the physical level. This thesis also develops empirical and analytical latency and energy models that can be adapted for newer fabrication technologies. Additionally, this thesis proposes physical-level latency and energy optimizations for these components motivated by design inefficiencies exposed during the physical-level study phase. Processor design Computer architecture Physical-level implementation 0544
104	A Multi-core processor for hard real-time systems Paolieri, Marco 04 November 2011 (has links) The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption. Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks. Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems. This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD). In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs. The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs). Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs. This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform. / La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en industrias como la automovilística y la de aviación, está impulsando un incremento en el rendimiento necesario en los actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple. Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas (formadas por tareas de tiempo real duro y laxo así como tareas sin requerimientos de tiempo real), maximizando así la utilización de los recursos de procesador y garantizando el bajo consumo de energía. Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden simultáneamente a los recursos compartidos del procesador. Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo de ejecución de la aplicación - se convierte en extremadamente difícil, si no imposible, porque el tiempo de ejecución de una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello, diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un límite superior (UBD). Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas HRT se reducen al mínimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del resto de recursos. Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo así mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las tareas HRT. Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con diferentes niveles de tiempo real (HRT y NHRTs). Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una ejecución analizable del modelo de ejecución paralela software pipelining. Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles. Real-time Multi-core Processor Embedded systems Woet 004
105	Solving Hyperbolic PDEs using Accelerator Architectures Rostrup, Scott 15 July 2009 (has links) Accelerator architectures are used to accelerate the simulation of nonlinear hyperbolic PDEs. Three different architectures, a multicore CPU using threading, IBM’s Cell Processor, and Nvidia’s Tesla GPUs are investigated. Speed-ups of between 40-75× relative to a single CPU core in single precision are obtained using the Cell processor and the GPU. The three implementations are extended to parallel computing clusters by making use of the Message Passing Interface (MPI). The resulting hybrid-parallel code is investigated for performance and scalability on both a GPU and Cell computing cluster. GPU Cell Processor Hyperbolic PDEs Hardware Optimization Applied Mathematics
106	Implementing CAL Actor Component on Massively Parallel Processor Array Khanfar, Husni January 2010 (has links) No description available. MPPA dataflow CAL Massively Parallel Processor Array static scheduling
107	Evaluation of the Turbo-decoder Coprocessor on a TMS320C64x Digital Signal Processor Ahlqvist, Johan January 2011 (has links) One technique that is used to reduce the errors brought upon signals, when transmitted over noisy channels, is error control coding. One type of such coding, which has a good performance, is turbo coding. In some of the TMS320C64xTM digital signal processors there is a built in coprocessor that performs turbo decoding. This thesis is performed on the account of Communication Developments, within Saab AB and presents an evaluation of this coprocessor. The evaluation deals with both the memory consumption as well as the data rate. The result is also compared to an implementation of turbo coding that does not use the coprocessor. / En teknik som används för att minska de fel som en signal utsätts för vid transmission över en brusig kanal är felrättande kodning. Ett exempel på sådan kodning som ger ett mycket bra resultat är turbokodning. I några digitalsignalprocessorer, av sorten TMS320C64xTM, finns en inbyggd coprocessor som utför turboavkodning. Denna uppsats är utförd åt Communication Development inom Saab AB och presenterar en utvärdering av denna coprocessor. Utvärderingen avser såväl minnesförbrukning som datatakt och innehåller även en jämförelse med en implementering av turbokodning utan att använda coprocessorn. Turbo decoding Turbo-decoder Coprocessor Digital Signal Processor C64x
108	Konstruktion av radiokontrollerad klocka / Design of a radio controled watch Gustavsson, Anders January 2012 (has links) Uppgiften var att ta emot och avkoda en radiosignal för tidsangivelse, DCF77. Avkodaren implementerades i en FPGA-krets från ALTERA. Utvecklingen genomfördes i Quartus II-miljön med språket VHDL samt en alternativ lösning där mjuk processor användes. Både utvecklingsmiljön och språken var väl lämpade för uppgiften. Ett genomgående problem var dock radiomottagaren ofta levererade för svag signal för att kunna avkodas korrekt. Under goda mottagningsförhållanden fungerande dock den beskrivna kretsen tillfredsställande. VHDL Mjuk processor Altera Nios Quartus II DCF77
109	Solving Hyperbolic PDEs using Accelerator Architectures Rostrup, Scott 15 July 2009 (has links) Accelerator architectures are used to accelerate the simulation of nonlinear hyperbolic PDEs. Three different architectures, a multicore CPU using threading, IBM’s Cell Processor, and Nvidia’s Tesla GPUs are investigated. Speed-ups of between 40-75× relative to a single CPU core in single precision are obtained using the Cell processor and the GPU. The three implementations are extended to parallel computing clusters by making use of the Message Passing Interface (MPI). The resulting hybrid-parallel code is investigated for performance and scalability on both a GPU and Cell computing cluster. GPU Cell Processor Hyperbolic PDEs Hardware Optimization Applied Mathematics
110	SHAP — Scalable Multi-Core Java Bytecode Processor Zabel, Martin, Spallek, Rainer G. 14 November 2012 (has links) (PDF) Abstract This paper introduces a new embedded Java multi-core architecture which shows a significantly better performance for a large number of cores than the related projects JopCMP and jamuth IP multi-core. The cores gain fast access to the shared heap by a fullduplex bus with pipelined transactions. Each core is equipped with local on-chip memory for the Java operand stack and the method cache to further reduce the memory bandwidth requirements. As opposed to the related projects, synchronization is supported on a per object-basis instead of a single lock. Load balancing is implemented in Java and requires no additional hardware. The multi-port memory manager includes an exact and fully concurrent garbage collector for automatic memory management. The design can be synthesized for a variable number of parallel cores and shows a linear increase in chip-space. Three different benchmarks demonstrate the very good scalability of our architecture. Due to limited chip-space on our evaluation platform, the core count could not be increased further than 8. But, we expect a smooth performance decrease. SHAP Java Programmierung Java Bytecode Processor ddc:004 rvk:SS 5514

Search results