Global ETD Search

31	Caracterização energética da codificação de vídeo de alta eficiência (HEVC) em processador de propósito geral / Energy characterization of high efficiency video coding (HEVC) in general purpose processor Monteiro, Eduarda Rodrigues January 2017 (has links) A popularização das aplicações que manipulam vídeos digitais de altas resoluções incorpora diversos desafios no desenvolvimento de novas e eficientes técnicas para manter a eficiência na compressão de vídeo. Para lidar com esta demanda, o padrão HEVC foi proposto com o objetivo de duplicar as taxas de compressão quando comparado com padrões predecessores. No entanto, para atingir esta meta, o HEVC impõe um elevado custo computacional e, consequentemente, o aumento no consumo de energia. Este cenário torna-se ainda mais preocupante quando considerados dispositivos móveis alimentados por bateria os quais apresentam restrições computacionais no processamento de aplicações multimídia. A maioria dos trabalhos relacionados com este desafio, tipicamente, concentram suas contribuições no redução e controle do esforço computacional refletido no processo de codificação. Entretanto, a literatura indica uma carência de informações com relação ao consumo de energia despendido pelo processamento da codificação de vídeo e, principalmente, o impacto energético da hierarquia de memória cache neste contexto. Esta tese apresenta uma metodologia para caracterização energética da codificação de vídeo HEVC em processador de propósito geral. O principal objetivo da metodologia proposta nesta tese é fornecer dados quantitativos referentes ao consumo de energia do HEVC. Esta metodologia é composta por dois módulos, um deles voltado para o processamento da codificação HEVC e, o outro, direcionado ao comportamento do padrão HEVC no que diz respeito à memória cache. Uma das principais vantagens deste segundo módulo é manter-se independente de aplicação ou de arquitetura de processador. Neste trabalho, diversas análises foram realizadas visando a caracterização do consumo de energia do codificador HEVC em processador de propósito geral, considerando diferentes sequências de vídeo, resoluções e parâmetros do codificador. Além disso, uma análise extensa e detalhada de diferentes configurações possíveis de memória cache foi realizada com o propósito de avaliar o impacto energético destas configurações na codificação. Os resultados obtidos com a caracterização proposta demonstram que o gerenciamento dos parâmetros da codificação de vídeo, de maneira conjunta com as especificações da memória cache, tem um alto potencial para redução do consumo energético de codificação de vídeo, mantendo bons resultados de qualidade visual das sequências codificadas. / The popularization of high-resolution digital video applications brings several challenges on developing new and efficient techniques to maintain the video compression efficiency. To respond to this demand, the HEVC standard was proposed aiming to duplicate the compression rate when compared to its predecessors. However, to achieve such goal, HEVC imposes a high computational cost and, consequently, energy consumption increase. This scenario becomes even more concerned under battery-powered mobile devices which present computational constraints to process multimedia applications. Most of the related works about encoder realization, typically concentrate their contributions on computational effort reduction and management. Therefore, there is a lack of information regarding energy consumption on video encoders, specially about the energy impact of the cache hierarchy in this context. This thesis presents a methodology for energy characterization of the HEVC video encoder in general purpose processors. The main goal of this methodology is to provide quantitative data regarding the HEVC energy consumption. This methodology is composed of two modules, one focuses on the HEVC processing and the other focuses on the HEVC behavior regarding cache memory-related consumption. One of the main advantages of this second module is to remain independent of application or processor architecture. Several analyzes are performed aiming at the energetic characterization of HEVC coding considering different video sequences, resolutions, and parameters. In addition, an extensive and detailed analysis of different cache configurations is performed in order to evaluate the energy impact of such configurations during the video coding execution. The results obtained with the proposed characterization demonstrate that the management of the video coding parameters in conjunction with the cache specifications has a high potential for reducing the energy consumption of video coding whereas maintaining good coding efficiency results. Microeletrônica Codificacao : Video digital Vídeo digital Consumo : Energia Video coding HEVC Energy consumption Cache memory General purpose processor
32	A smoothed particle hydrodynamic simulation utilizing the parallel processing capabilites of the GPUs Lundqvist, Viktor January 2009 (has links) Simulating fluid behavior has proven to be a demanding challenge which requires complex computational models and highly efficient data structures. Smoothed Particle Hydrodynamics (SPH) is a particle based computational model used to simulate fluid behavior that has been found capable of producing convincing results. However, the SPH algorithm is computational heavy which makes it cumbersome to work with. This master thesis describes how the SPH algorithm can be accelerated by utilizing the GPU’s computational resources. It describes a model for how to distribute the work load on the GPU and presents a suitable data structure. In addition, it proposes a method to represent and handle moving objects in the fluids surroundings. Finally, the performance gain due to the GPU is evaluated by comparing processing times with an identical implementation running solely on the CPU. Fluid simulation Smoothed Particle Hydrodynamics Computational-fluid dynamics General Purpose Computation on GPUs CUDA. Computer and Information Sciences Data- och informationsvetenskap
33	Multi-Resolution Modeling of Managed Lanes with Consideration of Autonomous/Connected Vehicles Fakharian Qom, Somaye 29 June 2016 (has links) Advanced modeling tools and methods are essential components for the analyses of congested conditions and advanced Intelligent Transportation Systems (ITS) strategies such as Managed Lanes (ML). A number of tools with different analysis resolution levels have been used to assess these strategies. These tools can be classified as sketch planning, macroscopic simulation, mesoscopic simulation, microscopic simulation, static traffic assignment, and dynamic traffic assignment tools. Due to the complexity of the managed lane modeling process, this dissertation investigated a Multi-Resolution Modeling (MRM) approach that combines a number of these tools for more efficient and accurate assessment of ML deployments. This study clearly demonstrated the differences in the accuracy of the results produced by the traffic flow models incorporated into different tools when compared with real-world measurements. This difference in the accuracy highlighted the importance of the selection of the appropriate analysis levels and tools that can better estimate ML and General Purpose Lanes (GPL) performance. The results also showed the importance of calibrating traffic flow model parameters, demand matrices, and assignment parameters based on real-world measurements to ensure accurate forecasts of real-world traffic conditions. In addition, the results indicated that the real-world utilization of ML by travelers can be best predicated with the use of dynamic traffic assignment modeling that incorporates travel time, toll, and travel time reliability of alternative paths in the assignment objective function. The replication of the specific dynamic pricing algorithm used in the real-world in the modeling process was also found to provide the better forecast of ML utilization. With regards to Connected Vehicle (CV) operations on ML, this study demonstrated the benefits of using results from tools with different modeling resolution to support each other’s analyses. In general, the results showed that providing toll incentives for Cooperative Adaptive Cruise Control (CACC)-equipped vehicles to use ML is not beneficial at lower market penetrations of CACC due to the small increase in capacity with these market penetrations. However, such incentives were found to be beneficial at higher market penetrations, particularly with higher demand levels. Multi-Resolution Modeling Managed Lanes General Purpose Lanes Dynamic Traffic Assignment Connected Vehicles Simulation Tools Transportation Engineering
34	Putting Queens in Carry Chains Preußer, Thomas B., Nägel, Bernd, Spallek, Rainer G. 14 November 2012 (has links) This paper describes an FPGA implementation of a solution-counting solver for the N-Queens Puzzle. The proposed algorithmic mapping utilizes the fast carrychain logic found on modern FPGA architectures in order to achieve a regular and efficient design. From an initial full chessboard mapping, several optimization strategies are explored. Also, the infrastructure is described, which we have constructed for the computation of the currently unknown solution count of the 26- Queens Puzzle. Finally, we compare the performance of our used concrete FPGA device mappings also in contrast to general-purpose CPUs. Programmierung, Damenproblem info:eu-repo/classification/ddc/004 ddc:004
35	Predicting Critical Warps in Near-Threshold GPGPU Applications Using a Dynamic Choke Point Analysis Sanyal, Sourav 01 August 2019 (has links) General purpose graphics processing units (GP-GPU), owing to their enormous thread-level parallelism, can significantly improve the power consumption at the near-threshold (NTC) operating region, while offering close to a super-threshold performance. However, process variation (PV) can drastically reduce the GPU performance at NTC. In this work, choke points—a unique device-level characteristic of PV at NTC—that can exacerbate the warp criticality problem in GPUs have been explored. It is shown that the modern warp schedulers cannot tackle the choke point induced critical warps in an NTC GPU. Additionally, Choke Point Aware Warp Speculator, a circuit-architectural solution is proposed to dynamically predict the critical warps in GPUs, and accelerate them in their respective execution units. The best scheme achieves an average improvement of ∼39% in performance, and ∼31% in energy-efficiency, over one state-of-the-art warp scheduler, across 15 GPGPU applications, while incurring marginal hardware overheads. Process Variation Near-Threshold Computing General Purpose Graphics Processing Unit Warp Criticality Problem Single Instruction Multiple Data Computer Engineering
36	Phenotyping Rodent Models of Obesity Using Magnetic Resonance Imaging Johnson, David Herbert January 2010 (has links) No description available. Biomedical Research MRI Magnetic resonance imaging fat-water imaging obesity mouse models general purpose graphics processor unit
37	Proton Computed Tomography: Matrix Data Generation Through General Purpose Graphics Processing Unit Reconstruction witt, micah 01 March 2014 (has links) Proton computed tomography (pCT) is an image modality that will improve treatment planning for patients receiving proton radiation therapy compared with the current techniques, which are based on X-ray CT. Images are reconstructed in pCT by solving a large and sparse system of linear equations. The size of the system necessitates matrix-partitioning and parallel reconstruction algorithms to be implemented across some sort of cluster computing architecture. The prototypical algorithm to solve the pCT system is the algebraic reconstruction technique (ART) that has been modified into parallel versions called block-iterative-projection (BIP) methods and string-averaging-projection (SAP) methods. General purpose graphics processing units (GPGPUs) have hundreds of stream processors for massively parallel calculations. A GPGPU cluster is a set of nodes, with each node containing a set of GPGPUs. This thesis describes a proton simulator that was developed to generate realistic pCT data sets. Simulated data sets were used to compare the performance of a BIP implementation against a SAP implementation on a single GPGPU with the data stored in a sparse matrix structure called the compressed sparse row (CSR) format. Both BIP and SAP algorithms allow for parallel computation by creating row partitions of the pCT linear system. The difference between these two general classes of algorithms is that BIP permits parallel computations within the row partitions yet sequential computations between the row partitions, whereas SAP permits parallel computations between the row partitions yet sequential computations within the row partitions. This thesis also introduces a general partitioning scheme to be applied to a GPGPU cluster to achieve a pure parallel ART algorithm while providing a framework for column partitioning to the pCT system, as well as show sparse visualization patterns that can be found via specified ordering of the equations within the matrix. Proton Computed Tomography General Purpose Graphics Processing Unit Proton Simulator Algebraic Reconstruction Technique String Averaging Projection Block Iterative Projection Systems Architecture
38	Efficient Cache Organization For Application Specific And General Purpose Processors Rajan, Kaushik 05 1900 (has links) The performance gap between processor and memory continues to remain a major performance bottleneck in both application specific and general purpose processors. This thesis strives to ease the above bottleneck by exploiting the characteristics of the application domain to improve the cache organization for two distinct processor architectures: (1) application specific processors for packet forwarding, (2) general purpose processors. Packet forwarding algorithms make use of a trie data structure to determine the forwarding route. We observe that the locality characteristics of the nodes at various levels of such a trie are different. Nodes that are closer to the root node, especially those that are immediate children of the root node (level-one nodes), exhibit higher temporal locality than nodes lower down the trie. Based on this observation we propose a novel Heterogeneously Segmented Cache Architecture (HSCA) that uses separate caches for level-one and lower-level nodes, each with carefully chosen sizes. We also propose a new replacement policy to enhance the performance of HSCA. Performance evaluation indicates that HSCA results in up to 32% reduction in average memory access time over a unified cache that shares the same cache space among all levels of the trie. HSCA also outperforms a previously proposed results cache. The use of a large root branching factor in a forwarding trie forcefully introduces a large number of nodes at level-one. Among these, only nodes that cover prefixes from the routing table are useful while the rest, are superfluous. We find that as many as 75% of the level-one nodes are superfluous. This leads to a skewed distribution of useful nodes among the cache sets of the level-one nodes cache. We propose a novel two-level mapping framework that achieves a better nodes to cache set mapping and hence incurs fewer conflict misses. Two-level mapping first aggregates nodes into Initial Partitions (IPs) using lower order bits and then remaps them from IPs into Refined Partitions (RPs), that form sets, based on some higher order bits. It provides flexibility in placement by allowing each IP to choose a different remap function. We propose three schemes conforming to the framework. A speedup in average memory access time of as much as 16% is gained over HSCA. In general purpose processor architectures, the design objectives of caches at various levels of the hierarchy are different. To ensure low access latencies, L1 caches are small and have low associativities, making them more susceptible to conflict misses. The extent of conflict misses incurred is governed by the placement function and the memory access patterns exhibited by the program. We propose a mechanism to learn the access characteristics of the program at runtime by analyzing the repetitive phases of program. We then make use of the two-level mapping framework to dynamically adapt the placement function. Further, we elegantly incorporate two-level mapping into the cache organization without increasing the cache access latency. Performance evaluation reveals that the proposed adaptive placement mechanism eliminates 32—36% of misses on average over a range of cache sizes. To prevent expensive off-chip accesses, L2 caches are larger and have higher associativities. Hence, the replacement policy plays a significant role in determining L2 cache performance. Further, as the inherent temporal locality in memory accesses is filtered out by the L1 cache, an L2 cache using the widely prevalent LRU replacement policy incurs significantly higher misses than the optimal replacement policy (OPT). We propose to bridge this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and shepherding the replacement decisions close to optimal for MC. Our proposed organization can cover 40% of the gap between LRU and OPT, resulting in 7% overall speedup. Internet Microprocessors Processor Architectures Packet Forwarding Engines (Routers) General Purpose Processors Shephepherd Cache Cache Architecture Computer Science
39	SHAP-Secure Hardware Agent Platform Zabel, Martin, Preußer, Thomas B., Reichel, Peter, Spallek, Rainer G. 11 June 2007 (has links) (PDF) This paper presents a novel implementation of an embedded Java microarchitecture for secure, realtime, and multi-threaded applications. Together with the support of modern features of object-oriented languages, such as exception handling, automatic garbage collection and interface types, a general-purpose platform is established which also fits for the agent concept. Especially, considering real-time issues, new techniques have been implemented in our Java microarchitecture, such as an integrated stack and thread management for fast context switching, concurrent garbage collection for real-time threads and autonomous control flows through preemptive round-robin scheduling. embedded Java microarchitecture integrated stack and thread management real-time issues ddc:004 ddc:500 Eingebettetes System Java
40	FPGA prototyping of custom GPGPUs Nigania, Nimit 08 January 2014 (has links) Prototyping new systems on hardware is a time-consuming task with limited scope for architectural exploration. The aim of this work was to perform fast prototyping of general-purpose graphics processing units (GPGPUs) on field programmable gate arrays (FPGAs) using a novel tool chain. This hardware flow combined with the higher level simulation flow using the same source code allowed us to create a whole tool chain to study and build future architectures using new technologies. It also gave us enough flexibility at different granularities to make architectural decisions. We will also discuss some example systems that were built using this tool chain along with some results. Field programmable gate arrays (FPGA) ISA Cache Field programmable gate arrays Rapid prototyping Computer simulation Graphics processing units

Search results