• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • 1
  • Tagged with
  • 10
  • 10
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Bounding the Worst-Case Response Times of Hard-Real-Time Tasks under the Priority Ceiling Protocol in Cache-Based Architectures

Poluri, Kaushik 01 August 2013 (has links)
AN ABSTRACT OF THE THESIS OF KAUSHIK POLURI, for the Master of Science degree in Electrical and Computer Engineering, presented on 07/03/2013, at Southern Illinois University Carbondale. TITLE: Bounding the Worst-Case Response Times of Hard-Real-Time Tasks under the Priority Ceiling Protocol in Cache-Based Architectures MAJOR PROFESSOR: Dr. HARINI RAMAPRASAD Schedulability analysis of hard-real-time systems requires a-priori knowledge of the worst-case execution times (WCET) of all tasks. Static timing analysis is a safe technique used for calculating WCET that attempts to model program complexity, architectural complexity and complexity introduced by interference from other tasks. Modern architectural features such as caches make static timing analysis of a single task challenging due to unpredictability introduced by their reliance on the history of memory accesses and the analysis of a set of tasks even more challenging due to cache-related interference among tasks. Researchers have proposed several static timing analysis techniques that explicitly consider cache-eviction delays for independent hard-real-time tasks executing on cache-based architectures. However, there is little research in this area for resource-sharing tasks. Recently, an analysis technique was proposed for systems using the Priority Inheritance Protocol (PIP) to manage resource-arbitration among tasks. The Priority Ceiling Protocol (PCP) is a resource-arbitration protocol that offers distinct advantages over the PIP, including deadlock avoidance. However, to the best of our knowledge, there is currently no technique to bound the WCET of resource-sharing tasks under PCP with explicit consideration of cache-eviction delays. This thesis presents a technique to bound the WCETs and hence, the Worst-Case Response Times (WCRTs) of resource-sharing hard-real-time tasks executing on cache-based uniprocessor systems, specifically focusing on data cache analysis.
2

Development of a New Client-Server Architecture for Context Aware Mobile Computing

Gui, Feng 25 March 2009 (has links)
This dissertation studies the context-aware application with its proposed algorithms at client side. The required context-aware infrastructure is discussed in depth to illustrate that such an infrastructure collects the mobile user’s context information, registers service providers, derives mobile user’s current context, distributes user context among context-aware applications, and provides tailored services. The approach proposed tries to strike a balance between the context server and mobile devices. The context acquisition is centralized at the server to ensure the usability of context information among mobile devices, while context reasoning remains at the application level. Hence, a centralized context acquisition and distributed context reasoning are viewed as a better solution overall. The context-aware search application is designed and implemented at the server side. A new algorithm is proposed to take into consideration the user context profiles. By promoting feedback on the dynamics of the system, any prior user selection is now saved for further analysis such that it may contribute to help the results of a subsequent search. On the basis of these developments at the server side, various solutions are consequently provided at the client side. A proxy software-based component is set up for the purpose of data collection. This research endorses the belief that the proxy at the client side should contain the context reasoning component. Implementation of such a component provides credence to this belief in that the context applications are able to derive the user context profiles. Furthermore, a context cache scheme is implemented to manage the cache on the client device in order to minimize processing requirements and other resources (bandwidth, CPU cycle, power). Java and MySQL platforms are used to implement the proposed architecture and to test scenarios derived from user’s daily activities. To meet the practical demands required of a testing environment without the impositions of a heavy cost for establishing such a comprehensive infrastructure, a software simulation using a free Yahoo search API is provided as a means to evaluate the effectiveness of the design approach in a most realistic way. The integration of Yahoo search engine into the context-aware architecture design proves how context aware application can meet user demands for tailored services and products in and around the user’s environment. The test results show that the overall design is highly effective,providing new features and enriching the mobile user’s experience through a broad scope of potential applications.
3

Hardware techniques to improve cache efficiency

Liu, Haiming 19 October 2009 (has links)
Modern microprocessors devote a large portion of their chip area to caches in order to bridge the speed and bandwidth gap between the core and main memory. One known problem with caches is that they are usually used with low efficiency; only a small fraction of the cache stores data that will be used before getting evicted. As the focus of microprocessor design shifts towards achieving higher performance-perwatt, cache efficiency is becoming increasingly important. This dissertation proposes techniques to improve both data cache efficiency in general and instruction cache efficiency for Explicit Data Graph Execution (EDGE) architectures. To improve the efficiency of data caches and L2 caches, dead blocks (blocks that will not be referenced again before their eviction from the cache) should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such schemes result in lower prediction accuracy and coverage. We delay the prediction to achieve better prediction accuracy and coverage. For the L1 cache, we propose a new class of dead-block prediction schemes that predict dead blocks based on cache bursts. A cache burst begins when a block moves into the MRU position and ends when it moves out of the MRU position. Cache burst history is more predictable than individual reference history and results in better dead-block prediction accuracy and coverage. Experiment results show that predicting the death of a block at the end of a burst gives the best tradeoff between timeliness and prediction accuracy/coverage. We also propose mechanisms to improve counting-based dead-block predictors, which work best at the L2 cache. These mechanisms handle reference-count variations, which cause problems for existing counting-based deadblock predictors. The new schemes can identify the majority of the dead blocks with approximately 90% or higher accuracy. For a 64KB, two-way L1 D-cache, 96% of the dead blocks can be identified with a 96% accuracy, half way into a block’s dead time. For a 64KB, four-way L1 cache, the prediction accuracy and coverage are 92% and 91% respectively. At any moment, the average fraction of the dead blocks that has been correctly detected for a two-way or four-way L1 cache is approximately 49% or 67% respectively. For a 1MB, 16-way set-associative L2 cache, 66% of the dead blocks can be identified with a 89% accuracy, 1/16th way into a block’s dead time. At any moment, 63% of the dead blocks in such an L2 cache, on average, has been correctly identified by the dead-block predictor. The ability to accurately identify the majority of the dead blocks in the cache long before their eviction can lead to not only higher cache efficiency, but also reduced power consumption or higher reliability. In this dissertation, we use the dead-block information to improve cache efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into dead blocks. Replacement optimization evicts blocks that become dead after several reuses, before they reach the LRU position. Cache bypassing identifies blocks that cause cache misses but will not be reused if they are written into the cache and do not store these blocks in the cache. Prefetching into dead blocks replaces dead blocks with prefetched blocks that are likely to be referenced in the future. Simulation results show that replacement optimization or bypassing improves performance by 5% and prefetching into dead blocks improves performance by 12% over the baseline prefetching scheme for the L1 cache and by 13% over the baseline prefetching scheme for the L2 cache. Each of these three techniques can turn part of the identified dead blocks into live blocks. As new techniques that can better utilize the space of the dead blocks are found, the deadblock information is likely to become more valuable. Compared to RISC architectures, the instruction cache in EDGE architectures faces challenges such as higher miss rate, because of the increase in code size, and longer miss penalty, because of the large block size and the distributed microarchitecture. To improve the instruction cache efficiency in EDGE architectures, we decouple the next-block prediction from the instruction fetch so that the nextblock prediction can run ahead of instruction fetch and the predicted blocks can be prefetched into the instruction cache before they cause any I-cache misses. In particular, we discuss how to decouple the next-block prediction from the instruction fetch and how to control the run-ahead distance of the next-block predictor in a fully distributed microarchitecture. The performance benefit of such a look-ahead instruction prefetching scheme is then evaluated and the run-ahead distance that gives the best performance improvement is identified. In addition to prefetching, we also estimate the performance benefit of storing variable-sized blocks in the instruction cache. Such schemes reduce the inefficiency caused by storing NOPs in the I-cache and enable the I-cache to store more blocks with the same capacity. Simulation results show that look-ahead instruction prefetching and storing variable-sized blocks can improve the performance of the benchmarks that have high I-cache miss rates by 17% and 18% respectively, out of an ideal 30% performance improvement only achievable by a perfect I-cache. Such techniques will close the gap in I-cache hit rates between EDGE architectures and RISC architectures, although the latter will still have higher I-cache hit rates because of the smaller code size. / text
4

Split array and scalar data cache: A comprehensive study of data cache organization.

Naz, Afrin 08 1900 (has links)
Existing cache organization suffers from the inability to distinguish different types of localities, and non-selectively cache all data rather than making any attempt to take special advantage of the locality type. This causes unnecessary movement of data among the levels of the memory hierarchy and increases in miss ratio. In this dissertation I propose a split data cache architecture that will group memory accesses as scalar or array references according to their inherent locality and will subsequently map each group to a dedicated cache partition. In this system, because scalar and array references will no longer negatively affect each other, cache-interference is diminished, delivering better performance. Further improvement is achieved by the introduction of victim cache, prefetching, data flattening and reconfigurability to tune the array and scalar caches for specific application. The most significant contribution of my work is the introduction of novel cache architecture for embedded microprocessor platforms. My proposed cache architecture uses reconfigurability coupled with split data caches to reduce area and power consumed by cache memories while retaining performance gains. My results show excellent reductions in both memory size and memory access times, translating into reduced power consumption. Since there was a huge reduction in miss rates at L-1 caches, further power reduction is achieved by partially or completely shutting down L-2 data or L-2 instruction caches. The saving in cache sizes resulting from these designs can be used for other processor activities including instruction and data prefetching, branch-prediction buffers. The potential benefits of such techniques for embedded applications have been evaluated in my work. I also explore how my cache organization performs for non-numeric data structures. I propose a novel idea called "Data flattening" which is a profile based memory allocation technique to compress sparsely scattered pointer data into regular contiguous memory locations and explore the potentials of my proposed Spit cache organization for data treated with data flattening method.
5

A New N-way Reconfigurable Data Cache Architecture for Embedded Systems

Bani, Ruchi Rastogi 12 1900 (has links)
Performance and power consumption are most important issues while designing embedded systems. Several studies have shown that cache memory consumes about 50% of the total power in these systems. Thus, the architecture of the cache governs both performance and power usage of embedded systems. A new N-way reconfigurable data cache is proposed especially for embedded systems. This thesis explores the issues and design considerations involved in designing a reconfigurable cache. The proposed reconfigurable data cache architecture can be configured as direct-mapped, two-way, or four-way set associative using a mode selector. The module has been designed and simulated in Xilinx ISE 9.1i and ModelSim SE 6.3e using the Verilog hardware description language.
6

A Branch Predictor Directed Data Cache Prefetcher for Out-of-order and Multicore Processors

Sharma, Prabal 16 December 2013 (has links)
Modern superscalar pipelines have tremendous capacity to consume the instruction stream. This has been possible owing to improvements in process technology, technology scaling and microarchitectural design improvements that allow programs to speculate past control and data dependencies in the superscalar architecture. However, the speed of the memory subsystem lags behind due to physical constraints in bringing in huge amounts of data to the processor core. Cache hierarchies have subdued the impact of this speed gap; however, there is much that can be still done in improving microarchitecture. Data prefetching techniques bring in memory content significantly before the instruction stream actually witnesses demand misses. However, a majority of the techniques proposed so far depend upon an initial demand miss that initiates a stream of previously identified prefetches. In this thesis, we propose a novel prefetching algorithm, which leverages branch prediction to facilitate deep memory system speculation. The branch predictor directed lookahead mechanism builds a speculative control flow path for the instruction stream about to be fetched by the main superscalar pipeline. Prefetches are generated along this speculative path from a condensed representation of the memory instructions, leveraging register index based correlation. The technique integrates eloquently with the main pipeline's branch predictor to filter out prefetches along invalid speculative paths. Impact of the prefetching scheme is analyzed using out- of-order model of the Gem5 cycle accurate simulator. Evaluation shows that on a set of 13 memory intensive SPEC CPU2006 benchmarks, our prefetching technique improves performance by an average of 5.6% over the baseline out-of-order processor.
7

An Energy Efficient Data Cache Implementing 2-way LRC Architecture

Musalappa, Saibhushan 09 December 2006 (has links)
Conventional level one data caches are widely used in high-performance microprocessors. Shrinking process parameters in chip fabrication technology allow a much larger number of devices on a chip with every new generation. This reduction in device size has led to an increase in the magnitude of a type of energy dissipation hitherto ignored?leakage energy. Transistor level leakage energy research for sub-micron processes has shown that leakage can be as much as or greater than the dynamic energy for advanced circuit designs. Researchers have devised techniques to reduce leakage energy at the fabrication and circuit levels. Transitioning the idle circuits from operating voltage to a reduced voltage is one such circuit-level technique. The ELRU-SEQ replacement policy exploits this technique to control cache bank transitions. This thesis proposes a new cache architecture called 2-way Leakage Reduction Cache (LRC) that uses this replacement policy. The architecture employs xor-mapping function to reduce conflict misses.
8

Performance improvements using dynamic performance stubs

Trapp, Peter January 2011 (has links)
This thesis proposes a new methodology to extend the software performance engineering process. Common performance measurement and tuning principles mainly target to improve the software function itself. Hereby, the application source code is studied and improved independently of the overall system performance behavior. Moreover, the optimization of the software function has to be done without an estimation of the expected optimization gain. This often leads to an under- or overoptimization, and hence, does not utilize the system sufficiently. The proposed performance improvement methodology and framework, called dynamic performance stubs, improves the before mentioned insufficiencies by evaluating the overall system performance improvement. This is achieved by simulating the performance behavior of the original software functionality depending on an adjustable optimization level prior to the real optimization. So, it enables the software performance analyst to determine the systems’ overall performance behavior considering possible outcomes of different improvement approaches. Moreover, by using the dynamic performance stubs methodology, a cost-benefit analysis of different optimizations regarding the performance behavior can be done. The approach of the dynamic performance stubs is to replace the software bottleneck by a stub. This stub combines the simulation of the software functionality with the possibility to adjust the performance behavior depending on one or more different performance aspects of the replaced software function. A general methodology for using dynamic performance stubs as well as several methodologies for simulating different performance aspects is discussed. Finally, several case studies to show the application and usability of the dynamic performance stubs approach are presented.
9

A Dual-Port Data Cache with Pseudo-Direct Mapping Function

Gade, Arul Sandeep 07 May 2005 (has links)
Conventional on-chip (L1) data caches such as Direct-Mapped (DM) and 2-way Set-Associative Caches (SAC) have been widely used for high-performance uni (or multi)-processors. Unfortunately, these schemes suffer from high conflict misses since more than one address is mapped onto the same cache line. To reduce the conflict misses, much research has been done in developing different cache architectures such as 2-way Skewed-Associative cache (Skew cache). The 2-way Skew cache has a hardware complexity equivalent to that of 2-way SAC and has a miss-rate approaching that of 4-way SAC. However, the reduction in the miss-rate using a Skew cache is limited by the confined space available to disperse the conflicting accesses over small memory banks. This research proposes a dual-port data cache called Pseudo-Direct Cache (PDC) to minimize the conflict misses by dispersing addresses effectively over a single memory bank. Our simulation results show that PDC reduces those misses significantly compared to any conventional L1 caches and also achieves 10-15% lesser miss-rates than a 2-way Skew cache. SimpleScalar simulator is used for these simulations with SPEC95FP benchmark programs. Similar results were also seen over SPEC2000FP benchmark programs. Simulations over CACTI 3.0 were performed to evaluate the hardware implications of PDC over Skew cache. The simulation results show that the PDC has a simple hardware complexity similar to 2-way SAC and has 4-15% better AMAT compared to 2-way Skew cache. The PDC also reduces execution cycles significantly.
10

Designing Energy-Aware Optimization Techniques through Program Behaviour Analysis

Kommaraju, Ananda Varadhan January 2014 (has links) (PDF)
Green computing techniques aim to reduce the power foot print of modern embedded devices with particular emphasis on processors, the power hot-spots of these devices. In this thesis we propose compiler-driven and profile-driven optimizations that reduce power consumption in a modern embedded processor. We show that these optimizations reduce power consumption in functional units and memory subsystems with very low performance loss. We present three new techniques to reduce power consumption in processors, namely, transition aware scheduling, leakage reduction in data caches using criticality analysis, and dynamic power reduction in data caches using locality analysis of data regions. A novel instruction scheduling technique to address leakage power consumption in functional units is proposed. This scheduling technique, transition aware scheduling, is motivated by idle periods that arise in the utilization of functional units during program execution. A continuously large idle period in a functional unit can be exploited to place the unit in low power state. This novel scheduling algorithm increases the duration of idle periods without hampering performance and drives power gating in these periods. A power model defined with idle cycles as a parameter shows that this technique saves up to 25% of leakage power with very low performance impact. In modern embedded programs, data regions can be classified as critical and non-critical. Critical data regions significantly impact the performance. A new technique to identify such data regions through profiling is proposed. This technique along with a new criticality based cache policy is used to control the power state of the data cache. This scheme allocates non-critical data regions to low-power cache regions, thereby reducing leakage power consumption by up to 40% without compromising on the performance. This profiling technique is extended to identify data regions that have low locality. Some data regions have high data reuse. A locality based cache policy based on cache parameters like size and associativity is proposed. This scheme reduces dynamic as well as static power consumption in the cache subsystem. This optimization reduces 25% of the total power consumption in the data caches without hampering the execution time. In this thesis, the problem of power consumption of a program is decoupled from the number of processor cores. The underlying architecture model is simplified to abstract away a variety of processor scenarios. This simplified model can be scaled up to be implemented in various multi-core architecture models like Chip Multi-Processors, Simultaneous Multi-Threaded Processors, Chip Multi-Threaded Processors, to name a few. The three techniques proposed in this thesis leverage underlying hardware features like low power functional units, drowsy caches and split data caches. These techniques reduce power consumption of a wide range of benchmarks with low performance loss.

Page generated in 0.0766 seconds