71 |
Improving on-chip data cache using instruction register information.January 1996 (has links)
by Lau Siu Chung. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 71-74). / Abstract --- p.i / Acknowledgment --- p.ii / List of Figures --- p.v / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Hiding memory latency --- p.1 / Chapter 1.2 --- Organization of dissertation --- p.4 / Chapter Chapter 2 --- Related Work --- p.5 / Chapter 2.1 --- Hardware controlled cache prefetching --- p.5 / Chapter 2.2 --- Software assisted cache prefetching --- p.9 / Chapter Chapter 3 --- Data Prefetching --- p.13 / Chapter 3.1 --- Data reference patterns --- p.14 / Chapter 3.2 --- Embedded hints for next data references --- p.19 / Chapter 3.3 --- Instruction Opcode and Addressing Mode Prefetching scheme --- p.21 / Chapter 3.3.1 --- Basic IAP scheme --- p.21 / Chapter 3.3.2 --- Enhanced IAP scheme --- p.24 / Chapter 3.3.3 --- Combined IAP scheme --- p.27 / Chapter 3.4 --- Summary --- p.29 / Chapter Chapter 4 --- Performance Evaluation --- p.31 / Chapter 4.1 --- Evaluation methodology --- p.31 / Chapter 4.1.1 --- Trace-driven simulation --- p.31 / Chapter 4.1.2 --- Caching models --- p.33 / Chapter 4.1.3 --- Benchmarks and metrics --- p.36 / Chapter 4.2 --- General Results --- p.41 / Chapter 4.2.1 --- Varying cache size --- p.44 / Chapter 4.2.2 --- Varying cache block size --- p.46 / Chapter 4.2.3 --- Varying associativity --- p.49 / Chapter 4.3 --- Other performance metrics --- p.52 / Chapter 4.3.1 --- Accuracy of prefetch --- p.52 / Chapter 4.3.2 --- Partial hit delay --- p.55 / Chapter 4.3.3 --- Bus usage problem --- p.59 / Chapter 4.4 --- Zero time prefetch --- p.63 / Chapter 4.5 --- Summary --- p.67 / Chapter Chapter 5 --- Conclusion --- p.68 / Chapter 5.1 --- Summary of our research --- p.68 / Chapter 5.2 --- Future work --- p.70 / Bibliography --- p.71
|
72 |
Techniques of distributed caching and terminal tracking for mobile computing.January 1997 (has links)
by Chiu-Fai Fong. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1997. / Includes bibliographical references (leaves 76-81). / Abstract --- p.i / Acknowledgments --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Distributed Data Caching --- p.2 / Chapter 1.2 --- Mobile Terminal Tracking --- p.5 / Chapter 1.3 --- Thesis Overview --- p.10 / Chapter 2 --- Personal Communication Network --- p.11 / Chapter 2.1 --- Network Architecture --- p.11 / Chapter 2.2 --- Resource Limitations --- p.13 / Chapter 2.3 --- Mobility --- p.14 / Chapter 3 --- Distributed Data Caching --- p.17 / Chapter 3.1 --- System Model --- p.18 / Chapter 3.1.1 --- The Wireless Network Environment --- p.18 / Chapter 3.1.2 --- Caching Protocol --- p.19 / Chapter 3.2 --- Caching at Mobile Computers --- p.22 / Chapter 3.3 --- Broadcasting at the Server --- p.24 / Chapter 3.3.1 --- Passive Strategy --- p.27 / Chapter 3.3.2 --- Active Strategy --- p.27 / Chapter 3.4 --- Performance Analysis --- p.29 / Chapter 3.4.1 --- Bandwidth Requirements --- p.29 / Chapter 3.4.2 --- Lower Bound on the Optimal Bandwidth Consumption --- p.30 / Chapter 3.4.3 --- The Read Response Time --- p.32 / Chapter 3.5 --- Experiments --- p.35 / Chapter 3.6 --- Mobility Concerns --- p.42 / Chapter 4 --- Mobile Terminal Tracking --- p.44 / Chapter 4.1 --- Movement Model --- p.45 / Chapter 4.2 --- Optimal Paging --- p.48 / Chapter 4.3 --- Transient Analysis --- p.52 / Chapter 4.3.1 --- The Time-Based Protocol --- p.55 / Chapter 4.3.2 --- Distance-Based Protocol --- p.59 / Chapter 4.4 --- The Reverse-Guessing Protocol --- p.64 / Chapter 4.5 --- Experiments --- p.66 / Chapter 5 --- Conclusions & Future Work --- p.71 / Chapter 5.1 --- Distributed Data Caching --- p.72 / Chapter 5.2 --- Mobile Terminal Tracking --- p.73 / Bibliography --- p.76 / A Proof of NP-hardness of the Broadcast Set Assignment Problem --- p.82
|
73 |
Enhancements to the scalable coherent interface cache protocolSafranek, Robert J. 01 January 1999 (has links)
As the number of NUMA system's cache coherency protocols based on the IEEE Std. 1596-1992, Standard for Scalable Coherent Interface (SCI) Specification increases, it is important to review this complex protocol to determine if the protocol can be enhanced in any way. This research provides two realizable extensions to the standard SCI cache protocol. Both of these extensions lie in the basic confines of the SCI architectures.
The first extension is a simplification to the SCI protocol in the area of prepending to a sharing list. Depending if the cache line is marked "Fresh" or "Gone", the flow of events is distinctly different. The guaranteed forward progress extension is a simplification to the SCI protocol in this area; making the act of prepending to an existing sharing list independent of whether the line is in the "Fresh" or "Gone" state. In addition, this extension eliminates the need for SCI command, as well as distributes the resource requirements of supplying data of a shared line equally among all nodes of the sharing list. The second extension addresses the time to purge (or invalidate) an SCI sharing list. This extension provides a realizable solution that allows the node being invalidated to acknowledge the request prior to the completion of the invalidation while maintaining the memory consistency model of the processors of the system.
The resulting cache protocol was developed and implemented for Sequent Computer System Inc. NUMA-Q system. The cache protocol was run on systems ranging from eight to sixty four processors and provided between 7% and 20% reduction in time to invalidate an SCI sharing list.
|
74 |
Optimization of instruction memory for embedded systemsJanapsatya, Andhi, Computer Science & Engineering, Faculty of Engineering, UNSW January 2005 (has links)
This thesis presents methodologies for improving system performance and energy consumption by optimizing the memory hierarchy performance. The processor-memory performance gap is a well-known problem that is predicted to get worse, as the performance gap between processor and memory is widening. The author describes a method to estimate the best L1 cache configuration for a given application. In addition, three methods are presented to improve the performance and reduce energy in embedded systems by optimizing the instruction memory. Performance estimation is an important procedure to assess the performance of the system and to assess the effectiveness of any applied optimizations. A cache memory performance estimation methodology is presented in this thesis. The methodology is designed to quickly and accurately estimate the performance of multiple cache memory configurations. Experimental results showed that the methodology is on average 45 times faster compared to a widely used tool (Dinero IV). The first optimization method is a software-only method, called code placement, was implemented to improve the performance of instruction cache memory. The method involves careful placement of code within memory to ensure high cache hit rate when code is brought into the cache memory. Code placement methodology aims to improve cache hit rates to improve cache memory performance. Experimental results show that by applying the code placement method, a reduction in cache miss rate by up to 71%, and energy consumption reduction of up to 63% are observed when compared to application without code placement. The second method involves a novel architecture for utilizing scratchpad memory. The scratchpad memory is designed as a replacement of the instruction cache memory. Hardware modification was designed to allow data to be written into the scratchpad memory during program execution, allowing dynamic control of the scratchpad memory content. Scratchpad memory has a faster memory access time and a lower energy consumption per access compared to cache memory; the usage of scratchpad memory aims to improve performance and lower energy consumption of systems compared to system with cache memory. Experimental results show an average energy reduction of 26.59% and an average performance improvement of 25.63% when compared to a system with cache memory. The third is an application profiling method using statistical information to identify application???s hot-spots. Application profiling is important for identifying section in the application where performance degradation might occur and/or where maximum performance gain can be obtained through optimization. The method was applied and tested on the scratchpad based system described in this thesis. Experimental results show the effectiveness of the analysis method in reducing energy and improving performance when compared to previous method for utilizing the scratchpad memory based system (average performance improvement of 23.6% and average energy reduction of 27.1% are observed).
|
75 |
MatRISC : a RISC multiprocessor for matrix applications / Andrew James Beaumont-Smith.Beaumont-Smith, Andrew James January 2001 (has links)
"November, 2001" / Errata on back page. / Includes bibliographical references (p. 179-183) / xxii, 193 p. : ill. (some col.), plates (col.) ; 30 cm. / Title page, contents and abstract only. The complete thesis in print form is available from the University Library. / This thesis proposes a highly integrated SOC (system on a chip) matrix-based parallel processor which can be used as a co-processor when integrated into the on-chip cache memory of a microprocessor in a workstation environment. / Thesis (Ph.D.)--University of Adelaide, Dept. of Electrical and Electronic Engineering, 2002
|
76 |
Optimization of instruction memory for embedded systemsJanapsatya, Andhi, Computer Science & Engineering, Faculty of Engineering, UNSW January 2005 (has links)
This thesis presents methodologies for improving system performance and energy consumption by optimizing the memory hierarchy performance. The processor-memory performance gap is a well-known problem that is predicted to get worse, as the performance gap between processor and memory is widening. The author describes a method to estimate the best L1 cache configuration for a given application. In addition, three methods are presented to improve the performance and reduce energy in embedded systems by optimizing the instruction memory. Performance estimation is an important procedure to assess the performance of the system and to assess the effectiveness of any applied optimizations. A cache memory performance estimation methodology is presented in this thesis. The methodology is designed to quickly and accurately estimate the performance of multiple cache memory configurations. Experimental results showed that the methodology is on average 45 times faster compared to a widely used tool (Dinero IV). The first optimization method is a software-only method, called code placement, was implemented to improve the performance of instruction cache memory. The method involves careful placement of code within memory to ensure high cache hit rate when code is brought into the cache memory. Code placement methodology aims to improve cache hit rates to improve cache memory performance. Experimental results show that by applying the code placement method, a reduction in cache miss rate by up to 71%, and energy consumption reduction of up to 63% are observed when compared to application without code placement. The second method involves a novel architecture for utilizing scratchpad memory. The scratchpad memory is designed as a replacement of the instruction cache memory. Hardware modification was designed to allow data to be written into the scratchpad memory during program execution, allowing dynamic control of the scratchpad memory content. Scratchpad memory has a faster memory access time and a lower energy consumption per access compared to cache memory; the usage of scratchpad memory aims to improve performance and lower energy consumption of systems compared to system with cache memory. Experimental results show an average energy reduction of 26.59% and an average performance improvement of 25.63% when compared to a system with cache memory. The third is an application profiling method using statistical information to identify application???s hot-spots. Application profiling is important for identifying section in the application where performance degradation might occur and/or where maximum performance gain can be obtained through optimization. The method was applied and tested on the scratchpad based system described in this thesis. Experimental results show the effectiveness of the analysis method in reducing energy and improving performance when compared to previous method for utilizing the scratchpad memory based system (average performance improvement of 23.6% and average energy reduction of 27.1% are observed).
|
77 |
Optimization of instruction memory for embedded systemsJanapsatya, Andhi, Computer Science & Engineering, Faculty of Engineering, UNSW January 2005 (has links)
This thesis presents methodologies for improving system performance and energy consumption by optimizing the memory hierarchy performance. The processor-memory performance gap is a well-known problem that is predicted to get worse, as the performance gap between processor and memory is widening. The author describes a method to estimate the best L1 cache configuration for a given application. In addition, three methods are presented to improve the performance and reduce energy in embedded systems by optimizing the instruction memory. Performance estimation is an important procedure to assess the performance of the system and to assess the effectiveness of any applied optimizations. A cache memory performance estimation methodology is presented in this thesis. The methodology is designed to quickly and accurately estimate the performance of multiple cache memory configurations. Experimental results showed that the methodology is on average 45 times faster compared to a widely used tool (Dinero IV). The first optimization method is a software-only method, called code placement, was implemented to improve the performance of instruction cache memory. The method involves careful placement of code within memory to ensure high cache hit rate when code is brought into the cache memory. Code placement methodology aims to improve cache hit rates to improve cache memory performance. Experimental results show that by applying the code placement method, a reduction in cache miss rate by up to 71%, and energy consumption reduction of up to 63% are observed when compared to application without code placement. The second method involves a novel architecture for utilizing scratchpad memory. The scratchpad memory is designed as a replacement of the instruction cache memory. Hardware modification was designed to allow data to be written into the scratchpad memory during program execution, allowing dynamic control of the scratchpad memory content. Scratchpad memory has a faster memory access time and a lower energy consumption per access compared to cache memory; the usage of scratchpad memory aims to improve performance and lower energy consumption of systems compared to system with cache memory. Experimental results show an average energy reduction of 26.59% and an average performance improvement of 25.63% when compared to a system with cache memory. The third is an application profiling method using statistical information to identify application???s hot-spots. Application profiling is important for identifying section in the application where performance degradation might occur and/or where maximum performance gain can be obtained through optimization. The method was applied and tested on the scratchpad based system described in this thesis. Experimental results show the effectiveness of the analysis method in reducing energy and improving performance when compared to previous method for utilizing the scratchpad memory based system (average performance improvement of 23.6% and average energy reduction of 27.1% are observed).
|
78 |
A pervasive information framework based on semantic routing and cooperative cachingChen, Weisong, January 2004 (has links)
Thesis (M. Phil.)--University of Hong Kong, 2005. / Title proper from title frame. Also available in printed format.
|
79 |
Asymmetric clustering using a register cacheMorrison, Roger Allen 18 April 2006 (has links)
Graduation date: 2006 / Conventional register files spread porting resources uniformly across all registers. This paper proposes a method called Asymmetric Clustering using a Register Cache (ACRC). ACRC utilizes a fast register cache that concentrates valuable register file ports to the most active registers thereby reducing the total register file area and power consumption. A cluster of functional units and a highly ported register cache execute the majority of instructions, while a second cluster with a full register file having fewer read ports processes instructions with source registers not found in the register cache. An ‘in-cache’ marking system tracks the contents of the register cache and routes instructions to the correct cluster. This system utilizes logic similar to the ‘ready’ bit system found in wake-up and select logic keeping the additional logic required to a minimum. When using a 256-entry register file, this design reduces the total register file area by an estimated 65% while exhibiting similar IPC performance compared to a non-clustered 8-way processor. As the feature size becomes smaller and processor clocks become faster, the number of clock cycles needed to access the register file will increase. Therefore, the smaller register file area requirement and subsequent smaller register file delay of ACRC will lead to better IPC performance than conventional processors.
|
80 |
A token caching waiting-matching unit for tagged-token dataflow computersTraylor, Roger L. 03 May 1991 (has links)
Computers using the tagged-token dataflow model are among the best candidates
for delivering extremely high levels of performance required in the future. Instruction
scheduling in these computers is determined by associatively matching data-bearing
tokens in a Waiting-Matching Unit (W-M unit). At the W-M unit, incoming tokens
with matching contexts are forwarded to an instruction while non-matching tokens are
stored to await their matching partner.
Requirements of the W-M unit are exacting. Necessary token storage capacity
at each processing element (PE) is presently estimated to be 100,000 tokens. Since the
most often executed arithmetic instructions require two operands, the bandwidth of the
W-M unit must be approximately twice that of the ALU. The contradictory
requirements of high storage capacity and high memory bandwidth have compromised
the M-W units of previous dataflow computers limiting their speed.
However, tokens arriving at a PE exhibit strong temporal locality. This
naturally suggests the use of some caching technique. Using a recently developed
CAM memory structure as a base, a token caching scheme is described which allows
rapid, fully associative token matching while allowing a large token storage capacity.
The key to the caching scheme is a fast and compact, articulated, first-in, first-out,
content addressable memory (AFCAM) which allows associative matching and
garbage collection while maintaining temporal ordering. A new memory cell is
developed as the basis for the AFCAM in an advanced CMOS (Complementary Metal
Oxide Semiconductor) technology. The design of the cell is discussed as well as
electrical simulation results, verifying its operation and performance.
Finally, estimated system performance of a dataflow computer using the
caching scheme is presented. / Graduation date: 1991
|
Page generated in 0.0763 seconds