• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 118
  • 11
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 145
  • 145
  • 145
  • 134
  • 39
  • 35
  • 33
  • 29
  • 27
  • 19
  • 13
  • 11
  • 11
  • 10
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Storage Management for Embedded SIMD Processors

Ryu, Soojung 17 December 2003 (has links)
SIMD parallelism offers a high performance and efficient execution approach for today's broad range of portable multimedia consumer products. However, new methods are needed to meet the complex demands of high performance, embedded systems. This research explores new storage management techniques for this focused but critical application. These techniques include memory design exploration based on the application retargeting technique, storage-based systolic instruction broadcast, and systolic virtual memory to improve both the performance and efficiency of embedded SIMD systems. For an efficient storage usage by memory design space exploration in embedded SIMD systems, an analysis method for assessing storage needs and costs of a given application automatically retargeted across a spectrum of storage configuration designs was developed. Using this technique, a SIMD processing element achieves optimal area and energy efficiency with a register file containing between 8 and 12 words for given workload. This configuration is between 15% and 25% more area and energy efficient than other memory configurations being considered. Systolic instruction broadcast is a high performance and area efficient instruction broadcasting scheme with short-wire interconnects by eliminating of wire latency bottleneck found in global instruction broadcast. Three implementation methods are defined and evaluated - software method, 2-write port register file method, and bypass method. In our evaluations, due to the system's short clock cycle time and scheduler, a speedup in system performance of up to 7.5 can be achieved by the year 2010. In addition, speedup of area efficiency also can be achieved up to 7.2 for a given workload. The ability of minimizing off-chip memory access latency while maximizing access frequency by scheduling techniques along with data prefetch techniques in systolic virtual memory mechanism was evaluated using our SIMD-systolic architecture simulator. Results show that, systolic virtual off-chip memory with shared address space can achieve over 50% higher area efficiency than that of an on-chip only system for a matrix multiplication application.
92

DLL-Conscious Instruction Fetch Optimization for SMT Processors

Mohamood, Fayez 12 April 2006 (has links)
Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of Dynamically Linked Libraries (DLL) are used by applications and operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed based cache implementation, existing instruction fetching mechanisms can induce unnecessary false cache misses caused by the DLL-based instructions, which were intended to be shared. This problem is more conspicuous when multiple independent threads are executing concurrently in an SMT processor. This work investigates an often-neglected form of contention between running threads in the I-TLB and I-cache caused by DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our new architecture, we are able to reinstate physical sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and improve the instruction cache hit rates by up to 62%, resulting in upto 30% DLL IPC improvements and upto 15% overall IPC improvements.
93

An adaptive software transactional memory support for multi-core programming

Chan, Kinson. January 2009 (has links)
Thesis (M. Phil.)--University of Hong Kong, 2010. / Includes bibliographical references (leaves 94-98). Also available in print.
94

Exploding Java objects for performance /

Noth, Michael E., January 2003 (has links)
Thesis (Ph. D.)--University of Washington, 2003. / Vita. Includes bibliographical references (leaves 134-139).
95

Surface chemistry of FeHx with dielectric surfaces : towards directed nanocrystal growth

Winkenwerder, Wyatt August, 1981- 07 September 2012 (has links)
The surface chemistry of GeH[subscript x] with dielectric surfaces is relevant to the application of germanium (Ge) nanocrystals for nanocrystal flash memory devices. GeH[subscript x] surface chemistry was first explored for thermally-grown SiO₂ revealing that GeH[subscript x] undergoes two temperature dependent reactions that remove Ge from the SiO₂ surface as GeH₄ and Ge, respectively. Ge only accumulates due to reactions between GeH[subscript x] species that form stable Ge clusters on the SiO₂ surface. Next, a Si-etched SiO₂ surface is probed by GeH[subscript x] revealing that the Si-etching defect activates the surface toward Ge deposition. The activation involves two separate reactions involving, first, the capture of GeH[subscript x] by the defect and second, a reaction between the captured Ge and remaining GeH[subscript x] species leading to the formation of Ge clusters. Reacting the defect with diborane, deactivates it toward GeH[subscript x] and also deactivates intrinsic hydroxyl groups toward GeH[subscript x] adsorption. A structure is proposed for the Si-etching defect. The surface chemistry of GeHx with HfO₂ is studied showing that the hafnium germinate that forms beneath the Ge nanocrystals exists as islands and not a continuous film. Annealing the hafnium germinate under a silane atmosphere will reduce it to Ge while leading to the deposition of hafnium silicate (HfSiO[subscript x]) and silicon (Si). Treating the HfO₂ with silane prior to Ge nanocrystal growth yields a surface with hafnium silicate islands on which Si also deposits. Ge deposition on this surface leads to the suppression of hafnium germinate formation. Electrical testing of capacitors made from Ge nanocrystals and HfO₂ shows that Ge nanocrystals encapsulated in Si/HfSiO[subscript x] layers have greatly improved retention characteristics. / text
96

Scalable hardware memory disambiguation

Sethumadhavan, Lakshminarasimhan, 1978- 29 August 2008 (has links)
Not available
97

Memory region: a system abstraction for managing the complex memory structures of multicore platforms

Lee, Min 13 January 2014 (has links)
The performance of modern many-core systems depends on the effective use of their complex cache and memory structures, and this will likely become more pronounced with the impending arrival of on-chip 3D stacked and non-volatile off-chip byte-addressable memory. Yet to date, operating systems have not treated memory as a first class schedulable resource, embracing memory heterogeneity. This dissertation presents a new software abstraction, called ‘memory region’, which denotes the current set of physical memory pages actively used by workloads. Using this abstraction, memory resources can be scheduled for applications to fully exploit a platform's underlying cache and memory system, thereby gaining improved performance and predictability in execution, particularly for the consolidated workloads seen in virtualized and cloud computing infrastructures. The abstraction's implementation in the Xen hypervisor involves the run-time detection of memory regions, the scheduled mapping of these regions to caches to match performance goals, and maintaining region-to-cache mappings using per-cache page tables. This dissertation makes the following specific contributions. First, its region scheduling method proposes that the location of memory blocks rather than CPU utilization is the principal determinant where workloads are run. It proposes a new scheduling method, the region scheduling that the location of memory blocks determines where the workloads are run. Second, treating memory blocks as first-class resources, new methods for efficient cache management are shown to improve application performance as well as the performance of certain operating system functions. Third, explicit memory scheduling makes it possible to disaggregate operating systems, without the need to change OS sources and with only small markups of target guest OS functionality. With this method, OS functions can be mapped to specific desired platform components, such as file system confined to running on specific cores and using only certain memory resources designated for its use. This can improve performance for applications heavily dependent on certain OS functions, by dynamically providing those functions with the resources needed for their current use, and it can prevent performance-critical application functionality from being needlessly perturbed by OS functions used for other purposes or by other jobs. Fourth, extensions of region scheduling can also help applications deal with the heterogeneous memory resources present in future systems, including on-chip stacked DRAM and NUMA or even NVRAM memory modules. More generally, regions scheduling is shown to apply to memory structures with well-defined differences in memory access latencies.
98

Semantics-oriented low power architecture

Ballapuram, Chinnakrishnan S. 01 April 2008 (has links)
Innovations in the microarchitecture and prominent advances in the semiconductor process technology enable sophisticated and powerful microprocessors. However, they also lead to increased power consumption. The main contribution of the thesis is the demonstration of Semantics-Oriented Low Power Architecture techniques that use the semantics of memory references and variables used in an application program to reduce the power consumption in the memory sub-system of a microprocessor. The Semantic-Aware Multilateral Partitioning (SAM) technique reduces the cache and TLB power consumption by decoupling the data TLB lookups and the data cache accesses, based on the semantic regions defined by the programming languages and the software convention, into discrete reference sub-streams, namely, stack, global static, and heap. To reduce the power consumed by the snoops in Chip Multiprocessor, we propose a hardware technique called Selective Snoop Probe (SSP) and a compiler-based hardware supported technique called Essential Snoop Probe (ESP) that use the properties of the program variables. By selectively sending the snoop probes, the SSP and ESP techniques relax the conservative nature of the cache coherency protocol and its implementation to reduce power and improve performance.
99

Microarchitectural techniques to reduce energy consumption in the memory hierarchy

Ghosh, Mrinmoy 03 April 2009 (has links)
This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.
100

Scratch-pad memory management for static data aggregates

Li, Lian, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
Scratch-pad memory (SPM), a fast on-chip SRAM managed by software, is widely used in embedded systems. Compared to hardware-managed cache, SPM can be more efficient in performance, power and area cost, and has the added advantage of better time predictability. In this thesis, SPMs should be seen in a general context. For example, in stream processors, a software-managed stream register file is usually used to stage data to and from off-chip memory. In IBM's Cell architecture, each co-processor has a software-managed local store for keeping data and instructions. SPM management is critical for SPM-based embedded systems. In this thesis, we propose two novel methodologies, the memory colouring methodology and the perfect colouring methodology, to place the static data aggregates such as arrays and structs of a program in SPM. Our methodologies are dynamic in the sense that some data aggregates can be swapped into and out of SPM during program execution. To this end, a live range splitting heuristic is introduced in order to create potential data transfer statements between SPM and off-chip memory. The memory colouring methodology is a general-purpose compiler approach. The novelty of this approach lies in partitioning an SPM into a pseudo register file then generalising existing graph colouring algorithms for register allocation to colour data aggregates. In this thesis, a scheme for partitioning an SPM into a pseudo register file is introduced. This methodology is inter-procedural and therefore operates on the interference graph for the data aggregates in the whole program. Different graph colouring algorithms may give rise to different results due to live range splitting and spilling heuristics used. As a result, two representative graph colouring algorithms, George and Appel's iterative-coalescing and Park and Moon's optimistic-coalescing, are generalised and evaluated for SPM allocation. Like memory colouring, perfect colouring is also inter-procedural. The novelty of this second methodology lies in formulating the SPM allocation problem as an interval colouring problem. The interval colouring problem is an NP problem and no widely-accepted approximation algorithms exist. The key observation is that the interference graphs for data aggregates in many embedded applications form a special class of superperfect graphs. This has led to the development of two additional SPM allocation algorithms. While differing in whether live range splits and spills are done sequentially or together, both algorithms place data aggregates in SPM based on the cliques in an interference graph. In both cases, we guarantee optimally that all data aggregates in an interference graph can be placed in SPM if the given SPM size is no smaller than the chromatic number of the graph. We have developed two memory colouring algorithms and two perfect colouring algorithms for SPM allocation. We have evaluated them using a set of embedded applications. Our results show that both methodologies are efficient and effective in handling large-scale embedded applications. While neither methodology outperforms the other consistently, perfect colouring has yielded better overall results in the set of benchmarks used in our experiments. All these algorithms are expected to be valuable. For example, they can be made available as part of the same compiler framework to assist the embedded designer with exploring a large number of optimisation opportunities for a particular embedded application.

Page generated in 0.3327 seconds