471 |
Architectural Enhancements for Color Image and Video Processing on Embedded SystemsKim, Jongmyon 21 April 2005 (has links)
As emerging portable multimedia applications demand more and more computational throughput with limited energy consumption, the need for high-efficiency, high-throughput embedded processing is becoming an important challenge in computer architecture.
In this regard, this dissertation addresses application-, architecture-, and technology-level issues in existing processing systems to provide efficient processing of multimedia in many, or ideally all, of its form. In particular, this dissertation explores color imaging in multimedia while focusing on two architectural enhancements for memory- and performance-hungry embedded applications: (1) a pixel-truncation technique and (2) a color-aware instruction set (CAX) for embedded multimedia systems. The pixel-truncation technique differs from previous techniques (e.g., 4:2:2 and 4:2:0 subsampling) used in image and video compression applications (e.g., JPEG and MPEG) in that it reduces the information content in individual pixel word sizes rather than in each dimension. Thus, this technique drastically reduces the bandwidth and memory required to transport and store color images without perceivable distortion in color. At the same time, it maintains the pixel storage format of color image processing in which each pixel computation is performed simultaneously on 3-D YCbCr components, which are widely used in the image and video processing community. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr data in a 32-bit datapath processor, providing greater concurrency and efficiency for processing color image sequences.
This dissertation presents the impact of CAX on processing performance and on both area and energy efficiency for color imaging applications in three major processor architectures: dynamically scheduled (superscalar), statically scheduled (very long instruction word, VLIW), and embedded single instruction multiple data (SIMD) array processors. Unlike typical multimedia extensions, CAX obtains substantial performance and code density improvements through direct support for color data processing rather than depending solely on generic subword parallelism. In addition, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design.
In summary, CAX, coupled with the pixel-truncation technique, provides an efficient mechanism that meets the computational requirements and cost goals for future embedded multimedia products.
|
472 |
DLL-Conscious Instruction Fetch Optimization for SMT ProcessorsMohamood, Fayez 12 April 2006 (has links)
Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of Dynamically Linked Libraries (DLL) are used by applications and
operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed based cache implementation, existing instruction fetching mechanisms can induce unnecessary false cache misses caused by the DLL-based instructions, which were intended to be shared. This problem is more conspicuous when multiple independent threads are executing concurrently in an SMT processor.
This work investigates an often-neglected form of contention between running threads in the I-TLB and I-cache caused by DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our new
architecture, we are able to reinstate physical sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our
simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and
improve the instruction cache hit rates by up to 62%, resulting in upto 30% DLL IPC improvements and upto 15% overall IPC improvements.
|
473 |
Subverting Linux on-the-fly using hardware virtualization technologyAthreya, Manoj B. 13 May 2010 (has links)
In this thesis, we address the problem faced by modern operating systems due to the exploitation of Hardware-Assisted Full-Virtualization technology by attackers.
Virtualization technology has been of growing importance these days. With the help of such a technology, multiple operating systems can be run on a single piece of hardware, with little or no modification to the operating system. Both Intel and AMD have contributed to x86 full-virtualization through their respective instruction set architectures. Hardware virtualization extensions can be found in almost all x86 processors these days.
Hardware virtualization technologies have opened a whole new frontier for a new kind of attack. A system hacker can abuse hardware virualization technology to gain control over an operating system on-the-fly (i.e., without a system restart) by installing a thin Virtual Machine Monitor (VMM) below the native operating system. Such a VMM based malware is termed a Hardware-Assisted Virtual Machine (HVM) rootkit. We discuss the technique used by a rootkit named Blue Pill to subvert the Windows Vista operating system by exploiting the AMD-V (codenamed "Pacifica") virtualization extensions. HVM rootkits do not hook any operating system code or data regions; hence detecting the existence of such malware using conventional techniques becomes extremely difficult. This thesis discusses existing methods to detect such rootkits and their inefficiencies.
In this work, we implement a proof-of-concept HVM rootkit using Intel-VT hardware virtualization technology and also discuss how such an attack can be defended against by using an autonomic architecture called SHARK, which was proposed by Vikas et al., in MICRO 2008.
|
474 |
Micro-scheduling and its interaction with cache partitioningChoudhary, Dhruv 05 July 2011 (has links)
The thesis explores the sources of energy inefficiency in asymmetric multi- core architectures where energy efficiency is measured by the energy-delay squared product. The insights gathered from this study drive the development of optimized thread scheduling and coordinated cache management strategies in an important class of asymmetric shared memory architectures. The proposed techniques are founded on well known mathematical optimization techniques yet are lightweight enough to be implemented in practical systems.
|
475 |
Performance Evaluation of Embedded Microcomputers for Avionics ApplicationsBilen, Celal Can, Alcalde, John January 2010 (has links)
<p>Embedded microcomputers are used in a wide range of applications nowadays. Avionics is one of these areas and requires extra attention regarding reliability and determinism. Thus, these issues should also be born in mind in addition to performance when evaluating embedded microcomputers.</p><p>This master thesis suggests a framework for performance evaluation of two members of the PowerPC microprocessor family, namely the MPC5554 from Freescale and PPC440EPx from AMCC, and analyzes the results within and between these processors. The framework can be generalized to be used in any microprocessor family, if required.</p><p>Apart from performance evaluation, this thesis also suggests also a new terminology by introducing the concept of determinism levels to be able to estimate determinism issues in avionics applications more clearly, which is crucial regarding the requirements and working conditions of this very application. Such estimation does not include any practical results as in performance evaluation, but rather remains theoretical. Similar to Automark™ used by AutoBench™ in the EEMBC Benchmark Suite, we introduce a new performance metric score that we call ”Aviomark” and we carry out a detailed comparison of Aviomark with the traditional Automark™ score to be able to see how Aviomark differs from Automark™ in behavior.</p><p>Finally, we have developed a graphical user interface (GUI) which works in parallel with the Green Hills MULTI Integrated Development Environment (IDE) in order to simplify and automate the evaluation process. By the help of the GUI, the users will be able to easily evaluate their specific PowerPC processors by starting the debugging from MULTI IDE.</p>
|
476 |
FPGA-based hardware accelerator design for performance improvement of a system-on-a-chip applicationVyas, Dhaval N. January 2005 (has links)
Thesis (M.S.)--State University of New York at Binghamton, Department of Electrical and Computer Engineering. / Includes bibliographical references (p. 55-56).
|
477 |
Platforms for HPJava Runtime support for scalable programming in Java /Lim, Sang Boem. Erlebacher, Gordon. January 2003 (has links)
Thesis (Ph. D.)--Florida State University, 2003. / Advisor: Dr. Gordon Erlebacher, Florida State University, College of Arts and Sciences, Dept. of Computer Science. Title and description from dissertation home page (viewed Apr. 8, 2004). Includes bibliographical references.
|
478 |
Simulation of distributed object oriented servers /Kwok, Chee Khan. January 2003 (has links) (PDF)
Thesis (M.S. in Computer Science)--Naval Postgraduate School, December 2003. / Thesis advisor(s): Bill Ray, Shing Man Tak. Includes bibliographical references (p. 215). Also available online.
|
479 |
Performance evaluation of high performance parallel I/ODhandapani, Mangayarkarasi. January 2003 (has links) (PDF)
Thesis (M.S.)--Mississippi State University. Department of Computer Science and Engineering. / Title from title screen. Includes bibliographical references.
|
480 |
Memory-subsystem resource management for the many-core eraKaseridis, Dimitrios 11 July 2012 (has links)
As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads.
The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before.
This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system.
To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements.
As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve.
Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text
|
Page generated in 0.0865 seconds