Spelling suggestions: "subject:"dataparallel architectures"" "subject:"data.parallel architectures""
1 |
Architectural Enhancements for Color Image and Video Processing on Embedded SystemsKim, Jongmyon 21 April 2005 (has links)
As emerging portable multimedia applications demand more and more computational throughput with limited energy consumption, the need for high-efficiency, high-throughput embedded processing is becoming an important challenge in computer architecture.
In this regard, this dissertation addresses application-, architecture-, and technology-level issues in existing processing systems to provide efficient processing of multimedia in many, or ideally all, of its form. In particular, this dissertation explores color imaging in multimedia while focusing on two architectural enhancements for memory- and performance-hungry embedded applications: (1) a pixel-truncation technique and (2) a color-aware instruction set (CAX) for embedded multimedia systems. The pixel-truncation technique differs from previous techniques (e.g., 4:2:2 and 4:2:0 subsampling) used in image and video compression applications (e.g., JPEG and MPEG) in that it reduces the information content in individual pixel word sizes rather than in each dimension. Thus, this technique drastically reduces the bandwidth and memory required to transport and store color images without perceivable distortion in color. At the same time, it maintains the pixel storage format of color image processing in which each pixel computation is performed simultaneously on 3-D YCbCr components, which are widely used in the image and video processing community. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr data in a 32-bit datapath processor, providing greater concurrency and efficiency for processing color image sequences.
This dissertation presents the impact of CAX on processing performance and on both area and energy efficiency for color imaging applications in three major processor architectures: dynamically scheduled (superscalar), statically scheduled (very long instruction word, VLIW), and embedded single instruction multiple data (SIMD) array processors. Unlike typical multimedia extensions, CAX obtains substantial performance and code density improvements through direct support for color data processing rather than depending solely on generic subword parallelism. In addition, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design.
In summary, CAX, coupled with the pixel-truncation technique, provides an efficient mechanism that meets the computational requirements and cost goals for future embedded multimedia products.
|
2 |
Efficient graph algorithm execution on data-parallel architecturesBangalore Lakshminarayana, Nagesh 12 January 2015 (has links)
Mechanisms for improving the execution efficiency of graph algorithms on Data-Parallel Architectures were proposed and identified. Execution of graph algorithms on GPGPU architectures, the prevalent data-parallel architectures was considered. Irregular and data dependent accesses in graph algorithms were found to cause significant idle cycles in GPGPU cores. A prefetching mechanism that reduced the amount of idle cycles by prefetching a data-dependent access pattern found in graph algorithms was proposed. Storing prefetches in unused spare registers in addition to storing them in the cache was shown to be more effective by the prefetching mechanism. The design of the cache hierarchy for graph algorithms was explored. First, an exclusive cache hierarchy was shown to be beneficial at the cost of increased traffic; a region based exclusive cache hierarchy was shown to be similar in performance to an exclusive cache hierarchy while reducing on-chip traffic. Second, bypassing cache blocks at both the level one and level two caches was shown to be beneficial. Third, the use of fine-grained memory accesses (or cache sub-blocking) was shown to be beneficial. The combination of cache bypassing and fine-grained memory accesses was shown to be more beneficial than applying the two mechanisms individually. Finally, the impact of different implementation strategies on algorithm performance was evaluated for the breadth first search algorithm using different input graphs and heuristics to identify the best performing implementation for a given input graph were also discussed.
|
Page generated in 0.1044 seconds