Global ETD Search

301	Modeling and implementation of an integrated pixel processing tile for focal plane systems Robinson, William Hugh, January 2003 (has links) (PDF) Thesis (Ph. D.)--School of Electrical and Computer Engineering, Georgia Institute of Technology, 2004. Directed by D. Scott Wills. / Vita. Includes bibliographical references (leaves 105-111).
302	Fast performance evaluation and design space exploration of multiprocessor distributed computing platform Zhang, Yukan. January 2009 (has links) Thesis (M.S.)--State University of New York at Binghamton, Thomas J. Watson School of Engineering and Applied Science, Department of Electrical and Computer Engineering, 2009. / Includes bibliographical references.
303	Probing for a continual validation prototype Gill, Peter W. January 2001 (has links) Thesis (M.S.)--Worcester Polytechnic Institute. / Keywords: run-time monitoring; continual validation; software probes; probing. Includes bibliographical references (p. 98-101).
304	Performance enhancing software loop transformations for embedded VLIW/EPIC processors / Akturan, Cagdas, January 2001 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2001. / Vita. Includes bibliographical references (leaves 94-102). Available also in a digital version from Dissertation Abstracts.
305	Architectural techniques to accelerate multimedia applications on general-purpose processors / Talla, Deependra, January 2001 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2001. / Vita. Includes bibliographical references (leaves 139-149). Available also in a digital version from Dissertation Abstracts.
306	Facilitating hard active database applications / Warshaw, Lane Bradley, January 2001 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2001. / Vita. Includes bibliographical references (leaves 151-160). Available also in a digital version from Dissertation Abstracts.
307	Modular detection of feature interactions through theorem proving a case study. Roberts, Brian Glenn. January 2003 (has links) Thesis (M.S.)--Worcester Polytechnic Institute. / Keywords: theorem proving; modular verification; software verification; feature-oriented programming; feature interaction. Includes bibliographical references (p. 131-136).
308	Network-on-chip implementation and performance improvement through workload characterization and congestion awareness Gratz, Paul V., 1970- 09 October 2012 (has links) Off-chip interconnection networks provide for communication between processors and components within computer systems. Semiconductor process technology trends have led to the inclusion of multiple processors and components onto a single chip and recently research has focused on interconnection networks, on-chip, to connect them together. On-chip networks provide a scalable, high-bandwidth interconnect, integrated tightly with the microarchitecture to achieve high performance. On-chip networks present several new challenges, different from off-chip networks, including tighter constraints in power, area and end-to-end latency. In this dissertation, I propose interconnection network architectures that address the unique design challenges of power and end-to-end latency on chip. My work in the design, implementation and evaluation of the on-chip networks of the TRIPS project’s prototype processor, a real hardware implementation, is the foundation for my work in on-chip networking. Based on my analysis of the TRIPS on-chip networks and their workloads, I propose, design, and evaluate novel network architectures for congestion monitoring and adaptive routing that are matched to the design constraints of on-chip networks. In the TRIPS system we designed, and implemented in silicon, a distributed processor microarchitecture where traditional processor components are divided into a collection of self-contained tiles. One novel aspect of the TRIPS system is the control and data networks that the tiles use to communicate with one another. I worked on the design and implementation of one of these networks, the On-Chip Network (OCN). The OCN, a 4x10 mesh network, interconnects the tiles of the L2 cache, the two processor cores and various I/O units. Another on-chip network, the Operand Network (OPN), interconnects the execution units and serves as a bypass network, integrated tightly with the processor core. In this document I evaluate these two on-chip networks and their workloads, these evaluations serve as case studies in how on-chip design constraints affect the design of on-chip networks. In the examination of the TRIPS OCN and OPN networks, one insight we gained was that network resource imbalances can lead to congestion and poor performance. We found these imbalances are transient with time and task. Timely information about the status of the network can be used to balance the resource utilization, or reduce power. A challenge lies in providing the right information, conveyed in a timely fashion, as the metrics and methods used in off-chip networks do not map well to on-chip networks. In this document, I propose and evaluate several metrics of network congestion for their utility and feasibility in an on-chip environment. In our examination of the TRIPS on-chip networks we also found that minimizing end-to-end packet latency was critical to maintaining good system performance. Effective use of the congestion information without impact to end-to-end latency is another challenge in on-chip networking. I explore novel adaptive routing techniques that address the challenge of managing the end-to-end latency. A method that produces good results is aggregation of network status information, reducing both the bandwidth and latency required for status information transmission. In this dissertation I examine how well this technique and others compare with conventional oblivious and adaptive routing. / text Computer networks Computer architecture
309	Performance-efficient mechanisms for managing irregularity in throughput processors Rhu, Minsoo 01 July 2014 (has links) Recent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. / text Computer architecture GPU Graphics Throughput processors Memory systems
310	Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures Han, Guodong, 韩国栋 January 2013 (has links) The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system. / published_or_final_version / Computer Science / Master / Master of Philosophy Graphics processing units. Computer architecture.

Search results