Global ETD Search

1	Sense of Past ... Sense of Place Boland, Katherine Ellen. January 2008 (has links) (PDF) Thesis (M Arch)--Montana State University--Bozeman, 2008. / Typescript. Chairperson, Graduate Committee: Christopher Livingston. Includes bibliographical references (leaves 136-149). Memory. Architecture.
2	Parzsweep: A Novel Parallel Algorithm for Volume Rendering of Regular Datasets Ramswamy, Lakshmy 10 May 2003 (has links) The sweep paradigm for volume rendering has previously been successfully applied with irregular grids. This thesis describes a parallel volume rendering algorithm called PARZSweep for regular grids that utilizes the sweep paradigm. The sweep paradigm is a concept where a plane sweeps the data volume parallel to the viewing direction. As the sweeping proceeds in the increasing order of z, the faces incident on the vertices are projected onto the viewing volume to constitute to the image. The sweeping ensures that all faces are projected in the correct order and the image thus obtained is very accurate in its details. PARZSweep is an extension of a serial algorithm for regular grids called RZSweep. The hypothesis of this research is that a parallel version of RZSweep can be designed and implemented which will utilize multiple processors to reduce rendering times. PARZSweep follows an approach called image-based task scheduling or tiling. This approach divides the image space into tiles and allocates each tile to a processor for individual rendering. The sub images are composite to form a complete final image. PARZSweep uses a shared memory architecture in order to take advantage of inherent cache coherency for faster communication between processor. Experiments were conducted comparing RZSweep and PARZSweep with respect to prerendering times, rendering times and image quality. RZSweep and PARZSweep have approximately the same prerendering costs, produce exactly the same images and PARZSweep substantially reduced rendering times. PARZSweep was evaluated for scalability with respect to the number of tiles and number of processors. Scalability results were disappointing due to uneven data distribution. Scientific visualization Parallel volume rendering Shared memory architecture
3	Memory centric compilers for embedded streaming systems Milford, Matthew Thomas Ian January 2014 (has links) No description available. 004
4	Memory Architecture Template for Fast Block Matching Algorithms on Field Programmable Gate Arrays Chandrakar, Shant 01 December 2009 (has links) Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architectures on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This thesis presents a memory architecture template that can be parameterized for a given FBM algorithm, number of parallel Processing Elements (PEs), and block size. The template can be parameterized with well known exploration techniques to design efficient on-chip memory subsystems. The memory subsystems are derived for two existing FBM algorithms and are implemented on a Xilinx Virtex 4 family of FPGAs. Results show that the derived memory subsystem in the best case supports up to 27 more parallel PEs than the three existing subsystems and processes integer pixels in a 1080p video sequence up to a rate of 73 frames per second. The speculative execution of an FBM algorithm for the same number of PEs increases the number of frames processed per second by 49%. fast block matching algorithm FPGA memory architecture video processing Data Storage Systems
5	SAGE: An Automatic Analyzing and Parallelizing System to Improve Performance and Reduce Energy on a New High-Performance SoC Architecture¡XProcessor-in-Memory Chu, Slo-Li 04 October 2002 (has links) Continuous improvements in semiconductor fabrication density are enabling new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processing with high-density memory. Such architectures are generally called Processor-in-Memory or Intelligent Memory and can support high-performance computing by reducing the performance gap between the processor and the memory. This architecture combines various processors in a single system. These processors are characterized by their computational and memory-access capabilities in performance and energy consumption. Two main problems addressed here are how to improve the performance and reduce the energy consumption of applications running on Processor-in-Memory architectures. Accordingly, a novel strategy must be developed to identify the capabilities of the different processors and dispatch the most appropriate jobs to them to exploit them fully. Accordingly, this study proposes a novel automatic source-to-source parallelizing system, called SAGE, to exploit the advantages of Processor-in-Memory architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analytical approaches. The strategy of the SAGE system, which decomposes the original program into blocks and produces a feasible execution schedule for the host and memory processors, is also investigated. Hence, several techniques including statement splitting, weight evaluation, performance scheduling and energy reduction scheduling are designed and integrated into the SAGE system to automatically transform Fortran source programs to improve the performance of the program or reduce energy consumed by the program executed on Processor-in-Memory architecture. This thesis provides detailed techniques and discusses the experimental results of real benchmarks which are transformed by SAGE system and targeted on the Processor-in-Memory architecture. SoC Processor-in-Memory architecture statement-based automatic parallelizing compiler energy reduction
6	Improving Memory Performance for Both High Performance Computing and Embedded/Edge Computing Systems Adavally, Shashank 12 1900 (has links) CPU-memory bottleneck is a widely recognized problem. It is known that majority of high performance computing (HPC) database systems are configured with large memories and dedicated to process specific workloads like weather prediction, molecular dynamic simulations etc. My research on optimal address mapping improves the memory performance by increasing the channel and bank level parallelism. In an another research direction, I proposed and evaluated adaptive page migration techniques that obviates the need for offline analysis of an application to determine page migration strategies. Furthermore, I explored different migration strategies like reverse migration, sub page migration that I found to be beneficial depending on the application behavior. Ideally, page migration strategies redirect the demand memory traffic to faster memory to improve the memory performance. In my third contribution, I worked and evaluated a memory-side accelerator to assist the main computational core in locating the non-zero elements of a sparse matrix that are typically used in scientific, machine learning workloads on a low-power embedded system configuration. Thus my contributions narrow the speed-gap by improving the latency and/or bandwidth between CPU and memory. Memory Address Mapping Page Migration Sparse Matrix Heterogenous Memory Architecture Computer Science
7	Optimisation des allocations de données pour des applications du Calcul Haute Performance sur une architecture à mémoires hétérogènes / Data Allocation Optimisation for High Performance Computing Application on Heterogeneous Architecture Brunie, Hugo 28 January 2019 (has links) Le Calcul Haute Performance, regroupant l’ensemble des acteurs responsables de l’amélioration des performances de calcul des applications scientifiques sur supercalculateurs, s’est donné pour objectif d’atteindre des performances exaflopiques. Cette course à la performance se caractérise aujourd’hui par la fabrication de machines hétérogènes dans lesquelles chaque composant est spécialisé. Parmi ces composants, les mémoires du système se spécialisent, et la tendance va vers une architecture composée de plusieurs mémoires aux caractéristiques complémentaires. La question se pose alors de l’utilisation de ces nouvelles machines dont la performance pratique dépend du placement des données de l’application sur les différentes mémoires. Dans cette thèse, nous avons développé une formulation du problème d’allocation de donnée sur une Architecture à Mémoires Hétérogènes. Dans cette formulation, nous avons fait apparaître le bénéfice que pourrait apporter une analyse temporelle du problème, parce que de nombreux travaux reposaient uniquement sur une approche spatiale. À partir de cette formulation, nous avons développé un outil de profilage hors ligne pour approximer les coefficients de la fonction objective afin de résoudre le problème d’allocation et d’optimiser l’allocation des données sur une architecture composée deux de mémoires principales aux caractéristiques complémentaires. Afin de réduire la quantité de modifications nécessaires pour prendre en compte la stratégie d’allocation recommandée par notre boîte à outils, nous avons développé un outil capable de rediriger automatiquement les allocations de données à partir d’un minimum d’instrumentation dans le code source. Les gains de performances obtenus sur des mini-applications représentatives des applications scientifiques codées par la communauté permet d’affirmer qu’une allocation intelligente des données est nécessaire pour bénéficier pleinement de ressources mémoires hétérogènes. Sur certaines tailles de problèmes, le gain entre un placement naïf est une allocation instruite peut atteindre un facteur ×3.75. / High Performance Computing, which brings together all the players responsible for improving the computing performance of scientific applications on supercomputers, aims to achieve exaflopic performance. This race for performance is today characterized by the manufacture of heterogeneous machines in which each component is specialized. Among these components, system memories specialize too, and the trend is towards an architecture composed of several memories with complementary characteristics. The question arises then of these new machines use whose practical performance depends on the application data placement on the different memories. Compromising code update against performance is challenging. In this thesis, we have developed a data allocation on Heterogeneous Memory Architecture problem formulation. In this formulation, we have shown the benefit of a temporal analysis of the problem, because many studies were based solely on a spatial approach this result highlight their weakness. From this formulation, we developed an offline profiling tool to approximate the coefficients of the objective function in order to solve the allocation problem and optimize the allocation of data on a composite architecture composed of two main memories with complementary characteristics. In order to reduce the amount of code changes needed to execute an application according to our toolbox recommended allocation strategy, we have developed a tool that can automatically redirect data allocations from a minimum source code instrumentation. The performance gains obtained on mini-applications representative of the scientific applications coded by the community make it possible to assert that intelligent data allocation is necessary to fully benefit from heterogeneous memory resources. On some problem sizes, the gain between a naive data placement strategy, and an educated data allocation one, can reach up to ×3.75 speedup. Profilage Allocation de données Memoires hétérogènes Optimisation combinatoire Data allocation Memory management Heterogeneous Memory Architecture Profiling Combinatorial optimization
8	Software Techniques for Distributed Shared Memory Radovic, Zoran January 2005 (has links) <p>In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: <i>nonuniform communication architecture</i>, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies <i>node affinity</i> as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations.</p><p>The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics.</p><p>This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence.</p> Datorteknik synchronization distributed shared memory write permission cache nonuniform communication architecture node affinity locality hardware-software trade-off profiling flexibility trap-based memory architecture Datorteknik Computer engineering Datorteknik
9	Power-Aware Compilation Techniques For Embedded Systems Shyam, K 07 1900 (has links) The demand for devices like Personal Digital Assistants (PDA’s), Laptops, Smart Mobile Phones, are at an all time high. As the demand for these devices increases, so is the push to provide sophisticated functionalities in these devices. However energy consumption has become a major constraint in providing increased functionality for these devices. A majority of the applications meant for these devices are rich with multimedia content. In this thesis, we propose two approaches for compiler directed energy reduction, one targeting the memory subsystem and another the processor. The ﬁrst technique is a compiler directed optimization technique that reduces the energy consumption of the memory subsystem, for an oﬀ-chip partitioned memory archi- tecture, having multiple memory banks, and various low-power operating modes for each of these banks. We propose an eﬃcient layout of the data segment to reduce the number of simultaneously active memory banks, so that the other memory banks that are inactive can be put to low power modes to reduce the energy. We model this problem as a graph partitioning problem, and use well known heuristics to solve the same. We also propose a simple Integer Linear Programming (ILP) formulation for the above problem. Perfor- mance results indicate that our approach achieves an energy reduction of 20% compared to the base scheme, and a reduction of 8%-10% over a previously suggested method. Also, our results are well within the optimal results obtained by using ILP method. The second approach proposed in this thesis reduces the dynamic energy consumed by the processor using dynamic voltage and frequency scaling technique. Earlier works on dynamic voltage scaling focused mainly on performing voltage scaling when the CPU is waiting for memory subsystem or concentrated chieﬂy on loop nests and/or subroutine calls having suﬃcient number of dynamic instructions. We concentrate on coarser pro- gram regions and for the ﬁrst time uses program phase behavior for performing dynamic voltage scaling. We relate the Dynamic Voltage Scaling Problem to the Multiple Choice Knapsack Problem, and use well known heuristics to solve it eﬃciently. Also, we develop a simple Integer Linear Programming (ILP) problem formulation for this problem. Experi-mental evaluation on a set of media applications reveal that our heuristic method obtains 35-40% reduction in energy consumption on an average, with a negligible performance degradation. Further the energy consumed by our heuristic solution is within 1% the optimal solution obtained by the ILP approach. Electric Power Control Compilers Embedded Systems Computer Memory Architecture Voltages Dynamic Voltage Scaling Memory Architectures Integer Linear Programming (ILP) Computer Science
10	Software Techniques for Distributed Shared Memory Radovic, Zoran January 2005 (has links) In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: nonuniform communication architecture, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations. The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics. This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence. synchronization distributed shared memory write permission cache nonuniform communication architecture node affinity locality hardware-software trade-off profiling flexibility trap-based memory architecture Computer engineering Datorteknik

Search results