Global ETD Search

1	Design and application of the RPA II O'Gorman, Russell John January 1989 (has links) No description available. 621.39 Parallel computer evaluation
2	A general purpose parallel computer Rushton, A. J. January 1987 (has links) No description available. 005 Parallel computer networks
3	A computation model for parallelism : self-adapting parallel servers Sahiner, Ali Vahit January 1991 (has links) No description available. 005 Parallel computer vision
4	Parallel discrete event simulation : its protocol development and applications Xu, Ming Qiang January 1991 (has links) No description available. 005 Parallel computer simulation
5	A VLSI-nMOS hardware implementation of a high speed parallel adder Taesopapong, Somboon. January 1986 (has links) Thesis (M.S.)--Ohio University, November, 1986. / Title from PDF t.p.
6	Simulations of Space Station Data Links and Ground Processing Horan, Stephen 11 1900 (has links) International Telemetering Conference Proceedings / October 30-November 02, 1989 / Town & Country Hotel & Convention Center, San Diego, California / The telemetry group has begun a new program in conjunction with Goddard Space Flight Center to investigate the possibilities of using parallel processing configurations for the real-time processing of Space Station data. In order to evaluate the potential configurations, a program based on using discrete-event simulation models is being used. This modeling software allows for generic configurations to be modeled and the relevant parameters to be modified to see the effects on performance. This paper represents a description of the work we will be undertaking over the next 18 months and the environment to be used in creating the simulation models at NMSU. Telemetry Systems Simulation and Modeling Parallel Computer Applications
7	Data centric and adaptive source changing transactional memory with exit functionality Herath, Herath Mudiyanselage Isuru Prasenajith January 2012 (has links) Multi-core computing is becoming ubiquitous due to the scaling limitations of single-core computing. It is inevitable that parallel programming will become the mainstream for such processors. In this paradigm shift, the concept of abstraction should not be compromised. A programming model serves as an abstraction of how programs are executed. Transactional Memory (TM) is a technique proposed to maintain lock free synchronization. Due to the simplicity of the abstraction provided by it, TM can also be used as a way of distributing parallel work, maintaining coherence and consistency. Motivated by this, at a higher level, the thesis makes three contributions and all are centred around Hardware Transactional Memory (HTM).As the first contribution, a transaction-only architecture is coupled with a ``data centric" approach, to address the scalability issues of the former whilst maintaining its simplicity. This is achieved by grouping together memory locations having similar access patterns and maintaining coherence and consistency according to the group each memory location belongs to. As the second contribution a novel technique is proposed to reduce the number of false transaction aborts which occur in a signature based HTM. The idea is to adaptively switch between cache lines and signatures to detect conflicts. That is, when a transaction fits in the L1 cache, cache line information is used to detect conflicts and signatures are used otherwise. As the third contribution, the thesis makes a case for having an exit functionality in an HTM. The objective of the proposed functionality, TM_EXIT, is to terminate a transaction without restarting or committing. 004
8	Efficient, scalable, and fair read-modify-writes Rajaram, Bharghava January 2015 (has links) Read-Modify-Write (RMW) operations, or atomics, have widespread application in (a) synchronization, where they are used as building blocks of various synchronization constructs like locks, barriers, and lock-free data structures (b) supervised memory systems, where every memory operation is effectively an RMW that reads and modifies metadata associated with memory addresses and (c) profiling, where RMW instructions are used to increment shared counters to convey meaningful statistics about a program. In each of these scenarios, the RMWs pose a bottleneck to performance and scalability. We observed that the cost of RMWs is dependent on two major factors – the memory ordering enforced by the RMW, and contention amongst processors performing RMWs to the same memory address. In the case of both synchronization and supervised memory systems, the RMWs are expensive due to the memory ordering enforced due to the atomic RMW operation. Performance overhead due to contention is more prevalent in parallel programs which frequently make use of RMWs to update concurrent data structures in a non-blocking manner. Such programs also suffer from a degradation in fairness amongst concurrent processors. In this thesis, we study the cost of RMWs in the above applications, and present solutions to obtain better performance and scalability from RMW operations. Firstly, this thesis tackles the large overhead of RMW instructions when used for synchronization in the widely used x86 processor architectures, like in Intel, AMD, and Sun processors. The x86 processor architecture implements a variation of the Total-Store-Order (TSO) memory consistency model. RMW instructions in existing TSO architectures (we call them type-1 RMW) are ordered like memory fences, which makes them expensive. The strong fence-like ordering of type-1 RMWs is unnecessary for the memory ordering required by synchronization. We propose weaker RMW instructions for TSO consistency; we consider two weaker definitions: type-2 and type-3, each causing subtle ordering differences. Type-2 and type-3 RMWs avoid the fence-like ordering of type-1 RMWs, thereby reducing their overhead. Recent work has shown that the new C/C++11 memory consistency model can be realized by generating type-1 RMWs for SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid for the proposed type-2 RMWs, and partially for type-3 RMWs. We also propose efficient implementations for type-2 (type-3) RMWs. Simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) for the programs considered. Next, we argue the case for an efficient and correct supervised memory system for the TSO memory consistency model. Supervised memory systems make use of RMW-like supervised memory instructions (SMIs) to atomically update metadata associated with every memory address used by an application program. Such a system is used to help increase reliability, security and accuracy of parallel programs by offering debugging/monitoring features. Most existing supervised memory systems assume a sequentially consistent memory. For weaker consistency models, like TSO, correctness issues (like imprecise exceptions) arise if the ordering requirement of SMIs is neglected. In this thesis, we show that it is sufficient for supervised instructions to only read and process their metadata in order to ensure correctness. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Our experimental results show that SuperCoP performs better than the existing state-of-the-art correct supervision system by 16.8%. Finally, we address the issue of contention and contention-based failure of RMWs in non-blocking synchronization mechanisms. We leverage the fact that most existing lock-free programs make use of compare-and-swap (CAS) loops to access the concurrent data structure. We propose DyFCoM (Dynamic Fairness and Contention Management), a holistic scheme which addresses both throughput and fairness under increased contention. DyFCoM monitors the number of successful and failed RMWs in each thread, and uses this information to implement a dynamic backoff scheme to optimize throughput. We also use this information to throttle faster threads and give slower threads a higher chance of performing their lock-free operations, to increase fairness among threads. Our experimental results show that our contention management scheme alone performs better than the existing state-of-the-art CAS contention management scheme by an average of 7.9%. When fairness management is included, our scheme provides an average of 3.4% performance improvement over the constant backoff scheme, while showing increased fairness values in all cases (up to 43.6%). 004
9	ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction Huang, Andrew "bunnie" 01 June 2002 (has links) The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies. AI
10	Effcient Simulation of Message-Passing in Distributed-Memory Architectures Demaine, Erik January 1996 (has links) In this thesis we propose a distributed-memory parallel-computer simulation system called PUPPET (Performance Under a Pseudo-Parallel EnvironmenT). It allows the evaluation of parallel programs run in a pseudo-parallel system, where a single processor is used to multitask the program's processes, as if they were run on the simulated system. This allows development of applications and teaching of parallel programming without the use of valuable supercomputing resources. We use a standard message-passing language, MPI, so that when desired (e. g. , development is complete) the program can be run on a truly parallel system without any changes. There are several features in PUPPET that do not exist in any other simulation system. Support for all deterministic MPI features is available, including collective and non-blocking communication. Multitasking (more processes than processors) can be simulated, allowing the evaluation of load-balancing schemes. PUPPET is very loosely coupled with the program, so that a program can be run once and then evaluated on many simulated systems with multiple process-to-processor mappings. Finally, we propose a new model of direct networks that ignores network traffic, greatly improving simulation speed and often not signficantly affecting accuracy. Computer Science distributed-memory parallel-computer simulation system Performance Pseudo-Parallel Environment

Search results