Global ETD Search

241	Parallel Discrete Event Simulation Techniques for Scientific Simulations Dave, Jagrut Durdant 19 April 2005 (has links) Exponential growth in computer technology, both in terms of individual CPUs and parallel technologies over the past decades has triggered rapid progress in large scale simulations. However, despite these achievements it has become clear that many conventional state-of-the-art techniques are ill-equipped to tackle problems that inherently involve multiple scales in configuration space. Our difficulty is that conventional ("time driven" or "time stepped") techniques update all parts of simulation space (fields, particles) synchronously, i.e. at time intervals assumed to be the same throughout the global computation domain or at best varying on a sub-domain basis (in adaptive mesh refinement algorithms). Using a serial electrostatic model, it was recently shown that discrete event techniques can lead to more than two orders of magnitude speedup compared to the time-stepped approach. In this research, the focus is on the extension of this technique to parallel architectures, using parallel discrete event simulation. Previous research in parallel discrete event simulations of scientific phenomena has been limited This thesis outlines a technique for converting a time-stepped simulation in the scientific domain into an equivalent parallel discrete event model. As a candidate simulation, an electromagnetic hybrid plasma simulation is considered. The experiments and analysis show the trade-offs on performance by varying the following factors: the simulations model characteristics (e.g. lookahead), applications load balancing, and accuracy of simulation results. The experiments are performed on a high performance cluster, using a conservative synchronization mechanism. Initial performance results are encouraging, demonstrating very good parallel speedup for large-scale model configurations containing tens of thousands of cells. Overheads for inter-processor communication remain a challenge for smaller computations. Parallel Computing Discrete events Simulation
242	Performance Analysis of Graph Algorithms using Graphics Processing Units Weng, Hui-Ze 02 September 2010 (has links) The GPU significantly improves the computing power by increasing the number of cores in recent years. The design principle of GPU focuses on the parallism of data processing. Therefore, there is some limitation of GPU application for the better computing power. For example, the processing of highly dependent data could not be well-paralleled. Consequently, it could not take the advantage of the computing power improved by GPU. Most of researches in GPU have discussed the improvement of computing power. Therefore, we try to study the cost effectiveness by the comparison between GPU and Multi-Core CPU. By well-applying the different hardware architectures of GPU and Multi-Core CPU, we implement some typical algorithms, respectively, and show the experimental result. Furthermore, the analysis of cost effectiveness, including time and money spending, is also well discussed in this paper. GPU Parallel computing Multi-Core
243	Design of multi-channel radio-frequency front-end for 200mhz parallel magnetic resonance imaging Liu, Xiaoqun 15 May 2009 (has links) The increasing demands for improving magnetic resonance imaging (MRI) quality, especially reducing the imaging time have been driving the channel number of parallel magnetic resonance imaging (Parallel MRI) to increase. When the channel number increases to 64 or even 128, the traditional method of stacking the same number of radio-frequency (RF) receivers with very low level of integration becomes expensive and cumbersome. However, the cost, size, power consumption of the Parallel MRI receivers can be dramatically reduced by designing a whole receiver front-end even multiple receiver front-ends on a single chip using CMOS technology, and multiplexing the output signal of each receiver front-end into one channel so that as much hardware resource can be shared by as many channels as possible, especially the digitizer. The main object of this research is focused on the analysis and design of fully integrated multi-channel RF receiver and multiplexing technology. First, different architectures of RF receiver and different multiplexing method are analyzed. After comparing the advantages and the disadvantages of these architectures, an architecture of receiver front-end which is most suitable for fully on-chip multi-channel design is proposed and a multiplexing method is selected. According to this proposed architecture, a four-channel receiver front-end was designed and fabricated using TSMC 0.18μm technology on a single chip and methods of testing in the MRI system using parallel planar coil array and phase coil array respectively as target coils were presented. Each channel of the receiver front-end includes an ultra low noise amplifier (LNA), a quadrature image rejection down-converter, a buffer, and a low-pass filter (LPF) which also acts as a variable gain amplifier (VGA). The quadrature image rejection downconverter consists of a quadrature generator, a passive mixer with a transimpedance amplifier which converts the output current signal of the passive mixer into voltage signal while acts as a LPF, and a polyphase filter after the TIA. The receiver has an over NF of 0.935dB, variable gain from about 80dB to 90dB, power consumption of 30.8mW, and chip area of 6mm2. Next, a prototype of 4-channel RF receiver with Time Domain Multiplexing (TDM) on a single printed circuit board (PCB) was designed and bench-tested. Then Parallel MRI experiment was carried out and images were acquired using this prototype. The testing results verify the proposed concepts. RF Front-End Parallel MRI
244	Generic implementations of parallel prefix sums and its applications Huang, Tao 15 May 2009 (has links) Parallel prefix sums algorithms are one of the simplest and most useful building blocks for constructing parallel algorithms. A generic implementation is valuable because of the wide range of applications for this method. This thesis presents a generic C++ implementation of parallel prefix sums. The implementation applies two separate parallel prefix sums algorithms: a recursive doubling (RD) algorithm and a binary-tree based (BT) algorithm. This implementation shows how common communication patterns can be separated from the concrete parallel prefix sums algorithms and thus simplify the work of parallel programming. For each algorithm, the implementation uses two different synchronization options: barrier synchronization and point-to-point synchronization. These synchronization options lead to different communication patterns in the algorithms, which are represented by dependency graphs between tasks. The performance results show that point-to-point synchronization performs better than barrier synchronization as the number of processors increases. As part of the applications for parallel prefix sums, parallel radix sort and four parallel tree applications are built on top of the implementation. These applications are also fundamental parallel algorithms and they represent typical usage of parallel prefix sums in numeric computation and graph applications. The building of such applications become straighforward given this generic implementation of parallel prefix sums. implementation parallel prefix sums application
245	Data Dependence Analysis and Its Applicatons on Loop Transformation Yang, Cheng-Ming 18 July 2000 (has links) For the past several decades, parallel processing has become an important research subject in the computer science area. According to the statistics, in executing a numerical program, most of time is spent on the loops. If we can use the technique of loop restructuring in the parallelizing compiler such that the conventional sequential program can be executed by exploiting the characteristics of vector machine or parallel machine, the execution efficiency will be greatly improved. In the parallelizing compiler, data dependence analysis is very important because it provides the information for loop restructuring. Data dependence analysis is necessary in order to determine whether a loop can be vectorized or parallelized. It analyzes whether the same array element or variable will be accessed more than once in a loop (e.g. access the same memory location more than once in loop execution). In the recent years, the researches on parallelizing compiler are considerable. But, data dependence analysis is still a bottleneck. There are many data dependence test such as Banerjee Test, test, Omega Test, I Test, Power Test, ... and so on, which have been used in the design of parallelizing compiler. In the thesis, we will propose a novel exact data dependence test method called Interval Reduced test (IR test). This method reduces the integer boundary of each constraint variable by repeatedly projection. When the effective region of a variable is reduced to be empty, the constraint containing this variable has no integer solution and the memory accesses under this constraint are therefore independent. The IR test is only suitable for the loops in which the loop bounds are rectangular, triangular, or unknown at compiling-time in some limited condition. To enhance the data dependence analysis capability of the IR test, we proposed the Extension-IR test in this thesis to extend the dependence testing range of one-dimensional array references to linear subscripts with variable bounds under any given direction vector. The Extension-IR test can solve in effective polynomial time. When array subscripts are non-linear expressions or too complex to analyze by the existing data dependence testing schemes, we devise a new parallelization algorithm called non-linear array subscripts test (NLA test) to deal with. The iterations subject to loop-carried dependence are scheduled into different wavefronts, while the iterations with no loop-carried dependence are assigned into the same wavefront. Based on the wavefront information, the original loop is transformed into parallel code for execution at run-time. Loop interchange is an important restructuring technique for supporting vectorization and parallelization. In this thesis, we proposed a technique, which can determine efficiently, whether loops can be interchanged between two non-adjacent loops on perfect nested loop or some imperfectly nested loop. A method for determining whether two arbitrary levels in perfectly nested loops, which contain IF and GOTO statements, can be interchanged is also presented in this thesis. Data Dependence Analysis Parallel Compiler
246	Congestion control schemes for single and parallel TCP flows in high bandwidth-delay product networks Cho, Soohyun 16 August 2006 (has links) In this work, we focus on congestion control mechanisms in Transmission Control Protocol (TCP) for emerging very-high bandwidth-delay product networks and suggest several congestion control schemes for parallel and single-flow TCP. Recently, several high-speed TCP proposals have been suggested to overcome the limited throughput achievable by single-flow TCP by modifying its congestion control mechanisms. In the meantime, users overcome the throughput limitations in high bandwidth-delay product networks by using multiple parallel TCP flows, without modifying TCP itself. However, the evident lack of fairness between the high-speed TCP proposals (or parallel TCP) and existing standard TCP has increasingly become an issue. In many scenarios where flows require high throughput, such as grid computing or content distribution networks, often multiple connections go to the same or nearby destinations and tend to share long portions of paths (and bottlenecks). In such cases benefits can be gained by sharing congestion information. To take advantage of this additional information, we first propose a collaborative congestion control scheme for parallel TCP flows. Although the use of parallel TCP flows is an easy and effective way for reliable high-speed data transfer, parallel TCP flows are inherently unfair with respect to single TCP flows. In this thesis we propose, implement, and evaluate a natural extension for aggregated aggressiveness control in parallel TCP flows. To improve the effectiveness of single TCP flows over high bandwidth-delay product networks without causing fairness problems, we suggest a new TCP congestion control scheme that effectively and fairly utilizes high bandwidth-delay product networks by adaptively controlling the flowÂs aggressiveness according to network situations using a competition detection mechanism. We argue that competition detection is more appropriate than congestion detection or bandwidth estimation. We further extend the adaptive aggressiveness control mechanism and the competition detection mechanism from single flows to parallel flows. In this way we achieve adaptive aggregated aggressiveness control. Our evaluations show that the resulting implementation is effective and fair. As a result, we show that single or parallel TCP flows in end-hosts can achieve high performance over emerging high bandwidth-delay product networks without requiring special support from networks or modifications to receivers. TCP Congestion Control Parallel TCP
247	Instrumentation for parallel magnetic resonance imaging Brown, David Gerald 25 April 2007 (has links) Parallel magnetic resonance (MR) imaging may be used to increase either the throughput or the speed of the MR imaging experiment. As such, parallel imaging may be accomplished either through a "parallelization" of the MR experiment, or by the use of arrays of sensors. In parallelization, multiple MR scanners (or multiple sensors) are used to collect images from different samples simultaneously. This allows for an increase in the throughput, not the inherent speed, of the MR experiment. Parallel imaging with arrays of sensor coils, on the other hand, makes use of the spatial localization properties of the sensors in an imaging array to allow a reduction in the number of phase encodes required in acquiring an image. This reduced phase-encoding requirement permits an increase in the overall imaging speed by a factor up to the number of sensors in the imaging array. The focus of this dissertation has been the development of cost-effective instrumentation that would enable advances in the state of the art of parallel MR imaging. First, a low-cost desktop MR scanner was developed (< $13,000) for imaging small samples (2.54 cm fields-of view) at low magnetic field strengths (< 0.25 T). The performance of the prototype was verified through bench-top measurements and phantom imaging. The prototype transceiver has demonstrated an SNR (signal-to-noise ratio) comparable to that of a commercial MR system. This scanner could make parallelization of the MR experiment a practical reality, at least in the areas of small animal research and education. A 64-channel receiver for parallel MR imaging with arrays of sensors was also developed. The receiver prototype was characterized through both bench-top tests and phantom imaging. The parallel receiver is capable of simultaneous reception of up to sixty-four, 1 MHz bandwidth MR signals, at imaging frequencies from 63 to 200 MHz, with an SNR performance (on each channel) comparable to that of a single-channel commercial MR receiver. The prototype should enable investigation into the speed increases obtainable from imaging with large arrays of sensors and has already been used to develop a new parallel imaging technique known as single echo acquisition (SEA) imaging. instrumentation parallel MR imaging MRI
248	Optimistic semantic synchronization Sreeram, Jaswanth 06 October 2011 (has links) Within the last decade multi-core processors have become increasingly commonplace with the power and performance demands of modern real-world programs acting to accelerate this trend. The rapid advancements in designing and adoption of such architectures mean that there is a serious need for programming models that allow the development of correct parallel programs that execute efficiently on these processors. A principle problem in this regard is that of efficiently synchronizing concurrent accesses to shared memory. Traditional solutions to this problem are either inefficient but provide programmability (coarse-grained locks) or are efficient but are not composable and very hard to program and verify (fine-grained locks). Optimistic Transactional Memory systems provide many of the composability and programmabillity advantages of coarse-grained locks and good theoretical scaling but several studies have found that their performance in practice for many programs remains quite poor primarily because of the high overheads of providing safe optimism. Moreover current transactional memory models remain rigid - they are not suited for expressing some of the complex thread interactions that are prevalent in modern parallel programs. Moreover, the synchronization achieved by these transactional memory systems is at the physical or memory level. This thesis advocates a position that memory synchronization problem for threads should be modeled and solved in terms of synchronization of underlying program values which have semantics associated with them. It presents optimistic synchronization techniques that address the semantic synchronization requirements of a parallel program instead. These techniques include methods to 1) enable optimistic transactions to recover from expensive sharing conflicts without discarding all the work made possible by the optimism 2) enable a hybrid pessimistic-optimistic form of concurrency control that lowers overheads 3) make synchronization value-aware and semantics-aware 4) enable finer grained consistency rules (than allowed by traditional optimistic TM models) therefore avoiding conflicts that do not enforce any semantic property required by the program. In addition to improving the expressibility of specific synchronization idioms all these techniques are also effective in improving parallel performance. This thesis formulates these techniques in terms of their purpose, the extensions to the language, the compiler as well as to the concurrency control runtime necessary to implement them. It also briefly presents an experimental evaluation of each of them on a variety of modern parallel workloads. These experiments show that these techniques significantly improve parallel performance and scalability over programs using state-of-the-art optimistic synchronization methods. Synchronization Transactional memory Parallel computing Computer architecture Multiprocessors Parallel programming (Computer science) Parallel programs (Computer programs)
249	Instruction history management for high-performance microprocessors Bhargava, Ravindra Nath. January 2003 (has links) (PDF) Thesis (Ph. D.)--University of Texas at Austin, 2003. / Vita. Includes bibliographical references. Available also from UMI Company.
250	On contention management for data accesses in parallel and distributed systems Yu, Xiao 08 June 2015 (has links) Data access is an essential part of any program, and is especially critical to the performance of parallel computing systems. The objective of this work is to investigate factors that affect data access parallelism in parallel computing systems, and design/evaluate methods to improve such parallelism - and thereby improving the performance of corresponding parallel systems. We focus on data access contention and network resource contention in representative parallel and distributed systems, including transactional memory system, Geo-replicated transactional systems and MapReduce systems. These systems represent two widely-adopted abstractions for parallel data accesses: transaction-based and distributed-system-based. In this thesis, we present methods to analyze and mitigate the two contention issues. We first study the data contention problem in transactional memory systems. In particular, we present a queueing-based model to evaluate the impact of data contention with respect to various system configurations and workload parameters. We further propose a profiling-based adaptive contention management approach to choose an optimal policy across different benchmarks and system platforms. We further develop several analytical models to study the design of transactional systems when they are Geo-replicated. For the network resource contention issue, we focus on data accesses in distributed systems and study opportunities to improve upon the current state-of-art MapReduce systems. We extend the system to better support map task locality for dual-map-input applications. We also study a strategy that groups input blocks within a few racks to balance the locality of map and reduce tasks. Experiments show that both mechanisms significantly reduce off-rack data communication and thus alleviate the resource contention on top-rack switch and reduce job execution time. In this thesis, we show that both the data contention and the network resource contention issues are key to the performance of transactional and distributed data access abstraction and our mechanisms to estimate and mitigate such problems are effective. We expect our approaches to provide useful insight on future development and research for similar data access abstractions and distributed systems. Contention management Parallel and distributed systems

Search results