Global ETD Search

1	Self-tuned parallel runtimes: a case of study for OpenMP Durán González, Alejandro 22 October 2008 (has links) In recent years parallel computing has become ubiquitous. Lead by the spread of commodity multicore processors, parallel programming is not anymore an obscure discipline only mastered by a few.Unfortunately, the amount of able parallel programmers has not increased at the same speed because is not easy to write parallel codes.Parallel programming is inherently different from sequential programming. Programmers must deal with a whole new set of problems: identification of parallelism, work and data distribution, load balancing, synchronization and communication.Parallel programmers have embraced several languages designed to allow the creation of parallel applications. In these languages, the programmer is not only responsible of identifying the parallelism but also of specifying low-level details of how the parallelism needs to exploited (e.g. scheduling, thread distribution ...). This is a burden than hampers the productivity of the programmers.We demonstrate that is possible for the runtime component of a parallel environment to adapt itself to the application and the execution environment and thus reducing the burden put into the programmer. For this purpose we study three different parameters that are involved in the parallel exploitation of the OpenMP parallel language: parallel loop scheduling, thread allocation in multiple levels of parallelism and task granularity control.In all the cases, we propose a self-tuned algorithm that will first perform an on-line profiling of the application and based on the information gathered it will adapt the value of the parameter to the one that maximizes the performance of the application.Our goal is not to develop methods that outperform a hand-tuned application for a specific scenario, as this is probably just as difficult as compiler code outperforming hand-tuned assembly code, but methods that get close to that performance with a minimum effort from the programmer. In other words, what we want to achieve with our self-tuned algorithms is to maximize the ratio performance over effort so the entry level to the parallelism is lower. The evaluation of our algorithms with different applications shows that we achieve that goal. task parallelism self-tuned parallelism openMP 004
2	An interprocedural framework for data redistributions in distributed memory machines Krishnamurthy, Sudha January 1996 (has links) No description available. Interprocedural Data Redistributions Data Parallelism Functional Parallelism
3	Exploring the limitations of fine-grained parallelism for a superscalar architecture Potter, Richard Daniel January 1998 (has links) No description available. 621.39 Instruction-level parallelism
4	Systematic construction and mapping of parallel programs Grant-Duff, Zulena Noemi January 1997 (has links) No description available. 005 Parallel architectures; Parallelism
5	A distributed model for dynamic optimisation of networks Azevedo Perdicoulis, Teresa-Paula C. January 1998 (has links) No description available. 005 Distributed computing; Parallelism
6	Parallelism in operating system design Hull, M. E. C. January 1980 (has links) No description available. 005 Computer program parallelism
7	A hardware scheduler for parallel processing in control Crummey, Thomas Paul January 1998 (has links) No description available. 621.39 Parallelism; Microprocessors
8	Lightweight speculative support for aggressive auto-parallelisation tools Powell, Daniel Christopher January 2015 (has links) With the recent move to multi-core architectures it has become important to create the means to exploit the performance made available to us by these architectures. Unfortunately parallel programming is often a difficult and time-intensive process, even to expert programmers. Auto-parallelisation tools have aimed to fill the performance gap this has created, but static analysis commonly employed by such tools are unable to provide the performance improvements required due to lack of information at compile-time. More recent aggressive parallelisation tools use profiled-execution to discover new parallel opportunities, but these tools are inherently unsafe. They require either manual confirmation that their changes are safe, completely ruling out auto-parallelisation, or they rely upon speculative execution such as software thread-level speculation (SW-TLS) to confirm safe execution at runtime. SW-TLS schemes are currently very heavyweight and often fail to provide speedups for a program. Performance gains are dependent upon suitable parallel opportunities, correct selection and configuration, and appropriate execution platforms. Little research has been completed into the automated implemention of SW-TLS programs. This thesis presents an automated, machine-learning based technique to select and configure suitable speculation schemes when appropriate. This is performed by extracting metrics from potential parallel opportunities and using them to determine if a loop is suitable for speculative execution and if so, which speculation policy should be used. An extensive evaluation of this technique is presented, verifying that SW-TLS configuration can indeed be automated and provide reliable performance gains. This work has shown that on an 8-core machine, up to 7.75X and a geometric mean of 1.64X speedups can be obtained through automatic configuration, providing on average 74% of the speedup obtainable through manual configuration. Beyond automated configuration, this thesis explores the idea that many SW-TLS schemes focus too heavily on recovery from detecting a dependence violation. Doing so often results in worse than sequential performance for many real-world applications, therefore this work hypothesises that for many highly-likely parallel candidates, discovered through aggressive parallelisation techniques, would benefit from a simple dependence check without the ability to roll back. Dependence violations become extremely expensive in this scenario, however this would be incredibly rare. With a thorough evaluation of the technique this thesis confirms the hypothesis whilst achieving speedups of up to 22.53X, and a geometric mean of 2.16X on a 32-core machine. In a competitive scheduling scenario performance loss can be restricted to at least sequential speeds, even when a dependence has been detected. As a means to lower costs further this thesis explores other platforms to aid in the execution of speculative error checking. Introduced is the use of a GPU to offload some of the costs to during execution that confirms that using an auxiliary device is a legitimate means to obtain further speedup. Evaluation demonstrates that doing so can achieve up to 14.74X and a geometric mean of 1.99X speedup on a 12-core hyperthreaded machine. Compared to standard CPU-only techniques this performs slightly slower with a geometric mean of 0.96X speedup, however this is likely to improve with upcoming GPU designs. With the knowledge that GPU’s can be used to reduce speculation costs, this thesis also investigates their use to speculatively improve execution times also. Presented is a novel SW-TLS scheme that targets GPU-based execution for use with aggressive auto-parallelisers. This scheme is executed using a competitive scheduling model, ensuring performance is no lower than sequential execution, whilst being able to provide speedups of up to 99X and on average 3.2X over sequential. On average this technique outperformed static analysis alone by a factor of 7X and achieved approximately 99% of the speedup obtained from manual parallel implementations and outperformed the state-of-the-art in GPU SW-TLS by a factor of 1.45. 005.2 parallelism ; speculative execution
9	Root parallelism in Invisalign® treatment Nemes, Jordan 22 April 2016 (has links) AIM: To assess root parallelism after Invisalign® treatment. MATERIALS AND METHODS: The sample consisted of 101 patients (mean age: 22.7 years, 29 males, 72 females) treated non-extraction with Invisalign® by one orthodontist. Root angulations were assessed using the 4-point angulation tool (Dolphin imaging©); the long axes of adjacent teeth were traced, yielding a convergence/divergence angle. Acceptable root parallelism was assessed if the root angulation did not converge/diverge more than 7 degrees. Sites evaluated: between 1st molars and 2nd premolars, 2nd and 1st premolars, lateral and central incisors, and between central incisors in all four quadrants. The average change in mesio-distal root angulation was assessed between pre- and post-treatment panoramic radiographs. RESULTS: Paired t-tests were used to analyze the average change in mesiodistal root angulation. Statistically significant differences were obtained indicating a reduction in the convergence/divergence angles between teeth #16-15, #15-14, #11-21, #24-25, #25-26, #45-44, #42-41, #41-31, #31-32, and #34-35 (at p-value <0.05). The average change in root angulation was not affected (p>0.05) by age (Pearson correlation coefficient), gender, occlusion type (I, II, or III), or elastic use (unpaired, 2 sample t-test at p<0.05). Intra and inter-rater reliability for 20% of the studied sample was assessed using the interclass correlation coefficient 3 test. All measured areas except teeth #16-15, #26-25, and #36-35 yielded good ICC reliability scores above 0.7. CONCLUSION: Root parallelism was improved post-Invisalign® treatment in ten of the fourteen areas evaluated. Thus, Invisalign® may be an effective treatment modality in controlling root angulation in non-extraction cases. / May 2016 Invisalign Root Parallelism Orthodontic
10	Verifiable early-reply with C++ Cook, Stephen Wendell 17 September 2007 (has links) Concurrent programming can improve performance. However, it comes with two drawbacks. First, concurrent programs can be more difficult to design and reason about than their sequential counterparts. Second, error conditions that do not exist in sequential programs, such as data race conditions and deadlock, can make concurrent programs more unreliable. To make concurrent programming simpler and more reliable, while still providing sufficient performance gains, we present a concurrency framework based on an existing concurrency initiation mechanism called Ã¢ÂÂEarly-ReplyÃ¢ÂÂ. Early-Reply is based on the idea that some functions can produce final return values long before they terminate. Concurrent execution begins when return value of a function is returned to the caller, allowing the rest of the work of the function to be done on an auxiliary thread. The simpler sequential programming model can be used by the caller, because the concurrency is initiated and hidden within the function body. Pike and Sridhar recognized Early-Reply as a way for sequential programs to get the benefits of concurrent execution. They also discussed using object-oriented programming to serialize access to data that needs synchronization. Our work expands on their approach and provides an actual C++ implementation of an Early-Reply based framework. Our framework simplifies concurrent programming for both users and implementers by allowing developers to use sequential reasoning, and by providing a minimal framework interface. Concurrent programming is made more reliable by combining the concurrency synchronization and initiation into one mechanism within the framework, which isolates where race conditions and deadlock can occur. Furthermore, this isolation facilitates the development of a simple set of coding guidelines that can be used by developers (through inspection) or static analysis tools (through verification) to eliminate race conditions and deadlocks. As a motivating example, we parallelize an instructional compiler that processes multiple input source files. For each input file; the parsing and semantic analysis execute on the calling thread, while the code optimization and object code generation execute on an auxiliary thread. Speedups of 1.5 to 1.7 were observed on a dual processor confirming that sufficient performance gains are possible. Early-Reply concurrency parallelism

Search results