Global ETD Search

1	Bug Finding Methods for Multithreaded Student Programming Projects Naciri, William Malik 04 August 2017 (has links) The fork-join framework project is one of the more challenging programming assignments in the computer science curriculum at Virginia Tech. Students in Computer Systems must manage a pool of threads to facilitate the shared execution of dynamically created tasks. This project is difficult because students must overcome the challenges of concurrent programming and conform to the project's specific semantic requirements. When working on the project, many students received inconsistent test results and were left confused when debugging. The suggested debugging tool, Helgrind, is a general-purpose thread error detector. It is limited in its ability to help fix bugs because it lacks knowledge of the specific semantic requirements of the fork-join framework. Thus, there is a need for a special-purpose tool tailored for this project. We implemented Willgrind, a debugging tool that checks the behavior of fork-join frameworks implemented by students through dynamic program analysis. Using the Valgrind framework for instrumentation, checking statements are inserted into the code to detect deadlock, ordering violations, and semantic violations at run-time. Additionally, we extended Willgrind with happens-before based checking in WillgrindPlus. This tool checks for ordering violations that do not manifest themselves in a given execution but could in others. In a user study, we provided the tools to 85 students in the Spring 2017 semester and collected over 2,000 submissions. The results indicate that the tools are effective at identifying bugs and useful for fixing bugs. This research makes multithreaded programming easier for students and demonstrates that special-purpose debugging tools can be beneficial in computer science education. / Master of Science / The fork-join framework project is one of the more challenging programming assignments in the computer science curriculum at Virginia Tech. Students in Computer Systems must manage a pool of threads to facilitate the shared execution of dynamically created tasks. This project is difficult because students must overcome the challenges of concurrent programming and conform to the project’s specific semantic requirements. When working on the project, many students received inconsistent test results and were left confused when debugging. The suggested debugging tool, Helgrind, is a general-purpose thread error detector. It is limited in its ability to help fix bugs because it lacks knowledge of the specific semantic requirements of the fork-join framework. Thus, there is a need for a special-purpose tool tailored for this project. We implemented Willgrind, a debugging tool that checks the behavior of fork-join frameworks implemented by students through dynamic program analysis. Using the Valgrind framework for instrumentation, checking statements are inserted into the code to detect deadlock, ordering violations, and semantic violations at run-time. Additionally, we extended Willgrind with happens-before based checking in WillgrindPlus. This tool checks for ordering violations that do not manifest themselves in a given execution but could in others. In a user study, we provided the tools to 85 students in the Spring 2017 semester and collected over 2,000 submissions. The results indicate that the tools are effective at identifying bugs and useful for fixing bugs. This research makes multithreaded programming easier for students and demonstrates that special-purpose debugging tools can be beneficial in computer science education. Binary Instrumentation Multithreaded Programming Debugging
2	Replay Debugger For Multi Threaded Android Applications January 2011 (has links) abstract: Debugging is a hard task. Debugging multi-threaded applications with their inherit non-determinism is all the more difficult. Non-determinism of any kind adds to the difficulty of cyclic debugging. In Android applications which are written in Java, threads and concurrency constructs introduce non-determinism to the program execution. Even with the same input, consecutive runs may not be the same and reproducing the same bug is a challenging task. This makes it difficult to understand and analyze the execution behavior or to understand the source of a failing execution. This thesis introduces a replay mechanism for Android applications written in Java and is based on the Lamport Clock. This tool provides the user with a controlled debugging environment, where the program execution follows the identical partially ordered happened-before dependency among threads, as during the recorded execution. In this, certain significant events like thread creation, synchronization etc. are recorded during run-time. They can later be replayed off-line, as many times as needed to pinpoint and fix an error in the application. It is software based approach and has been implemented by modifying the Dalvik Virtual Machine in the Android platform. The method of replay described in this thesis is independent of the underlying operating system scheduler. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science Android Debugger Multithreaded Replay
3	A Simple Throttling Concept for Multithreaded Application Servers Stridh, Fredrik January 2009 (has links) Multithreading is today a very common technology to achieve concurrency within software. Today there exists three commonly used threading strategies for multithreaded application servers. These are thread per client, thread per request and thread pool. Earlier studies has shown that the choice of threading strategy is not that important. Our measurements show that the choice of threading architecture becomes more important when the application comes under high load. We will in this study present a throttling concept which can give thread per client almost as good qualities as the thread pool strategy when it comes to performance. No architecture change is required. This concept has been evaluated on three types of hardware, ranging from 1 to 64 CPUs, using 6 alternatives loads and both in C and Java. We have also identified that there is a high correlation between average response times and the length of the run time queue. This can be used to construct a self tuning throttling algorithm that makes the introduction of the throttle concept even simpler, since it does require any configuring. throttling multithreaded server Software Engineering Programvaruteknik
4	The Named-State Register File Nuth, Peter R. 01 August 1993 (has links) This thesis introduces the Named-State Register File, a fine-grain, fully-associative register file. The NSF allows fast context switching between concurrent threads as well as efficient sequential program performance. The NSF holds more live data than conventional register files, and requires less spill and reload traffic to switch between contexts. This thesis demonstrates an implementation of the Named-State Register File and estimates the access time and chip area required for different organizations. Architectural simulations of large sequential and parallel applications show that the NSF can reduce execution time by 9% to 17% compared to alternative register files. multithreaded context switch register fully-associative sthread parallel processor
5	A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors Weng, Lichen 23 April 2012 (has links) The Multicore Multithreaded Microprocessor maximizes parallelism on a chip for the optimal system performance, such that its popularity is growing rapidly in high-performance computing. It increases the complexity in resource distribution on a chip by leading it to two directions: isolation and unification. On one hand, multiple cores are implemented to deliver the computation and memory accessing resources to more than one thread at the same time. Nevertheless, it limits the threads’ access to resources in different cores, even if extensively demanded. On the other hand, simultaneous multithreaded architectures unify the domestic execu- tion resources together for concurrently running threads. In such an environment, threads are greatly affected by the inter-thread interference. Moreover, the impacts of the complicated distribution are enlarged by variation in workload behaviors. As a result, the microprocessor requires an adaptive management scheme to schedule threads throughout different cores and coordinate them within cores. In this study, an adaptive thread management scheme was proposed, integrating both hardware and software approaches. The instruction fetch policy at the hardware level took the responsibility by prioritizing domestic threads, while the Operating System scheduler at the software level was used to pair threads dynami- vi cally to multiple cores. The tie between them was the proposed online linear model, which was dynamically constructed for every thread based on data misses by the regression algorithm. Consequently, the hardware part of the proposed scheme proactively granted higher priority to the threads with less predicted long-latency loads, expecting they would better utilize the shared execution resources. Mean- while, the software part was invoked by such a model upon significant changes in the execution phases and paired threads with different demands to the same core to minimize competition on the chip. The proposed scheme was compared to its peer designs and overall 43% speedup was achieved by the integrated approach over the combination of two baseline policies in hardware and software, respectively. The overhead was examined carefully regarding power, area, storage and latency, as well as the relationship between the overhead and the performance. Computer architecture High-performance computing Simultaneous Multithreaded Adaptive resource management Multicore multithreaded microprocessor Ordinary least square regression
6	Parallelizing Set Similarity Joins Fier, Fabian 24 January 2022 (has links) Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze. / One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far. Join Parallelisierung Verteilt Multithreaded Join Parallelization Distributed Multithreaded 004 Informatik ST 530 ST 134 ddc:005 ddc:004
7	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
8	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
9	Bottleneck identification and acceleration in multithreaded applications Joao, José Alberto 09 February 2015 (has links) When parallel applications do not fully utilize the cores that are available to them they are missing the opportunity to have better performance. Sometimes threads have to wait for other threads. I call the code segments that make other threads wait bottlenecks. Examples of these bottlenecks include contended critical sections, threads arriving late to barriers and the slowest stage of a pipelined program. Other times all threads are running but some of them, which I call lagging threads, are making less progress, setting the stage to become bottlenecks. My thesis proposes identifying the code segments that are more critical for performance and efficiently accelerating them using faster cores, by either migrating execution to large cores of an Asymmetric Chip Multi-Processor (ACMP) or executing locally on DVFS-accelerated cores. The key contribution of this dissertation is a Utility of Acceleration metric that combines a measure of the acceleration for each code segment with a measure of its criticality. This metric enables meaningful comparisons to decide which bottlenecks or lagging threads to accelerate with each of the available acceleration mechanisms. My evaluation shows significant performance improvement for single multithreaded applications and sets of multiple single- and multi-threaded applications, and also reduction in energy-delay product due to the efficient utilization of the available acceleration mechanisms. / text Multithreaded applications Bottlenecks Critical sections Barriers Multicore Asymmetric CMP Heterogeneous CMP DVFS
10	Study and design of a manycore architecture with multithreaded processors for dynamic embedded applications Bechara, Charly 08 December 2011 (has links) (PDF) Embedded systems are getting more complex and require more intensive processing capabilities. They must be able to adapt to the rapid evolution of the high-end embedded applications that are characterized by their high computation-intensive workloads (order of TOPS: Tera Operations Per Second), and their high level of parallelism. Moreover, since the dynamism of the applications is becoming more significant, powerful computing solutions should be designed accordingly. By exploiting efficiently the dynamism, the load will be balanced between the computing resources, which will improve greatly the overall performance. To tackle the challenges of these future high-end massively-parallel dynamic embedded applications, we have designed the AHDAM architecture, which stands for "Asymmetric Homogeneous with Dynamic Allocator Manycore architecture". Its architecture permits to process applications with large data sets by efficiently hiding the processors' stall time using multithreaded processors. Besides, it exploits the parallelism of the applications at multiple levels so that they would be accelerated efficiently on dedicated resources, hence improving efficiently the overall performance. AHDAM architecture tackles the dynamism of these applications by dynamically balancing the load between its computing resources using a central controller to increase their utilization rate.The AHDAM architecture has been evaluated using a relevant embedded application from the telecommunication domain called "spectrum radio-sensing". With 136 cores running at 500 MHz, AHDAM architecture reaches a peak performance of 196 GOPS and meets the computation requirements of the application. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Multicore MPSoC Multithreaded processors Embedded systems Dynamic applications Simulation

Search results