Global ETD Search

1	Use of Parallel Computing Techniques to Search for Perfectly Orthogonal Complementary Codes Chang, Chin-hsiang 02 September 2005 (has links) Because using the perfect orthogonal code in CDMA system means elimination of multi-path interference and multiple access interference, we will discuss this aspect in this thesis. At first, we address a algorithm to how to search a flock of perfect orthogonal code, which consisting of many codes have a characteristic of perfect auto-correlation and perfect cross-correlation to each other. Following, we will compare perfect orthogonal code we find with the present complementary codes include complete complementary code (CCC), super complementary code (SCC), and two-dimensional orthogonal variable spreading code (2D-OVSF code). Considering that the run time of searching perfect orthogonal code is long in case of high processing gain (PG), we construct a PC cluster based Linux O.S or Windows O.S for the operation of parallel program to faster our production of perfect orthogonal code. CDMA complementary code parallel program PC cluster
2	New MP-SoC profiling tools based on data mining techniques / Nouveaux outils de profilage de MP-SoC basés sur des techniques de fouille de données Lagraa, Sofiane 13 June 2014 (has links) La miniaturisation des composants électroniques a conduit à l'introduction de systèmes électroniques complexes qui sont intégrés sur une seule puce avec multiprocesseurs, dits Multi-Processor System-on-Chip (MPSoC). La majorité des systèmes embarqués récents sont basées sur des architectures massivement parallèles MPSoC, d'où la nécessité de développer des applications parallèles embarquées. La conception et le développement d'une application parallèle embarquée devient de plus en plus difficile notamment pour les architectures multiprocesseurs hétérogènes ayant différents types de contraintes de communication et de conception tels que le coût du matériel, la puissance et la rapidité. Un défi à relever par de nombreux développeurs est le profilage des applications parallèles embarquées afin qu'ils puissent passer à l'échelle sur plusieurs cœurs possible. Cela est particulièrement important pour les systèmes embarqués de type MPSoC, où les applications doivent fonctionner correctement sur de nombreux cœurs. En outre, la performance d'une application ne s'améliore pas forcément lorsque l'application tourne sur un nombre de cœurs encore plus grand. La performance d'une application peut être limitée en raison de multiples goulot d'étranglement notamment la contention sur des ressources partagées telles que les caches et la mémoire. Cela devient contraignant etune perte de temps pour un développeur de faire un profilage de l'application parallèle embarquée et d'identifier des goulots d'étranglement dans le code source qui diminuent la performance de l'application. Pour surmonter ces problèmes, dans cette thèse, nous proposons trois méthodes automatiques qui détectent les instructions du code source qui ont conduit à une diminution de performance due à la contention et à l'évolutivité des processeurs sur une puce. Les méthodes sont basées sur des techniques de fouille de données exploitant des gigaoctets de traces d'exécution de bas niveau produites par les platesformes MPSoC. Nos approches de profilage permettent de quantifier et de localiser automatiquement les goulots d'étranglement dans le code source afin d'aider les développeurs à optimiserleurs applications parallèles embarquées. Nous avons effectué plusieurs expériences sur plusieurs applications parallèles embarquées. Nos expériences montrent la précision des techniques proposées, en quantifiant et localisant avec précision les hotspots dans le code source. / Miniaturization of electronic components has led to the introduction of complex electronic systems which are integrated onto a single chip with multiprocessors, so-called Multi-Processor System-on-Chip (MPSoC). The majority of recent embedded systems are based on massively parallel MPSoC architectures, hence the necessity of developing embedded parallel applications. Embedded parallel application design becomes more challenging: It becomes a parallel programming for non-trivial heterogeneous multiprocessors with diverse communication architectures and design constraints such as hardware cost, power, and timeliness. A challenge faced by many developers is the profiling of embedded parallel applications so that they can scale over more and more cores. This is especially critical for embedded systems powered by MPSoC, where ever demanding applications have to run smoothly on numerous cores, each with modest power budget. Moreover, application performance does not necessarily improve as more cores are added. Application performance can be limited due to multiple bottlenecks including contention for shared resources such as caches and memory. It becomes time consuming for a developer to pinpoint in the source code the bottlenecks decreasing the performance. To overcome these issues, in this thesis, we propose a fully three automatic methods which detect the instructions of the code which lead to a lack of performance due to contention and scalability of processors on a chip. The methods are based on data mining techniques exploiting gigabytes of low level execution traces produced by MPSoC platforms. Our profiling approaches allow to quantify and pinpoint, automatically the bottlenecks in source code in order to aid the developers to optimize its embedded parallel application. We performed several experiments on several parallel application benchmarks. Our experiments show the accuracy of the proposed techniques, by quantifying and pinpointing the hotspot in the source code. MPSoC Profilage Fouille de données Programe Parallèle MPSoC Profiling Data mining Parallel program 004
3	Extracting Data-Level Parallelism from Sequential Programs for SIMD Execution Baumstark, Lewis Benton, Jr. 29 October 2004 (has links) The goal of this research is to retarget multimedia programs written in sequential languages (e.g., C) to architectures with data-parallel execution capabilities. Image processing algorithms often have a high potential for data-level parallelism, but the artifacts imposed by the sequential programming language (e.g., loops, pointer variables) can obscure the parallelism and prohibit generation of efficient parallel code. This research presents a program representation and recognition approach for generating a data parallel program specification from sequential source code and retargeting it to data parallel execution mechanisms. The representation is based on an extension of the multi-dimensional synchronous dataflow model of computation. A partial recognition approach identifies and transforms only those program elements that hinder parallelization while leaving other computational elements intact. This permits flexibility in the types of programs that can be retargeted, while avoiding the complexity of complete program recognition. This representation and recognition process is implemented in the PARRET system, which is used to extract the high-level specification of a set of image-processing programs. From this specification, code is generated for Intels SSE2 instruction set and for the SIMPil processor. The results demonstrate that PARRET can exploit, given sufficient parallel resources, the maximum available parallelism in the retargeted applications. Similarly, the results show PARRET can also exploit parallelism on architectures with hardware-limited parallel resources. It is desirable to estimate potential parallelism before undertaking the expensive process of reverse engineering and retargeting. The goal is to narrow down the search space to a select set of loops which have a high likelihood of being data-parallel. This work also presents a hybrid static/dynamic approach, called DLPEST, for estimating the data-level parallelism in sequential program loops. We demonstrate the correctness of the DLPESTs estimates, show that estimates for programs of 25 to 5000 lines of code can be performed in under 10 minutes and that estimation time scales sub-linearly with input program size. Program recognition Data-level parallelization SIMD processors Reengineering
4	Modeling performance of serial and parallel sections of multi-threaded programs in many-core era / Modélisation de la performance des sections séquentielles et parallèles au sein de programmes multithreadés à l'ère des many-coeurs Khizakanchery Natarajan, Surya Narayanan 01 June 2015 (has links) Ce travail a été effectué dans le contexte d'un projet financé par l'ERC, Defying Amdahl's Law (DAL), dont l'objectif est d'explorer les techniques micro-architecturales améliorant la performance des processeurs multi-cœurs futurs. Le projet prévoit que malgré les efforts investis dans le développement de programmes parallèles, la majorité des codes auront toujours une quantité signifiante de code séquentiel. Pour cette raison, il est primordial de continuer à améliorer la performance des sections séquentielles des-dits programmes. Le travail de recherche de cette thèse porte principalement sur l'étude des différences entre les sections parallèles et les sections séquentielles de programmes multithreadés (MT) existants. L'exploration de l'espace de conception des futurs processeurs multi-cœurs est aussi traitée, tout en gardant à l'esprit les exigences concernant ces deux types de sections ainsi que le compromis performance-surface. / This thesis work is done in the general context of the ERC, funded Defying Amdahl's Law (DAL) project which aims at exploring the micro-architectural techniques that will enable high performance on future many-core processors. The project envisions that despite future huge investments in the development of parallel applications and porting it to the parallel architectures, most applications will still exhibit a significant amount of sequential code sections and, hence, we should still focus on improving the performance of the serial sections of the application. In this thesis, the research work primarily focuses on studying the difference between parallel and serial sections of the existing multi-threaded (MT) programs and exploring the design space with respect to the processor core requirement for the serial and parallel sections in future many-core with area-performance tradeoff as a primary goal. Ordinateurs Multiprocesseurs Parallélisme (informatique) Programmation parallèle (informatique) Transputers Computers Multiprocessors Parallelism Parallel Program Many-Core
5	Collecting and representing parallel programs with high performance instrumentation Railing, Brian Paul 07 January 2016 (has links) Computer architecture has looming challenges with finding program parallelism, process technology limits, and limited power budget. To navigate these challenges, a deeper understanding of parallel programs is required. I will discuss the task graph representation and how it enables programmers and compiler optimizations to understand and exploit dynamic aspects of the program. I will present Contech, which is a high performance framework for generating dynamic task graphs from arbitrary parallel programs. The Contech framework supports a variety of languages and parallelization libraries, and has been tested on both x86 and ARM. I will demonstrate how this framework encompasses a diversity of program analyses, particularly by modeling a dynamically reconfigurable, heterogeneous multi-core processor. Computer architecture Compilers Compiler-based instrumentation Parallel programming Parallel program analysis Instrumentation performance Task graph Program representation Heterogeneous computing
6	Checkpointing Algorithms for Parallel Computers Kalaiselvi, S 02 1900 (has links) Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration. Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state. Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system. It is defined as follows: ·If two events a and b happen in the same process, and if a happens before b, then a->b ·If a is the sending event of a message and b is the receiving event of the same message then a -> b ·If neither a à b nor b -> a, then a and b are said to be concurrent. A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows: 1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts. 2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard. 3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform. We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Chapter 2 concludes with brief comments on the desirable features of a checkpointing algorithm. In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms. Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis. In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken. The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis. We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing. Computer and Information Science Parallel Computers Parallel Program Checkpointing Algorithms Scalable Coherent Interface (SCI) Static Checkpointing Algorithms Distributed Shared Memory (DSM) Uniprocessor Systems
7	Schedules for Dynamic Bidirectional Simulations on Parallel Computers Lehmann, Uwe 19 May 2003 (has links) For adjoint calculations, parameter estimation, and similar purposes one may need to reverse the execution of a computer program. The simplest option is to record a complete execution log and then to read it backwards. This requires massive amounts of storage. Instead one may generate the execution log piecewise by restarting the ``forward'' calculation repeatedly from suitably placed checkpoints. This thesis extends the theoretical results of the parallel reversal schedules. First a algorithm was constructed which carries out the ``forward'' calculation and distributes checkpoints in a way, such that the reversal calculation can be started at any time. This approach provides adaptive parallel reversal schedules for simulations where the number of time steps is not known a-priori. The number of checkpoints and processors used is optimal at any time. Further, an algorithm was described which makes is possible to restart the initial computer program during the program reversal. Again, this can be done without any additional computation at any time. Hence, optimal parallel reversal schedules for the bidirectional simulation are provided by this thesis. / Bei der Berechnung von Adjungierten, zum Debuggen und für ähnliche Anwendungen kann man die Umkehr der entsprechenden Programmauswertung verwenden. Der einfachste Ansatz, nämlich das Erstellen einer kompletten Mitschrift der Vorwärtsrechnung, welche anschließend rückwärts gelesen wird, verursacht einen enormen Speicherplatzbedarf. Als Alternative dazu kann man die Mitschrift auch stückweise erzeugen, indem die Programmauswertung von passend gewählten Checkpoints wiederholt gestartet wird. In dieser Arbeit wird die Theorie der optimalen parallelen Umkehrschemata erweitert. Zum einen erfolgt die Konstruktion von adaptiven parallelen Umkehrschemata. Dafür wird ein Algorithmus beschrieben, der es durch die Nutzung von mehreren Prozessen ermöglicht, Checkpoints so zu verteilen, daß die Umkehrung des Programmes jederzeit ohne Zeitverlust erfolgen kann. Hierbei bleibt die Zahl der verwendeten Checkpoints und Prozesse innerhalb der bekannten Optimalitätsgrenzen. Zum anderen konnte für die adaptiven parallelen Umkehrschemata ein Algorithmus entwickelt werden, welcher ein Restart der eigentlichen Programmauswertung basierend auf der laufenden Programmumkehr erlaubt. Dieser Restart kann wieder jederzeit ohne Zeitverlust erfolgen und die entstehenden Checkpointverteilung erfüllen wieder sowohl Optimalitäts- als auch die Adaptivitätskriterien. Zusammenfassend wurden damit in dieser Arbeit Schemata konstruiert, die bidirektionale Simulationen ermöglichen. info:eu-repo/classification/ddc/27 ddc:27

1

Page generated in 0.0704 seconds