Global ETD Search

191	Efektivní implementace genetického algoritmu s využitím vícejádrových CPU / The Efficient Implementation of the Genetic Algorithm Using Multicore Processors Kouřil, Miroslav January 2010 (has links) This diploma thesis deals with acceleration of advanced genetic algorithm. For implementation, discrete and continuos versions of UMDA genetic algorithm were chosen. The main part of the acceleration is the utilization of SSE instruction set. Using this set, the functions for calculating fitness and new population sampling were accelerated in particular. Then the pseudorandom number generator that also uses SSE instruction set was implemented. The discrete algorithm reached the speed of up to 4,6 after this implementation. Finally, the algorithms were modified so that the system OpenMP could be used, which enables the running of blocks of code in more threads. The continuous version of algorithm is not convenient for parallelization, because computational complexity of that algorithm is low. In comparison, the discrete versions of algorithm are really appropriate for parallelization. Both the implemented versions reached the total acceleration of up to 4,9 and 7,2.
192	Akcelerace algoritmů Lattice-Boltzmann pro modelování toku krve v mozku / Acceleration of Lattice-Boltzmann Algorithms for Bloodflow Modeling Kompová, Radmila January 2016 (has links) This thesis aims to explore possible implementations and optimizations of the lattice-Boltzmann method. This method allows modeling of fluid flow using a simulation of fictive particles. The thesis focuses on possible improvements of the existing tool HemeLB which is designed and optimized for bloodflow modeling. Several vectorization and paralellization approaches that could be included in this tool are explored. An application focused on comparing chosen algorithms including optimizations for the lattice-Boltzmann method was implemented as a part of the thesis. A group of tests focused on comparing this algorithms according to performance, cache usage and overall memory usage was performed. The best performance achieved was 150 millions of lattice site updates per second.
193	Jádra schématu lifting pro vlnkovou transformaci / Lifting Scheme Cores for Wavelet Transform Bařina, David Unknown Date (has links) Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace. Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu výpočtu. Prezentovaný přístup je nakonec začleněn do kompresního řetězce formátu JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.
194	Méthodes numériques pour les plasmas sur architectures multicoeurs / Numerical methods for plasmas on massively parallel architectures Massaro, Michel 16 December 2016 (has links) Cette thèse traite de la résolution du système de la Magnéto-Hydro-Dynamique (MHD) sur architectures massivement parallèles. Ce système est un système hyperbolique de lois de conservation. Pour des raisons de coût en termes de temps et d'espace, nous utilisons la méthode des volumes finis. Ces critères sont particulièrement importants dans le cas de la MHD, car les solutions obtenues peuvent présenter de nombreuses ondes de choc et être très turbulentes. L'approche d'un phénomène physique nécessite par conséquent de travailler sur un maillage fin entrainant une grande quantité de calcul. Afin de réduire les temps d'exécution des algorithmes proposés, nous proposons des méthodes d'optimisations pour l'exécution sur CPU telles que l'utilisation d'OpenMP pour une parallélisation automatique ou le parcours optimisé afin de bénéficier des effets de cache. Une implémentation sur architecture GPU à l'aide de la librairie OpenCL est également proposée. Dans le but de conserver une coalescence maximale des données en mémoire, nous proposons une méthode utilisant un splitting directionnel associé à une méthode de transposition optimisée pour les implémentations parallèle. Dans la dernière partie, nous présentons la librairie SCHNAPS. Ce solveur utilisant la méthode Galerkin Discontinu (GD) utilise des implémentations OpenCL et StarPU afin de profiter au maximum des avantages de la programmation hybride. / This thesis deals with the resolution of the Magneto-Hydro-Dynamic (MHD) system on massively parallel architectures. This problem is an hyperbolic system of conservation laws. For cost reasons in terms of time and space, we use the finite volume method. These criteria are particularly important in the case of MHD because the solutions obtained may have many shock waves and be very turbulent. The approach of a physical phenomenon requires working on a fine mesh which involves a large quantity of computations. In order to reduce the execution time of the proposed algorithms, we present several optimization methods for CPU execution such as the use of OpenMP for an automatic parallelization or an optimized way to browse a grid in order to benefit from cache effects. An implementation on GPU architecture using the OpenCL library is also available. To maintain a maximal coalescence of the data in memory, we propose a method using a directional splitting associated with an optimized transposition method for parallel implementations. In the last part, we present the SCHNAPS library. This solver using the Galerkin Disontinu (GD) method uses OpenCL and StarPU implementations in order to maximize the benefits of hybrid programming. Magnéto-Hydro-Dynamique Volumes finis Galerkin Discontinu Parallélisation OpenCL StarPU Magneto-Hydro-Dynamic Finite volume Discontinuous Galerkin Parallelization OpenCL StarPU 005.4 518 538.6
195	Formalisation et automatisation de YAO, générateur de code pour l’assimilation variationnelle de données / Formalisation and automation of YAO, code generator for variational data assimilation Nardi, Luigi 08 March 2011 (has links) L’assimilation variationnelle de données 4D-Var est une technique très utilisée en géophysique, notamment en météorologie et océanographie. Elle consiste à estimer des paramètres d’un modèle numérique direct, en minimisant une fonction de coût mesurant l’écart entre les sorties du modèle et les mesures observées. La minimisation, qui est basée sur une méthode de gradient, nécessite le calcul du modèle adjoint (produit de la transposée de la matrice jacobienne avec le vecteur dérivé de la fonction de coût aux points d’observation). Lors de la mise en œuvre de l’AD 4D-Var, il faut faire face à des problèmes d’implémentation informatique complexes, notamment concernant le modèle adjoint, la parallélisation du code et la gestion efficace de la mémoire. Aﬁn d’aider au développement d’applications d’AD 4D-Var, le logiciel YAO qui a été développé au LOCEAN, propose de modéliser le modèle direct sous la forme d’un graphe de ﬂot de calcul appelé graphe modulaire. Les modules représentent des unités de calcul et les arcs décrivent les transferts des données entre ces modules. YAO est doté de directives de description qui permettent à un utilisateur de décrire son modèle direct, ce qui lui permet de générer ensuite le graphe modulaire associé à ce modèle. Deux algorithmes, le premier de type propagation sur le graphe et le second de type rétropropagation sur le graphe permettent, respectivement, de calculer les sorties du modèle direct ainsi que celles de son modèle adjoint. YAO génère alors le code du modèle direct et de son adjoint. En plus, il permet d’implémenter divers scénarios pour la mise en œuvre de sessions d’assimilation.Au cours de cette thèse, un travail de recherche en informatique a été entrepris dans le cadre du logiciel YAO. Nous avons d’abord formalisé d’une manière plus générale les spécifications deYAO. Par la suite, des algorithmes permettant l’automatisation de certaines tâches importantes ont été proposés tels que la génération automatique d’un parcours “optimal” de l’ordre des calculs et la parallélisation automatique en mémoire partagée du code généré en utilisant des directives OpenMP. L’objectif à moyen terme, des résultats de cette thèse, est d’établir les bases permettant de faire évoluer YAO vers une plateforme générale et opérationnelle pour l’assimilation de données 4D-Var, capable de traiter des applications réelles et de grandes tailles. / Variational data assimilation 4D-Var is a well-known technique used in geophysics, and in particular in meteorology and oceanography. This technique consists in estimating the control parameters of a direct numerical model, by minimizing a cost function which measures the misﬁt between the forecast values and some actual observations. The minimization, which is based on a gradient method, requires the computation of the adjoint model (product of the transpose Jacobian matrix and the derivative vector of the cost function at the observation points). In order to perform the 4DVar technique, we have to cope with complex program implementations, in particular concerning the adjoint model, the parallelization of the code and an efﬁcient memory management. To address these difﬁculties and to facilitate the implementation of 4D-Var applications, LOCEAN is developing the YAO framework. YAO proposes to represent a direct model with a computation ﬂow graph called modular graph. Modules depict computation units and edges between modules represent data transfer. Description directives proper to YAO allow a user to describe its direct model and to generate the modular graph associated to this model. YAO contains two core algorithms. The ﬁrst one is a forward propagation algorithm on the graph that computes the output of the numerical model; the second one is a back propagation algorithm on the graph that computes the adjoint model. The main advantage of the YAO framework, is that the direct and adjoint model programming codes are automatically generated once the modular graph has been conceived by the user. Moreover, YAO allows to cope with many scenarios for running different data assimilation sessions.This thesis introduces a computer science research on the YAO framework. In a ﬁrst step, we have formalized in a more general way the existing YAO speciﬁcations. Then algorithms allowing the automatization of some tasks have been proposed such as the automatic generation of an “optimal” computational ordering and the automatic parallelization of the generated code on shared memory architectures using OpenMP directives. This thesis permits to lay the foundations which, at medium term, will make of YAO a general and operational platform for data assimilation 4D-Var, allowing to process applications of high dimensions. Assimilation variationnelle de données Modèle numérique Modèle adjoint Génération automatique Parallélisation automatique Mémoire partagée OpenMP Variational data assimilation Numerical model Adjoint model Automatic generation Automatic parallelization Shared memory OpenMP
196	Parallelizing Set Similarity Joins Fier, Fabian 24 January 2022 (has links) Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze. / One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far. Join Parallelisierung Verteilt Multithreaded Join Parallelization Distributed Multithreaded 004 Informatik ST 530 ST 134 ddc:005 ddc:004
197	A unifying mathematical definition enables the theoretical study of the algorithmic class of particle methods. Pahlke, Johannes 05 June 2023 (has links) Mathematical definitions provide a precise, unambiguous way to formulate concepts. They also provide a common language between disciplines. Thus, they are the basis for a well-founded scientific discussion. In addition, mathematical definitions allow for deeper insights into the defined subject based on mathematical theorems that are incontrovertible under the given definition. Besides their value in mathematics, mathematical definitions are indispensable in other sciences like physics, chemistry, and computer science. In computer science, they help to derive the expected behavior of a computer program and provide guidance for the design and testing of software. Therefore, mathematical definitions can be used to design and implement advanced algorithms. One class of widely used algorithms in computer science is the class of particle-based algorithms, also known as particle methods. Particle methods can solve complex problems in various fields, such as fluid dynamics, plasma physics, or granular flows, using diverse simulation methods, including Discrete Element Methods (DEM), Molecular Dynamics (MD), Reproducing Kernel Particle Methods (RKPM), Particle Strength Exchange (PSE), and Smoothed Particle Hydrodynamics (SPH). Despite the increasing use of particle methods driven by improved computing performance, the relation between these algorithms remains formally unclear. In particular, particle methods lack a unifying mathematical definition and precisely defined terminology. This prevents the determination of whether an algorithm belongs to the class and what distinguishes the class. Here we present a rigorous mathematical definition for determining particle methods and demonstrate its importance by applying it to several canonical algorithms and those not previously recognized as particle methods. Furthermore, we base proofs of theorems about parallelizability and computational power on it and use it to develop scientific computing software. Our definition unified, for the first time, the so far loosely connected notion of particle methods. Thus, it marks the necessary starting point for a broad range of joint formal investigations and applications across fields.:1 Introduction 1.1 The Role of Mathematical Definitions 1.2 Particle Methods 1.3 Scope and Contributions of this Thesis 2 Terminology and Notation 3 A Formal Definition of Particle Methods 3.1 Introduction 3.2 Definition of Particle Methods 3.2.1 Particle Method Algorithm 3.2.2 Particle Method Instance 3.2.3 Particle State Transition Function 3.3 Explanation of the Definition of Particle Methods 3.3.1 Illustrative Example 3.3.2 Explanation of the Particle Method Algorithm 3.3.3 Explanation of the Particle Method Instance 3.3.4 Explanation of the State Transition Function 3.4 Conclusion 4 Algorithms as Particle Methods 4.1 Introduction 4.2 Perfectly Elastic Collision in Arbitrary Dimensions 4.3 Particle Strength Exchange 4.4 Smoothed Particle Hydrodynamics 4.5 Lennard-Jones Molecular Dynamics 4.6 Triangulation refinement 4.7 Conway's Game of Life 4.8 Gaussian Elimination 4.9 Conclusion 5 Parallelizability of Particle Methods 5.1 Introduction 5.2 Particle Methods on Shared Memory Systems 5.2.1 Parallelization Scheme 5.2.2 Lemmata 5.2.3 Parallelizability 5.2.4 Time Complexity 5.2.5 Application 5.3 Particle Methods on Distributed Memory Systems 5.3.1 Parallelization Scheme 5.3.2 Lemmata 5.3.3 Parallelizability 5.3.4 Bounds on Time Complexity and Parallel Scalability 5.4 Conclusion 6 Turing Powerfulness and Halting Decidability 6.1 Introduction 6.2 Turing Machine 6.3 Turing Powerfulness of Particle Methods Under a First Set of Constraints 6.4 Turing Powerfulness of Particle Methods Under a Second Set of Constraints 6.5 Halting Decidability of Particle Methods 6.6 Conclusion 7 Particle Methods as a Basis for Scientific Software Engineering 7.1 Introduction 7.2 Design of the Prototype 7.3 Applications, Comparisons, Convergence Study, and Run-time Evaluations 7.4 Conclusion 8 Results, Discussion, Outlook, and Conclusion 8.1 Problem 8.2 Results 8.3 Discussion 8.4 Outlook 8.5 Conclusion info:eu-repo/classification/ddc/004 ddc:004
198	Deinterleaving of radar pulses with batch processing to utilize parallelism / Gruppering av radar pulser med batch-bearbetning för att utnyttja parallelism Lind, Emma, Stahre, Mattias January 2020 (has links) The threat level (specifically in this thesis, for aircraft) in an environment can be determined by analyzing radar signals. This task is critical and has to be solved fast and with high accuracy. The received electromagnetic pulses have to be identiﬁed in order to classify a radar emitter. Usually, there are several emitters transmitting radar pulses at the same time in an environment. These pulses need to be sorted into groups, where each group contains pulses from the same emitter. This thesis aims to find a fast and accurate solution to sort the pulses in parallel. The selected approach analyzes batches of pulses in parallel to exploit the advantages of a multi-threaded Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). Firstly, a suitable clustering algorithm had to be selected. Secondly, an optimal batch size had to be determined to achieve high clustering performance and to rapidly process the batches of pulses in parallel. A quantitative method based on experiments was used to measure clustering performance, execution time, system response, and parallelism as a function of batch sizes when using the selected clustering algorithm. The algorithm selected for clustering the data was Density-based Spatial Clustering of Applications with Noise (DBSCAN) because of its advantages, such as not having to specify the number of clusters in advance, its ability to find arbitrary shapes of a cluster in a data set, and its low time complexity. The evaluation showed that implementing parallel batch processing is possible while still achieving high clustering performance, compared to a sequential implementation that used the maximum likelihood method.An optimal batch size in terms of data points and cutoff time is hard to determine since the batch size is very dependent on the input data. Therefore, one batch size might not be optimal in terms of clustering performance and system response for all streams of data. A solution could be to determine optimal batch sizes in advance for different streams of data, then adapt a batch size depending on the stream of data. However, with a high level of parallelism, an additional delay is introduced that depends on the difference between the time it takes to collect data points into a batch and the time it takes to process the batch, thus the system will be slower to output its result for a given batch compared to a sequential system. For a time-critical system, a high level of parallelism might be unsuitable since it leads to slower response times. / Genom analysering av radarsignaler i en miljö kan hotnivån bestämmas. Detta är en kritisk uppgift som måste lösas snabbt och med bra noggrannhet. För att kunna klassificera en specifik radar måste de elektromagnetiska pulserna identifieras. Vanligtvis sänder flera emittrar ut radarpulser samtidigt i en miljö. Dessa pulser måste sorteras i grupper, där varje grupp innehåller pulser från en och samma emitter. Målet med denna avhandling är att ta fram ett sätt att snabbt och korrekt sortera dessa pulser parallellt. Den valda metoden använder grupper av data som analyserades parallellt för att nyttja fördelar med en multitrådad Central Processing Unit (CPU) eller en Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). Först behövde en klustringsalgoritm väljas och därefter en optimal gruppstorlek för den valda algoritmen. Gruppstorleken baserades på att grupperna kunde behandlas parallellt och snabbt, samt uppnå tillförlitlig klustring. En kvantitativ metod användes som baserades på experiment genom att mäta klustringens tillförlitlighet, exekveringstid, systemets svarstid och parallellitet som en funktion av gruppstorlek med avseende på den valda klustringsalgoritmen. Density-based Spatial Clustering of Applications with Noise (DBSCAN) valdes som algoritm på grund av dess förmåga att hitta kluster av olika former och storlekar utan att på förhand ange antalet kluster för en mängd datapunkter, samt dess låga tidskomplexitet. Resultaten från utvärderingen visade att det är möjligt att implementera ett system med grupper av pulser och uppnå bra och tillförlitlig klustring i jämförelse med en sekventiell implementation av maximum likelihood-metoden. En optimal gruppstorlek i antal datapunkter och cutoff tid är svårt att definiera då storleken är väldigt beroende på indata. Det vill säga, en gruppstorlek måste inte nödvändigtvis vara optimal för alla typer av indataströmmar i form av tillförlitlig klustring och svarstid för systemet. En lösning skulle vara att definiera optimala gruppstorlekar i förväg för olika indataströmmar, för att sedan kunna anpassa gruppstorleken efter indataströmmen. Det uppstår en fördröjning i systemet som är beroende av differensen mellan tiden det tar att skapa en grupp och exekveringstiden för att bearbeta en grupp. Denna fördröjning innebär att en parallell grupp-implementation aldrig kommer kunna vara lika snabb på att producera sin utdata som en sekventiell implementation. Detta betyder att det i ett tidskritiskt system förmodligen inte är optimalt att parallellisera mycket eftersom det leder till långsammare svarstid för systemet. Cluster analysis DBSCAN Parallelization Signal Separation Unsupervised learning Klusteranalys DBSCAN Parallellisering Signal Separation Oövervakat lärande Computer Sciences Datavetenskap (datalogi) Signal Processing Signalbehandling
199	Cloud Based Dynamic Workflow with QOS For Mass Spectrometry Data Analysis Nagavaram, Ashish 19 December 2011 (has links) No description available. Bioinformatics Biomedical Engineering Biomedical Research Computer Engineering Computer Science cloud dynamic workflow adaptive execution on cloud parallelization on cloud time constraint execution QOS on cloud parameter prediction parameter modeling
200	Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Saad Ibrahim, Mohamed Mohamed 26 July 2016 (has links) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine-grained locking performance. This dissertation exploits TM as an optimistic execution approach for transforming a sequential application into parallel. The design and the implementation of two frameworks that support automatic parallelization: Lerna and HydraVM, are proposed, along with a number of algorithmic optimizations to make the parallelization effective. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data dependency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler. Correctness of the program is preserved through exploiting Software Transactional Memory (STM) to manage concurrent and out-of-order memory accesses. Our experiments show that HydraVM achieves speedup between 2×-5× on a set of benchmark applications. Lerna is a compiler framework that automatically and transparently detects and extracts parallelism from sequential code through a set of techniques including code profiling, instrumentation, and adaptive execution. Lerna is cross-platform and independent of the programming language. The parallel execution exploits memory transactions to manage concurrent and out-of-order memory accesses. This scheme makes Lerna very effective for sequential applications with data sharing. This thesis introduces the general conditions for embedding any transactional memory algorithm into Lerna. In addition, the ordered version of four state-of-art algorithms have been integrated and evaluated using multiple benchmarks including RSTM micro benchmarks, STAMP and PARSEC. Lerna showed great results with average 2.7× (and up to 18×) speedup over the original (sequential) code. While prior research shows that transactions must commit in order to preserve program semantics, placing the ordering enforces scalability constraints at large number of cores. In this dissertation, we eliminates the need for commit transactions sequentially without affecting program consistency. This is achieved by building a cooperation mechanism in which transactions can forward some changes safely. This approach eliminates some of the false conflicts and increases the concurrency level of the parallel application. This thesis proposes a set of commit order algorithms that follow the aforementioned approach. Interestingly, using the proposed commit-order algorithms the peak gain over the sequential non-instrumented execution in RSTM micro benchmarks is 10× and 16.5× in STAMP. Another main contribution is to enhance the concurrency and the performance of TM in general, and its usage for parallelization in particular, by extending TM primitives. The extended TM primitives extracts the embedded low level application semantics without affecting TM abstraction. Furthermore, as the proposed extensions capture common code patterns, it is possible to be handled automatically through the compilation process. In this work, that was done through modifying the GCC compiler to support our TM extensions. Results showed speedups of up to 4× on different applications including micro benchmarks and STAMP. Our final contribution is supporting the commit-order through Hardware Transactional Memory (HTM). HTM contention manager cannot be modified because it is implemented inside the hardware. Given such constraint, we exploit HTM to reduce the transactional execution overhead by proposing two novel commit order algorithms, and a hybrid reduced hardware algorithm. The use of HTM improves the performance by up to 20% speedup. / Ph. D. Transaction Memory Automatic Parallelization Low-Level Virtual Machine Optimistic Concurrency Speculative Execution Legacy Systems Age Commitment Order Low-Level TM Semantics TM Friendly Semantics

Search results