Global ETD Search

201	"Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment" Utrera Iglesias, Gladys Miriam 10 December 2007 (has links) This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future. parallel applications malleability process scheduling processor scheduling job scheduling mpi parallel computing load balancing message passsing.
202	Large Scale Parallel Inference of Protein and Protein Domain families Rezvoy, Clément 28 September 2011 (has links) (PDF) Protein domains are recurring independent segment of proteins. The combinatorial arrangement of domains is at the root of the functional and structural diversity of proteins. Several methods have been developed to infer protein domain decomposition and domain family clustering from sequence information alone. MkDom2 is one of those methods. Mkdom2 infers domain families in a greedy fashion. Families are inferred one after the other in order to create a delineation of domains on proteins and a clustering of those domains in families. MkDom2 is instrumental in the building of the ProDom database. The exponential growth of the number of sequences to process as rendered MkDom2 obsolete, it would now take several years to compute a newrelease of ProDom. We present a nous algorithm, MPI_MkDom2, allowing computation of several families at once across a distributed computing platform. MPI_MkDom2 is an asynchronous distributed algorithm managing load balancing to ensure efficient platform usage; it ensures the creation of a non-overlapping partitioning of the whole protein set. A new proximity measure is defined to assess the effect of the parallel computation on the result. We also Propose a second algorithm, MPI_mkDom3, allowing the simultaneous computation of a clustering of protein domains as well as full protein sharing the same domain decomposition. [INFO:INFO_OH] Computer Science/Other Bioinformatics Protein Domain MPI Distributed computing Clustering
203	Lygiagretaus skaičiavimo technologijų naudojimas Kuršių marių ekologiniame modelyje / Parallel computing technology application to Curonian Lagoon ecological model Bliūdžiutė, Lina 14 June 2005 (has links) Modern computers are capable of completing most of the tasks in fairly short time, however there are areas in what calculations can last for months and even years. Parallel algorithms are one of the ways to accelerate long-lasting calculations. In the thesis we analyze parallel computing technologies OpenMP (shared memory) and MPI (distributed memory), cons and pros of their architecture. We identify potential parts of program code of Curonian lagoon model for parallelizing, in which chosen parallel computing technologies OpenMP and MPI is applied. Also we make the runtime speedup analysis. Informatics Amdahl'o dėsnis OpenMP Greitinimas Speedup SHYFEM Parallel computing MPI Lygiagretūs skaičiavimai Amdahl's low
204	CellPilot: An extension of the Pilot library for Cell Broadband Engine processors and heterogeneous clusters Girard, Natalie 13 January 2012 (has links) The CellPilot library provides a uniform communication programming model, based on Pilot's process/channel approach, for clusters of Cell Broadband Engine processors. Pilot, a thin layer on top of the Message Passing Interface (MPI) library, allows processes to read/write messages on channels defined between pairs of processes on the cluster, but Pilot alone does not help a Cell programmer cope with the considerable complexities of intra-Cell communication. With CellPilot, programmers still design software in terms of processes, but they can now be located on a Cell node's Power Processor Elements (PPEs), Synergistic Processing Elements (SPEs), or non-Cell node within a heterogeneous Cell cluster, and communication is accomplished via channels between process pairs. Programs are coded in terms of reading and writing on those channels, whereupon CellPilot transparently applies whichever communication mechanisms are required to transport the message, regardless of its endpoints. This gives the programmer a way to handle inter-process communication while avoiding low-level I/O operations and the use of multiple libraries. CellPilot Pilot Cell BE Cell Broadband Engine MPI heterogeneous cluster high performance computing multicore
205	Interprocess Communication Mechanisms With Inter-Virtual Machine Shared Memory Ke, Xiaodi Unknown Date No description available. Virtual Machine (VM) Nahanni Shared Memory Interprocess Communication (IPC) Message-Passing Interface (MPI) Minitransactions High Performance
206	McMPI : a managed-code message passing interface library for high performance communication in C# Holmes, Daniel John January 2012 (has links) This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-performance parallel programming of distributed-memory computer architectures for three decades. The MPI Standard originally distilled architecture-independent and language-agnostic ideas from existing specialised communication libraries and has since been enhanced and extended. Object-oriented languages can increase programmer productivity, for example by allowing complexity to be managed through encapsulation. Both the C# computer language and the .Net common language runtime (CLR) were originally developed by Microsoft Corporation but have since been standardised by the European Computer Manufacturers Association (ECMA) and the International Standards Organisation (ISO), which facilitates portability of source-code and compiled binary programs to a variety of operating systems and hardware. Combining these two open and mature technologies enables mainstream programmers to write tightly-coupled parallel programs in a popular standardised object-oriented language that is portable to most modern operating systems and hardware architectures. This work also establishes that a thread-to-thread delivery option increases shared-memory communication performance between MPI ranks on the same node. This suggests that the thread-as-rank threading model should be explicitly specified in future versions of the MPI Standard and then added to existing MPI libraries for use by thread-safe parallel codes. This work also ascertains that the C# socket object suffers from undesirable characteristics that are critical to communication performance and proposes ways of improving the implementation of this object. 005.7
207	MPI WITHIN A GPU Young, Bobby Dalton 01 January 2009 (has links) GPUs offer high-performance floating-point computation at commodity prices, but their usage is hindered by programming models which expose the user to irregularities in the current shared-memory environments and require learning new interfaces and semantics. This thesis will demonstrate that the message-passing paradigm can be conceptually cleaner than the current data-parallel models for programming GPUs because it can hide the quirks of current GPU shared-memory environments, as well as GPU-specific features, behind a well-established and well-understood interface. This will be shown by demonstrating a proof-of-concept MPI implementation which provides cleaner, simpler code with a reasonable performance cost. This thesis will also demonstrate that, although there is a virtualization constraint imposed by MPI, this constraint is harmless as long as the virtualization was already chosen to be optimal in terms of a strong execution model and nearly-optimal execution time. This will be demonstrated by examining execution times with varying virtualization using a computationally-expensive micro-kernel. message-passing virtualization data-parallel virtualization MPI GPU Electrical and Computer Engineering
208	Extended architectural enhancements for minimizing message delivery latency on cache-less architectures (e.g., Cell BE) Kroeker, Anthony 12 January 2012 (has links) This thesis proposes to reduce the latency of MPI receive operations on cacheless architectures, by removing the delay of copying messages when they are first received. This is achieved by copying the messages directly into buffers in the lowest level of the memory hierarchy (e.g., scratchpad memory). The previously proposed solution introduced an Indirection Cache which would map between the receive variables and the buffered message payload locations. This proved somewhat beneficial, but the lookup penalty of the Indirection Cache limited its effectiveness. Therefore this thesis proposes that a most recently used buffer (i.e., an Indirection Buffer) be placed in front of the Indirection Cache to eliminate this penalty and speed up access. The tests conducted demonstrated that this method was indeed effective and improved over the original method by at least an order of magnitude. Finally, examination of implementation feasibility showed that this could be implemented with a small Cache, and that even with access times 6x slower than initially assumed, the approach with the Indirection Buffer would still be effective. / Graduate computer engineering computer architecture cell processor mpi cache injection cacheless Indirection Cache Indirection Buffer
209	Multithreaded PDE Solvers on Non-Uniform Memory Architectures Nordén, Markus January 2006 (has links) A trend in parallel computer architecture is that systems with a large shared memory are becoming more and more popular. A shared memory system can be either a uniform memory architecture (UMA) or a cache coherent non-uniform memory architecture (cc-NUMA). In the present thesis, the performance of parallel PDE solvers on cc-NUMA computers is studied. In particular, we consider the shared namespace programming model, represented by OpenMP. Since the main memory is physically, or geographically distributed over several multi-processor nodes, the latency for local memory accesses is smaller than for remote accesses. Therefore, the geographical locality of the data becomes important. The focus of the present thesis is to study multithreaded PDE solvers on cc-NUMA systems, in particular their memory access pattern with respect to geographical locality. The questions posed are: (1) How large is the influence on performance of the non-uniformity of the memory system? (2) How should a program be written in order to reduce this influence? (3) Is it possible to introduce optimizations in the computer system for this purpose? The main conclusion is that geographical locality is important for performance on cc-NUMA systems. This is shown experimentally for a broad range of PDE solvers as well as theoretically using a model involving characteristics of computer systems and applications. Geographical locality can be achieved through migration directives that are inserted by the programmer or — possibly in the future — automatically by the compiler. On some systems, it can also be accomplished by means of transparent, hardware initiated migration and replication. However, a necessary condition that must be fulfilled if migration is to be effective is that the memory access pattern must not be "speckled", i.e. as few threads as possible shall make accesses to each memory page. We also conclude that OpenMP is competitive with MPI on cc-NUMA systems if care is taken to get a favourable data distribution. PDE solver high-performance NUMA UMA OpenMP MPI data migration data replication thread scheduling data affinity
210	Controle de granularidade com threads em programas MPI dinâmicos / Controlling granularity of dynamic mpi programs with threads Lima, João Vicente Ferreira January 2009 (has links) Nos últimos anos, a crescente demanda por alto desempenho tem favorecido o surgimento de arquiteturas e algoritmos cada vez mais eficientes. A popularidade das plataformas distribuídas levanta novas questões no desenvolvimento de algoritmos paralelos tais como comunicação, heterogeneidade e dinamismo de recursos. Estas questões podem resultar em aplicações com carga de trabalho conhecida somente em tempo de execução. A irregularidade do algoritmo ou da entrada de dados também pode influenciar na carga de trabalho da aplicação. Uma aplicação paralela pode solucionar estas questões por meio de algoritmos dinâmicos ao utilizar técnicas de programação que definam o trabalho de uma tarefa e possibilitem a utilização de recursos sob demanda. A granularidade, que é a razão entre processamento e comunicação, considera questões práticas de execução e é um fator importante no desempenho de algoritmos dinâmicos. A implementação de um controle de granularidade é complicada e depende do suporte dos ambientes de programação. Porém, os ambientes de programação possuem interfaces extensas e complicadas que dificultam sua utilização em PAD. Este trabalho propõe a implementação de uma biblioteca (libSpawn) que incorpora um controle de granularidade em aplicações MPI dinâmicas. A biblioteca controla a granularidade ao mapear tarefas entre processos ou threads de acordo com três parâmetros: cores da arquitetura, carga e recursos de sistema. Os tempos obtidos com processos e libSpawn demonstram ganhos significativos em benchmarks sintéticos utilizados por outros ambientes de programação. Não obstante, constata-se carências na implementação atual que produzem tempos anômalos, ainda que estes sejam insignificantes em relação aos tempos com processos. / In the last years, the demand for high performance enables the emergence of more efficient computing platforms and algorithms. The increase of distributed computing platforms rises new challenges for parallel algorithm development like communication, heterogeneity, and resource management. These factors can result in applications whose work load is unknown until runtime. An irregular behavior from algorithm or data can also affect the work load. A parallel application can solve these questions through a programming technique which predicts the work load of a task and offers resource on demand. The granularity, which is the ratio of computation to communication, considers more practical issues, and is an important factor in performance of dynamic algorithms. However, this control is difficult to be designed and the support of a programming tool is needed. Yet, the programming tools have extensive and complicated interfaces which difficult your usage in HPC. This work implements a library (libSpawn) which adds a granularity control on MPI dynamic programs. The library controls the granularity by mapping tasks between processes or threads with three parameters: cores of architecture, load and resources of the operating system. The results obtained between processes and libSpawn show significant gains on synthetic benchmarks from other programming tools. Processamento paralelo Mpi Parallel computing High performance computing Dynamic algorithms Granularity

Search results