Spelling suggestions: "subject:"ppc"" "subject:"dppc""
111 |
Enhancing an InfiniBand driver by utilizing an efficient malloc/free library supporting multiple page sizesRex, Robert 23 October 2006 (has links) (PDF)
Despite using high-speed network interconnection
systems like InfiniBand, the communication
overhead for parallel applications, especially
in the area of High-Performance Computing (HPC), is still high. Using large
page frames - so called hugepages in Linux - can
improve the crucial work of registering
communication buffers to the network adapter.
Thus, an InfiniBand driver was modified. But these
hugepages do not only reduce communication costs
but can also improve computation time in a
perceptible manner, e.g. by less TLB misses. To
bypass the outlay of rewriting applications, a
preload library was implemented that is able
to utilize large page frames transparently.
This work also shows benchmark results with these
components and performance improvements of up to
10 %.
|
112 |
Execution of SPE code in an Opteron-Cell/B.E. hybrid systemHeinig, Andreas 03 July 2008 (has links) (PDF)
It is a great research interest to integrate the Cell/B.E. processor into an AMD Opteron system. The result is a system benefiting from the advantages of both processors: the high computational power of the Cell/B.E. and the high I/O throughput of the Opteron.
The task of this diploma thesis is to accomplish, that Cell-SPU code initially residing on the Opteron could be executed on the Cell under the GNU/Linux operating system. However, the SPUFS (Synergistic Processing Unit File System), provided from STI (Sony, Toshiba, IBM), does exactly the same thing on the Cell. The Cell is a combination of a PowerPC core and Synergistic Processing elements (SPE). The main work is to analyze the SPUFS and migrate it to the Opteron System.
The result of the migration is a project called RSPUFS (Remote Synergistic Processing Unit File System), which provides nearly the same interface as SPUFS on the Cell side. The differences are caused by the TCP/IP link between Opteron and Cell, where no Remote Direct Memory Access (RDMA) is available. So it is not possible to write synchronously to the local store of the SPEs. The synchronization occurs implicitly before executing the Cell-SPU code. But not only semantics have changed: to access the XDR memory RSPUFS extends SPUFS with a special XDR interface, where the application can map the XDR into the local address space. The application must be aware of synchronization with an explicit call of the provided ''xdr\_sync'' routine. Another difference is, that RSPUFS does not support the gang principle of SPUFS, which is necessary to set the affinity between the SPEs.
This thesis deals not only with the operating system part, but also with a library called ''libspe''. Libspe provides a wrapper around the SPUFS system calls. It is essential to port this library to the Opteron, because most of the Cell applications use it. Libspe is not only a wrapper, it saves a lot of work for the developer as well, like loading the Cell-SPU code or managing the context and system calls initiated by the SPE. Thus it has to be ported, too.
The result of the work is, that an application can link against the modified libspe on the Opteron gaining direct access to the Synergistic Processor Elements.
|
113 |
McMPI : a managed-code message passing interface library for high performance communication in C#Holmes, Daniel John January 2012 (has links)
This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-performance parallel programming of distributed-memory computer architectures for three decades. The MPI Standard originally distilled architecture-independent and language-agnostic ideas from existing specialised communication libraries and has since been enhanced and extended. Object-oriented languages can increase programmer productivity, for example by allowing complexity to be managed through encapsulation. Both the C# computer language and the .Net common language runtime (CLR) were originally developed by Microsoft Corporation but have since been standardised by the European Computer Manufacturers Association (ECMA) and the International Standards Organisation (ISO), which facilitates portability of source-code and compiled binary programs to a variety of operating systems and hardware. Combining these two open and mature technologies enables mainstream programmers to write tightly-coupled parallel programs in a popular standardised object-oriented language that is portable to most modern operating systems and hardware architectures. This work also establishes that a thread-to-thread delivery option increases shared-memory communication performance between MPI ranks on the same node. This suggests that the thread-as-rank threading model should be explicitly specified in future versions of the MPI Standard and then added to existing MPI libraries for use by thread-safe parallel codes. This work also ascertains that the C# socket object suffers from undesirable characteristics that are critical to communication performance and proposes ways of improving the implementation of this object.
|
114 |
Software para arquitecturas basadas en procesadores de múltiples núcleosFrati, Fernando Emmanuel January 2015 (has links)
Todos los procesadores disponibles en el mercado (incluso los procesadores utilizados en dispositivos móviles) poseen una arquitectura típica multicore. En consecuencia, el modelo de programación en memoria compartida se impuso sobre el modelo de programación secuencial como modelo por excelencia para obtener el máximo desempeño de estas arquitecturas.
En este modelo de programación las suposiciones de orden de ejecución entre instrucciones y atomicidad en el acceso a las variables heredadas del modelo de programación secuencial ya no son válidas. El no determinismo implícito en la ejecución de los programas concurrentes, obliga al programador a utilizar algún mecanismo de sincronización para asegurar esas propiedades.
Frecuentemente el programador se equivoca al sincronizar los procesos, dando lugar a nuevos errores de programación como son los deadlocks, condiciones de carrera, violaciones de orden, violaciones de atomicidad simple y violaciones de atomicidad multivariable. Los métodos tradicionales de depuración de programas no son aplicables en el contexto de los programas concurrentes, por lo que es necesario disponer de herramientas de depuración que puedan ayudar al programador a detectar esta clase de errores.
De estos errores, los deadlocks y las condiciones de carrera han gozado de mayor popularidad en la comunidad científica. Sin embargo, solo el 29,5 % de los errores son deadlocks: del 70,5 % restante, las violaciones de atomicidad representan más del 65 % de los errores, el 96 % ocurren entre dos threads y el 66 % involucran una sola variable. Por eso las violaciones de atomicidad simple se han definido en los últimos años como el caso más general de error de concurrencia y han recibido gran atención por numerosos grupos de investigación.
En 2005 aparecen las primeras propuestas que utilizan métodos de instrumentación dinámicos para la detección de violaciones de atomicidad, mejorando notablemente la capacidad de detección sobre las propuestas anteriores. De estas propuestas, AVIO(Lu, Tucek, Qin, y Zhou, 2006) se destaca como la propuesta con mejor rendimiento y capacidad de detección. Para detectar una violación de atomicidad, el método de AVIO consiste en monitorizar los accesos a memoria por parte de los procesos concurrentes durante la ejecución, registrando qué procesos acceden a cada variable, en búsqueda de interleavings no serializables. Pese a que AVIO es superior a las propuestas previas, el overhead que introduce (en promedio 25×) es demasiado elevado para ser utilizado en entornos en producción.
Muchas propuestas proponen reducir el overhead de los algoritmos de detección implementándolos directamente en el hardware a través de extensiones (cambios en el procesador, memoria cache, etc.), consiguiendo excelentes resultados. Sin embargo, este enfoque requiere que los fabricantes de procesadores decidieran incorporar esas modificaciones en sus diseños (cosa que no ha sucedido por el momento), por lo que es de esperar que tardarán en llegar al mercado y más aún en reemplazar las plataformas que actualmente están en producción.
Por otro lado, las implementaciones en software aplican métodos de instrumentación de programas. Debido a que requieren agregar llamadas a una rutina de análisis a cada instrucción que accede a la memoria, los métodos de detección de errores utilizan instrumentación a nivel de instrucción. Lamentablemente, este granularidad de instrumentación es lenta, penalizando el tiempo de la ejecución con más de un orden de magnitud.
Sin embargo, la posibilidad de error solamente existe si al menos dos threads acceden simultáneamente a datos compartidos. Esto significa que, si de la totalidad de la aplicación que está siendo monitorizada sólo un pequeño porcentaje de las operaciones acceden a datos compartidos, gran parte del tiempo invertido en instrumentar todos los accesos a memoria está siendo desperdiciado.
Para reducir el overhead de la instrumentación a nivel de instrucción restringiéndolo sólo a los accesos a memoria compartida, es necesario detectar el momento preciso en que esos accesos ocurren. La mejor opción para detectar este momento es cuando ocurre algún cambio en la memoria cache compartida entre los núcleos que ejecutan los procesos.
Una herramienta muy útil para esta tarea son los contadores hardware, un conjunto de registros especiales disponibles en todos los procesadores actuales. Esos registros pueden ser programados para contar el número de veces que un evento ocurre dentro del procesador durante la ejecución de una aplicación. Los eventos proveen información sobre diferentes aspectos de la ejecución de un programa (por ejemplo el número de instrucciones ejecutadas, el número de fallos en cache L1 o el número de operaciones en punto flotante ejecutadas).
Se plantea como estrategia encontrar un evento que detecte la ocurrencia de interleavings no serializables y en función de ello activar/desactivar AVIO. Lamentablemente, no existe un evento capaz de indicar la ocurrencia de casos de interleavings. Sin embargo, si es posible representar los casos a través de patrones de acceso a memoria.
La búsqueda de eventos asociados a los cambios de estado en el protocolo de coherencia cache reveló que para la arquitectura de pruebas existe un evento, cuya descripción indica que ocurre con uno de los patrones de acceso presentes en los casos de interleavings.
El patrón asociado al evento está presente en tres de los cuatro casos de interleavings no serializables que AVIO debe detectar. La experimentación realizada para validar el evento demostró que efectivamente ocurre con precisión con el patrón de acceso, y en consecuencia puede detectar la ocurrencia interleavings no serializables.
Luego de determinar la viabilidad del evento seleccionado, se experimentó con los contadores en un modo de operación llamado muestreo, el cual permite configurar los contadores para generar señales dirigidas a un proceso ante la ocurrencia de eventos. En este modo el programador especifica la cantidad de eventos que deben ocurrir antes de que la señal sea generada, permitiendo ajustar esta prestación de acuerdo a los requerimientos de la aplicación.
Este modo de operación fue utilizado para decidir cuándo activar la rutina de análisis de las herramientas de detección y en consecuencia reducir la instrumentación del código.
Por otro lado, el desactivado puede ser un poco más complejo. Debido a que no es posible configurar un contador para enviar una señal ante la no ocurrencia de eventos, se propone configurar un timer para verificar a intervalos regulares de tiempo si es seguro desactivar la rutina de análisis (por ejemplo porque en el último intervalo no se detectaron violaciones de atomicidad).
El modelo propuesto se utilizó para implementar una nueva versión llamada AVIO-SA, la cual inicia la ejecución de las aplicaciones monitorizadas con la rutina de análisis desactivada. En el momento en que detecta un evento la rutina es activada, funcionando por un tiempo como la versión original de AVIO. Eventualmente AVIO deja de detectar interleavings y la rutina de análisis es desactivada.
Debido a que no es posible estimar el valor óptimo para el tiempo del intervalo de muestreo analíticamente, se desarrollaron experimentos para encontrar este valor empíricamente. Se encontró que un intervalo de 5ms permite a AVIO-SA detectar aproximadamente la misma cantidad de interleavings que AVIO, pero con un tiempo de ejecución significativamente menor.
Para completar las pruebas de rendimiento se completaron los experimentos con HELGRIND, una herramienta libre de detección de condiciones de carrera y se estimó el overhead de cada herramienta con respecto a cada aplicación. En promedio, HELGRIND demostró un overhead de 223×, AVIO un overhead de 32× y AVIO-SA de 9×.
Aparte del rendimiento, se evaluó la capacidad de detección de errores de AVIO-SA. Para ello se hicieron 3 experimentos:
- Prueba de detección con kernels de bugs conocidos.
- Prueba de detección en aplicaciones reales (Apache).
- Comparación de bugs informados entre AVIO y AVIO-SA (a partir de SPLASH-2).
Afortunadamente AVIO-SA pasó las 3 pruebas satisfactoriamente. Los resultados obtenidos demuestran que el modelo propuesto no afecta negativamente la capacidad de detección de la herramienta, empleando en el proceso menos del 30 % del tiempo requerido por AVIO. Debido a que AVIO-SA altera menos la historia de ejecución de la aplicación monitorizada, es una mejor opción para ser utilizada en entornos de producción.
|
115 |
Benchmark-driven Approaches to Performance Modeling of Multi-Core ArchitecturesPutigny, Bertrand 27 March 2014 (has links) (PDF)
Ce manuscrit s'inscrit dans le domaine du calcul intensif (HPC) où le besoin croissant de performance pousse les fabricants de processeurs à y intégrer des mécanismes de plus en plus sophistiqués. Cette complexité grandissante rend l'utilisation des architectures compliquée. La modélisation des performances des architectures multi-cœurs permet de remonter des informations aux utilisateurs, c'est à dire les programmeurs, afin de mieux exploiter le matériel. Cependant, du fait du manque de documentation et de la complexité des processeurs modernes, cette modélisation est souvent difficile. L'objectif de ce manuscrit est d'utiliser des mesures de performances de petits fragments de codes afin de palier le manque d'information sur le matériel. Ces expériences, appelées micro-benchmarks, permettent de comprendre les performances des architectures modernes sans dépendre de la disponibilité des documentations techniques. Le premier chapitre présente l'architecture matérielle des processeurs modernes et, en particulier, les caractéristiques rendant la modélisation des performances complexe. Le deuxième chapitre présente une méthodologie automatique pour mesurer les performances des instructions arithmétiques. Les informations trouvées par cette méthode sont la base pour des modèles de calculs permettant de prédire le temps de calcul de fragments de codes arithmétique. Ce chapitre présent également comment de tels modèles peuvent être utilisés pour optimiser l'efficacité énergétique, en prenant pour exemple le processeur SCC. La dernière partie de ce chapitre motive le fait de réaliser un modèle mémoire prenant en compte la cohérence de cache pour prédire le temps d'accès au données. Le troisième chapitre présente l'environnement de développement de micro-benchmark utilisé pour caractériser les hiérarchies mémoires dotées de cohérence de cache. Ce chapitre fait également une étude comparative des performances mémoire de différentes architectures et l'impact sur les performances du choix du protocole de cohérence. Enfin, le quatrième chapitre présente un modèle mémoire permettant la prédiction du temps d'accès aux données pour des applications régulières de type \openmp. Le modèle s'appuie sur l'état des données dans le protocole de cohérence. Cet état évolue au fil de l'exécution du programme en fonction des accès à la mémoire. Pour chaque transition, une fonction de coût est associée. Cette fonction est directement dérivée des résultats des expériences faites dans le troisième chapitre, et permet de prédire le temps d'accès à la mémoire. Une preuve de concept de la fiabilité de ce modèle est faite, d'une part sur les applications d'algèbre et d'analyse numérique, d'autre part en utilisant ce modèle pour modéliser les performance des communications \mpi en mémoire partagée.
|
116 |
Presence of potentially pathogenic heterotrophic plate count (HPC) bacteria occurring in a drinking water distribution system in the North-West Province, South Africa / by Leandra VenterVenter, Leandra January 2010 (has links)
There is currently growing concern about the presence of heterotrophic plate count (HPC)
bacteria in drinking water. These HPC may have potential pathogenic features, enabling
them to cause disease. It is especially alarming amongst individuals with a weakened
immune system. South Africa, the country with the highest incidents of HIV positive
individuals in the world, mainly uses these counts to assess the quality of drinking water in
terms of the number of micro-organisms present in the water. These micro-organisms may
be present in the bulk water or as biofilms adhered to the surfaces of a drinking water
distribution system. The current study investigated the pathogenic potential of HPC bacteria
occurring as biofilms within a drinking water distribution system and determined the
possible presence of these micro-organims within the bulk water. Biofilm samples were
taken from five sites within a drinking water distribution system. Fifty six bacterial colonies
were selected based on morphotypes and isolated for the screening of potential pathogenic
features. Haemolysin production was tested for using sheep-blood agar plates. Of the 56,
31 isolates were ?-haemolytic. Among the 31 ?-haemolytic positive isolates 87.1% were
positive for lecithinase, 41.9% for proteinase, 19.4% for chondroitinase, 9.7% for DNase
and 6.5% for hyaluronidase. All of the ?-haemolytic isolates were resistant to
oxytetracycline 30 ?g, trimethoprim 2.5 ?g and penicillin G10 units, 96.8% were resistant to
vancomycin 30 ?g and ampicillin 10 ?g, 93.5% to kanamycin 30 ?g, 74.2% to
chloramphenicol 30 ?g, 54.8% to ciprofloxacin 5 ?g, 22.6% to streptomycin 300 ?g and
16.1% to erythromycin 15 ?g. Nineteen isolates producing two or more enzymes were
subjected to Gram staining. The nineteen isolates were all Gram-positive. These isolates
were then identified using the BD BBL CRYSTALTM Gram-positive (GP) identification (ID)
system. Isolates were identified as Bacillus cereus, Bacillus licheniformis, Bacillus subtilis,
Bacillus megaterium, Bacillus pumilus and Kocuria rosea. 16S rRNA gene sequencing was
performed to confirm these results and to obtain identifications for the bacteria not identified
with the BD BBL CRYSTALTM GP ID system. Additionally identified bacteria included
Bacillus thuringiensis, Arthrobacter oxydans and Exiguobacterium acetylicum.
Morphological properties of the different species were studied with transmission electron
microscopy (TEM) to confirm sequencing results. All the isolates displayed rod shaped cells
with the exception of Arthrobacter oxydans being spherical in the stationary phase of their life cycle. Bulk water samples were taken at two sites in close proximity with the biofilm
sampling sites. The DNA was extracted directly from the water samples and the 16S rRNA
gene region was amplified. Denaturing gradient gel electrophoresis (DGGE) was performed
to confirm the presence of the isolates from the biofilm samples in the bulk water samples.
The presence of Bacillus pumilus and Arthrobacter oxydans could be confirmed with
DGGE. This study demonstrated the presence of potentially pathogenic HPC bacteria within
biofilms in a drinking water distribution system. It also confirmed the probable presence of
two of these biofilm based bacteria in the bulk water. / Thesis (M.Sc. (Microbiology))--North-West University, Potchefstroom Campus, 2010.
|
117 |
Presence of potentially pathogenic heterotrophic plate count (HPC) bacteria occurring in a drinking water distribution system in the North-West Province, South Africa / by Leandra VenterVenter, Leandra January 2010 (has links)
There is currently growing concern about the presence of heterotrophic plate count (HPC)
bacteria in drinking water. These HPC may have potential pathogenic features, enabling
them to cause disease. It is especially alarming amongst individuals with a weakened
immune system. South Africa, the country with the highest incidents of HIV positive
individuals in the world, mainly uses these counts to assess the quality of drinking water in
terms of the number of micro-organisms present in the water. These micro-organisms may
be present in the bulk water or as biofilms adhered to the surfaces of a drinking water
distribution system. The current study investigated the pathogenic potential of HPC bacteria
occurring as biofilms within a drinking water distribution system and determined the
possible presence of these micro-organims within the bulk water. Biofilm samples were
taken from five sites within a drinking water distribution system. Fifty six bacterial colonies
were selected based on morphotypes and isolated for the screening of potential pathogenic
features. Haemolysin production was tested for using sheep-blood agar plates. Of the 56,
31 isolates were ?-haemolytic. Among the 31 ?-haemolytic positive isolates 87.1% were
positive for lecithinase, 41.9% for proteinase, 19.4% for chondroitinase, 9.7% for DNase
and 6.5% for hyaluronidase. All of the ?-haemolytic isolates were resistant to
oxytetracycline 30 ?g, trimethoprim 2.5 ?g and penicillin G10 units, 96.8% were resistant to
vancomycin 30 ?g and ampicillin 10 ?g, 93.5% to kanamycin 30 ?g, 74.2% to
chloramphenicol 30 ?g, 54.8% to ciprofloxacin 5 ?g, 22.6% to streptomycin 300 ?g and
16.1% to erythromycin 15 ?g. Nineteen isolates producing two or more enzymes were
subjected to Gram staining. The nineteen isolates were all Gram-positive. These isolates
were then identified using the BD BBL CRYSTALTM Gram-positive (GP) identification (ID)
system. Isolates were identified as Bacillus cereus, Bacillus licheniformis, Bacillus subtilis,
Bacillus megaterium, Bacillus pumilus and Kocuria rosea. 16S rRNA gene sequencing was
performed to confirm these results and to obtain identifications for the bacteria not identified
with the BD BBL CRYSTALTM GP ID system. Additionally identified bacteria included
Bacillus thuringiensis, Arthrobacter oxydans and Exiguobacterium acetylicum.
Morphological properties of the different species were studied with transmission electron
microscopy (TEM) to confirm sequencing results. All the isolates displayed rod shaped cells
with the exception of Arthrobacter oxydans being spherical in the stationary phase of their life cycle. Bulk water samples were taken at two sites in close proximity with the biofilm
sampling sites. The DNA was extracted directly from the water samples and the 16S rRNA
gene region was amplified. Denaturing gradient gel electrophoresis (DGGE) was performed
to confirm the presence of the isolates from the biofilm samples in the bulk water samples.
The presence of Bacillus pumilus and Arthrobacter oxydans could be confirmed with
DGGE. This study demonstrated the presence of potentially pathogenic HPC bacteria within
biofilms in a drinking water distribution system. It also confirmed the probable presence of
two of these biofilm based bacteria in the bulk water. / Thesis (M.Sc. (Microbiology))--North-West University, Potchefstroom Campus, 2010.
|
118 |
Road to exascale : improving scheduling performances and reducing energy consumption with the help of end-users / Route vers l'exaflops : amélioration des performances d'ordonnancement et réduction de la consommation énergétique avec l'aide des utilisateurs finauxGlesser, David 18 October 2016 (has links)
Le domaine du calcul haute performance (i.e. la science des super-calculateurs)est caractérisé par l’évolution continuelle des architectures de calcul, la proliférationdes ressources de calcul et la complexité croissante des problèmes que les utilisateursveulent résoudre. Un des logiciels les plus importants de la pile logicielle dessupercalculateurs est le Système de Gestion des Ressources et des Tâches. Il est lelien entre la charge de travail donnée par les utilisateurs et la plateforme de calcul. Cetype de logiciels spécialisés fournit des fonctions pour construire, soumettre, planifieret surveiller les tâches dans un environnent de calcul complexe et dynamique.Pour pouvoir atteindre des supercalculateurs exaflopiques, de nouvelles con-traintes et objectifs ont été inventés. Cette thèse développe et teste l’idée que lesutilisateurs de ces systèmes peuvent aider à atteindre l’échelle exaflopique. Spé-cifiquement, nous montrons des techniques qui utilisent les comportements desutilisateurs pour améliorer la consommation énergétique et les performances glob-ales des supercalculateurs.Pour tester ces nouvelles techniques, nous avons besoin de nouveaux outils etméthodes qui sont capables d’aller jusqu’à l’échelle exaflopique. Nous proposonsdonc des outils qui permettent de tester de nouveaux algorithmes capables des’exécuter sur ces systèmes. Ces outils sont capables de fonctionner sur de petitssupercalculateurs en émulant ou simulant des systèmes plus puissants. Après avoirévalué différentes techniques pour mesurer l’énergie dans les supercalculateurs, nousproposons une nouvelle heuristique, basée sur un algorithme répandu (Easy Backfill-ing), pour pouvoir contrôler la puissance électrique de ces énormes systèmes. Nousmontrons aussi comment, en utilisant la même méthode, contrôler la consommationénergétique pendant une période de temps. Le mécanisme proposé peut limiterla consommation énergétique tout en gardant des performances satisfaisantes. Sil’énergie est une ressource limitée, il faut la partager équitablement. Nous présen-tons de plus un mécanisme permettant de partager la consommation énergétiqueentre les utilisateurs. Nous soutenons que cette méthode va motiver les utilisateursà réduire la consommation énergétique de leurs calculs. Finalement, nous analysonsle comportement actuel et passé des utilisateurs pour améliorer les performancesdes supercalculateurs. Cette approche non seulement surpasse les performances destravaux existants, mais aussi ouvre la voie à l’utilisation de méthodes semblablesdans d’autres aspects des Systèmes de Gestion des Ressources et des Tâches. / The field of High Performance Computing (HPC) is characterized by the contin-uous evolution of computing architectures, the proliferation of computing resourcesand the increasing complexity of applications users wish to solve. One of the mostimportant software of the HPC stack is the Resource and Job Management System(RJMS) which stands between the user workloads and the platform, the applica-tions and the resources. This specialized software provides functions for building,submitting, scheduling and monitoring jobs in a dynamic and complex computingenvironment.In order to reach exaflops HPC systems, new constraints and objectives havebeen introduced. This thesis develops and tests the idea that the users of suchsystems can help reaching the exaflopic scale. Specifically, we show and introducenew techniques that employ users behaviors to improve energy consumption andoverall cluster performances.To test the proposed techniques, we need to develop new tools and method-ologies that scale up to large HPC clusters. Thus, we designed adequate tools thatassess new RJMS scheduling algorithms of such large systems. These tools areable to run on small clusters by emulating or simulating bigger platforms. Afterevaluating different techniques to measure the energy consumption of HPC clusters,we propose a new heuristic, based on the popular Easy Backfilling algorithm, inorder to control the power consumption of such huge systems. We also demonstrate,using the same idea, how to control the energy consumption during a time period.The proposed mechanism is able to limit the energy consumption while keepingsatisfying performances. If energy is a limited resource, it has to be shared fairly.We also present a mechanism which shares energy consumption among users. Weargue that sharing fairly the energy among users should motivate them to reducethe energy consumption of their applications. Finally, we analyze past and presentbehaviors of users using learning algorithms in order to improve the performancesof the parallel platforms. This approach does not only outperform state of the artmethods, it also shows promising insight on how such method can improve otheraspects of RJMS.
|
119 |
Appréhender l'hétérogénéité à (très) grande échelle / Apprehending heterogeneity at (very) large scaleBleuse, Raphaël 11 October 2017 (has links)
Le besoin de simuler des phénomènes toujours plus complexes accroît les besoinsen puissance de calcul, tout en consommant et produisant de plus en plus dedonnées.Pour répondre à cette demande, la taille et l'hétérogénéité des plateformes decalcul haute performance augmentent.L'hétérogénéité permet en effet de découper les problèmes en sous-problèmes,pour lesquels du matériel ou des algorithmes ad hoc sont plus efficients.Cette hétérogénéité se manifeste dans l'architecture des plateformes et dans lavariété des applications exécutées.Aussi, les performances sont de plus en plus sensibles au contexte d'exécution.L'objet de cette thèse est de considérer, qualitativement et à faible coût,l'impact du contexte d'exécution dans les politiques d'allocation etd'ordonnancement.Cette étude est menée à deux niveaux: au sein d'applications uniques, et àl'échelle des plateformes au niveau inter-applications.Nous étudions en premier lieu la minimisation du temps de complétion pour destâches séquentielles sur des plateformes hybrides intégrant des CPU et des GPU.Nous proposons de tenir compte du contexte d'exécution grâce à un mécanismed'affinité améliorant le comportement local des politiques d'ordonnancement.Ce mécanisme a été implémenté dans un run-time parallèle.Une campagne d'expérience montre qu'il permet de diminuer les transferts dedonnées tout en conservant un faible temps de complétion.Puis, afin de prendre implicitement en compte le parallélisme sur les CPU, nousenrichissons le modèle en considérant les tâches comme moldables sur CPU.Nous proposons un algorithme basé sur la programmation linéaire en nombresentiers.Cet algorithme efficace a un rapport de compétitivité de 3/2+ε.Dans un second temps, nous proposons un nouveau cadre de modélisation danslequel les contraintes sont des outils de premier ordre.Plutôt que d'étendre les modèles existants en considérant toutes lesinteractions possibles, nous réduisons l'espace des ordonnancements réalisablesvia l'ajout de contraintes.Nous proposons des contraintes raisonnables pour modéliser l'étalement desapplications ainsi que les flux d'E/S.Nous proposons ensuite une étude de cas exhaustive dans le cadre de laminimisation du temps de complétion pour des topologies unidimensionnelles,sous les contraintes de convexité et de localité. / The demand for computation power is steadily increasing, driven by the need tosimulate more and more complex phenomena with an increasing amount ofconsumed/produced data.To meet this demand, the High Performance Computing platforms grow in both sizeand heterogeneity.Indeed, heterogeneity allows splitting problems for a more efficient resolutionof sub-problems with ad hoc hardware or algorithms.This heterogeneity arises in the platforms' architecture and in the variety ofprocessed applications.Consequently, the performances become more sensitive to the execution context.We study in this thesis how to qualitatively bring—at a reasonablecost—context-awareness/obliviousness into allocation and scheduling policies.This study is conducted from two standpoints: within single applications, andat the whole platform scale from an inter-applications perspective.We first study the minimization of the makespan of sequential tasks onplatforms with a mixed architecture composed of multiple CPUs and GPUs.We integrate context-awareness into schedulers with an affinity mechanism thatimproves local behavior.This mechanism has been implemented in a parallel run-time, and experimentsshow that it is able to reduce the memory transfers while maintaining a lowmakespan.We then extend the model to implicitly consider parallelism on the CPUs withthe moldable-task model.We propose an efficient algorithm formulated as an integer linear program witha constant performance guarantee of 3/2+ε.Second, we devise a new modeling framework where constraints are a first-classtool.Rather than extending existing models to consider all possible interactions, wereduce the set of feasible schedules by further constraining existing models.We propose a set of reasonable constraints to model application spreading andI/O traffic.We then instantiate this framework for unidimensional topologies, and propose acomprehensive case study of the makespan minimization under convex and localconstraints.
|
120 |
Source code optimizations to reduce multi core and many core performance bottlenecks / Otimizações de código fonte para reduzir gargalos de desempenho em multi core e many coreSerpa, Matheus da Silva January 2018 (has links)
Atualmente, existe uma variedade de arquiteturas disponíveis não apenas para a indústria, mas também para consumidores finais. Processadores multi-core tradicionais, GPUs, aceleradores, como o Xeon Phi, ou até mesmo processadores orientados para eficiência energética, como a família ARM, apresentam características arquiteturais muito diferentes. Essa ampla gama de características representa um desafio para os desenvolvedores de aplicações. Os desenvolvedores devem lidar com diferentes conjuntos de instruções, hierarquias de memória, ou até mesmo diferentes paradigmas de programação ao programar para essas arquiteturas. Para otimizar uma aplicação, é importante ter uma compreensão profunda de como ela se comporta em diferentes arquiteturas. Os trabalhos relacionados provaram ter uma ampla variedade de soluções. A maioria deles se concentrou em melhorar apenas o desempenho da memória. Outros se concentram no balanceamento de carga, na vetorização e no mapeamento de threads e dados, mas os realizam separadamente, perdendo oportunidades de otimização. Nesta dissertação de mestrado, foram propostas várias técnicas de otimização para melhorar o desempenho de uma aplicação de exploração sísmica real fornecida pela Petrobras, uma empresa multinacional do setor de petróleo. Os experimentos mostram que loop interchange é uma técnica útil para melhorar o desempenho de diferentes níveis de memória cache, melhorando o desempenho em até 5,3 e 3,9 nas arquiteturas Intel Broadwell e Intel Knights Landing, respectivamente. Ao alterar o código para ativar a vetorização, o desempenho foi aumentado em até 1,4 e 6,5 . O balanceamento de carga melhorou o desempenho em até 1,1 no Knights Landing. Técnicas de mapeamento de threads e dados também foram avaliadas, com uma melhora de desempenho de até 1,6 e 4,4 . O ganho de desempenho do Broadwell foi de 22,7 e do Knights Landing de 56,7 em comparação com uma versão sem otimizações, mas, no final, o Broadwell foi 1,2 mais rápido que o Knights Landing. / Nowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing.
|
Page generated in 0.0539 seconds