231 |
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word ArchitecturesSalinger, Alejandro January 2013 (has links)
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming.
|
232 |
Adaptive transaction scheduling for transactional memory systemsYoo, Richard M. 01 April 2008 (has links)
Transactional memory systems are expected to enable parallel
programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only
when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations will actually become dominant in future workloads when large-scale transactions are frequently executed.
In this thesis, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.
|
233 |
Σχεδιασμός - υλοποίηση ολοκληρωμένου γραφικού περιβάλλοντος gene expression programming και ανάπτυξη καινοτόμων τελεστώνΑντωνίου, Μαρία 25 January 2012 (has links)
Τo Gene Expression Programming - GEP (Προγραμματισμός Γονιδιακής Έκφρασης - ΠΓΕ) είναι μια μέθοδος αυτόματης παραγωγής προγραμμάτων η οποία ανήκει στη γενική κατηγορία των Εξελικτικών Αλγορίθμων, εκείνων των τεχνικών δηλαδή που εμπνέονται από τις φυσικές διεργασίες της βιολογικής εξέλιξης. Συγκεκριμένα ο ΠΓΕ χρησιμοποιεί πληθυσμούς από άτομα, επιλέγει τα άτομα σύμφωνα με την καταλληλότητά τους (fitness) και εισάγει νέα σημεία (άτομα, πιθανές λύσεις) στον πληθυσμό χρησιμοποιώντας έναν ή περισσότερους γενετικούς τελεστές.
Στόχος αυτής της Μεταπτυχιακής Διπλωματικής Εργασίας ήταν ο σχεδιασμός και η υλοποίηση ενός Ολοκληρωμένου Γραφικού Περιβάλλοντος για τον Προγραμματισμό Γονιδιακής Έκφρασης καθώς και η υλοποίηση ορισμένων καινοτομιών.
Στα πλαίσια της διπλωματικής εργασίας, σχεδιάσθηκε και αναπτύχθηκε ένας καινοτόμος τελεστής για την μέθοδο του ΠΓΕ. Ο συγκεκριμένος τελεστής πραγματοποιεί μια τοπική αναζήτηση στις μεταβλητές που χρησιμοποιούνται στη μοντελοποίηση του εκάστοτε προβλήματος και επιλέγει εκείνες τις μεταβλητές για τις οποίες η απόδοση του αλγορίθμου βελτιστοποιείται. Η απόδοση του καινούργιου τελεστή ελέγχθηκε και πειραματικά. Μια επιπλέον καινοτομία που εφαρμόστηκε είναι η αυξομείωση του αριθμού των μεταλλάξεων. Συγκεκριμένα, επιλέγουμε να μειώνουμε τον αριθμό των μεταλλάξεων καθώς ο πληθυσμός εξελίσσεται, ενώ τον αυξάνουμε όταν έχουμε μικρή διαφορά ανάμεσα στη βέλτιστη και τη μέση απόδοση του πληθυσμού. Ο μεταβλητός αριθμός μεταλλάξεων σε συνδυασμό με την ικανότητα της μεθοδολογίας του ΠΓΕ να αποφεύγει τα τοπικά ακρότατα βελτιώνει σημαντικά την προσαρμοστικότητα του αλγορίθμου. Επιπλέον, για την αντιμετώπιση της αυξημένης υπολογιστικής πολυπλοκότητας που παρουσιάζει η μέθοδος, εισήχθη η έννοια του παραλληλισμού.
Τέλος, η τροποποιημένη μέθοδος του ΠΓΕ εφαρμόστηκε σε πληθώρα προβλημάτων όπως η μοντελοποίηση συμπεριφοράς μιας χρονοσειράς μαγνητοεγκεφαλογραφήματος, η μοντελοποίηση της συμπεριφοράς κόπωσης υλικών, η πρόβλεψη ισοτιμίας δολαρίου – ευρώ, η πρόβλεψη πρωτεϊνικών αλληλεπιδράσεων και η πρόβλεψη του βαθμού υδατοκορεσμού ελαιοκαλλιεργειών. Τα αποτελέσματα που προέκυψαν είναι ιδιαίτερα ενθαρρυντικά. / Gene Expression Programming (GEP) is one method of automatic generation of programs that belongs to a wider class of Evolutionary Algorithms. Evolutionary Algorithms are inspired by biological mechanisms of evolution. Specifically, GEP uses populations of individuals, select the individuals according to their fitness, and introduce genetic variation using one or more genetic operators.
The purpose of this Master's Thesis was to design and implement an Integrated Graphical Environment for Gene Expression Programming and the implementation of certain innovations.
Ιn the context of this thesis an innovative operator was designed and developed for the GEP method. This particular operator is conducting a local search on the variables used in modeling of a problem and chooses those variables for which the performance of the algorithm is optimized. The performance of the new operator was experimentally tested. Another innovation implemented was the fluctuation in the number of mutations. Specifically, we choose to reduce the number of mutations as the population evolves, while we increase it when the performance of the best individual found is very close to the average performance of the population. The variable number of mutations in combination with the ability of the methodology of GEP to avoid local extrema significantly improves the adaptability of the algorithm. Moreover, in order to face the increased computational complexity of the method, we introduce parallelism.
Finally, the modified method of GEP was applied to many problems such as modeling behavior of a MEG’s time series, modeling of fatigue behavior of materials, forecasting Euro - United States Dollar exchange rate, predicting protein interactions and predicting the degree of saturation of olive crops. The results are very encouraging.
|
234 |
Proposta e implementa??o de uma arquitetura reconfigur?vel h?brida para aplica??es baseadas em fluxo de dadosPereira, M?nica Magalh?es 21 February 2008 (has links)
Made available in DSpace on 2014-12-17T15:47:47Z (GMT). No. of bitstreams: 1
MonicaMP.pdf: 1183724 bytes, checksum: 59ab47a1731d0a647c07a25b7e4f0a84 (MD5)
Previous issue date: 2008-02-21 / The increase of applications complexity has demanded hardware even more flexible and able to achieve higher performance. Traditional hardware solutions have not
been successful in providing these applications constraints. General purpose processors have inherent flexibility, since they perform several tasks, however, they can not reach high performance when compared to application-specific devices. Moreover, since application-specific devices perform only few tasks, they achieve high performance, although they have less flexibility. Reconfigurable architectures emerged as an alternative to traditional approaches and have become an area of rising interest over the last decades. The purpose of this new paradigm is to modify the device s behavior according to the application. Thus, it is possible to balance flexibility and performance and also to attend the applications constraints. This work presents the design and implementation of a coarse grained hybrid reconfigurable architecture to stream-based applications. The architecture, named RoSA, consists of a reconfigurable logic attached to a processor. Its goal is to exploit the instruction level parallelism from intensive data-flow applications to accelerate the application s execution on the reconfigurable logic. The instruction level parallelism extraction is done at compile time, thus, this work also presents an optimization phase to the RoSA architecture to be included in the GCC compiler. To design the architecture, this work also presents a methodology based on hardware reuse of datapaths, named RoSE. RoSE aims to visualize the reconfigurable units through reusability levels, which provides area saving and datapath
simplification. The architecture presented was implemented in hardware description language (VHDL). It was validated through simulations and prototyping. To characterize
performance analysis some benchmarks were used and they demonstrated a speedup of 11x on the execution of some applications / O aumento na complexidade das aplica??es vem exigindo dispositivos cada vez mais flex?veis e capazes de alcan?ar alto desempenho. As solu??es de hardware tradicionais s?o ineficientes para atender as exig?ncias dessas aplica??es. Processadores de prop?sito geral, embora possuam flexibilidade inerente devido ? capacidade de executar diversos tipos de tarefas, n?o alcan?am alto desempenho quando comparados ?s arquiteturas de aplica??o espec?fica. Este ?ltimo, por ser especializado em uma pequena quantidade de tarefas, alcan?a alto desempenho, por?m n?o possui flexibilidade. Arquiteturas reconfigur?veis surgiram como uma alternativa ?s abordagens convencionais e vem ganhado espa?o nas ?ltimas d?cadas. A proposta desse paradigma ? alterar o comportamento do hardware de acordo com a aplica??o a ser executada. Dessa forma, ? poss?vel equilibrar flexibilidade e desempenho e atender a demanda das aplica??es atuais. Esse trabalho prop?e o projeto e a implementa??o de uma arquitetura
reconfigur?vel h?brida de granularidade grossa, voltada a aplica??es baseadas em fluxo de dados. A arquitetura, denominada RoSA, consiste de um bloco reconfigur?vel anexado a um processador. Seu objetivo ? explorar paralelismo no n?vel de instru??o de aplica??es com intenso fluxo de dados e com isso acelerar a execu??o dessas aplica??es no bloco reconfigur?vel. A explora??o de paralelismo no n?vel de instru??o ? feita em tempo de compila??o e para tal, esse trabalho tamb?m prop?e uma fase de otimiza??o para a arquitetura RoSA a ser inclu?da no compilador GCC. Para o projeto da arquitetura esse trabalho tamb?m apresenta uma metodologia baseada no reuso de hardware em caminho de dados, denominada RoSE. Sua proposta ? visualizar as unidades reconfigur?veis atrav?s de n?veis de reusabilidade, que permitem a economia de ?rea e a simplifica??o do projeto do caminho de dados da arquitetura. A arquitetura proposta foi implementada em linguagem de descri??o de hardware (VHDL). Sua valida??o deu-se atrav?s de simula??es e da prototipa??o em FPGA. Para an?lise de desempenho foram utilizados alguns estudos de caso que demonstraram uma
acelera??o de at? 11 vezes na execu??o de algumas aplica??es
|
235 |
SoMMA : a software managed memory architecture for multi-issue processorsJost, Tiago Trevisan January 2017 (has links)
Processadores embarcados utilizam eficientemente o paralelismo a nível de instrução para atender as necessidades de desempenho e energia em aplicações atuais. Embora a melhoria de performance seja um dos principais objetivos em processadores em geral, ela pode levar a um impacto negativo no consumo de energia, uma restrição crítica para sistemas atuais. Nesta dissertação, apresentamos o SoMMA, uma arquitetura de memória gerenciada por software para processadores embarcados capaz de reduz consumo de energia e energy-delay product (EDP), enquanto ainda aumenta a banda de memória. A solução combina o uso de memórias gerenciadas por software com a cache de dados, de modo a reduzir o consumo de energia e EDP do sistema. SoMMA também melhora a performance do sistema, pois os acessos à memória podem ser realizados em paralelo, sem custo em portas de memória extra na cache de dados. Transformações de código do compilador auxiliam o programador a utilizar a arquitetura proposta. Resultados experimentais mostram que SoMMA é mais eficiente em termos de energia e desempenho tanto a nível de processador quanto a nível do sistema completo. A técnica apresenta speedups de 1.118x e 1.121x, consumindo 11% e 12.8% menos energia quando comparando processadores que utilizam e não utilizam SoMMA. Há ainda redução de até 41.5% em EDP do sistema, sempre mantendo a área dos processadores equivalentes. Por fim, SoMMA também reduz o número de cache misses quando comparado ao processador baseline. / Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this dissertation, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer’s effort to benefit from the proposed architecture. Our experimental results show that SoMMA is more energy- and performance-efficient not only for the processing cores, but also at full-system level. Comparisons were done using the VEX processor, a VLIW reconfigurable processor. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified processors and their baselines. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. Lastly, even with SoMMA halving the data cache size, we still reduce the number of data cache misses in comparison to baselines.
|
236 |
Adapting the polytope model for dynamic and speculative parallelization / Adaptation du modèle polyhédrique à la parallélisation dynamique et spéculaticeJimborean, Alexandra 14 September 2012 (has links)
Dans cette thèse, nous décrivons la conception et l'implémentation d'une plate-forme logicielle de spéculation de threads, ou fils d'exécution, appelée VMAD, pour "Virtual Machine for Advanced Dynamic analysis and transformation", et dont la fonction principale est d'être capable de paralléliser de manière spéculative un nid de boucles séquentiel de différentes façons, en ré-ordonnançant ses itérations. La transformation à appliquer est sélectionnée au cours de l'exécution avec pour objectifs de minimiser le nombre de retours arrières et de maximiser la performance. Nous effectuons des transformations de code en appliquant le modèle polyédrique que nous avons adapté à la parallélisation spéculative au cours de l'exécution. Pour cela, nous construisons au préalable un patron de code qui est "patché" par notre "runtime", ou support d'exécution logiciel, selon des informations de profilage collectées sur des échantillons du temps d'exécution. L'adaptabilité est assurée en considérant des tranches de code de tailles différentes, qui sont exécutées successivement, chacune étant parallélisée différemment, ou exécutée en séquentiel, selon le comportement des accès à la mémoire observé. Nous montrons, sur plusieurs programmes que notre plate-forme offre de bonnes performances, pour des codes qui n'auraient pas pu être traités efficacement par les systèmes spéculatifs de threads proposés précédemment. / In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.
|
237 |
Paralelização do algoritmo DIANA com OpenMP e MPI / Parallelization of the DIANA algorithm with OpenMP and MPIRibeiro, Hethini do Nascimento 31 August 2018 (has links)
Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-08T23:20:34Z
No. of bitstreams: 1
Dissertação_hethini.pdf: 1986842 bytes, checksum: f1d6e8b9be8decd1fb1e992204d2b2d0 (MD5) / Rejected by Elza Mitiko Sato null (elzasato@ibilce.unesp.br), reason: Solicitamos que realize correções na submissão seguindo as orientações abaixo:
Problema 01) A FICHA CATALOGRÁFICA (Obrigatório pela ABNT NBR14724) está desconfigurada e falta número do CDU.
Problema 02) Falta citação nos agradecimentos, segundo a Portaria nº 206, de 4 de setembro de 2018, todos os trabalhos que tiveram financiamento CAPES deve constar nos agradecimentos a expressão:
"O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001
Problema 03) Falta o ABSTRACT (resumo em língua estrangeira), você colocou apenas o resumo em português.
Problema 04) Na lista de tabelas, a página referente a Tabela 9 está desconfigurada.
Problema 05) A cidade na folha de aprovação deve ser Bauru, cidade onde foi feita a defesa.
Bauru
31 de agosto de 2018
Problema 06) A paginação deve ser sequencial, iniciando a contagem na folha de rosto e mostrando o número a partir da introdução, a ficha catalográfica ficará após a folha de rosto e não deverá ser contada.
OBS:-Estou encaminhando via e-mail o template/modelo das páginas pré-textuais para que você possa fazer as correções da paginação, sugerimos que siga este modelo pois ele contempla as normas da ABNT
Lembramos que o arquivo depositado no repositório deve ser igual ao impresso, o rigor com o padrão da Universidade se deve ao fato de que o seu trabalho passará a ser visível mundialmente.
Agradecemos a compreensão
on 2018-10-09T14:18:32Z (GMT) / Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-10T00:30:40Z
No. of bitstreams: 1
Dissertação_hethini_corrigido.pdf: 1570340 bytes, checksum: a42848ab9f1c4352dcef8839391827a7 (MD5) / Approved for entry into archive by Elza Mitiko Sato null (elzasato@ibilce.unesp.br) on 2018-10-10T14:37:37Z (GMT) No. of bitstreams: 1
ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5) / Made available in DSpace on 2018-10-10T14:37:37Z (GMT). No. of bitstreams: 1
ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5)
Previous issue date: 2018-08-31 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / No início desta década havia cerca de 5 bilhões de telefones em uso gerando dados. Essa produção global aumentou aproximadamente 40% ao ano no início da década passada. Esses grandes conjuntos de dados que podem ser capturados, comunicados, agregados, armazenados e analisados, também chamados de Big Data, estão colocando desafios inevitáveis em muitas áreas e, em particular, no campo Machine Learning. Algoritmos de Machine Learning são capazes de extrair informações úteis desses grandes repositórios de dados e por este motivo está se tornando cada vez mais importante o seu estudo. Os programas aptos a realizarem essa tarefa podem ser chamados de algoritmos de classificação e clusterização. Essas aplicações são dispendiosas computacionalmente. Para citar alguns exemplos desse custo, o algoritmo Quality Threshold Clustering tem, no pior caso, complexidade O(�������������5). Os algoritmos hierárquicos AGNES e DIANA, por sua vez, possuem O(n²) e O(2n) respectivamente. Sendo assim, existe um grande desafio, que consiste em processar grandes quantidades de dados em um período de tempo realista, encorajando o desenvolvimento de algoritmos paralelos que se adequam ao volume de dados. O objetivo deste trabalho é apresentar a paralelização do algoritmo de hierárquico divisivo DIANA. O desenvolvimento do algoritmo foi realizado em MPI e OpenMP, chegando a ser três vezes mais rápido que a versão monoprocessada, evidenciando que embora em ambientes de memória distribuídas necessite de sincronização e troca de mensagens, para um certo grau de paralelismo é vantajosa a aplicação desse tipo de otimização para esse algoritmo. / Earlier in this decade there were about 5 billion phones in use generating data. This global production increased approximately 40% per year at the beginning of the last decade. These large datasets that can be captured, communicated, aggregated, stored and analyzed, also called Big Data, are posing inevitable challenges in many areas, and in particular in the Machine Learning field. Machine Learning algorithms are able to extract useful information from these large data repositories and for this reason their study is becoming increasingly important. The programs that can perform this task can be called classification and clustering algorithms. These applications are computationally expensive. To cite some examples of this cost, the Quality Threshold Clustering algorithm has, in the worst case, complexity O (n5). The hierarchical algorithms AGNES and DIANA, in turn, have O (n²) and O (2n) respectively. Thus, there is a great challenge, which is to process large amounts of data in a realistic period of time, encouraging the development of parallel algorithms that fit the volume of data. The objective of this work is to present the parallelization of the DIANA divisive hierarchical algorithm. The development of the algorithm was performed in MPI and OpenMP, reaching three times faster than the monoprocessed version, evidencing that although in distributed memory environments need synchronization and exchange of messages, for a certain degree of parallelism it is advantageous to apply this type of optimization for this algorithm. / 1757857
|
238 |
Exploração de paralelismo ou em uma linguagem em lógica com restrições / OR parallelism exploitation in a constraint logic languageVargas, Patricia Kayser January 1998 (has links)
Este trabalho a dedicado ao estudo da exploração de paralelismo OU na programação em lógica com restrições em ambientes distribuídos. A programação em lógica, cuja linguagem mais significativa 6 Prolog, tem como premissa a utilização da lógica de predicados como linguagem computacional. A programação em lógica com restrições (CLP) é uma extensão da programação em lógica, onde busca-se a eficiência e a possibilidade de executar novas classes de problemas. Variáveis em CLP podem pertencer a domínios específicos como, por exemplo, reais ou booleanos. O principal conceito introduzido é a restrição. Restrição a uma equação que representa uma certa informação sobre uma variável e a sua relação com outras variáveis. o uso de restrições foi proposto para diminuir o espaço de busca na execução dos programas. Apesar de mais eficientes que a programação em lógica clássica, para algumas aplicações reais o desempenho das linguagens CLP ainda é insatisfatório. Por isso, é necessário buscar alternativas novas como a execução em paralelo. A exploração de paralelismo implícito em programas em 1ógica já demonstrou resultados promissores. Vários modelos foram propostos e implementados utilizando as duas principais fontes de paralelismo — E e OU — de forma isolada ou combinada. O objetivo principal desse trabalho é apresentar o modelo pclp(FD) de exploração de paralelismo OU multi-sequêncial para um ambiente com memória distribuída. O modelo pclp(FD) caracteriza-se pela existência de vários trabalhadores, cada um deles possuindo uma maquina abstrata completa. O escalonamento de tarefas a realizado por uma política dinâmica e distribuída. Uma tarefa em pclp(FD) equivale a um ponto de escolha e a um contexto de execução. O contexto de execução a formado por porções da pilha do exportador. Para que o importador tenha acesso ao contexto de execução utiliza-se a cópia incremental, que a uma das varias técnicas possíveis. Cada trabalhador possui a sua própria copia privada das pilhas de execução. A cópia caracteriza-se pelo envio das pilhas de execução do exportador para uma área privada do importador. A cópia incremental é uma técnica mais otimizada que verifica a existência de partes comuns entre os trabalhadores, copiando apenas as panes novas. O algoritmo de cópia incremental proposto no modelo a feito sem nenhuma centralização de informação do estado das pilhas. O projeto e implementação de um prot6tipo para esse modelo, utilizando a linguagem clp(FD), que implementa CLP sobre domínios finitos, permitirá uma analise das vantagens e desvantagens do modelo proposto. Os resultados obtidos com a análise servirão de base para trabalhos futuros, visando aprimorar a implementação e o modelo. / This work is dedicated to the study of the exploration of OR parallelism in Constraint Logic Programming for distributed environment. Logic Programming, which the most meaningful language is Prolog, has as premise the use of the logic of predicates as computational language. Constraint Logic Programming or CLP is an extension of the logic programming, where efficiency and the possibility to execute new kinds of problems are searched. A variable in CLP can belong to specific domains as, for example, Real or Boolean. The main concept introduced is the constraint. Constraint is an equation that represents a certain information over a variable and its relation with others variables. The use of constraints was proposed to decrease search space in the program execution. Although it is more efficient than classic logic programming, for some real applications, the performance of CLP languages still is unsatisfactory. So, it is necessary to search alternatives as parallel execution. The exploration of implicit parallelism in programs in logic has already demonstrated promising results. Several models have been proposed and implemented using the two main sources of parallelism - AND and OR — in an isolated or combined form. The main objective of this work is to present the pclp(FD) model of exploration of multi-sequential OR parallelism for a distributed memory environment. The pclp(FD) model is characterized for the existence of some workers, each one of them possessing a complete abstract machine. Task scheduling is executed by one dynamic and distributed policy. A task in pclp(FD) is equivalent to a choice point and an execution context. Execution context is formed by portions of the stack of the exporter. So that importer has access to the execution context, it uses incremental copy, which is one of the several possible techniques. The copy is characterized for sending execution stacks of the exporter to a private area of the importer, that is, each worker possesses its private copy of the execution stacks. The incremental copy is a more optimized technique that verifies the existence of common parts between workers, copying only the new ones. The incremental copy algorithm proposed in the model executes without centralized information of the state of the stacks. A prototype project and implementation for this model, using the language clp(FD), that implements CLP over finite domains, will allow an analysis of advantages and disadvantages of the considered model. The results gotten with the analysis will serve of base for future works, aiming to improve the implementation and the model.
|
239 |
Granlog : um modelo para analise automatica de granulosidade na programacao em logica / Granlog a model for automatic granulariy analysis in logic programmingBarbosa, Jorge Luis Victoria January 1996 (has links)
A exploração do paralelismo na programação em lógica e considerada uma alternativa para simplificação da programação de maquinas paralelas e para aumento do desempenho de programas em lógica. Desta forma, a integração da programação em lógica e sistemas paralelos tornou-se nos últimos anos um centro de atenções da comunidade ciêntifica. Dentre os problemas que devem ser solucionados para exploração adequada do paralelismo, encontra-se a analise de granulosidade. A análise de granulosidade determina o tamanho dos grãos, ou seja, a complexidade dos módulos que devendo ser executados seqüencialmente num único processador. Basicamente, esta analise consiste de uma refinada identificação dos grãos, visando a máxima eficiência na exploração do paralelismo. Neste sentido, devem ser realizadas considerações sobre dependências, complexidade dos grãos e custos envolvidos na paralelização. Recentemente, a analise de granulosidade na programação em lógica tem recebido atenção especial por parte dos pesquisadores. Os grãos podem ser identificados pelo programador através de primitivas de programação ou podem ser detectados automaticamente pelo sistema paralelo. Na programação em lógica, a exploração automática do paralelismo é estimulada, devido ao paralelismo implícito existente na avaliação das expressões lógicas. Além disso, a programação em lógica permite uma clara distinção entre a semântica e o controle da linguagem, proporcionando uma abordagem distinta entre a descrição do problema e o caminho para obtenção das soluções. A detecção automática do paralelismo permite o aproveitamento de programas já existentes, alem de liberar o programador do encargo de paralelizar o problema. Este trabalho dedica-se ao estudo da analise automática de granulosidade na programação em lógica. O texto propõe um modelo para geração de informações de granulosidade, denominado GRANLOG (GRanularty ANalyzer for LOGic Programming). O GRANLOG realiza uma analise estática de um programa em 16aica. Dessa analise resulta o programa granulado, ou seja, o programa original acrescido da anotação de granulosidade. Esta anotação contem diversas informações que contribuem de forma significativa com a exploração adequada do paralelismo na programação em lógica. Durante o desenvolvimento do GRANLOG foram exploradas diversas áreas de pesquisa da programação em lógica, dentre as quais destacam-se: analise de modos, analise de tipos, análise de medidas para mensuração do tamanho de termos, interpretação abstrata, analise de dependências e analise de complexidade. A integração destes t6picos torna o GRANLOG uma rica fonte de pesquisa. Além disso, a organização modular da proposta permite o aprimoramento independente de suas partes, tornando a estrutura do modelo uma base para o desenvolvimento de novos trabalhos. Além do modelo, o texto descreve a implementação de um protótipo e propõe duas aplicações para as informações de granulosidade, ou seja, auxilio a decisões de escalonamento e simulação da execução de programas. O texto apresenta ainda uma proposta para integração do GRANLOG a um modelo para execução paralela de programas em lógica, denominado OPERA. O OPERA dedica-se a exploração do paralelismo na programação em lógica e possui atualmente um protótipo para execução paralela de programas em lógica em redes de computadores. Os bons resultados obtidos com a integração OPERA-GRANLOG demonstram a relevância das informações geradas pelo modelo proposto neste trabalho. Encontra-se ainda neste texto uma proposta para inclusão do GRANLOG numa interface gráfica, denominada XOPERA. Esta interface permite a execução do protótipo OPERA e, a partir deste trabalho, gerencia também o protótipo GRANLOG. A inclusão da gerencia do GRANLOG na interface XOPERA, contribui de forma substancial para a integração OPERA-GRANLOG. / The exploitation of parallelism in logic programming is considered an alternative for simplifying the task of programming parallel machines. Also, it provides a way to increase the performance of logic programs. Because of this, integrating parallel systems with parallel programmin g has been a topic of much interest in the scientific comunity, in the last years. Among the problems that must be solved for the adequate exploitation of parallelism, there is the granularity analysis. Granularity analysis determines the size of the grains, that is, the complexity of the modules that must be sequentially executed in a single processor. Basically, this analysis consists of a refined identification of the grains, aiming the maximum efficiency in the parallelism exploitation. In this sense, considerations must be taken about dependencies, grain complexity and costs involved in the parallelizing process. Recently, many researchers have given special attention to the granularity analysis of logic programming. The grains may be identified by the programmer via programming primitives, or they may be automatically detected by the parallel system. In logic programming, the automatic exploitation of parallelism is stimulated, because of the implicit parallelism that exists in the evaluation of the logic expressions. Besides, logic programming allows a clear distinction between the semantics and the control of the language, providing a distinct approach between the problem description and the way to obtain the results. The automatic detection of parallelism permits the utilization of already written programs, also freeing the programmer from parallelizing the program by hand. This work is dedicated to the study of automatic granularity analysis in logic programming. The text proposes a model for generating granularity informations, called GRANLOG (GRanularity Analyzer for LOGic Programming). GRANLOG performs a static analysis of a logic program. From this analysis, it results a granulated program, that is, the original program increased by the granularity annotation. This annotation has several informations that contribute in a significant way to the adequate exploitation of parallelism in logic programming. During the development of GRANLOG, several research areas have been explored, namely, mode analysis, type analysis, measure analysis for measuring the size of terms, abstract interpretation, dependencies analysis and complexity analysis. The integration of these topics makes GRANLOG a good source for researchs. Besides, the modular organization proposed permits the independent improvement of its parts, making of the model structure, a base for the development of new works. Besides the model, the text describes the implementation of a prototype and proposes two applications for the granularity informations, namely, help in scheduling decisions and program execution simulation. It also presents a proposal for integrating GRANLOG to a parallel logic execution model for logic programming, called OPERA. OPERA is dedicated to the exploitation of parallelism in logic programming and, at the present time, has a prototype for parallel execution of logic programming in computer networks. The good results obtained by integrating OPERA and GRANLOG show the importance of the information generated by the model proposed in this work. There is, also, in this work, a proposal for including GRANLOG in a graphical interface, called XOPERA. This interface allows the execution of the OPERA prototype and, from now on, also manaaes the GRANLOG prototype. The inclusion of GRANLOG in the XOPERA interfaces substantially contributes to the OPERAGRANLOG intearation.
|
240 |
Communication Reducing Approaches and Shared-Memory Optimizations for the Hierarchical Fast Multipole Method on Distributed and Many-core SystemsAbduljabbar, Mustafa 06 December 2018 (has links)
We present algorithms and implementations that overcome obstacles in the migration of the Fast Multipole Method (FMM), one of the most important algorithms in computational science and engineering, to exascale computing. Emerging architectural approaches to exascale computing are all characterized by data movement rates that are slow relative to the demand of aggregate floating point capability, resulting in performance that is bandwidth limited. Practical parallel applications of FMM are impeded in their scaling by irregularity of domains and dominance of collective tree communication, which is known not to scale well. We introduce novel ideas that improve partitioning of the N-body problem with boundary distribution through a sampling-based mechanism that hybridizes two well-known partitioning techniques, Hashed Octree (HOT) and Orthogonal Recursive Bisection (ORB). To reduce communication cost, we employ two methodologies. First, we directly utilize features available in parallel runtime systems to enable asynchronous computing and overlap it with communication. Second, we present Hierarchical Sparse Data Exchange (HSDX), a new all-to-all algorithm that inherently relieves communication by relaying sparse data in a few steps of neighbor exchanges. HSDX exhibits superior scalability and improves relative performance compared to the default MPI alltoall and other relevant literature implementations. We test this algorithm alongside others on a Cray XC40 tightly coupled with the Aries network and on Intel Many Integrated Core Architecture (MIC) represented by Intel Knights Corner (KNC) and Intel Knights Landing (KNL) as modern shared-memory CPU environments. Tests
include comparisons of thoroughly tuned handwritten versus auto-vectorization of FMM Particle-to-Particle (P2P) and Multipole-to-Local (M2L) kernels. Scalability of task-based parallelism is assessed with FMM’s tree traversal kernel using different threading libraries. The MIC tests show large performance gains after adopting the prescribed techniques, which are inevitable in a world that is moving towards many-core parallelism.
|
Page generated in 0.0686 seconds