Global ETD Search

71	Machine learning based mapping of data and streaming parallelism to multi-cores Wang, Zheng January 2011 (has links) Multi-core processors are now ubiquitous and are widely seen as the most viable means of delivering performance with increasing transistor densities. However, this potential can only be realised if the application programs are suitably parallel. Applications can either be written in parallel from scratch or converted from existing sequential programs. Regardless of how applications are parallelised, the code must be efficiently mapped onto the underlying platform to fully exploit the hardware’s potential. This thesis addresses the problem of finding the best mappings of data and streaming parallelism—two types of parallelism that exist in broad and important domains such as scientific, signal processing and media applications. Despite significant progress having been made over the past few decades, state-of-the-art mapping approaches still largely rely upon hand-crafted, architecture-specific heuristics. Developing a heuristic by hand, however, often requiresmonths of development time. Asmulticore designs become increasingly diverse and complex, manually tuning a heuristic for a wide range of architectures is no longer feasible. What are needed are innovative techniques that can automatically scale with advances in multi-core technologies. In this thesis two distinct areas of computer science, namely parallel compiler design and machine learning, are brought together to develop new compiler-based mapping techniques. Using machine learning, it is possible to automatically build highquality mapping schemes, which adapt to evolving architectures, with little human involvement. First, two techniques are proposed to find the best mapping of data parallelism. The first technique predicts whether parallel execution of a data parallel candidate is profitable on the underlying architecture. On a typical multi-core platform, it achieves almost the same (and sometimes a better) level of performance when compared to the manually parallelised code developed by independent experts. For a profitable candidate, the second technique predicts how many threads should be used to execute the candidate across different program inputs. The second technique achieves, on average, over 96% of the maximum available performance on two different multi-core platforms. Next, a new approach is developed for partitioning stream applications. This approach predicts the ideal partitioning structure for a given stream application. Based on the prediction, a compiler can rapidly search the program space (without executing any code) to generate a good partition. It achieves, on average, a 1.90x speedup over the already tuned partitioning scheme of a state-of-the-art streaming compiler. 005.3
72	LALP: uma linguagem para exploração do paralelismo de loops em computação reconfigurável / LALP: a language for parallelism of loops exploitation in reconfigurable computing Menotti, Ricardo 23 June 2010 (has links) A computação reconfigurável tem se tornado cada vez mais importante em sistemas computacionais embarcados e de alto desempenho. Ela permite níveis de desempenho próximos aos obtidos com circuitos integrados de aplicação específica (ASIC), enquanto ainda mantém flexibilidade de projeto e implementação. No entanto, para programar eficientemente os dispositivos, é necessária experiência em desenvolvimento e domínio de linguagem de descrição de hardware (HDL), tais como VHDL ou Verilog. As técnicas empregadas na compilação em alto nível (por exemplo, a partir de programas em C) ainda possuem muitos pontos em aberto a serem resolvidos antes que se possa obter resultados eficientes. Muitos esforços em se obter um mapeamento direto de algoritmos em hardware se concentram em loops, uma vez que eles representam as regiões computacionalmente mais intensivas de muitos programas. Uma técnica particularmente útil para isto é a de loop pipelining, a qual geralmente é adaptada de técnicas de software pipelining. A aplicação dessas técnicas está fortemente relacionada ao escalonamento das instruções, o que frequentemente impede o uso otimizado dos recursos presentes nos FPGAs modernos. Esta tese descreve uma abordagem alternativa para o mapeamento direto de loops descritos em uma linguagem de alto nível para FPGAs. Diferentemente de outras abordagens, esta técnica não é proveniente das técnicas de software pipelining. Nas arquiteturas obtidas o controle das operações é distribuído, tornando desnecessária uma máquina de estados finitos para controlar a ordem das operações, o que permitiu a obtenção de implementações eficientes. A especificação de um bloco de hardware é feita por meio de uma linguagem de domínio específico (LALP), especialmente concebida para suportar a aplicação das técnicas. Embora a sintaxe da linguagem lembre C, ela contém certas construções que permitem intervenções do programador para garantir ou relaxar dependências de dados, conforme necessário, e assim otimizar o desempenho do hardware gerado / Reconfigurable computing is becoming increasingly important in embedded and high-performance computing systems. It allows performance levels close to the ones obtained with Application-Specific Integrated circuits (ASIC), while still keeping design and implementation flexibility. However, to efficiently program devices, one needs the expertise of hardware developers in order master hardware description languages (HDL) such as VHDL or Verilog. Attempts to furnish a high-level compilation flow (e.g., from C programs) still have to address open issues before broader efficient results can be obtained. Many efforts trying to achieve a direct of algorithms into hardware concentrate on loops since they represent the most computationally intensive regions of many application codes. A particularly useful technique for this purpose is loop pipelining, which is usually adapted from software pipelining techniques. The application of this technique is strongly related to instruction scheduling, whic often prevents an optimized use of the resources present in modern FPGAs. This thesis decribes an alternative approach to direct mapping loops described in high-level labguages onto FPGAs. Different from oyher approaches, this technique does not inherit from software pipelining techniques. The control is distributed over operations, thus a finite state machine is not necessary to control the order of operations, allowing efficient harware implementations. The specification of a hardware block is done by means of LALP, a domain specific language specially designed to help the application of the techniques. While the language syntax resembles C, it contains certain constructs that allow programmer interventions to enforce or relax data dependences as needed, and so optimize the performance of the generated hardware Compiladores Compilers Computação reconfigurável FPGA FPGA Reconfigurable computing
73	Paralelização de programas sisal para sistemas MPI / Parallelization of sisal programs for MPI systems Nakashima, Raul Junji 15 March 1996 (has links) Este trabalho teve como finalidade a implementação de um método para a paralelização parcial de programas, escritos na linguagem funcional, SISAL utilizando as bibliotecas do padrão MPI (Message Passing Interface). Para tal, propusemos a transformação dos programas SISAL através do particionamento do loop paralelo forall, através do método de particionamento slice e a utilização do modelo de implementação do paralelismo SPMD (Single Program Multiple Data) no estilo de programas mestre/escravo. A validação de nossa proposta foi obtida através da realização de testes onde foram comparados os resultados obtidos com os programas originais e os programas com as alterações propostas / This work describes a method for the partial parallelization of SISAL programs into programs with calls to MPI routines. We focused on the parallelization of the forall loop (through slicing of the index range). The generated code is a master/slave SPMD program. The work was validated through the compilation of some simple SISAL programs and comparison of the results with an unmodified version Compiladores Compilers Functional languages Linguagens funcionais MPI MPI Paralelismo Parallelism
74	Software-assisted data prefetching algorithms. January 1995 (has links) by Chi-sum, Ho. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1995. / Includes bibliographical references (leaves 110-113). / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Overview --- p.1 / Chapter 1.2 --- Cache Memories --- p.1 / Chapter 1.3 --- Improving Cache Performance --- p.3 / Chapter 1.4 --- Improving System Performance --- p.4 / Chapter 1.5 --- Organization of the dissertation --- p.6 / Chapter 2 --- Related Work --- p.8 / Chapter 2.1 --- Cache Performance --- p.8 / Chapter 2.2 --- Non-Blocking Cache --- p.9 / Chapter 2.3 --- Cache Prefetching --- p.10 / Chapter 2.3.1 --- Hardware Prefetching --- p.10 / Chapter 2.3.2 --- Software-assisted Prefetching --- p.13 / Chapter 2.3.3 --- Improving Cache Effectiveness --- p.22 / Chapter 2.4 --- Other Techniques to Reduce and Hide Memory Latencies --- p.25 / Chapter 2.4.1 --- Register Preloading --- p.25 / Chapter 2.4.2 --- Write Policies --- p.26 / Chapter 2.4.3 --- Small Specialized Cache --- p.26 / Chapter 2.4.4 --- Program Transformation --- p.27 / Chapter 3 --- Stride CAM Prefetching --- p.30 / Chapter 3.1 --- Introduction --- p.30 / Chapter 3.2 --- Architectural Model --- p.32 / Chapter 3.2.1 --- Compiler Support --- p.33 / Chapter 3.2.2 --- Hardware Support --- p.35 / Chapter 3.2.3 --- Model Details --- p.39 / Chapter 3.3 --- Optimization Issues --- p.39 / Chapter 3.3.1 --- Eliminating Reductant Prefetching --- p.40 / Chapter 3.3.2 --- Code Motion --- p.40 / Chapter 3.3.3 --- Burst Mode --- p.44 / Chapter 3.3.4 --- Stride CAM Overflow --- p.45 / Chapter 3.3.5 --- Effects of Loop Optimizations --- p.46 / Chapter 3.4 --- Practicability --- p.50 / Chapter 3.4.1 --- Evaluation Methodology --- p.51 / Chapter 3.4.2 --- Prefetch Accuracy --- p.54 / Chapter 3.4.3 --- Stride CAM Size --- p.56 / Chapter 3.4.4 --- Software Overhead --- p.60 / Chapter 4 --- Stride Register Prefetching --- p.67 / Chapter 4.1 --- Motivation --- p.67 / Chapter 4.2 --- Architectural Model --- p.67 / Chapter 4.2.1 --- Stride Register --- p.69 / Chapter 4.2.2 --- Compiler Support --- p.70 / Chapter 4.2.3 --- Prefetch Bits --- p.72 / Chapter 4.2.4 --- Operation Details --- p.77 / Chapter 4.3 --- Practicability and Optimizations --- p.78 / Chapter 4.3.1 --- Practicability on NASA7 Benchmark Programs --- p.78 / Chapter 4.3.2 --- Optimization Issues --- p.81 / Chapter 4.4 --- Comparison Between Stride CAM and Stride Register Models --- p.84 / Chapter 5 --- Small Software-Driven Array Cache --- p.87 / Chapter 5.1 --- Introduction --- p.87 / Chapter 5.2 --- Cache Pollution in MXM --- p.88 / Chapter 5.3 --- Architectural Model --- p.89 / Chapter 5.3.1 --- Operation Details --- p.91 / Chapter 5.4 --- Effectiveness of Array Cache --- p.92 / Chapter 6 --- Conclusion --- p.96 / Chapter 6.1 --- Conclusion --- p.96 / Chapter 6.2 --- Future Research: An Extension of the Stride CAM Model --- p.97 / Chapter 6.2.1 --- Background --- p.97 / Chapter 6.2.2 --- Reference Address Series --- p.98 / Chapter 6.2.3 --- Extending the Stride CAM Model --- p.100 / Chapter 6.2.4 --- Prefetch Overhead --- p.109 / Bibliography --- p.110 / Appendix --- p.114 / Chapter A --- Simulation Results - Stride CAM Model --- p.114 / Chapter A.l --- Execution Time --- p.114 / Chapter A.1.1 --- BTRIX --- p.114 / Chapter A.1.2 --- CFFT2D --- p.115 / Chapter A.1.3 --- CHOLSKY --- p.116 / Chapter A.1.4 --- EMIT --- p.117 / Chapter A.1.5 --- GMTRY --- p.118 / Chapter A.1.6 --- MXM --- p.119 / Chapter A.1.7 --- VPENTA --- p.120 / Chapter A.2 --- Memory Delay --- p.122 / Chapter A.2.1 --- BTRIX --- p.122 / Chapter A.2.2 --- CFFT2D --- p.123 / Chapter A.2.3 --- CHOLSKY --- p.124 / Chapter A.2.4 --- EMIT --- p.125 / Chapter A.2.5 --- GMTRY --- p.126 / Chapter A.2.6 --- MXM --- p.127 / Chapter A.2.7 --- VPENTA --- p.128 / Chapter A.3 --- Overhead --- p.129 / Chapter A.3.1 --- BTRIX --- p.129 / Chapter A.3.2 --- CFFT2D --- p.130 / Chapter A.3.3 --- CHOLSKY --- p.131 / Chapter A.3.4 --- EMIT --- p.132 / Chapter A.3.5 --- GMTRY --- p.133 / Chapter A.3.6 --- MXM --- p.134 / Chapter A.3.7 --- VPENTA --- p.135 / Chapter A.4 --- Hit Ratio --- p.136 / Chapter A.4.1 --- BTRIX --- p.136 / Chapter A.4.2 --- CFFT2D --- p.137 / Chapter A.4.3 --- CHOLSKY --- p.137 / Chapter A.4.4 --- EMIT --- p.138 / Chapter A.4.5 --- GMTRY --- p.139 / Chapter A.4.6 --- MXM --- p.139 / Chapter A.4.7 --- VPENTA --- p.140 / Chapter B --- Simulation Results - Array Cache --- p.141 / Chapter C --- NASA7 Benchmark --- p.145 / Chapter C.1 --- BTRIX --- p.145 / Chapter C.2 --- CFFT2D --- p.161 / Chapter C.2.1 --- cfft2dl --- p.161 / Chapter C.2.2 --- cfft2d2 --- p.169 / Chapter C.3 --- CHOLSKY --- p.179 / Chapter C.4 --- EMIT --- p.192 / Chapter C.5 --- GMTRY --- p.205 / Chapter C.6 --- MXM --- p.217 / Chapter C.7 --- VPENTA --- p.220 Cache memory Memory management (Computer science) Compilers (Computer programs)
75	ML4JIT- um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT. / ML4JIT - a framework for research on machine learning in JIT compilers. Mignon, Alexandre dos Santos 27 June 2017 (has links) Determinar o melhor conjunto de otimizações para serem aplicadas a um programa tem sido o foco de pesquisas em otimização de compilação por décadas. Em geral, o conjunto de otimizações é definido manualmente pelos desenvolvedores do compilador e aplicado a todos os programas. Técnicas de aprendizado de máquina supervisionado têm sido usadas para o desenvolvimento de heurísticas de otimização de código. Elas pretendem determinar o melhor conjunto de otimizações com o mínimo de interferência humana. Este trabalho apresenta o ML4JIT, um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT para a linguagem Java. O arcabouço permite que sejam realizadas pesquisas para encontrar uma melhor sintonia das otimizações específica para cada método de um programa. Experimentos foram realizados para a validação do arcabouço com o objetivo de verificar se com seu uso houve uma redução no tempo de compilação dos métodos e também no tempo de execução do programa. / Determining the best set of optimizations to be applied in a program has been the focus of research on compile optimization for decades. In general, the set of optimization is manually defined by compiler developers and apply to all programs. Supervised machine learning techniques have been used for the development of code optimization heuristics. They intend to determine the best set of optimization with minimal human intervention. This work presents the ML4JIT, a framework for research with machine learning in JIT compilers for Java language. The framework allows research to be performed to better tune the optimizations specific to each method of a program. Experiments were performed for the validation of the framework with the objective of verifying if its use had a reduction in the compilation time of the methods and also in the execution time of the program. Aprendizado computacional Code optimization JIT compilers Machine learning Montadores e compiladores
76	Compiler-assisted Adaptive Software Testing Petsios, Theofilos January 2018 (has links) Modern software is becoming increasingly complex and is plagued with vulnerabilities that are constantly exploited by attackers. The vast numbers of bugs found in security-critical systems and the diversity of errors presented in commercial off-the-shelf software require effective, scalable testing frameworks. Unfortunately, the current testing ecosystem is heavily fragmented, with the majority of toolchains targeting limited classes of errors and applications without offering provably strong guarantees. With software codebases continuously becoming more diverse and complex, the large-scale deployment of monolithic, non-adaptive analysis engines is likely to increase the aforementioned fragmentation. Instead, modern software testing requires adaptive, hybrid techniques that target errors selectively. This dissertation argues that adopting context-aware analyses will enable us to set the foundations for retargetable testing frameworks while further increasing the accuracy and extensibility of existing toolchains. To this end, we initially examine how compiler analyses can become context-aware, prioritizing certain errors over others of the same type. As a use case of our proposed approach, we extend a state-of-the-art compiler's integer error detection pipeline to suppress reports of benign errors by up to 89% in real-world workloads, while allowing for reporting of serious errors. Subsequently, we demonstrate how compiler-based instrumentation can be utilized by feedback-driven evolutionary fuzzers to provide multifaceted analyses targeting broader classes of bugs. In this direction, we present differential diversity (δ-diversity), we propose a generic methodology for offering state-aware guidance in feedback-driven frameworks, and we demonstrate how to retrofit state-of-the-art fuzzers to target broader classes of errors. We provide two such prototype implementations: NEZHA, the first differential generic fuzzer capable of handling logic bugs, as well as SlowFuzz, the first generic fuzzer targeting complexity vulnerabilities. We applied both prototypes on production software, and demonstrate their effectiveness. We found that NEZHA discovered hundreds of logic discrepancies across a wide variety of applications (SSL/TLS libraries, parsers, etc.), while SlowFuzz successfully generated inputs triggering slowdowns in complex, real-world software, including zip parsers, regular expression libraries, and hash table implementations. Computer science Compilers (Computer programs) Computer software--Testing
77	ML4JIT- um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT. / ML4JIT - a framework for research on machine learning in JIT compilers. Alexandre dos Santos Mignon 27 June 2017 (has links) Determinar o melhor conjunto de otimizações para serem aplicadas a um programa tem sido o foco de pesquisas em otimização de compilação por décadas. Em geral, o conjunto de otimizações é definido manualmente pelos desenvolvedores do compilador e aplicado a todos os programas. Técnicas de aprendizado de máquina supervisionado têm sido usadas para o desenvolvimento de heurísticas de otimização de código. Elas pretendem determinar o melhor conjunto de otimizações com o mínimo de interferência humana. Este trabalho apresenta o ML4JIT, um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT para a linguagem Java. O arcabouço permite que sejam realizadas pesquisas para encontrar uma melhor sintonia das otimizações específica para cada método de um programa. Experimentos foram realizados para a validação do arcabouço com o objetivo de verificar se com seu uso houve uma redução no tempo de compilação dos métodos e também no tempo de execução do programa. / Determining the best set of optimizations to be applied in a program has been the focus of research on compile optimization for decades. In general, the set of optimization is manually defined by compiler developers and apply to all programs. Supervised machine learning techniques have been used for the development of code optimization heuristics. They intend to determine the best set of optimization with minimal human intervention. This work presents the ML4JIT, a framework for research with machine learning in JIT compilers for Java language. The framework allows research to be performed to better tune the optimizations specific to each method of a program. Experiments were performed for the validation of the framework with the objective of verifying if its use had a reduction in the compilation time of the methods and also in the execution time of the program. Aprendizado computacional Montadores e compiladores Code optimization JIT compilers Machine learning
78	Compiling Evaluable Functions in the Godel Programming Language Shapiro, David 30 January 1996 (has links) We present an extension of the Godel logic programming language code generator which compiles user-defined functions. These functions may be used as arguments in predicate or goal clauses. They are defined in extended Godel as rewrite rules. A translation scheme is introduced to convert function definitions into predicate clauses for compilation. This translation scheme and the compilation of functional arguments both employ leftmost-innermost narrowing. As function declarations are indistinguishable from constructor declarations, a function detection method is implemented. The ultimate goal of this research is the implementation of extended Godel using needed narrowing. The work presented here is an intermediate step in creating a functional-logic language which expands the expressiveness of logic programming and streamlines its execution. Gèodel (Computer program language) Computer Sciences Programming Languages and Compilers
79	A Parallelizing Compiler Based on Partial Evaluation Surati, Rajeev 01 July 1993 (has links) We constructed a parallelizing compiler that utilizes partial evaluation to achieve efficient parallel object code from very high-level data independent source programs. On several important scientific applications, the compiler attains parallel performance equivalent to or better than the best observed results from the manual restructuring of code. This is the first attempt to capitalize on partial evaluation's ability to expose low-level parallelism. New static scheduling techniques are used to utilize the fine-grained parallelism of the computations. The compiler maps the computation graph resulting from partial evaluation onto the Supercomputer Toolkit, an eight VLIW processor parallel computer. VLIW partial evaluation register allocation parallelsscheduling parallelizing compilers
80	Compiler Techniques For Code Size And Power Reduction For Embedded Processors Sarvani, V V N S 06 1900 (has links) (PDF) No description available. Microprocessors Embedded Systems Compilers Embedded Processors Computer Science

Search results