Global ETD Search

51	Elasticity in IaaS Cloud, Preserving Performance SLAs Dhingra, Mohit January 2014 (has links) (PDF) Infrastructure-as-a-Service(IaaS), one of the service models of cloud computing, provides resources in the form of Virtual Machines(VMs). Many applications hosted on the IaaS cloud have time varying workloads. These kind of applications benefit from the on-demand provision ing characteristic of cloud platforms. Applications with time varying workloads demand time varying resources in IaaS, which requires elastic resource provisioning in IaaS, such that their performance is intact. In current IaaS cloud systems, VMs are static in nature as their configurations do not change once they are instantiated. Therefore, fluctuation in resource demand is handled in two ways: allocating more VMs to the application(horizontal scaling) or migrating the application to another VM with a different configuration (vertical scaling). This forces the customers to characterize their workloads at a coarse grained level which potentially leads to under-utilized VM resources or under performing application. Furthermore, the current IaaS architecture does not provide performance guarantees to applications, because of two major factors: 1)Performance metrics of the application are not used for resource allocation mechanisms by the IaaS, 2) Current resource allocation mechanisms do not consider virtualization overheads, can significantly impact the application’s performance, especially for I/O workloads. In this work, we develop an Elastic Resource Framework for IaaS, which provides flexible resource provisioning mechanism and at the same time preserves performance of applications specified by the Service Level Agreement(SLA). For identification of workloads which needs elastic resource allocation, variability has been defined as a metric and is associated with the definition of elasticity of a resource allocation system. We introduce new components Forecasting Engine based on a Cost Model and Resource manager in Open Nebula IaaS cloud, which compute a n optimal resource requirement for the next scheduling cycle based on prediction. Scheduler takes this as an input and enables fine grained resource allocation by dynamically adjusting the size of the VM. Since the prediction may not always be entirely correct, there might be under-allocation or over-allocation of resources based on forecast errors. The design of the cost model accounts for both over-allocation of resources and SLA violations caused by under-allocation of resources. Also, proper resource allocation requires consideration of the virtualization overhead, which is not captured by current monitoring frameworks. We modify existing monitoring frameworks to monitor virtualization over head and provide fine-grained monitoring information in the Virtual Machine Monitor (VMM) as well as VMs. In our approach, the performance of the application is preserved by 1)binding the application level performance SLA store source allocation, and 2) accounting for virtualization over-head while allocating resources. The proposed framework is implemented using the forecasting strategies like Seasonal Auto Regressive and Moving Average model (Seasonal ARIMA), and Gaussian Process model. However, this framework is generic enough to use any other forecasting strategy as well. It is applied to the real workloads, namely web server and mail server workloads, obtained through Supercomputer Education and Research Centre, Indian Institute of Science. The results show that significant reduction in the resource requirements can be obtained while preserving the performance of application by restricting the SLA violations. We further show that more intelligent scaling decisions can be taken using the monitoring information derived by the modification in monitoring framework. Cloud Computing Virtual Machines Infrastructure-as-a-Service Cloud Architecture IaaS Clouds Gaussian Processs Forecasting Engine IaaS Architecture Compute Clouds ARIMA Model Computer Science
52	Faster upper body pose recognition and estimation using compute unified device architecture Brown, Dane January 2013 (has links) >Magister Scientiae - MSc / The SASL project is in the process of developing a machine translation system that can translate fully-fledged phrases between SASL and English in real-time. To-date, several systems have been developed by the project focusing on facial expression, hand shape, hand motion, hand orientation and hand location recognition and estimation. Achmed developed a highly accurate upper body pose recognition and estimation system. The system is capable of recognizing and estimating the location of the arms from a twodimensional video captured from a monocular view at an accuracy of 88%. The system operates at well below real-time speeds. This research aims to investigate the use of optimizations and parallel processing techniques using the CUDA framework on Achmed’s algorithm to achieve real-time upper body pose recognition and estimation. A detailed analysis of Achmed’s algorithm identified potential improvements to the algorithm. Are- implementation of Achmed’s algorithm on the CUDA framework, coupled with these improvements culminated in an enhanced upper body pose recognition and estimation system that operates in real-time with an increased accuracy. Pose recognition and estimation Graphics processing unit Compute unified device architecture Face detection Skin detection Background subtraction Morphological operations Haar features Support vector machine Blender
53	Compute-in-Memory Primitives for Energy-Efficient Machine Learning Amogh Agrawal (10506350) 26 July 2021 (has links) <div>Machine Learning (ML) workloads, being memory and compute-intensive, consume large amounts of power running on conventional computing systems, restricting their implementations to large-scale data centers. Thus, there is a need for building domain-specific hardware primitives for energy-efficient ML processing at the edge. One such approach is in-memory computing, which eliminates frequent and unnecessary data-transfers between the memory and the compute units, by directly computing the data where it is stored. Most of the chip area is consumed by on-chip SRAMs in both conventional von-Neumann systems (e.g. CPU/GPU) as well as application-specific ICs (e.g. TPU). Thus, we propose various circuit techniques to enable a range of computations such as bitwise Boolean and arithmetic computations, binary convolution operations, non-Boolean dot-product operations, lookup-table based computations, and spiking neural network implementation - all within standard SRAM memory arrays.</div><div><br></div><div>First, we propose X-SRAM, where, by using skewed sense amplifiers, bitwise Boolean operations such as NAND/NOR/XOR/IMP etc. can be enabled within 6T and 8T SRAM arrays. Moreover, exploiting the decoupled read/write ports in 8T SRAMs, we propose read-compute-store scheme where the computed data can directly be written back in the array simultaneously. </div><div><br></div><div>Second, we propose Xcel-RAM, where we show how binary convolutions can be enabled in 10T SRAM arrays for accelerating binary neural networks. We present charge sharing approach for performing XNOR operations followed by a population count (popcount) using both analog and digital techniques, highlighting the accuracy-energy tradeoff. </div><div><br></div><div>Third, we take this concept further and propose CASH-RAM, to accelerate non-Boolean operations, such as dot-products within standard 8T-SRAM arrays by utilizing the parasitic capacitances of bitlines and sourcelines. We analyze the non-idealities that arise due to analog computations and propose a self-compensation technique which reduces the effects of non-idealities, thereby reducing the errors. </div><div><br></div><div>Fourth, we propose ROM-embedded caches, RECache, using standard 8T SRAMs, useful for lookup-table (LUT) based computations. We show that just by adding an extra word-line (WL) or a source-line (SL), the same bit-cell can store a ROM bit, as well as the usual RAM bit, while maintaining the performance and area-efficiency, thereby doubling the memory density. Further we propose SPARE, an in-memory, distributed processing architecture built on RECache, for accelerating spiking neural networks (SNNs), which often require high-order polynomials and transcendental functions for solving complex neuro-synaptic models. </div><div><br></div><div>Finally, we propose IMPULSE, a 10T-SRAM compute-in-memory (CIM) macro, specifically designed for state-of-the-art SNN inference. The inherent dynamics of the neuron membrane potential in SNNs allows processing of sequential learning tasks, avoiding the complexity of recurrent neural networks. The highly-sparse spike-based computations in such spatio-temporal data can be leveraged for energy-efficiency. However, the membrane potential incurs additional memory access bottlenecks in current SNN hardware. IMPULSE triew to tackle the above challenges. It consists of a fused weight (WMEM) and membrane potential (VMEM) memory and inherently exploits sparsity in input spikes. We propose staggered data mapping and re-configurable peripherals for handling different bit-precision requirements of WMEM and VMEM, while supporting multiple neuron functionalities. The proposed macro was fabricated in 65nm CMOS technology. We demonstrate a sentiment classification task from the IMDB dataset of movie reviews and show that the SNN achieves competitive accuracy with only a fraction of trainable parameters and effective operations compared to an LSTM network.</div><div><br></div><div>These circuit explorations to embed computations in standard memory structures shows that on-chip SRAMs can do much more than just store data and can be re-purposed as on-demand accelerators for a variety of applications. </div> Computer Engineering SRAM in-memory computing compute-in-memory Neuromorphic Computing Spiking Neural Network
54	Získávání znalostí na webu - shlukování / Web Mining - Clustering Rychnovský, Martin January 2008 (has links) This work presents the topic of data mining on the web. It is focused on clustering. The aim of this project was to study the field of clustering and to implement clustering through the k-means algorithm. Then, the algorithm was tested on a dataset of text documents and on data extracted from web. This clustering method was implemented by means of Java technologies.
55	Interactive Mesostructures Nykl, Scott L. January 2013 (has links) No description available. Computer Science Interactive Mesostructures Inverse Displacement Mapping Mesostructures Quadric based Mesostructures Interactive Deformation GPU GPGPU CUDA Per-texel Collision Detection Image-based Rendering Surface Details Compute Shaders
56	A 2TnC ferroelectric memory gain cell suitable for compute-in-memory and neuromorphic application Slesazeck, Stefan, Ravsher, Taras, Havel, Viktor, Breyer, Evelyn T., Mulaosmanovic, Halid, Mikolajick, Thomas 20 June 2022 (has links) A 2TnC ferroelectric memory gain cell consisting of two transistors and two or more ferroelectric capacitors (FeCAP) is proposed. While a pre-charge transistor allows to access the single cell in an array, the read transistor amplifies the small read signals from small-scaled FeCAPs that can be operated either in FeRAM mode by sensing the polarization reversal current, or in ferroelectric tunnel junction (FTJ) mode by sensing the polarization dependent leakage current. The simultaneous read or write operation of multiple FeCAPs is used to realize compute-in-memory (CiM) algorithms that enable processing of data being represented by both, non-volatilely internally stored data and externally applied data. The internal gain of the cell mitigates the need for 3D integration of the FeCAPs, thus making the concept very attractive especially for embedded memories. Here we discuss design constraints of the 2TnC cell and present the proof-of-concept for realizing versatile (CiM) approaches by means of electrical characterization results. info:eu-repo/classification/ddc/621.3 ddc:621.3
57	Physical Layer Security vs. Network Layer Secrecy: Who Wins on the Untrusted Two-Way Relay Channel? Richter, Johannes, Franz, Elke, Engelmann, Sabrina, Pfennig, Stefan, Jorswieck, Eduard A. 07 July 2014 (has links) (PDF) We consider the problem of secure communications in a Gaussian two-way relay network where two nodes exchange confidential messages only via an untrusted relay. The relay is assumed to be honest but curious, i.e., an eavesdropper that conforms to the system rules and applies the intended relaying scheme. We analyze the achievable secrecy rates by applying network coding on the physical layer or the network layer and compare the results in terms of complexity, overhead, and efficiency. Further, we discuss the advantages and disadvantages of the respective approaches. Netzwerkkodierung Sicherheit Übertragung Relay-Netzwerk Sonderforschungsbereich 912 Hochadaptive Energieeffiziente Systeme Secure network coding compute-and-forward bidirectional untrusted relay channel Collaborative Research Centre 912 ddc:004 rvk:ST 200 rvk:ST 277
58	Applications of Lattices over Wireless Channels Najafi, Hossein January 2012 (has links) In wireless networks, reliable communication is a challenging issue due to many attenuation factors such as receiver noise, channel fading, interference and asynchronous delays. Lattice coding and decoding provide efficient solutions to many problems in wireless communications and multiuser information theory. The capability in achieving the fundamental limits, together with simple and efficient transmitter and receiver structures, make the lattice strategy a promising approach. This work deals with problems of lattice detection over fading channels and time asynchronism over the lattice-based compute-and-forward protocol. In multiple-input multiple-output (MIMO) systems, the use of lattice reduction significantly improves the performance of approximate detection techniques. In the first part of this thesis, by taking advantage of the temporal correlation of a Rayleigh fading channel, low complexity lattice reduction methods are investigated. We show that updating the reduced lattice basis adaptively with a careful use of previous channel realizations yields a significant saving in complexity with a minimal degradation in performance. Considering high data rate MIMO systems, we then investigate soft-output detection methods. Using the list sphere decoder (LSD) algorithm, an adaptive method is proposed to reduce the complexity of generating the list for evaluating the log-likelihood ratio (LLR) values. In the second part, by applying the lattice coding and decoding schemes over asynchronous networks, we study the impact of asynchronism on the compute-and-forward strategy. While the key idea in compute-and-forward is to decode a linear synchronous combination of transmitted codewords, the distributed relays receive random asynchronous versions of the combinations. Assuming different asynchronous models, we design the receiver structure prior to the decoder of compute-and-forward so that the achievable rates are maximized at any signal-to-noise-ratio (SNR). Finally, we consider symbol-asynchronous X networks with single antenna nodes over time-invariant channels. We exploit the asynchronism among the received signals in order to design the interference alignment scheme. It is shown that the asynchronism provides correlated channel variations which are proved to be sufficient to implement the vector interference alignment over the constant X network. Wireless Lattice Fading MIMO decoding Interference Alignment Compute-and-Forward Soft-output Lattice Decoding X network Degrees of Freedom Lattice Reduction Asynchronous Networks Asynchronous Communications Matched Filter Maximum Likelihood LLL DILLL X channel Adaptive Decoding Closest Point Search Electrical and Computer Engineering
59	Applications of Lattices over Wireless Channels Najafi, Hossein January 2012 (has links) In wireless networks, reliable communication is a challenging issue due to many attenuation factors such as receiver noise, channel fading, interference and asynchronous delays. Lattice coding and decoding provide efficient solutions to many problems in wireless communications and multiuser information theory. The capability in achieving the fundamental limits, together with simple and efficient transmitter and receiver structures, make the lattice strategy a promising approach. This work deals with problems of lattice detection over fading channels and time asynchronism over the lattice-based compute-and-forward protocol. In multiple-input multiple-output (MIMO) systems, the use of lattice reduction significantly improves the performance of approximate detection techniques. In the first part of this thesis, by taking advantage of the temporal correlation of a Rayleigh fading channel, low complexity lattice reduction methods are investigated. We show that updating the reduced lattice basis adaptively with a careful use of previous channel realizations yields a significant saving in complexity with a minimal degradation in performance. Considering high data rate MIMO systems, we then investigate soft-output detection methods. Using the list sphere decoder (LSD) algorithm, an adaptive method is proposed to reduce the complexity of generating the list for evaluating the log-likelihood ratio (LLR) values. In the second part, by applying the lattice coding and decoding schemes over asynchronous networks, we study the impact of asynchronism on the compute-and-forward strategy. While the key idea in compute-and-forward is to decode a linear synchronous combination of transmitted codewords, the distributed relays receive random asynchronous versions of the combinations. Assuming different asynchronous models, we design the receiver structure prior to the decoder of compute-and-forward so that the achievable rates are maximized at any signal-to-noise-ratio (SNR). Finally, we consider symbol-asynchronous X networks with single antenna nodes over time-invariant channels. We exploit the asynchronism among the received signals in order to design the interference alignment scheme. It is shown that the asynchronism provides correlated channel variations which are proved to be sufficient to implement the vector interference alignment over the constant X network. Wireless Lattice Fading MIMO decoding Interference Alignment Compute-and-Forward Soft-output Lattice Decoding X network Degrees of Freedom Lattice Reduction Asynchronous Networks Asynchronous Communications Matched Filter Maximum Likelihood LLL DILLL X channel Adaptive Decoding Closest Point Search Electrical and Computer Engineering
60	Compilation of Graph Algorithms for Hybrid, Cross-Platform and Distributed Architectures Patel, Parita January 2017 (has links) (PDF) 1. Main Contributions made by the supplicant: This thesis proposes an Open Computing Language (OpenCL) framework to address the challenges of implementation of graph algorithms on parallel architectures and large scale graph processing. The proposed framework uses the front-end of the existing Falcon DSL compiler, andso, programmers enjoy conventional, imperative and shared memory programming style. The back-end of the framework generates implementations of graph algorithms in OpenCL to target single device architectures. The generated OpenCL code is portable across various platforms, e.g., CPU and GPU, and also vendors, e.g., NVIDIA, Intel and AMD. The framework automatically generates code for thread management and memory management for the devices. It hides all the lower level programming details from the programmers. A few optimizations are applied to reduce the execution time. The large graph processing challenge is tackled through graph partitioning over multiple devices of a single node and multiple nodes of a distributed cluster. The programmer codes a graph algorithm in Falcon assuming that the graph fits into single machine memory and the framework handles graph partitioning without any intervention by the programmer. The framework analyses the Abstract Syntax Tree (AST) generated by Falcon to find all the necessary information about communication and synchronization. It automatically generates code for message passing to hide the complexity of programming in a distributed environment. The framework also applies a set of optimizations to minimize the communication latency. The thesis reports results of several experiments conducted on widely used graph algorithms: single source shortest path, pagerank and minimum spanning tree to name a few. Experimental evaluations show that the reported results are comparable to the state-of-art non-portable graph DSLs and frameworks on a single node. Experiments in a distributed environment to show the scalability and efficiency of the framework are also described. 2. Summary of the Referees' Written Comments: Extracts from the referees' reports are provided below. A copy of the written replies to the clarifications sought by the external examiner is appended to this report. Referee 1: This thesis extends the Falcon framework with OpenCL for parallel graph processing on multi-device and multi-node architectures. The thesis makes important contributions. Processing large graphs in short time is very important, and making use of multiple nodes and devices is perhaps the only way to achieve this. Towards this, the thesis makes good contributions for easy programming, compiler transformations and efficient runtime systems. One of the commendable aspects of the thesis that it demonstrates with graphs that cannot be accommodated In the memory of a single device. The thesis is generally written well. The related work coverage is very good. The magnitude of thesis excellent for a Masters work. The experimental setup is very comprehensive with good set of graphs, good experimental comparisons with state-of-art works and good platforms. Particularly. the demonstration with a GPU cluster with multiple GPU nodes (Chapter 5) is excellent. The attempt to demonstrate scalability with 2, 4 and 8 nodes is also noteworthy. However, the contributions on optimizations are weak. Most of the optimizations and compiler transformations are straight-forward. There should be summary observations on the results in Chapter 3, especially given that the results are mixed and don't quite clearly convey the clear advantages of their work. The same is the case with multi-device results in chapter 4, where the results are once again mixed. Similarly, the speedups and scalability achieved with multiple nodes are not great. The problem size justification in the multi-node results is not clear. (Referee 1 also indicates a couple of minor changes to the thesis). Referee 2: The thesis uses the OpenCL framework to address the problem of programming graph algorithms on distributed systems. The use of OpenCL ensures that the generated code is platform-agnoistic and vendor-agnoistic. Sufficient experimentation with large scale graphs and reasonable size clusters have been conducted to demonstrate the scalability and portability of the code generated by the framework. The automatically generated code is almost as efficient as manually written code. The thesis is well written and is of high quality. The related work section is well organized and displays a good knowledge of the subject matter under consideration. The author has made important contributions to a good publication as well. 3. An Account of the Open Oral Examination: The oral examination of Ms. Parita Patel took place during 10 AM and 11AM on 27th November 2017, in the Seminar Hall of the Department of Computer Science and Automation. The members of the Oral Examination Board present were, Prof. Sathish Vadhiyar, external examiner and Prof. Y. N. Srikant, research supervisor. The candidate presented the work in an open defense seminar highlighting the problem domain, the methodology used, the investigations carried out by her, and the resulting contributions documented in the thesis before an audience consisting of the examiners, some faculty members, and students. Some of the questions posed by the examiners and the members of the audience during the oral examination are listed below. 1. How much is the overlap between Falcon work and this thesis? Response: We have used the Falcon front end in our work. Further, the existing Falcon compiler was useful to us to test our own implementation of algorithms in Falcon. 2. Why are speedup and scalability not very high with multiple nodes? Response: For the multi-node architecture, we were not able to achieve linear scalability because, with the increase in number of nodes, communication cost increases significantly. Unless the computation cost in the nodes is significant and is much more than the communication cost, this is bound to happen. 3. Do you have plans of making the code available for use by the community? Response: The code includes some part of Falcon implementation (front-end parsing/grammar) also. After discussion with the author of Falcon, the code can be made available to the community. 4. How can a graph that does not fit into a single device fit into a single node in the case of multiple nodes? Response: Single node machine used in the experiments of “multi-device architecture” contains multiple devices while each node used in experiments of “multi-node architecture” contains only a single device. So, the graph which does not fit into single-node-single-device memory can fit into single-node-multi-device after partitioning. 5. Is there a way to permit morph algorithms to be coded in your framework? Response: Currently, our framework does not translate morph algorithms. Supporting morph algorithms will require some kind of runtime system to manage memory on GPU since morph algorithms add and remove the vertices and edges to the graph dynamically. This can be further explored in future work. 6. Is it possible to accommodate FPGA devices in your framework? Response: Yes, we can support FPGA devices (or any other device that is compatible for OpenCL) just by specifying the device type in the command line argument. We did not work with other devices because CPU and GPU are generally used to process graph algorithms. The candidate provided satisfactory answers to all the questions posed and the clarifications sought by the audience and the examiners during the presentation. The candidate's overall performance during the open defense and the oral examination was very satisfactory to the oral examination board. 4. Certificate of Corrections and Changes: All the necessary corrections and changes suggested by the examiners have been made in the thesis and these have been verified by the members of the oral examination board. The thesis has been recommended for acceptance in its revised form. 5. Final Recommendation: In view of the recommendations of the referees and the satisfactory performance of the candidate in the oral examination, the oral examination board recommends that the thesis of Ms. ParitaPatel be accepted for the award of the M.Sc(Engg.) Degree of the Institute. Response to the comments by the external examiner on the M.Sc(Engg.) thesis “Compilation of Graph Algorithms for Hybrid, Cross-Platform, and Distributed Architectures” by Parita Patel 1. Comment: The contributions on optimizations are weak. Response: The novelty of this thesis is to make the Falcon platform agnostic, and additionally process large scale graphs on multi-devices of a single node and multi-node clusters seamlessly. Our framework performs similar to the existing frameworks, but at the same time, it targets several types of architectures which are not possible in the existing works. Advanced optimizations are beyond the scope of this thesis. 2. Comment: The translation of Falcon to OpenCL is simple. While the translation of Falcon to OpenCL was not hard, figuring out the details of the translation for multi-device and multi-node architectures was not simple. For example, design of implementations for collection, set, global variables, concurrency, etc., were non-trivial. These designs have already been explained in the appropriate places in the thesis. Further, such large software introduced its own intricacies during development. 3. Comment: Lines between Falcon work and this work are not clear. Response: Appendix-A shows the falcon implementation of all the algorithms which we used to run the experiments. We compiled these falcon implementations through our framework and subsequently ran the generated code on different types of target architectures and compared the results with other framework's generated code. These falcon programs were written by us. We have also used the front-end of the Falcon compiler and this has already been stated in the thesis (page 16). 4. Comment: There should be a summary of observations in chapter 3. Response: Summary of observations have been added to chapter 3 (pages 35-36), chapter 4 (page 46), and chapter 5 (page 51) of the thesis. 5. Comment: Speedup and scalability achieved with multiple nodes are not great. Response: For the multi-node architecture, we were not able to achieve linear scalability because, with the increase in number of nodes, communication cost increases significantly. Unless the computation cost in the nodes is significant and is much more than the communication cost, this is bound to happen. 6. Comment: It will be good to separate the related work coverage into a separate chapter. Response: The related work is coherent with the flow in chapter 1. It consists of just 4.5 pages and separating it into a separate chapter would make both (rest of) chapter 1 and the new chapter very small. Therefore, we do not recommend it. 7. Comment: The code should be made available for use by the community. Response: The code includes some part of Falcon code (front-end parsing/grammar) also. After discussion with the author of Falcon, the code can be made available to the community. 8. Comment: Page 28: Shouldn’t the else part be inside the kernel? Response: There was some missing text and a few minor changes in Figure 3.14 (page 28) which have been incorporated in the corrected thesis. 9. Comment: Figure 4.1 needs to be explained better. Response: Explanation for Figure 4.1 (pages 38-39) has been added to the thesis. 10. Comment: The problem size justification in the multi-node results is not clear. Response: Single node machine used in the experiments of “multi-device architecture” contains multiple devices while each node used in experiments of “multi-node architecture” contains only a single device. So, the graph which does not fit into single-node-single-device memory can fit into single-node-multi-device after partitioning. Name of the Candidate: Parita Patel (S.R. No. 04-04-00-10-21-14-1-11610) Degree Registered: M.Sc(Engg.) Department: Computer Science & Automation Title of the Thesis: Compilation of Graph Algorithms for Hybrid, Cross-Platform and Graph algorithms are abundantly used in various disciplines. These algorithms perform poorly due to random memory access and negligible spatial locality. In order to improve performance, parallelism exhibited by these algorithms can be exploited by leveraging modern high performance parallel computing resources. Implementing graph algorithms for these parallel architectures requires manual thread management and memory management which becomes tedious for a programmer. Large scale graphs cannot fit into the memory of a single machine. One solution is to partition the graph either on multiple devices of a single node or on multiple nodes of a distributed network. All the available frameworks for such architectures demand unconventional programming which is difficult and error prone. To address these challenges, we propose a framework for compilation of graph algorithms written in an intuitive graph domain-specific language, Falcon. The framework targets shared memory parallel architectures, computational accelerators and distributed architectures (CPU and GPU cluster). First, it analyses the abstract syntax tree (generated by Falcon) and gathers essential information. Subsequently, it generates optimized code in OpenCL for shared-memory parallel architectures and computational accelerators, and OpenCL coupled with MPI code for distributed architectures. Motivation behind generating OpenCL code is its platform-agnostic and vendor-agnostic behavior, i.e., it is portable to all kinds of devices. Our framework makes memory management, thread management, message passing, etc., transparent to the user. None of the available domain-specific languages, frameworks or parallel libraries handle portable implementations of graph algorithms. Experimental evaluations demonstrate that the generated code performs comparably to the state-of-the-art non-portable implementations and hand-tuned implementations. The results also show portability and scalability of our framework. Large Scale Graph Processing Hybrid Architecture High Performance Compute Resources Distributed Architecture Portability Graph Algorithms Bulk Synchronous Parallel (BSP) Model Open Computing Language (OpenCL) Multi-Node Architectures Multi-Device Architectures Computer Science

Search results