Spelling suggestions: "subject:"ppc"" "subject:"dppc""
151 |
Analysing Memory Performance when computing DFTs using FFTW / Analys av minneshantering vid beräkning av DFTs med FFTWHeiskanen, Andreas, Johansson, Erik January 2018 (has links)
Discrete Fourier Transforms (DFTs) are used in a wide variety of dif-ferent scientific areas. In addition, there is an ever increasing demand on fast and effective ways of computing DFT problems with large data sets. The FFTW library is one of the most common used libraries when computing DFTs. It adapts to the system architecture and predicts the most effective way of solving the input problem. Previous studies have proved the FFTW library to be superior to other DFT solving libraries. However, not many have specifically examined the cache memory performance, which is a key factor for overall performance. In this study, we examined the cache memory utilization when computing 1-D complex DFTs using the FFTW library. Testing was done using bench FFT, Linux Perf and testing scripts. The results from this study show that cache miss ratio increases with problem size when the input size is smaller than the theoretical input size matching the cache capacity. This is also verified by the results from the L2 prefetcher miss ratio. However, the study show that cache miss ratio stabilizes when exceeding the cache capacity. In conclusion, it is possible to use bench FFT and Linux Perf to measure cache memory utilization. Also, the analysis shows that cache memory performance is good when computing 1-D complex DFTS using the FFTW library, since the miss ratios stabilizes at low values. However, we suggest further examination ofthe memory behaviour for DFT computations using FFTW with larger input sizes and a more in-depth testing method. / Diskret Fouriertransform (DFT) används inom många olika vetenskapliga områden. Det finns en ökande efterfrågan på snabba och effektiva sätt att beräkna DFT-problem med stora mängder data. FFTW-biblioteket är ett av de mest använda biblioteken vid beräkning av DFT-problem. FFTW-biblioteket anpassar sig till systemarkitekturen och försöker generera det mest effektiva sättet att lösa ett givet DFT-problem. Tidigare studier har visat att FFTW-biblioteket är effektivare än andra bibliotek som kan användas för att lösa DFT-problem. Däremot har studierna inte fokuserat på minneshanteringen, vilket är en nyckelfaktor för den generella prestandan. I den här studien undersökte vi FFTW-bibliotekets cache-minneshanteringen vid beräkning av 1-D komplexa DFT-problem. Tester utfördes med hjälp av bench FFT, Linux Perf och testskript. Resultaten från denna studie visar att cache-missförhållandet ökar med problemstorleken när problemstorleken ärmindre än den teoretiska problemstorleken som matchar cachekapaciteten. Detta bekräftas av resultat från L2-prefetcher-missförhållandet. Studien visar samtidigt att cache-missförhållandet stabiliseras när problemstorleken överskrider cachekapaciteten. Sammanfattningsvis går det att argumentera för att det är möjligt att använda bench FFT och Linux Perf för att mäta cache-minneshanteringen. Analysen visar också att cache-minneshanteringen är bra vid beräkning av 1-D komplexa DFTs med hjälp av FFTW-biblioteket eftersom missförhållandena stabiliseras vid låga värden. Vi föreslår dock ytterligare undersökning av minnesbeteendet för DFT-beräkningar med hjälp av FFTW där problemstorlekarna är större och en mer genomgående testmetod används.
|
152 |
Comparison Of Thm Formation During Disinfection: Ferrate Versus Free Chlorine For Different Source WatersMukattash, Adhem 01 January 2007 (has links)
The objective of the study was to compare the trihalomethanes (THMs) produced from ferrate with hypochlorite and to determine how different the THM production would be for a given degree of disinfection (3 log reduction in Heterotrophic Plate Count (HPC)). Different water samples were collected from Lake Claire, Atlantic Ocean, and secondary effluent from an advanced wastewater treatment plant. THM formation was determined using a standard assay over 7 days at room temperature. In addition samples were tested for Total Coliform Escherichia coli (TC/E.coli), and heterotrophic bacteria using HPC by spreadplating on R2A agar. Dissolved organic carbon (DOC) was measured as well. Dosages of 2, 5, and 10 ppm of hypochlorite and ferrate were used for Lake Claire and Atlantic Ocean water, while 1, 2, and 5 ppm dosages were used for wastewater treatment effluent. Ferrate resulted in 48.3% ± 11.2% less THM produced for the same level of disinfection (i.e. approximately 3 logs reduction in HPC). Oxidation of DOC was relatively small with a 6.1 to 11.6 % decrease in DOC being observed for ferrate doses from 2 to 10 mg/L. Free chlorine oxidation of DOC was negligible.
|
153 |
Implementing Streaming Parallel Decision Trees on Graphic Processing Units / En implementering av Streaming Parallel Decision Trees på grafikkortSvantesson, David January 2018 (has links)
Decision trees have long been a prevalent area within machine learning. With streaming data environments as well as large datasets becoming increasingly common, researchers have developed decision tree algorithms adapted to streaming data. One such algorithm is SPDT, which approaches the streaming data problem by making use of workers on a network combined with a dynamic histogram approximation of the data. There exist several implementations for decision trees on GPU, but those are uncommon in a streaming data setting. In this research, conducted at RISE SICS, the possibilities of accelerating the SPDT algorithm on GPU is investigated. An implementation is successfully created using the CUDA platform. The implementation uses a set number of data samples per layer to better fit the GPU platform. Experiments were conducted to investigate the impact on both accuracy and speed. It is found that the GPU implementation performs as well as the CPU implementation in terms of accuracy, suggesting that using small subsets of the data in each layer is sufficient for making accurate split decisions. The GPU implementation is found to be up to 113 times faster than the reference Scala CPU implementation for one of the tested datasets, and 13 times faster on average over all the tested datasets. Weak parts of the implementation are identified, and further improvements are suggested to increase both accuracy and runtime performance. / Beslutsträd har länge varit ett betydande område inom maskininlärning. Strömmandedata och stora dataset har blivit allt vanligare, vilket har lett till att forskare utvecklat algoritmer för beslutsträd anpassade till dessa miljöer. En av dessa algoritmer är SPDT. Denna algoritm använder sig av flera arbetare i ett nätverk kombinerat med en dynamisk histogram-representation av data. Det existerar flera implementationer av beslutsträd på grafikkort, men inte många för strömmande data. I detta forskningsarbete, utfört på RISE SICS, undersöks möjligheten att snabba upp SPDT genom att accelerera beräkningar med hjälp av grafikkort. En lyckad implementation skriven i CUDA beskrivs. Implementationen anpassar sig till grafikkortsplattformen genom att använda sig utav ett bestämt antal datapunkter per lager. Experiment som undersöker effekten på noggrannhet och hastighet har genomförts. Resultaten visar att GPU-implementationen presterar lika väl som CPU-implementationen vad gäller noggrannhet, vilket påvisar att användandet av en mindre del av data i varje lager är tillräckligt för goda resultat. GPU-implementationen är upp till 113 gånger snabbare jämfört med en existerande CPU-implementation skriven i Scala, och är i medel 13 gånger snabbare. Svagheter i implementationen identifieras, och vidare förbättringar till implementationen föreslås för att förbättra både noggrannhet och hastighetsprestanda.
|
154 |
High-performance and Scalable Bayesian Group Testing and Real-time fMRI Data AnalysisChen, Weicong 27 January 2023 (has links)
No description available.
|
155 |
AN FPGA IMPLEMENTATIN OF FDTD CODES FOR RECONFIGURABLE HIGH PERFORMANCE COMPUTINGGANDHI, SACHIN January 2004 (has links)
No description available.
|
156 |
Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC SystemsRahman, Md Wasi-ur- January 2016 (has links)
No description available.
|
157 |
Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based CloudsTaifi, Moussa January 2013 (has links)
Extreme scale HPC (high performance computing) applications require massively many nodes. At these scales, transient hardware and software failures, as well as network congestion and disconnections increase linearly with the number of components. This volatility contributed to the dramatic decrease in applications' MTBF (mean time between failures). Traditional point-to-point transmission APIs semantics are ill-fitted to support applications of extreme scale. In this thesis, we investigate an application dependent network design that focuses on the sustainability of extreme scale high performance computing applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report our results on performance, reliability and overall application sustainability. In the preliminary tests, for the most common HPC application categories, the prototype has demonstrated sustained performance, while providing a reliable computing architecture that can withstand multiple failure types without manual checkpoint-restart(CPR). The feasibility of efficient non-stop HPC enables aution-based cloud for more cost efficient HPC applications. For all HPC application categories, we first report a novel method for determining bid-aware checkpointing intervals using fluctuating cloud providers' pricing histories. Subsequently, we explore the effects of bidding in the case of virtual HPC clusters composed of EC2 Spot Instances. We expose the counter-intuitive effects of uniform versus non-uniform bidding, especially in terms of failure rate and failure model, and we propose a method to deal with the problem of predicting the runtime of parallel applications under various bidding strategies. We then show that CPR-free HPC applications require a new optimization strategy. As extreme scale HPC and auction-based cloud computing offer the ultimate computational scale and resource efficiency, they challenge the very foundations in computer science research and development. This thesis answers some critical questions about these challenges and we hope to pave the way for future improvements of the HPC field under increasingly harsh and volatile conditions. / Computer and Information Science
|
158 |
Automated Adaptive Software Maintenance: A Methodology and Its ApplicationsTansey, Wesley 11 August 2008 (has links)
In modern software development, maintenance accounts for the majority of the total cost and effort in a software project. Especially burdensome are those tasks which require applying a new technology in order to adapt an application to changed requirements or a different environment.
This research explores methodologies, techniques, and approaches for automating such adaptive maintenance tasks. By combining high-level specifications and generative techniques, a new methodology shapes the design of approaches to automating adaptive maintenance tasks in the application domains of high performance computing (HPC) and enterprise software. Despite the vast differences of these domains and their respective requirements, each approach is shown to be effective at alleviating their adaptive maintenance burden.
This thesis proves that it is possible to effectively automate tedious and error-prone adaptive maintenance tasks in a diverse set of domains by exploiting high-level specifications to synthesize specialized low-level code. The specific contributions of this thesis are as follows: (1) a common methodology for designing automated approaches to adaptive maintenance, (2) a novel approach to automating the generation of efficient marshaling logic for HPC applications from a high-level visual model, and (3) a novel approach to automatically upgrading legacy enterprise applications to use annotation-based frameworks.
The technical contributions of this thesis have been realized in two software tools for automated adaptive maintenance: MPI Serializer, a marshaling logic generator for MPI applications, and Rosemari, an inference and transformation engine for upgrading enterprise applications.
This thesis is based on research papers accepted to IPDPS '08 and OOPSLA '08. / Master of Science
|
159 |
Algorithms and Frameworks for Accelerating Security Applications on HPC PlatformsYu, Xiaodong 09 September 2019 (has links)
Typical cybersecurity solutions emphasize on achieving defense functionalities. However, execution efficiency and scalability are equally important, especially for real-world deployment. Straightforward mappings of cybersecurity applications onto HPC platforms may significantly underutilize the HPC devices' capacities. On the other hand, the sophisticated implementations are quite difficult: they require both in-depth understandings of cybersecurity domain-specific characteristics and HPC architecture and system model.
In our work, we investigate three sub-areas in cybersecurity, including mobile software security, network security, and system security. They have the following performance issues, respectively: 1) The flow- and context-sensitive static analysis for the large and complex Android APKs are incredibly time-consuming. Existing CPU-only frameworks/tools have to set a timeout threshold to cease the program analysis to trade the precision for performance. 2) Network intrusion detection systems (NIDS) use automata processing as its searching core and requires line-speed processing. However, achieving high-speed automata processing is exceptionally difficult in both algorithm and implementation aspects. 3) It is unclear how the cache configurations impact time-driven cache side-channel attacks' performance. This question remains open because it is difficult to conduct comparative measurement to study the impacts.
In this dissertation, we demonstrate how application-specific characteristics can be leveraged to optimize implementations on various types of HPC for faster and more scalable cybersecurity executions. For example, we present a new GPU-assisted framework and a collection of optimization strategies for fast Android static data-flow analysis that achieve up to 128X speedups against the plain GPU implementation. For network intrusion detection systems (IDS), we design and implement an algorithm capable of eliminating the state explosion in out-of-order packet situations, which reduces up to 400X of the memory overhead. We also present tools for improving the usability of Micron's Automata Processor. To study the cache configurations' impact on time-driven cache side-channel attacks' performance, we design an approach to conducting comparative measurement. We propose a quantifiable success rate metric to measure the performance of time-driven cache attacks and utilize the GEM5 platform to emulate the configurable cache. / Doctor of Philosophy / Typical cybersecurity solutions emphasize on achieving defense functionalities. However, execution efficiency and scalability are equally important, especially for the real-world deployment. Straightforward mappings of applications onto High-Performance Computing (HPC) platforms may significantly underutilize the HPC devices’ capacities. In this dissertation, we demonstrate how application-specific characteristics can be leveraged to optimize various types of HPC executions for cybersecurity. We investigate several sub-areas, including mobile software security, network security, and system security. For example, we present a new GPU-assisted framework and a collection of optimization strategies for fast Android static data-flow analysis that achieve up to 128X speedups against the unoptimized GPU implementation. For network intrusion detection systems (IDS), we design and implement an algorithm capable of eliminating the state explosion in out-of-order packet situations, which reduces up to 400X of the memory overhead. We also present tools for improving the usability of HPC programming. To study the cache configurations’ impact on time-driven cache side-channel attacks’ performance, we design an approach to conducting comparative measurement. We propose a quantifiable success rate metric to measure the performance of time-driven cache attacks and utilize the GEM5 platform to emulate the configurable cache.
|
160 |
Characterization of FPGA-based High Performance ComputersPimenta Pereira, Karl Savio 02 September 2011 (has links)
As CPU clock frequencies plateau and the doubling of CPU cores per processor exacerbate the memory wall, hybrid core computing, utilizing CPUs augmented with FPGAs and/or GPUs holds the promise of addressing high-performance computing demands, particularly with respect to performance, power and productivity. While traditional approaches to benchmark high-performance computers such as SPEC, took an architecture-based approach, they do not completely express the parallelism that exists in FPGA and GPU accelerators. This thesis follows an application-centric approach, by comparing the sustained performance of two key computational idioms, with respect to performance, power and productivity. Specifically, a complex, single precision, floating-point, 1D, Fast Fourier Transform (FFT) and a Molecular Dynamics modeling application, are implemented on state-of-the-art FPGA and GPU accelerators. As results show, FPGA floating-point FFT performance is highly sensitive to a mix of dedicated FPGA resources; DSP48E slices, block RAMs, and FPGA I/O banks in particular. Estimated results show that for the floating-point FFT benchmark on FPGAs, these resources are the performance limiting factor. Fixed-point FFTs are important in a lot of high performance embedded applications. For an integer-point FFT, FPGAs exploit a flexible data path width to trade-off circuit cost and speed of computation, improving performance and resource utilization. GPUs cannot fully take advantage of this, having a fixed data-width architecture. For the molecular dynamics application, FPGAs benefit from the flexibility in creating a custom, tightly-pipelined datapath, and a highly optimized memory subsystem of the accelerator. This can provide a 250-fold improvement over an optimized CPU implementation and 2-fold improvement over an optimized GPU implementation, along with massive power savings. Finally, to extract the maximum performance out of the FPGA, each implementation requires a balance between the formulation of the algorithm on the platform, the optimum use of available external memory bandwidth, and the availability of computational resources; at the expense of a greater programming effort. / Master of Science
|
Page generated in 0.0411 seconds