Spelling suggestions: "subject:"inter""
91 |
Návrh mobilní aplikace pro portál Hlidani.eu / Design of Mobile Application for Hlidani.euWeigel, Martin January 2016 (has links)
The master‘s thesis focuses on the design of mobile application for web portal Hlidani.eu on Android platform. The theoretical part of the thesis analyzes problems and terms concerning mobile applications. The thesis uses selected analytical methods to analyze the current state of web portal Hlidani.eu. Based on these results, the application itself is designed.
|
92 |
Vysoce náročné aplikace na svazku karet Intel Xeon Phi / High Performance Applications on Intel Xeon Phi ClusterKačurik, Tomáš January 2016 (has links)
The main topic of this thesis is the implementation and subsequent optimization of high performance applications on a cluster of Intel Xeon Phi coprocessors. Using two approaches to solve the N-Body problem, the possibilities of the program execution on a cluster of processors, coprocessors or both device types have been demonstrated. Two particular versions of the N-Body problem have been chosen - the naive and Barnes-hut. Both problems have been implemented and optimized. For better comparison of the achieved results, we only considered achieved acceleration against single node runs using processors only. In the case of the naive version a 15-fold increase has been achieved when using combination of processors and coprocessors on 8 computational nodes. The performance in this case was 9 TFLOP/s. Based on the obtained results we concluded the advantages and disadvantages of the program execution in the distributed environments using processors, coprocessors or both.
|
93 |
Efficient Implementation of 3D Finite Difference Schemes on Recent Processor Architectures / Effektiv implementering av finita differensmetoder i 3D på senaste processorarkitekturerCeder, Frederick January 2015 (has links)
Efficient Implementation of 3D Finite Difference Schemes on Recent Processors Abstract In this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures. Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI. The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%. An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency. Frederick Ceder / Effektiv implementering av finita differensmetoder i 3D på moderna processorarkitekturer Sammanfattning Denna uppsats diskuterar implementationen av ett program som kan lösa problem modellerade efter Burgers ekvation numeriskt. Programmet är byggt ifrån grunden och använder sig av finita differensmetoder och applicerar FTCS metoden (Forward in Time Central in Space). Implementationen paralleliseras och optimeras på Intel Xeon Phi 7120P Coprocessor och Intel Xeon E5-2699v3 processorn för att undersöka skillnader i prestanda mellan de två modellerna. Vi optimerade programmet med omtanke på dataåtkomst och minneslayout för att få bra cacheutnyttjande. Loopblockningsstrategier används också för att dela upp arbetsminnet i mindre delar för att begränsa delarna i L2 cacheminnet. För att utnyttja vektorisering till fullo så används kompilatordirektiv som beskriver minnesåtkomsten, vilket ska hjälpa kompilatorn att förstå vilka dataaccesser som är alignade. Vi implementerade också prefetching strategier och streaming stores på Xeon Phi och disskuterar deras värde. Paralleliseringen gjordes med OpenMP och MPI. Parallelliseringen för Xeon Phi:en är baserad på bara OpenMP och exekverades direkt på chipet. Detta gav en rå prestanda på nästan 100 GFLOP/s och nådde en speedup på 50 med en 83% effektivitet. En OpenMP implementation på E5-2699v3 (Haswell) processorn fick upp till 292 GFLOP/s och nådde en speedup på 31 med en effektivitet på 85%. I jämnförelse fick en hybrid implementation 267 GFLOP/s och nådde en speedup på 28 med en effektivitet på 87%. En ren MPI implementation på PDC's Beskow superdator med 16 noder gav en total prestanda på 1450 GFLOP/s och för en större problemställning gav det totalt 2325 GFLOP/s, med speedup och effektivitet på respektive 170 och 33% och 290 och 56%. En analys baserad på roofline modellen visade att beräkningarna var minnesbudna till L2 cache bandbredden, vilket tyder på bra L2-cache användning för både Haswell och Xeon Phi:s arkitekturer. Xeon Phis prestanda kan förmodligen förbättras genom att även använda MPI. Håller man i åtanke de tekniska framstegen när det gäller beräkningskärnor på de senaste åren, så preseterar både arkitekturer bra. Beräkningskärnan av implementationen kan förmodligen anpassas till en mer kompilatorvänlig variant, vilket eventuellt kan leda till mer optimeringar av kompilatorn för respektive plattform. Experimenten på Cray-systemet Beskow visade en ökad effektivitet från 33,3% till 56% för större problemställningar, vilket visar tecken på bra weak scaling. Detta tyder på att effektivitet kan uppehållas om problemställningen växer med fler antal beräkningsnoder. Frederick Ceder
|
94 |
Defeating Critical Threats to Cloud User Data in Trusted Execution EnvironmentsAdil Ahmad (13150140) 26 July 2022 (has links)
<p>In today’s world, cloud machines store an ever-increasing amount of sensitive user data, but it remains challenging to guarantee the security of our data. This is because a cloud machine’s system software—critical components like the operating system and hypervisor that can access and thus leak user data—is subject to attacks by numerous other tenants and cloud administrators. Trusted execution environments (TEEs) like Intel SGX promise to alter this landscape by leveraging a trusted CPU to create execution contexts (or enclaves) where data cannot be directly accessed by system software. Unfortunately, the protection provided by TEEs cannot guarantee complete data security. In particular, our data remains unprotected if a third-party service (e.g., Yelp) running inside an enclave is adversarial. Moreover, data can be indirectly leaked from the enclave using traditional memory side-channels.</p>
<p><br></p>
<p>This dissertation takes a significant stride towards strong user data protection in cloud machines using TEEs by defeating the critical threats of adversarial cloud services and memory side-channels. To defeat these threats, we systematically explore both software and hardware designs. In general, we designed software solutions to avoid costly hardware changes and present faster hardware alternatives.</p>
<p><br></p>
<p>We designed 4 solutions for this dissertation. Our Chancel system prevents data leaks from adversarial services by restricting data access capabilities through robust and efficient compiler-enforced software sandboxing. Moreover, our Obliviate and Obfuscuro systems leverage strong cryptographic randomization and prevent information leakage through memory side-channels. We also propose minimal CPU extensions to Intel SGX called Reparo that directly close the threat of memory side-channels efficiently. Importantly, each designed solution provides principled protection by addressing the underlying root-cause of a problem, instead of enabling partial mitigation.</p>
<p><br></p>
<p>Finally, in addition to the stride made by our work, future research thrust is required to make TEEs ubiquitous for cloud usage. We propose several such research directions to pursue the essential goal of strong user data protection in cloud machines.</p>
|
95 |
Solving dense linear systems on accelerated multicore architectures / Résoudre des systèmes linéaires denses sur des architectures composées de processeurs multicœurs et d’accélerateursRémy, Adrien 08 July 2015 (has links)
Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU. / In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers.
|
96 |
An Evaluation of Intel Cache Allocation Technology for Data- Intensive Applications / En utvärdering av Intel Cache Allocation Technology för dataintensiva applikationerIhre Sherif, Alan January 2021 (has links)
On certain CPUs part of the Intel Xeon Scalable CPU family, the level three (L3) cache is shared among the CPU cores residing on the same CPU socket. This has benefits in that a larger and more scalable cache space is available to the CPU cores. However, when the L3 cache is shared between CPU cores and thereby by the applications running there, the applications can affect the performance of each other if some of them have high L3 cache usage. This can be particularly problematic if an application is over-utilizing the L3 cache and effectively evicting the data of other applications, which are more prioritized, from the L3 cache. Such applications are called L3 cache noisy neighbors. The experiments in this thesis study the effect L3 cache noisy neighbors have on other, more prioritized, applications and if Intel Cache Allocation Technology (CAT) can be used to limit the performance impact the noisy neighbors have. Intel CAT provides functionality to control the amount of L3 cache allocated to a CPU core and by allocating less L3 cache to a noisy neighbor it no longer shares as much L3 cache with the prioritized applications and thus the prioritized applications can again utilize more of the L3 cache and regain their performance. The research question of this thesis is to investigate in what cases Intel CAT can provide advantages and where it is a disadvantage to use it by studying its use for three commonly used applications; bzip2, Redis, and Graph500. All the three applications were significantly impacted when running simultaneously with a noisy neighbor and for the Redis application there was a decrease of 49.2% in the number of ’GET’ requests per second that the Redis server could handle and an 18.2% decrease for ’SET’ requests. For the bzip2 and Graph500 applications, there was a 14.7% and 28.1% increase in execution time respectively. Intel CAT was successfully used to limit the impact of the noisy neighbor on the three applications. For the Redis application, the number of requests per second increased by 8.6% for the ’GET’ operation and by 4.2% for the ’SET’ operation. For the bzip2 and Graph500 applications, there was a 5.8% and 12.0% decrease in execution time respectively. Moreover, the thesis studies the scenario when only prioritized applications are running and if their performance can be increased by isolating the L3 cache for each one of them so that they cannot cause L3 cache evictions for each other. The use case of Intel CAT in such a scenario is not as clear as when mitigating the impact of a noisy neighbor but some performance benefits can be observed when running multiple Redis instances on the same machine and isolating some of the L3 cache available to them. / För vissa processorer som tillhör familjen Intel Xeon Scalable är den tredje nivåns cache (L3-cache) delad mellan CPU-kärnorna som befinner sig på samma CPU-sockel. Detta har fördelen att ett större och mer skalbart cacheutrymme blir tillgängligt för CPU-kärnorna. Att L3-cache är delat mellan kärnorna innebär däremot att applikationerna som kör där kan påverka varandras prestanda om någon av dem överutnyttjar L3-cache. När en applikation överutnyttjar L3-cache leder det till att data från andra applikationer, som kan vara mer prioriterade, inte längre får plats i cachen. Sådana applikationer kallas för ”L3-cache noisy neigbors”. Experimenten i denna studie undersöker effekterna av L3-cache noisy neigbors på mer prioriterade applikationer och om Intel Cache Allocation Technology (CAT) kan användas för att begränsa den påverkan som L3-cache noisy neigbors har. Intel CAT har funktionalitet för att kontrollera mängden L3-cache som allokeras till en CPU-kärna och genom att allokera mindre L3-cache till en noisy neigbor så delar den inte lika mycket L3-cache med de prioriterade applikationerna och därmed kan de prioriterade applikationerna återfå sin prestanda. Frågeställningen för denna studie är att undersöka i vilka användningsområden Intel CAT har fördelar och när det är en nackdel att använda det genom att studera dess användning för tre välanvända applikationer, bzip2, Redis och Graph500. Prestandan för alla av dessa tre applikationer blev tydligt påverkad när de kördes samtidigt som en noisy neigbor och Intel CAT kunde användas för att minska den påverkan. För Redis ökade antalet frågor som hanterades av Redis med 8.6% för GET-operationer och 4.2% för SET-operationer. För bzip2 och Graph500 observerades en minskning i exekveringstid på 5.8% och 12.0% respektive. Denna uppsats undersöker även scenariot där bara prioriterade applikationer körs och om deras prestanda kan ökas genom att isolera L3-cache för var och en av dem så att de inte tar plats från varandra i L3-cachen. När Intel CAT användes i ett sådant scenario är fördelarna inte lika tydliga som när påverkan av en noisy neighbor begränsades men en viss förbättring i prestanda går att observera när flera Redisservrar körs på samma maskin och en del av L3-cachen isoleras till var och en av dem.
|
97 |
開放原始碼軟體平台與互補性資產建構—以Google與 Intel 為例 / Open Source Software Platform for Promoting Complementary Asset Developments–a Case Study of Google and Intel高士翔, Shih-Hsiang (Sean) Kao Unknown Date (has links)
開放原始碼軟體平台與互補性資產建構—以Google與 Intel 為例 / Open source software is Open Innovation only if it has a business model driving it (West and Gallagher 2006). Open Innovation is the paradigm describing the scenario in which firms use a broad range of external sources for innovation and seek a broad range of commercialization alternatives for internal innovation (Chesbrough 2003). The Platform
Leader builds the platform and concentrates its efforts on promoting and directing innovation of complementary products in favor of its R&D direction (Cusumano and Gawer 2002). The author has chosen leaders in two distinctive industry sectors— Google, the leader in search engine industry, and Intel, the leader in the microprocessor business for the personal computer industry—as the case study companies for this research. Both cases fit the definition of open innovation since both Google and Intel have specific business models for their open source
software platforms.
This research explores how industry leaders exploit open source software platforms to realize their specific strategic intents. The research problems are: (1) how companies can incorporate external creativity and innovation to maintain their own innovative momentum; (2)
what are the key factors and strategies for building a successful open source software platform and its ecosystem; (3) how can a company use an open source software platform as part of its strategy to enter new markets and promote development of complementary assets to build its competitive advantages.
The author proposes the following framework to analyze how leading firms design open source platform strategies: (1) analyze the firm’s core competencies; (2) analyze the firm’s strategic intent for their open source software platform; (3) analyze the firm’s strategies for designing the architecture of their open source software platform; (4) analyze the firm’s strategies for designing the ecosystem around the platform.
Based on the analysis of the two comparative cases, the author has been convinced of the following propositions:
1. Firms can use open source software platform to incorporate external creativity and innovations that promote the development of complementary assets and to build or at least maintain their competitive advantage against competitors.
2. Instead of a purely open or purely proprietary platform strategy, platform owners can utilize a hybrid strategy, which combines the advantages of open source and closed source to retain control and differentiation.
3. As opposed to a company-owned open source software platform, a community-owned open source software platform will attract more communities’ involvements and stimulate
more innovation.
4. When developing complementary assets, firms should adopt an open innovation approach to incorporate external creativity and innovations; however, when building their core competencies, firms should adopt a more closed innovation approach to maintain their distinctive competitive advantages.
5. One of the key determining factors of a successful open source platform strategy is the platform owner’s ability to create value and enable every partner within the ecosystem to share some portion of it.
|
98 |
Real-time Detection And Tracking Of Human Eyes In Video SequencesSavas, Zafer 01 September 2005 (has links) (PDF)
Robust, non-intrusive human eye detection problem has been a fundamental and challenging problem for computer vision area. Not only it is a problem of its own, it can be used to ease the problem of finding the locations of other facial features for recognition tasks and human-computer interaction purposes as well. Many previous works have the capability of determining the locations of the human eyes but the main task in this thesis is not only a vision system with eye detection capability / Our aim is to design a real-time, robust, scale-invariant eye tracker system with human eye movement indication property using the movements of eye pupil. Our eye tracker algorithm is implemented using the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm proposed by Bradski and the EigenFace method proposed by Turk & / Pentland. Previous works for scale invariant object detection using Eigenface method are mostly dependent on limited number of user predefined scales which causes speed problems / so in order to avoid this problem an adaptive eigenface method using the information extracted from CAMSHIFT algorithm is implemented to have a fast and scale invariant eye tracking.
First of all / human face in the input image captured by the camera is detected using the CAMSHIFT algorithm which tracks the outline of an irregular shaped object that may change size and shape during the tracking process based on the color of the object. Face area is passed through a number of preprocessing steps such as color space conversion and thresholding to obtain better results during the eye search process. After these preprocessing steps, search areas for left and right eyes are determined using the geometrical properties of the human face and in order to locate each eye indivually the training images are resized by the width information supplied by the CAMSHIFT algortihm. Search regions for left and right eyes are individually passed to the eye detection algortihm to determine the exact locations of each eye. After the detection of eyes, eye areas are individually passed to the pupil detection and eye area detection algorithms which are based on the Active Contours method to indicate the pupil and eye area. Finally, by comparing the geometrical locations of pupil with the eye area, human gaze information is extracted.
As a result of this thesis a software named &ldquo / TrackEye&rdquo / with an user interface having indicators for the location of eye areas and pupils, various output screens for human computer interaction and controls for allowing to test the effects of color space conversions and thresholding types during object tracking has been built.
|
99 |
Design, Implementation and Cryptanalysis of Modern Symmetric CiphersHenricksen, Matthew January 2005 (has links)
The main objective of this thesis is to examine the trade-offs between security and efficiency within symmetric ciphers. This includes the influence that block ciphers have on the new generation of word-based stream ciphers. By incorporating block-cipher like components into their designs, word-based stream ciphers have experienced hundreds-fold improvement in speed over bit-based stream ciphers, without any observable security degradation. The thesis also emphasizes the importance of keying issues in block and stream ciphers, showing that by reusing components of the principal cipher algorithm in the keying algorithm, security can be enhanced without loss of key-agility or expanding footprint in software memory. Firstly, modern block ciphers from four recent cipher competitions are surveyed and categorized according to criteria that includes the high-level structure of the block cipher, the method in which non-linearity is instilled into each round, and the strength of the key schedule. In assessing the last criterion, a classification by Carter [45] is adopted and modified to improve its consistency. The classification is used to demonstrate that the key schedule of the Advanced Encryption Standard (AES) [62] is surprisingly flimsy for a national standard. The claim is supported with statistical evidence that shows the key schedule suffers from bit leakage and lacks sufficient diffusion. The thesis contains a replacement key schedule that reuses components from the cipher algorithm, leveraging existing analysis to improve security, and reducing the cipher's implementation footprint while maintaining key agility. The key schedule is analyzed from the perspective of an efficiency-security tradeoff, showing that the new schedule rectifies an imbalance towards e±ciency present in the original. The thesis contains a discussion of the evolution of stream ciphers, focusing on the migration from bit-based to word-based stream ciphers, from which follows a commensurate improvement in design flexibility and software performance. It examines the influence that block ciphers, and in particular the AES, have had upon the development of word-based stream ciphers. The thesis includes a concise literature review of recent styles of cryptanalytic attack upon stream ciphers. Also, claims are refuted that one prominent word-based stream cipher, RC4, suffers from a bias in the first byte of each keystream. The thesis presents a divide and conquer attack against Alpha1, an irregularly clocked bit-based stream cipher with a 128-bit state. The dominating aspect of the divide and conquer attack is a correlation attack on the longest register. The internal state of the remaining registers is determined by utilizing biases in the clocking taps and launching a guess and determine attack. The overall complexity of the attack is 261 operations with text requirements of 35,000 bits and memory requirements of 2 29.8 bits. MUGI is a 64-bit word-based cipher with a large Non-linear Feedback Shift Register (NLFSR) and an additional non-linear state. In standard benchmarks, MUGI appears to su®er from poor key agility because it is implemented on an architecture for which it is not designed, and because its NLFSR is too large relative to the size of its master key. An unusual feature of its key initialization algorithm is described. A variant of MUGI, entitled MUGI-M, is proposed to enhance key agility, ostensibly without any loss of security. The thesis presents a new word-based stream cipher called Dragon. This cipher uses a large internal NLFSR in conjunction with a non-linear filter to produce 64 bits of keystream in one round. The non-linear filter looks very much like the round function of a typical modern block cipher. Dragon has a native word size of 32 bits, and uses very simple operations, including addition, exclusive-or and s-boxes. Together these ensure high performance on modern day processors such as the Intel Pentium family. Finally, a set of guidelines is provided for designing and implementing symmetric ciphers on modern processors, using the Intel Pentium 4 as a case study. Particular attention is given to understanding the architecture of the processor, including features such as its register set and size, the throughput and latencies of its instruction set, and the memory layouts and speeds. General optimization rules are given, including how to choose fast primitives for use within the cipher. The thesis describes design decisions that were made for the Dragon cipher with respect to implementation on the Intel Pentium 4. Block Ciphers, Word-based Stream Ciphers, Cipher Design, Cipher Implementa- tion, -
|
100 |
Accelerated Deep Learning using Intel Xeon PhiViebke, André January 2015 (has links)
Deep learning, a sub-topic of machine learning inspired by biology, have achieved wide attention in the industry and research community recently. State-of-the-art applications in the area of computer vision and speech recognition (among others) are built using deep learning algorithms. In contrast to traditional algorithms, where the developer fully instructs the application what to do, deep learning algorithms instead learn from experience when performing a task. However, for the algorithm to learn require training, which is a high computational challenge. High Performance Computing can help ease the burden through parallelization, thereby reducing the training time; this is essential to fully utilize the algorithms in practice. Numerous work targeting GPUs have investigated ways to speed up the training, less attention have been paid to the Intel Xeon Phi coprocessor. In this thesis we present a parallelized implementation of a Convolutional Neural Network (CNN), a deep learning architecture, and our proposed parallelization scheme, CHAOS. Additionally a theoretical analysis and a performance model discuss the algorithm in detail and allow for predictions if even more threads are available in the future. The algorithm is evaluated on an Intel Xeon Phi 7120p, Xeon E5-2695v2 2.4 GHz and Core i5 661 3.33 GHz using various architectures and thread counts on the MNIST dataset. Findings show a 103.5x, 99.9x, 100.4x speed up for the large, medium, and small architecture respectively for 244 threads compared to 1 thread on the coprocessor. Moreover, a 10.9x - 14.1x (large to small) speed up compared to the sequential version running on Xeon E5. We managed to decrease training time from 7 days on the Core i5 and 31 hours on the Xeon E5, to 3 hours on the Intel Xeon Phi when training our large network for 15 epochs
|
Page generated in 0.0748 seconds