Global ETD Search

1	Ανάπτυξη διαδικτυακής εφαρμογής για την εξομοίωση της λειτουργίας ενός επεξεργαστή με διευρυμένο ρεπερτόριο εντολών Κάτσενος, Χρήστος 26 July 2012 (has links) Αντικείμενο της παρούσας εργασίας είναι η εξομοίωση της λειτουργίας ενός επεξεργαστή με διευρυμένο ρεπερτόριο εντολών μέσω του διαδικτύου. Αναλυτικότερα αναπτύχθηκε ένα διαδικτυακό εργαλείο που δέχεται την αλληλουχία των εντολών και στην συνέχεια αφού εκτελέσει έλεγχο αυτών, συμβολομεταφράζει και αποθηκεύει τον κώδικα που προκύπτει στην μνήμη της εφαρμογής. Αφού όλα τα παραπάνω έχουν ολοκληρωθεί και το πρόγραμμα έχει ελεγχθεί και αποθηκευθεί στην μνήμη τότε το γραφικό τμήμα της εφαρμογής αναλαμβάνει να εξομοιώσει την λειτουργία του επεξεργαστή, προβάλλοντας τις τιμές που παίρνουν οι καταχωρητές την κάθε στιγμή καθώς και την αλληλουχία των δεδομένων που μεταφέρονται από και προς αυτούς. / The purpose of this study is to simulate the operation of a processor with an expanded set of instructions through the Internet. In more details, it has been developed an online tool that accepts a sequence of instructions and then do various checks on them, compiles them and stores the code in application’s memory. As long as all this has been completed and the program has been tested and stored in memory, the simulation part of the application starts, in order to simulate the operation of the processor, providing registers with the correct value each time and the sequence of data transferred to and from them. Επεξεργαστές Εξομοιωτές 006.76 Central Processing Unit (CPU) Simulators ERS CPU
2	Σχεδίαση & υλοποίηση ενός μικροϋπολογιστικού συστήματος βασισμένου σε μια επαυξημένη σχετικά απλή CPU Γαλετάκης, Εμμανουήλ 26 July 2012 (has links) Η παρούσα ειδική ερευνητική εργασία εκπονήθηκε στα πλαίσια του Διατμηματικού Προγράμματος Μεταπτυχιακών Σπουδών Ειδίκευσης στην “Ηλεκτρονική και Επεξεργασία της Πληροφορίας” στο Τμήμα Φυσικής του Πανεπιστημίου Πατρών. Αντικείμενο της παρούσας εργασίας είναι η σχεδίαση και ανάπτυξη ενός βασικού μικροϋπολογιστικού συστήματος με τη χρήση της VHDL και FPGAs. Το σύστημα βασίζεται σε μία επαυξημένη, σε δυνατότητες, εκδοχή της σχετικά απλής cpu του Carpinelli και ενσωματώνει τη δυνατότητα παράλληλης διασύνδεσης μίας σειράς περιφερειακών διατάξεων και υποκυκλωμάτων. Στο πρώτο κεφάλαιο παρουσιάζεται πλήρως η σχεδίαση ενός τέτοιου συστήματος και μελετάται η δομή των επιμέρους δομικών στοιχείων που το απαρτίζουν. Στο δεύτερο κεφάλαιο παρουσιάζεται η περιγραφή του μικροϋπολογιστικού συστήματος σε γλώσσα VHDL και η πλήρης εξομοίωσή του με τη βοήθεια του λογισμικού Quartus v7.2 της ALTERA. Στο τελευταίο κεφάλαιο παρουσιάζεται η υλοποίηση του μικροϋπολογιστικού συστήματος στην αναπτυξιακή πλατφόρμα DE2 της εταιρείας ALTERA. / This project objective is the design and development of an FPGA based microcomputer system in VHDL. The system is based on an enhanced version of Carpinelli’s relative simple cpu and is implemented with parallel input and output ports and interrupts. The first chapter presents the full design of such a system and study the structure of the individual components that compose it. The second chapter presents the implementation of the microcomputer system in VHDL and the simulation results using Quartus v7.2 software suite. The last chapter presents the implementation of the system in a FPGA using DE2 development board of ALTERA. 004.16 Central Processing Unit (CPU) Field-programmable gate array (FPGA) VHDL DE2
3	Parallellisering av Sliding Extensive Cancellation Algorithm (ECA-S) för passiv radar med OpenMP / Parallelization of Sliding Extensive Cancellation Algorithm (ECA-S) for Passive Radar with OpenMP Johansson Hultberg, Andreas January 2021 (has links) Software parallelization has gained increasing interest since the transistor manufacturing of smaller chips within an integrated circuit has begun to stagnate. This has led to the development of new processing units with an increasing number of cores. Parallelization is an optimization technique that allows the user to utilize parallel processes in order to streamline algorithm flows. This study examines the performance benefits that a passive bistatic radar system can obtain by parallelization and code refactorization. The study focuses mainly on investigating the use of parallel instructions within a shared memory model on a Central Processing Unit (CPU) with the use of an application programming interface, namely OpenMP. Quantitative data is collected to compare the runtime of the most central algorithm in the passive radar system, namely the Extensive Cancellation Algorithm (ECA). ECA can be used to suppress unwanted clutter in the surveillance signal, which purpose is to create clear target detections of airborne objects. The algorithm on the other hand is computationally demanding, which has led to the development of faster versions such as the Sliding ECA (ECA-S). Despite the ongoing development, the algorithm is still relatively computationally demanding which can lead to long execution times within the radar system. In this study, a MATLAB implementation of ECA-S is transformed to C in order to take advantage of the fast execution time of the procedural programming language. Parallelism is introduced within the converted algorithm by the use of Intel's thread methodology and then applied within two different operating systems. The study shows that a speedup can be obtained, in the programming language C, by a factor of 24 while still ensuring the correctness of the results. The results also showed that code refactorization of a MATLAB algorithm could result in 73% faster code and that C-MEX implementations are twice as slow as a C-implementation. Finally, the study pointed out that real-time can be achieved for a passive bistatic radar system with the use of the programming language C and by using parallel instructions within a shared memory model on a CPU. / Parallellisering av mjukvara har fått ett ökat intresse sedan transistortillverkningen av mindre chip inom en integrerade krets har börjat att stagnera. Detta har lett till utveckling av moderna processorer med ett ökande antal av kärnor. Parallellisering är en optimeringsteknik vilken tillåter användaren att utnyttja parallella processer till att effektivisera algoritmflöden. Denna studie undersöker de tidsmässiga fördelar ett passivt bistatiskt radarsystem kan erhålla genom att, bland annat tillämpa parallellisering och omformning. Studien fokuserar främst på att undersöka användandet av parallella trådar inom det delade minnesutrymmet på en centralprocessor (CPU), detta med hjälp av applikationsprogrammeringsgränssnittet OpenMP. Kvantitativa jämförelser tas fram med hjälp av en av de mest centrala algoritmerna inom det passiva radarsystemet, nämligen Extensive Cancellation Algorithm (ECA). ECA kan används till att undertrycka oönskat klotter i övervakningssignalen, vilket har till syfte att skapa klara måldetektioner av luftföremål. Algoritmen är däremot beräkningstung, vilket har medfört utveckling av snabbare versioner som exempelvis Sliding ECA (ECA-S). Trots utvecklingen är algoritmen fortfarande relativt beräkningstung och kan medföra en lång exekeveringstid inom hela radarsystemet. I denna studie transformeras en MATLAB-implementation av ECA-S till C för att kunna dra nytta av den snabba exekeveringstiden i det procedurella programmeringsspråket. Parallellism införs inom den transformerade algoritmen med hjälp av Intels trådmetodik och appliceras sedan inom två olika operativsystem. Studien visar på en tidsmässig optimering i C med upp till 24 gånger snabbare exekeveringstid och bibehållen noggrannhet. Resultaten visade även på att en enklare omformning av en MATLAB-algoritm kunde resultera till 73% snabbare kod och att en C-MEX-implementation är dubbelt så långsam i jämförelse med en C-implementering. Slutligen pekade studien på att realtid kan uppnås för ett passivt bistatiskt radarsystem vid användandet av programmeringsspråket C och med utnyttjandet av parallella instruktioner inom det delade minnet på en CPU. Extensive Cancellation Algorithm (ECA) ECA-S Passive Radar Optimization OpenMP Parallel Processing Shared Memory Model Central Processing Unit (CPU) Signal Processing Signalbehandling Computer Engineering Datorteknik Software Engineering Programvaruteknik
4	Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices Pandit, Prasanna Vasant January 2013 (has links) (PDF) Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary signiﬁcantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modiﬁed data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or proﬁling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and ﬁnd that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and ﬁnd that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case. Heterogeneous Computers Open Computing Language FluidiCL Fluidic Kernels OpenCL Application Programming Interface Graphics Processing Unit (GPU) Central Processing Unit (CPU) Computer Architecture FluidiCL Runtime Heterogeneous OpenCL Runtime OpenCL Programs CPU–GPU Systems Computer Engineering
5	Investigation of hierarchical deep neural network structure for facial expression recognition Motembe, Dodi 01 1900 (has links) Facial expression recognition (FER) is still a challenging concept, and machines struggle to comprehend effectively the dynamic shifts in facial expressions of human emotions. The existing systems, which have proven to be effective, consist of deeper network structures that need powerful and expensive hardware. The deeper the network is, the longer the training and the testing. Many systems use expensive GPUs to make the process faster. To remedy the above challenges while maintaining the main goal of improving the accuracy rate of the recognition, we create a generic hierarchical structure with variable settings. This generic structure has a hierarchy of three convolutional blocks, two dropout blocks and one fully connected block. From this generic structure we derived four different network structures to be investigated according to their performances. From each network structure case, we again derived six network structures in relation to the variable parameters. The variable parameters under analysis are the size of the filters of the convolutional maps and the max-pooling as well as the number of convolutional maps. In total, we have 24 network structures to investigate, and six network structures per case. After simulations, the results achieved after many repeated experiments showed in the group of case 1; case 1a emerged as the top performer of that group, and case 2a, case 3c and case 4c outperformed others in their respective groups. The comparison of the winners of the 4 groups indicates that case 2a is the optimal structure with optimal parameters; case 2a network structure outperformed other group winners. Considerations were done when choosing the best network structure, considerations were; minimum accuracy, average accuracy and maximum accuracy after 15 times of repeated training and analysis of results. All 24 proposed network structures were tested using two of the most used FER datasets, the CK+ and the JAFFE. After repeated simulations the results demonstrate that our inexpensive optimal network architecture achieved 98.11 % accuracy using the CK+ dataset. We also tested our optimal network architecture with the JAFFE dataset, the experimental results show 84.38 % by using just a standard CPU and easier procedures. We also compared the four group winners with other existing FER models performances recorded recently in two studies. These FER models used the same two datasets, the CK+ and the JAFFE. Three of our four group winners (case 1a, case 2a and case 4c) recorded only 1.22 % less than the accuracy of the top performer model when using the CK+ dataset, and two of our network structures, case 2a and case 3c came in third, beating other models when using the JAFFE dataset. / Electrical and Mining Engineering Facial Expression Recognition (FER) Deep Learning Convolutional Neural Network (CNN) Deep Convolutional Neural Network (DCNN) Artificial Intelligence Face Detection Facial Feature Extraction Central Processing Unit (CPU) Graphics Processing Unit (GPU)
6	Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems Mishra, Ashirbad January 2016 (has links) (PDF) Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles. Network Analysis Betweenness Centrality CPU-GPU Hybrid Systems Space Complexity Border Matrix Reduction Distributed Computing Graphics Processing Unit (GPU) Graph Partitioning Bulk Synchronous Paradigm (BSP) Central Processing Unit (CPU) Parallel Processing High Performance Computing Network Theory System Analysis Graph Theory Hybrid Betweenness Centrality Algorithm Computer Science

1

Page generated in 0.1334 seconds