Spelling suggestions: "subject:"multithreading"" "subject:"multihreading""
1 |
Lane-Based Front Vehicle Detection and Its AccelerationChen, Jie-Qi 02 January 2013 (has links)
Based on .Net Framework4.0 development platform and Visual C# language, this thesis presents various methods of performing lane detection and preceding vehicle detection/tracking with code optimization and acceleration to reduce the execution time. The thesis consists of two major parts: vehicle detection and tracking. In the part of detection, driving lanes are identified first and then the preceding vehicles between the left lane and right lane are detected using the shadow information beneath vehicles. In vehicle tracking, three-pass search method is used to find the matched vehicles based on the detection results in the previous frames. According to our experiments, the preprocessing (including color-intensity conversion) takes a significant portion of total execution time. We propose different methods to optimize the code and speed up the software execution using pure C # pointers, OPENCV, and OPENCL etc. Experimental results show that the fastest detection/tracking speed can reach more than 30 frames per second (fps) using PC with i7-2600 3.4Ghz CPU. Except for OPENCV with execution rate of 18 fps, the rest of methods have up to 28 fps processing rate of almost the real-time speed. We also add the auxiliary vehicle information, such as preceding vehicle distance and vehicle offset warning.
|
2 |
Energy-efficient mechanisms for managing on-chip storage in throughput processorsGebhart, Mark Alan 05 July 2012 (has links)
Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to an end. Therefore, architects must improve a design's energy efficiency to continue to increase performance at historical rates, while staying within a system's power limit. Throughput processors, which use a large number of threads to tolerate
memory latency, have emerged as an energy-efficient platform for
achieving high performance on diverse workloads and are found in
systems ranging from cell phones to supercomputers. This work focuses
on graphics processing units (GPUs), which contain thousands of
threads per chip.
In this dissertation, I redesign the on-chip storage system of a
modern GPU to improve energy efficiency. Modern GPUs contain very large register files that consume between 15%-20% of the
processor's dynamic energy. Most values written into the register
file are only read a single time, often within a few instructions of
being produced. To optimize for these patterns, we explore various
designs for register file hierarchies. We study both a
hardware-managed register file cache and a software-managed operand register file. We evaluate the energy tradeoffs in varying the number of levels and the capacity of each level in the hierarchy. Our most efficient design reduces register file energy by 54%.
Beyond the register file, GPUs also contain on-chip scratchpad
memories and caches. Traditional systems have a fixed partitioning
between these three structures. Applications have diverse
requirements and often a single resource is most critical to
performance. We propose to unify the register file, primary data
cache, and scratchpad memory into a single structure that is
dynamically partitioned on a per-kernel basis to match the
application's needs.
The techniques proposed in this dissertation improve the utilization of on-chip memory, a scarce resource for systems with a large number of hardware threads. Making more efficient use of on-chip memory both improves performance and reduces energy. Future efficient systems will be achieved by the combination of several such techniques which
improve energy efficiency. / text
|
3 |
Design and Implementation of Video View Synthesis for the CloudPouladzadeh, Parvaneh January 2017 (has links)
In multi-view video applications, view synthesis is a computationally intensive task that needs to be done correctly and efficiently in order to deliver a seamless user experience. In order to provide fast and efficient view synthesis, in this thesis, we present a cloud-based implementation that will be especially beneficial to mobile users whose devices may not be powerful enough for high quality view synthesis. Our proposed implementation balances the view synthesis algorithm’s components across multiple threads and utilizes the computational capacity of modern CPUs for faster and higher quality view synthesis. For arbitrary view generation, we utilize the depth map of the scene from the cameras’ viewpoint and estimate the depth information conceived from the virtual camera. The estimated depth is then used in a backward direction to warp the cameras’ image onto the virtual view. Finally, we use a depth-aided inpainting strategy for the rendering step to reduce the effect of disocclusion regions (holes) and to paint the missing pixels. For our cloud implementation, we employed an automatic scaling feature to offer elasticity in
order to adapt the service load according to the fluctuating user demands. Our performance results using 4 multi-view videos over 2 different scenarios show that our proposed system achieves average improvement of 3x speedup, 87% efficiency, and 90% CPU utilization for the parallelizable parts of the algorithm.
|
4 |
Performance evaluation of Web Workers API and OpenMPHellberg, Linus, Bhamidipati, Bhargava January 2022 (has links)
Background - Web browsers and and the web programs on them are being used now more than ever in a manner similar to traditional software. But with the increase in the demand for performance on bigger and bigger web applications, there is a need for making the web applications perform faster and better. Introducing parallelism to a normally single threaded system is one popular way of introducing more performance. Objectives - We will implement proven and workable programs, created with OpenMP, that will be translated to JavaScript. These JavaScript applications will use the WebWorkers API to achieve similar levels of parallelism as the OpenMP applications. Methods - To implement and gather results from the all of the various programs, we will be using visual studio code and its live server extension that it hosts to run and compare the JavaScript implementations. The selected OpenMP applications to be measured and translated were primarily taken and selected from a benchmark suite that hosts programs that have already been written in a traditional parallel computing model, such as OpenMP. The performance of these OpenMP programs and Web Workers applications will be analyzed and compared in the second portion of the research where results will be gathered. Results - JavaScript was proven to perform worse than OpenMP in every situation tested. Though this was expected, there were also some situations where JavaScript applications were performing close to the OpenMP programs. Conclusions - Ultimately, using Web Workers is recommended for what they were designed to do. To help alleviate the main thread to keep the web program running smoothly. For the heavy computational tasks that we were experimenting on, JavaScript did not do a sufficient enough job compared against the OpenMP applications. When measuring the workers we did not get any results for any applicationsthat was very close to what OpenMP achieved. Thus Web Workers are really only suited for easy problems that needs to be done repeatedly. They lack the efficiency for any complicated algorithms to be worth implementing
|
5 |
Transforming TLP into DLP with the dynamic inter-thread vectorization architecture / Transformer le TLP en DLP avec l'architecture de vectorisation dynamique inter-threadKalathingal, Sajith 13 December 2016 (has links)
De nombreux microprocesseurs modernes mettent en œuvre le multi-threading simultané (SMT) pour améliorer l'efficacité globale des processeurs superscalaires. SMT masque les opérations à longue latence en exécutant les instructions de plusieurs threads simultanément. Lorsque les threads exécutent le même programme (cas des applications SPMD), les mêmes instructions sont souvent exécutées avec des entrées différentes. Les architectures SMT traditionnelles exploitent le parallélisme entre threads, ainsi que du parallélisme de données explicite au travers d'unités d'exécution SIMD. L'exécution SIMD est efficace en énergie car le nombre total d'instructions nécessaire pour exécuter un programme est significativement réduit. Cette réduction du nombre d'instructions est fonction de la largeur des unités SIMD et de l'efficacité de la vectorisation. L'efficacité de la vectorisation est cependant souvent limitée en pratique. Dans cette thèse, nous proposons l'architecture de vectorisation dynamique inter-thread (DITVA) pour tirer parti du parallélisme de données implicite des applications SPMD en assemblant dynamiquement des instructions vectorielles à l'exécution. DITVA augmente un processeur à exécution dans l'ordre doté d'unités SIMD en lui ajoutant un mode d'exécution vectorisant entre threads. Lorsque les threads exécutent les mêmes instructions simultanément, DITVA vectorise dynamiquement ces instructions pour assembler des instructions SIMD entre threads. Les threads synchronisés sur le même chemin d'exécution partagent le même flot d'instructions. Pour conserver du parallélisme de threads, DITVA groupe de manière statique les threads en warps ordonnancés indépendamment. DITVA tire parti des unités SIMD existantes et maintient la compatibilité binaire avec les architectures CPU existantes. / Many modern microprocessors implement Simultaneous Multi-Threading (SMT) to improve the overall efficiency of superscalar CPU. SMT hides long latency operations by executing instructions from multiple threads simultaneously. SMT may execute threads of different processes, threads of the same processes or any combination of them. When the threads are from the same process, they often execute the same instructions with different data most of the time, especially in the case of Single-Program Multiple Data (SPMD) applications.Traditional SMT architecture exploit thread-level parallelism and with the use of SIMD execution units, they also support explicit data-level parallelism. SIMD execution is power efficient as the total number of instructions required to execute a complete program is significantly reduced. This instruction reduction is a factor of the width of SIMD execution units and the vectorization efficiency. Static vectorization efficiency depends on the programmer skill and the compiler. Often, the programs are not optimized for vectorization and hence it results in inefficient static vectorization by the compiler.In this thesis, we propose the Dynamic Inter-Thread vectorization Architecture (DITVA) to leverage the implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA optimizes an SIMD-enabled in-order SMT processor with inter-thread vectorization execution mode. When the threads are running in lockstep, similar instructions across threads are dynamically vectorized to form a SIMD instruction. The threads in the convergent paths share an instruction stream. When all the threads are in the convergent path, there is only a single stream of instructions. To optimize the performance in such cases, DITVA statically groups threads into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.
|
6 |
Improving memory consumption and performance scalability of HPC applications with multi-threaded network communications / Amélioration de la consommation mémoire et de l'extensibilité des performances des applications HPC par le multi-threading des communications réseauxDidelot, Sylvain 12 June 2014 (has links)
La tendance en HPC est à l'accroissement du nombre de coeurs par noeud de calcul pour une quantité totale de mémoire par noeud constante. A large échelle, l'un des principaux défis pour les applications parallèles est de garder une faible consommation mémoire. Cette thèse présente une couche de communication multi-threadée sur Infiniband, laquelle fournie de bonnes performances et une faible consommation mémoire. Nous ciblons les applications scientifiques parallélisées grâce à la bibliothèque MPI ou bien combinées avec un modèle de programmation en mémoire partagée. En partant du constat que le nombre de connexions réseau et de buffers de communication est critique pour la mise à l'échelle des bibliothèques MPI, la première contribution propose trois approches afin de contrôler leur utilisation. Nous présentons une topologie virtuelle extensible et entièrement connectée pour réseaux rapides orientés connexion. Dans un contexte agrégeant plusieurs cartes permettant d'ajuster dynamiquement la configuration des buffers réseau utilisant la technologie RDMA. La seconde contribution propose une optimisation qui renforce le potentiel d'asynchronisme des applications MPI, laquelle montre une accélération de deux des communications. La troisième contribution évalue les performances de plusieurs bibliothèques MPI exécutant une application de modélisation sismique en contexte hybride. Les expériences sur des noeuds de calcul jusqu'à 128 coeurs montrent une économie de 17 % sur la mémoire. De plus, notre couche de communication multi-threadée réduit le temps d'exécution dans le cas où plusieurs threads OpenMP participent simultanément aux communications MPI. / A recent trend in high performance computing shows a rising number of cores per compute node, while the total amount of memory per compute node remains constant. To scale parallel applications on such large machines, one of the major challenges is to keep a low memory consumption. This thesis develops a multi-threaded communication layer over Infiniband which provides both good performance of communications and a low memory consumption. We target scientific applications parallelized using the MPI standard in pure mode or combined with a shared memory programming model. Starting with the observation that network endpoints and communication buffers are critical for the scalability of MPI runtimes, the first contribution proposes three approaches to control their usage. We introduce a scalable and fully-connected virtual topology for connection-oriented high-speed networks. In the context of multirail configurations, we then detail a runtime technique which reduces the number of network connections. We finally present a protocol for dynamically resizing network buffers over the RDMA technology. The second contribution proposes a runtime optimization to enforce the overlap potential of MPI communications, showing a 2x improvement factor on communications. The third contribution evaluates the performance of several MPI runtimes running a seismic modeling application in a hybrid context. On large compute nodes up to 128 cores, the introduction of OpenMP in the MPI application saves up to 17 % of memory. Moreover, we show a performance improvement with our multi-threaded communication layer where the OpenMP threads concurrently participate to the MPI communications
|
7 |
Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer GraphicsYeh, Chia-Yu 06 September 2011 (has links)
Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU.
|
8 |
Improving Performance Of Network Intrusion Detection Systems Through Concurrent MechanismsAtakan, Mustafa 01 January 2004 (has links) (PDF)
As the bandwidth of present networks gets larger than the past, the demand of Network Intrusion Detection Systems (NIDS) that function in real time becomes the major requirement for high-speed networks. If these systems are not fast enough to process all network traffic passing, some malicious security violations may take role using this drawback. In order to make that kind of applications schedulable, some concurrency mechanism is introduced to the general flowchart of their algorithm. The principal aim is to fully utilize each resource of the platform and overlap the independent parts of the applications. In the sense of this context, a generic multi-threaded infrastructure is designed and proposed. The concurrency metrics of the new system is analyzed and compared with the original ones.
|
9 |
Dynamic Analysis of Multithreaded Embedded Software to Expose Atomicity ViolationsJanuary 2016 (has links)
abstract: Concurrency bugs are one of the most notorious software bugs and are very difficult to manifest. Significant work has been done on detection of atomicity violations bugs for high performance systems but there is not much work related to detect these bugs for embedded systems. Although criteria to claim existence of bugs remains same, approach changes a bit for embedded systems. The main focus of this research is to develop a systemic methodology to address the issue from embedded systems perspective. A framework is developed which predicts the access interleaving patterns that may violate atomicity using memory references of shared variables and provides support to force and analyze these schedules for any output change, system fault or change in execution path. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
10 |
Tile Based Procedural Terrain Generation in Real-Time : A Study in PerformanceGrelsson, David January 2014 (has links)
Context. Procedural Terrain Generation refers to the algorithmical creation of terrains with limited or no user input. Terrains are an important piece of content in many video games and other forms of simulations. Objectives. In this study a tile-based approach to creating endless terrains is investigated. The aim is to find if real-time performance is possible using the proposed method and possible performance increases from utilization of the GPU. Methods. An application that allows the user to walk around on a seemingly endless terrain is created in two versions, one that exclusively utilizes the CPU and one that utilizes both CPU and GPU. An experiment is then conducted that measures performance of both versions of the application. Results. Results showed that real-time performance is indeed possible for smaller tile sizes on the CPU. They also showed that the application benefits significantly from utilizing the GPU. Conclusions. It is concluded that the tile-based approach works well and creates a functional terrain. However performance is too poor for the technique to be utilized in e.g. a video game.
|
Page generated in 0.044 seconds