Global ETD Search

1	Enabling high-performance, mixed-signal approximate computing St Amant, Renee Marie 07 July 2014 (has links) For decades, the semiconductor industry enjoyed exponential improvements in microprocessor power and performance with the device scaling of successive technology generations. Scaling limitations at sub-micron technologies, however, have ceased to provide these historical performance improvements within a limited power budget. While device scaling provides a larger number of transistors per chip, for the same chip area, a growing percentage of the chip will have to be powered off at any given time due to power constraints. As such, the architecture community has focused on energy-efficient designs and is looking to specialized hardware to provide gains in performance. A focus on energy efficiency, along with increasingly less reliable transistors due to device scaling, has led to research in the area of approximate computing, where accuracy is traded for energy efficiency when precise computation is not required. There is a growing body of approximation-tolerant applications that, for example, compute on noisy or incomplete data, such as real-world sensor inputs, or make approximations to decrease the computation load in the analysis of cumbersome data sets. These approximation-tolerant applications span application domains, such as machine learning, image processing, robotics, and financial analysis, among others. Since the advent of the modern processor, computing models have largely presumed the attribute of accuracy. A willingness to relax accuracy requirements, however, with goal of gaining energy efficiency, warrants the re-investigation of the potential of analog computing. Analog hardware offers the opportunity for fast and low-power computation; however, it presents challenges in the form of accuracy. Where analog compute blocks have been applied to solve fixed-function problems, general-purpose computing has relied on digital hardware implementations that provide generality and programmability. The work presented in this thesis aims to answer the following questions: Can analog circuits be successfully integrated into general-purpose computing to provide performance and energy savings? And, what is required to address the historical analog challenges of inaccuracy, programmability, and a lack of generality to enable such an approach? This thesis work investigates a neural approach as a means to address the historical analog challenges of inaccuracy, programmability, and generality and to enable the use of analog circuits in general-purpose, high-performance computing. The first piece of this thesis work investigates the use of analog circuits at the microarchitecture level in the form of an analog neural branch predictor. The task of branch prediction can tolerate imprecision, as roll-back mechanisms correct for branch mispredictions, and application-level accuracy remains unaffected. We show that analog circuits enable the implementation of a highly-accurate, neural-prediction algorithm that is infeasible to implement in the digital domain. The second piece of this thesis work presents a neural accelerator that targets approximation-tolerant code. Analog neural acceleration provides application speedup of 3.3x and energy savings of 12.1x with a quality loss less than 10% for all except one approximation-tolerant benchmark. These results show that, using a neural approach, analog circuits can be applied to provide performance and energy efficiency in high-performance, general-purpose computing. / text Approximate computing Neural branch prediction Neural accelerator General purpose computing
2	GPGPU design space exploration using neural networks Jooya, Ali 28 September 2018 (has links) General Purpose computing on Graphic Processing Unit (GPGPU) gained atten- tion in 2006 with NVIDIA’s first Tesla Graphic Processing Unit (GPU) which could perform high performance computing. Ever since, researchers have been working on software and hardware techniques to improve the efficiency of running general purpose applications on GPUs. The efficiency can be evaluated using metrics such as energy consumption and throughput and is defined based on the requirements of the system. I define it as obtaining high throughput by consuming minimum energy. GPUs are equipped with a large number of processing units, a high memory bandwidth, and different types of on-chip memory and caches. To run efficiently, an application should maximize the utilization of GPU resources. Therefore, a good correspondence between the computing and memory resources of the GPU and those of application is critical. Since an application’s requirements are fixed, the GPU’s configuration should be tuned to these requirements. Having models to study and predict the power consumption and throughput of running a GPGPU application on a given GPU configuration can help achieve high efficiency. The main purpose of this dissertation is to find a GPU configuration that best matches the requirements of a given application. I propose three models that predict a GPU configuration that runs an application with maximum throughput while consuming minimum energy. The first model is a fast, low-cost and effective approach to optimize resource allocation in future GPUs. The model finds the optimal GPU configuration for different available chip real-estate budgets . The second model considers the power consumption and throughput of a GPGPU application as functions of the GPU configuration parameters. The proposed model accurately predicts the power consumption and throughput of the modeled GPGPU application. I then propose to accelerate the process of building the model using optimization techniques and quantum annealing. I use the proposed model to explore the GPU configuration space of different applications. I apply multiobjective optimization technique to find the configurations that offer minimum power consumption and maximum throughput. Finally, using clustering and classification techniques, I develop models to re- late the power consumption and throughput of GPGPU applications to the code attributes. Both models could accurately predict the optimum configuration for any given GPGPU application. To build these models I have used different machine learning techniques and optimization methods such as Pareto Front and Knapsack optimization problem. I validated the model produced results with simulation results and showed that the models make accurate predictions. These models could be used by GPGPU programmers to identify the architectural parameters that most affect an application’s power consumption and throughput. This information could be translated into software optimization opportunities. Also, these models can be implemented as part of a compiler to help it to make the best optimization decisions. Moreover, GPU manufacturers could gain insight on architectural parameters which would profit GPGPU applications the most in terms of power and performance and hence invest on these. / Graduate Code Attributes Multi-objective optimization method GPU Architecture
3	Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations Lutz, Thibaut January 2015 (has links) Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four. 006.6
4	On continuous maximum ﬂow image segmentation algorithm Marak, Laszlo 28 March 2012 (has links) (PDF) In recent years, with the advance of computing equipment and image acquisition techniques, the sizes, dimensions and content of acquired images have increased considerably. Unfortunately as time passes there is a steadily increasing gap between the classical and parallel programming paradigms and their actual performance on modern computer hardware. In this thesis we consider in depth one particular algorithm, the continuous maximum flow computation. We review in detail why this algorithm is useful and interesting, and we propose efficient and portable implementations on various architectures. We also examine how it performs in the terms of segmentation quality on some recent problems of materials science and nano-scale biology [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Total variation Image analysis Continuous optimization Transmission electron tomography Image segmentation
5	Application specific programmable processors for reconfigurable self-powered devices Nyländen, T. (Teemu) 27 April 2018 (has links) Abstract The current Internet of Things solutions for simple measurement and monitoring tasks are evolving into ubiquitous sensor networks that are constantly observing both our well being and the conditions of our living environment. The oncoming omnipresent wireless infrastructure is expected to feature artificial intelligence capabilities that can interpret human actions, gestures and even needs. All of this will require processing power on a par with and energy efficiency far beyond that of the current mobile devices. The current Internet of Things devices rely mostly on commercial low power off-the-shelf micro-controllers. Optimized solely for low power, while paying little attention to computing performance, the present solutions are far from achieving the energy efficiency, let alone, the compute capability requirements of the future Internet of Things solutions. Since this domain is application specific by nature, the use of general purpose processors for signal processing tasks is counterintuitive. Instead, dedicated accelerator based solutions are more likely to be able to meet these strict demands. This thesis proposes one potential solution for achieving the necessary low energy, as well as the flexibility and performance requirements of the Internet of Things domain in a cost effective manner using reconfigurable heterogeneous processing solutions. A novel graphics processing unit-style accelerator for the Internet of Things application domain is presented. Since the accelerator can be reconfigured, it can be used for most applications of the Internet of Things domain, as well as other application domains. The solution is assessed using two computer vision applications, and is demonstrated to achieve an excellent combination of performance and energy efficiency. The accelerator is designed using an efficient and rapid co-design flow of software and hardware, featuring ease of development characteristics close to commercial off-the-shelf solutions, which also enables cost-efficient design flow. / Tiivistelmä Esineiden internet tulee muuttamaan tulevaisuudessa elinympäristömme täysin. Se tulee mahdollistamaan interaktiiviset ympäristöt nykyisten passiivisten ympäristöjen sijaan. Lisäksi elinympäristömme tulee reagoimaan tekoihimme ja puheeseemme sekä myös tunteisiimme. Tämä kaikkialla läsnä olevan langaton infrastruktuuri tulee vaatimaan ennennäkemätöntä laskentatehokkuutta yhdistettynä äärimmäiseen energiatehokkuuteen. Nykyiset esineiden internet ratkaisut nojaavat lähes täysin kaupallisiin "suoraan hyllyltä" saataviin yleiskäyttöisiin mikrokontrollereihin. Ne ovat kuitenkin optimoituja pelkästään matalan tehonkulutuksen näkökulmasta, eivätkä niinkään energiatehokkuuden, saati tulevaisuuden esineiden internetin vaatiman laskentatehon suhteen. Kuitenkin esineiden internet on lähtökohtaisesti sovelluskohtaista laskentaa vaativa, joten yleiskäyttöisten prosessoreiden käyttö signaalinkäsittelytehtäviin on epäloogista. Sen sijaan sovelluskohtaisten kiihdyttimien käyttö laskentaan, todennäköisesti mahdollistaisi tavoitellun vaatimustason saavuttamisen. Tämä väitöskirja esittelee yhden mahdollisen ratkaisun matalan energian kulutuksen, korkean suorituskyvyn ja joustavuuden yhdenaikaiseen saavuttamiseen kustannustehokkaalla tavalla, käyttäen uudelleenkonfiguroitavia heterogeenisiä prosessoriratkaisuja. Työssä esitellään uusi grafiikkaprosessori-tyylinen uudelleen konfiguroitava kiihdytin esineiden internet sovellusalueelle, jota pystytään hyödyntämään useimpien laskentatehoa vaativien sovellusten kanssa. Ehdotetun kiihdyttimen ominaisuuksia arvioidaan kahta konenäkösovellusta esimerkkinä käyttäen ja osoitetaan sen saavuttavan loistavan yhdistelmän energia tehokkuutta ja suorituskykyä. Kiihdytin suunnitellaan käyttäen tehokasta ja nopeaa ohjelmiston ja laitteiston yhteissuunnitteluketjua, jolla voidaan saavuttaa lähestulkoon kaupallisten "suoraan hyllyltä" saatavien prosessoreiden kehitystyön helppous, joka puolestaan mahdollistaa kustannustehokkaan kehitys- ja suunnittelutyön. application specific processing energy efficient computing internet of things reconfigurable architectures energiatehokas laskenta esineiden internet sovelluskohtainen prosessointi uudelleenkonfiguroitavat arkkitehtuurit yleiskäyttöinen grafiikkaprosessori
6	Využití GPU pro akcelerované zpracování obrazu / Image Processing on GPUs Bačík, Ladislav January 2008 (has links) This master thesis deals with modern technologies in graphic hardware and using their for general purpose computing. It is primary focused on architecture of unified processors and algorithm implementation via CUDA programming interface. Thesis base is to choose suited algorithm for GPU horsepower demonstration. Main aim of this work is implementation of multiplatform library offering algorithms for discrete volumetric data vectorization. For this purpose was chosen algorithm Marching cubes that is able to find surface of processed object. In created library will be contained algorithm runnable on graphic device and also one runnable on CPU. Finally we compare both variants and discuss their pros and cons.
7	Efficient generation and rendering of tube geometry in Unreal Engine : Utilizing compute shaders for 3D line generation / Effektiv generering och rendering av tubgeometri i Unreal Engine : Generering av 3D-linjer med compute shaders Woxler, Platon January 2021 (has links) Massive graph visualization in an immersive environment, such as virtual reality (VR) or Augmented Reality (AR), has the possibility to improve users’ understanding when exploring data in new ways. To make the most of a visualization, such as this, requires interactive components that are fast enough to accommodate interactivity. By rendering the edges of the graph as shaded lines that imitate three‑dimensional (3D) lines or tubes, one can circumvent technical limitations. This method works well enough when using traditional two‑dimensional (2D) monitors, but representing tubes as flat lines in a virtual environment (VE) makes for a less immersive user experience as opposed to visualizing true 3D geometry. In order to accommodate for these requirements i.e., speed and visual fidelity, we need a time efficient way of producing tubular meshes. This thesis project explores how one can generate tubular geometry utilizing compute shaders in the modern game engine, Unreal Engine (UE). Exploiting the parallel computing power of the graphical processing unit (GPU) we use compute shaders to generate a tubular mesh following a predetermined path. The result from the project is an open source plugin for UE, able to generate tubular geometry at rapid rates. While not giving any major advantages when generating smaller models, comparing it to a sequential implementation, the compute shader implementation create and render models > 40× faster when generating 106 tube segments. A secondary effect of generating most of the data on the GPU, is that we avoid bottlenecks that can occur when surpassing the bandwidth of the central processing unit (CPU) to GPU data transfer. Using this tool researches can more easily explore information visualization in a VE. Furthermore, this thesis promotes extended development of mesh generation, using compute shaders in UE. / Att visualisera stora grafer i en immersiv miljö, såsom VR eller AR, kan förbättra en användares förståelse när de utforskar data på nya sätt. För att få ut det mesta av denna typen av visualiseringar krävs interaktiva komponenter som är tillräckligt snabba för att tillgodose interaktivitet. Genom att visa de linjer, som binder samman en grafs noder, som plana linjer som imiterar 3Dlinjer eller rör, kan man undvika att slå i det tak som tekniska begränsningar medför. Denna metoden är acceptabel vid användning av traditionella 2Dskärmar, men att representera rör som plana linjer i VE ger en mindre immersiv användarupplevelse, i kontrast till att visualisera sann 3D -geometri. För att tillgodose dessa krav dvs, tidseffektivitet och visuella kvaliteter, behöver vi ett effektivt sätt att producera 3D-linjer. Denna uppsats undersöker hur man kan generera rörformad geometri med hjälp av compute shaders i den moderna spelmotorn Unreal Engine (UE). Genom att använda compute shaders kan vi utnyttja den parallella beräkningskraften hos en GPU, kan vi generera ett rörformat mesh som följer en förutbestämd bana. Resultatet från projektet är ett open source-plugin för UE, som kan generera rörformad geometri i höga hastigheter. Även om det inte kan visas ge några större fördelar när man genererar mindre modeller, jämfört med en sekventiell implementering, skapar och renderar implementeringen av compute Shaders modeller > 40× snabbare, när de genererar 106 rörsegment. I och med att den större delen av datan skapas på GPU kan vi också undvika den flaskhals som kan uppstå när vi överskrider bandbredden mellan CPU och GPU. Med hjälp av verktyget som skapats i samband med denna uppsats kan människor lättare utforska informationsvisualisering i VE. Dessutom främjar denna uppsats utökad utveckling av mesh-generering med hjälp av compute shaders i UE. Computer Graphics Mesh Models Datorgrafik Mesh-modeller Computer Sciences Datavetenskap (datalogi)
8	On continuous maximum ﬂow image segmentation algorithm / Segmentation d'images par l'algorithme des flot maximum continu Marak, Laszlo 28 March 2012 (has links) Ces dernières années avec les progrès matériels, les dimensions et le contenu des images acquises se sont complexifiés de manière notable. Egalement, le différentiel de performance entre les architectures classiques mono-processeur et parallèles est passé résolument en faveur de ces dernières. Pourtant, les manières de programmer sont restées largement les mêmes, instituant un manque criant de performance même sur ces architectures. Dans cette thèse, nous explorons en détails un algorithme particulier, les flots maximaux continus. Nous explicitons pourquoi cet algorithme est important et utile, et nous proposons plusieurs implémentations sur diverses architectures, du mono-processeur à l'architecture SMP et NUMA, ainsi que sur les architectures massivement parallèles des GPGPU. Nous explorons aussi des applications et nous évaluons ses performances sur des images de grande taille en science des matériaux et en biologie à l'échelle nano / In recent years, with the advance of computing equipment and image acquisition techniques, the sizes, dimensions and content of acquired images have increased considerably. Unfortunately as time passes there is a steadily increasing gap between the classical and parallel programming paradigms and their actual performance on modern computer hardware. In this thesis we consider in depth one particular algorithm, the continuous maximum flow computation. We review in detail why this algorithm is useful and interesting, and we propose efficient and portable implementations on various architectures. We also examine how it performs in the terms of segmentation quality on some recent problems of materials science and nano-scale biology Variation totale Analyse d\'image Optimisation continue Calcul GPU Segmentation Total variation Image analysis Continuous optimization Transmission electron tomography Image segmentation
9	Dynamický částicový systém jako účinný nástroj pro statistické vzorkování / A dynamical particle system as a driver for optimal statistical sampling Mašek, Jan Unknown Date (has links) The presented doctoral thesis aims at development a new efficient tool for optimization of uniformity of point samples. One of use-cases of these point sets is the usage as optimized sets of integration points in statistical analyses of computer models using Monte Carlo type integration. It is well known that the pursuit of uniformly distributed sets of integration points is the only possible way of decreasing the error of estimation of an integral over an unknown function. The tasks of the work concern a survey of currently used criteria for evaluation and/or optimization of uniformity of point sets. A critical evaluation of their properties is presented, leading to suggestions towards improvements in spatial and statistical uniformity of resulting samples. A refined variant of the general formulation of the phi optimization criterion has been derived by incorporating the periodically repeated design domain along with a scale-independent behavior of the criterion. Based on a notion of a physical analogy between a set of sampling points and a dynamical system of mutually repelling particles, a hyper-dimensional N-body system has been selected to be the driver of the developed optimization tool. Because the simulation of such a dynamical system is known to be a computationally intensive task, an efficient solution using the massively parallel GPGPU platform Nvidia CUDA has been developed. An intensive study of properties of this complex architecture turned out as necessary to fully exploit the possible solution speedup.
10	Evoluční návrh kolektivních komunikací akcelerovaný pomocí GPU / Evolutionary Design of Collective Communications Accelerated by GPUs Tyrala, Radek January 2012 (has links) This thesis provides an analysis of the application for evolutionary scheduling of collective communications. It proposes possible ways to accelerate the application using general purpose computing on graphics processing units (GPU). This work offers a theoretical overview of systems on a chip, collective communications scheduling and more detailed description of evolutionary algorithms. Further, the work provides a description of the GPU architecture and its memory hierarchy using the OpenCL memory model. Based on the profiling, the work defines a concept for parallel execution of the fitness function. Furthermore, an estimation of the possible level of acceleration is presented. The process of implementation is described with a closer insight into the optimization process. Another important point consists in comparison of the original CPU-based solution and the massively parallel GPU version. As the final point, the thesis proposes distribution of the computation among different devices supported by OpenCL standard. In the conclusion are discussed further advantages, constraints and possibilities of acceleration using distribution on heterogenous computing systems.

Search results