Global ETD Search

1	Polymage : Automatic Optimization for Image Processing Pipelines Mullapudi, Ravi Teja January 2015 (has links) (PDF) Image processing pipelines are ubiquitous. Every image captured by a camera and every image uploaded on social networks like Google+or Facebook is processed by a pipeline. Applications in a wide range of domains like computational photography, computer vision and medical imaging use image processing pipelines. Many of these applications demand high-performance which requires effective utilization of modern architectures. Given the proliferation of camera enabled devices and social networks optimizing these emerging workloads has become important both at the data center and the embedded device scales. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, sam-pling, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective uti-lization of parallelism available on modern architectures. The traditional options are using optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. Inthisthesis,wepresentthedesignandimplementationofPolyMage,adomain-speciﬁc language and compiler for image processing pipelines. The focus of the system is on au-tomatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. We achieve such automation with: • tiling techniques to improve parallelism and locality by introducing redundant computation, v a model-driven fusion heuristic which enables a trade-off between locality and re-dundant computations, and anautotuner whichleveragesthefusionheuristictoexploreasmallsubsetofpipeline implementations and ﬁnd the best performing one. Our optimization approach primarily relies on the transformation and code generation ca-pabilities of the polyhedral compiler framework. To the best of our knowledge, this is the ﬁrst model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization fully automatically. We evaluate our framework on a modern multicore system using a set of seven benchmarks which vary widely in structure and complexity. Experimental results show that the performance of pipeline implementations generated by our approach is: • up to 1.81× better than pipeline implementations manually tuned using Halide, a state-of-the-art language and compiler for image processing pipelines, • on average 5.39× better than pipeline implementations automatically tuned using Halide and OpenTuner, and • on average 3.3× better than naive pipeline implementations which only exploit par-allelism without optimizing for locality. We also demonstrate that the performance of PolyMage generated code is better or compa-rable to implementations using OpenCV, a state-of-the-art image processing and computer vision library. Polymage Image Processing Polyhydral Optimization Image Processing Pipelines Domain Specific Languages Compiler Optimiation Multicore Processing Parallel Computing Code Generation Tiling Pipeline Graphs Computer Science
2	Analysis of the Scope of Dynamic Power Management in Emerging Server Architectures Hähnel, Markus, Dargie, Waltenegus, Schill, Alexander 16 May 2023 (has links) The architectures of large-scale Internet servers are becoming more complex each year in order to store and process a large amount of Internet data (Big Data) as efficiently as possible. One of the consequences of this continually growing complexity is that individual servers consume a significant amount of data even when they are idle. In this paper we experimentally investigate the scope and usefulness of existing and proposed dynamic power management strategies to manage power at core, socket, and server levels. Our experiment involves four dynamic voltage and frequency scaling policies, three different workloads having different resource consumption statistics, and the activation and deactivation of different sockets (packets) of a multicore, multi-socket server. Moreover, we establish a quantitative relationships between the workload (w) and the estimated power consumption (p) under different power management strategies to make a quantitative comparison of the different strategies and server configurations. info:eu-repo/classification/ddc/004 ddc:004
3	Predictable Multiprocessor Platform for Safety- Critical Real- Time Systems Sigurðsson, Páll Axel January 2021 (has links) Multicore systems excel at providing concurrent execution of applications, giving true parallelism where all cores can execute sequences of machine instructions at the same time. However, multicore systems come with their own sets of problems, most notably when cores in a system (or core tiles) share hardware components such as memory modules or Input/Output (IO) peripherals. This increased level of complexity makes it especially difficult to design and verify safety- critical systems that require real- time operation, such as flight controllers in airplanes and airbag controllers in the automotive industry. Verifying that that systems are predictable is therefore essential, requiring methods for measuring and finding out the Worst- Case Execution Times (WCETs) and Best- Case Execution Times (BCETs). Additionally, the designer must ensure isolation between running applications (indicating that the platform is composable). This thesis work consists of designing a predictable Multiprocessor System On- Chip (MPSoC) using Qsys and Quartus II, as well as providing methods and test benches that can support all claims made about the platform’s reported behavior. A shared- memory loosely coupled multicore design was implemented, which can be horizontally scaled from 2 to 8 core tiles. A high- level Hardware Abstraction Layer (HAL) is written for the platform to simplify its use. Using Nios II/e processors as the logical cores in the platform’s core tiles gives predictable (mostly static) latencies when the platform is tested, showing no erratic or unexplained timing variations. However, due to the Round Robin (RR) nature of the arbitration logic in the Avalon Switch Fabric (ASF), composability was not fully achieved in the platform. Groundwork for implementing Time- Division Multiplexing (TDM) arbitration logic is proposed and will ideally be fully implemented in future work. / Mångkärniga processorsystem utmärker sig när det kommer till samkörning mellan applikationer. De ger en sann parallellism, där alla kärnor kan köra processorinstruktioner samtidigt. Mångkärniga system kommer med sina egna problem, framför allt när kärnorna ska dela komponenter så som minnesmoduler och Input/Output tillbehör. Den ökade komplexiteten gör att det är extra svårt att designa och verifiera säkerhetskritiska system som kräver körning i realtid, så som flygkontrollers på flygplan och styrenheter för krockkudden i bilar. Verifiering av att systemen är förutsägbara är essentiellt, detta behöver metoder för att mäta och hitta den värsta möjliga exekveringstiden (WCET) och den bästa möjliga exekveringstiden (BCET). Utöver detta måste designern säkerställa att processerna som körs på kärnorna är isolerade ifrån varandra (komponerbara). Detta arbetet består av att designa ett förutsägbart mångkärnigt system på chip (MPSoC) med Qsys och Quartus II, samt att ge metoder och testbänkar som kan bevisa systemets hävdade beteende. Ett löst kopplat mångkärnigt system med delat minne implementerades, där systemets kärnor kan ökas horisontellt från 2 till 8 stycken. Ett Hardware Abstraction Layer (HAL) skapades för systemet för att simplifiera användningen. Användningen av Nios II/e som processorkärna gav förutsägbara exekveringstider när systemet testades och visade inga oförklarliga tids variationer. Däremot, på grund av att Avalon Switch Fabric (ASF) tilldelar access med Round Robin (RR), är systemet inte komponerbart. Basen för att implementera Time- Division Multiplexing (TDM) istället är föreslaget och kommer idealt implementeras som fortsatt arbete. Composability FPGA Multicore processing Predictability Real- time systems System on chip FPGA Förutsägbarhet Komposibilitet Mångkärnig bearbetning Realtidssystem System på chip Elektroteknik och elektronik
4	Прилог аутоматској паралелизацији секвенцијалног машинског кода / Prilog automatskoj paralelizaciji sekvencijalnog mašinskog koda / An approach to automatic parallelization of sequential machine code Marinković Vladimir 24 September 2018 (has links) <p>Докторска теза анализира подршку за вишејезгарне и многојезгарне системе у циљу повећања искоришћења њихове снаге. Предмет истраживања је проналажење решења које би без уплитања програмера (аутоматски) паралелизовало постојеће секвенцијалне програме на бинарном нивоу који се извршавају на једном језгру (или процесору). Резултат истраживања је израда решења и алата за паралелизацију секвенцијалног машинког кода, који самостално стварају програме који се извршавају паралелно на више језгара вишејезгарног процесора, и тиме постижу балансирано оптерећење процесора. Основни циљ је добијање убрзања извршења програмског кода на вишејезгарном процесору ради омогућавања рада у реланом времену за задата ограничења. Добијено решење би се могло искористити и за смањење потрошње смањивањем радног такта процесора уз задржавање полазног времена извршења програма.</p> / <p>Doktorska teza analizira podršku za višejezgarne i mnogojezgarne sisteme u cilju povećanja iskorišćenja njihove snage. Predmet istraživanja je pronalaženje rešenja koje bi bez uplitanja programera (automatski) paralelizovalo postojeće sekvencijalne programe na binarnom nivou koji se izvršavaju na jednom jezgru (ili procesoru). Rezultat istraživanja je izrada rešenja i alata za paralelizaciju sekvencijalnog mašinkog koda, koji samostalno stvaraju programe koji se izvršavaju paralelno na više jezgara višejezgarnog procesora, i time postižu balansirano opterećenje procesora. Osnovni cilj je dobijanje ubrzanja izvršenja programskog koda na višejezgarnom procesoru radi omogućavanja rada u relanom vremenu za zadata ograničenja. Dobijeno rešenje bi se moglo iskoristiti i za smanjenje potrošnje smanjivanjem radnog takta procesora uz zadržavanje polaznog vremena izvršenja programa.</p> / <p>PhD thesis analyzes a support for multicore and manycore systems in terms<br />of better processing power utilization. Purpose of this study is finding a<br />solution for automatic parallelization of existing sequential code which<br />executes on single core (or processor), at the binary level. The research<br />intents to develop a solution and tools for parallelization of the sequential<br />machine code, which can create a program running simultaneously on all the<br />cores of the multi-core processor, and for achieving optimal load-balancing.<br />The primary goal is obtaining execution speedup of the program running on<br />the multicore processor, for meeting real-time processing constraints. Given<br />solution could be also used for energy saving, by lowering system clock and<br />keeping program execution runtime.</p>

1

Page generated in 0.1173 seconds