Global ETD Search

61	Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code Zhang, Yuhua 12 1900 (has links) Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures. Multiprocessors. Computer architecture. SDF single-chip multiprocessor multithreading automatic parallelization scalability superscalar VLIW SMT
62	Neblokující vstup/výstup pro projekt k-Wave / Non-Blocking Input/Output for the k-Wave Toolbox Kondula, Václav January 2020 (has links) This thesis deals with an implementation of non-blocking I/O interface for the k-Wave project, which is designed for time-domain simulation of ultrasound propagation. Main focus is on large domain simulations that, due to high computing power requirements, must run on supercomputers and produce tens of GB of data in a single simulation step. In this thesis, I have designed and implemented a non-blocking interface for storing data using dedicated threads, which allows to overlap simulation calculations with disk operations in order to speed up the simulation. An acceleration of up to 33% was achieved compared to the current implementation of project k-Wave, which resulted, among other things, also to reduce cost of the simulation.
63	Effective and Accelerated Informative Frame Filtering in Colonoscopy Videos Using Graphic Processing Units Karri, Venkata Praveen 08 1900 (has links) Colonoscopy is an endoscopic technique that allows a physician to inspect the mucosa of the human colon. Previous methods and software solutions to detect informative frames in a colonoscopy video (a process called informative frame filtering or IFF) have been hugely ineffective in (1) covering the proper definition of an informative frame in the broadest sense and (2) striking an optimal balance between accuracy and speed of classification in both real-time and non real-time medical procedures. In my thesis, I propose a more effective method and faster software solutions for IFF which is more effective due to the introduction of a heuristic algorithm (derived from experimental analysis of typical colon features) for classification. It contributed to a 5-10% boost in various performance metrics for IFF. The software modules are faster due to the incorporation of sophisticated parallel-processing oriented coding techniques on modern microprocessors. Two IFF modules were created, one for post-procedure and the other for real-time. Code optimizations through NVIDIA CUDA for GPU processing and/or CPU multi-threading concepts embedded in two significant microprocessor design philosophies (multi-core design and many-core design) resulted a 5-fold acceleration for the post-procedure module and a 40-fold acceleration for the real-time module. Some innovative software modules, which are still in testing phase, have been recently created to exploit the power of multiple GPUs together. Colonoscopy multithreading connectivity informative frame filtering CPU NVIDIA CUDA Colonoscopy. Video recording in medicine.
64	Teaching Concurrency in a Modern Manner, Flipped Classroom or Game-Based Learning Murphie, Bobby, Hansen, Mattias January 2018 (has links) Mycket forskning har gjorts för att hitta förbättrade sätt att lära ut concurrency. Allt från visualiseringsverktyg till spel-baserad inlärning och flippat klassrum. Dock så saknas forskning som jämför metoder och modeller som lär ut concurrency. Den här artikeln tar upp och tittar på resultat från studenter som studerar concurrent programmering genom att jämföra två olika moderna sätt att lära ut. Den tittar också på vilken metod/modell studenterna f ̈oredrar och håller dem mer engagerade. Författarna av denna artikel jämför ett spel-inlärnings tillvägagångssätt med ett flippat klassrum tillvägagångssätt. Spel-inlärnings tillvägagångssättet som används i denna artikel är utvecklad av Dr. Robert Marmorstein och använder sig av spelet OpenTTD [1]. Studenterna lär sig om race condition, deadlock och starvation genom att använda semaforer(järnvägssignaler) för att förhindra kollisioner. Det flippade klassrum tillvägagångssättet i denna artikel används i en flertrådad programmeringskurs på Malmö Universitet. Efter att båda tillvägagångssätten har genomförts tar studenterna ett test och svarar på ett frågeformulär för att se hur mycket studenterna har lärt sig, hur engagerade de är och vad de föredrar. För att få mer exakta resultat får bara studenterna delta vid ett av tillfällena där tillvägagångssätten genomförs. Resultaten från den här studien gynnar OpenTTD labbens tillvägagångssätt då studenterna verkar vara mer engagerade och föredra den lite mer. Studenterna som deltog i OpenTTD labben gjorde bättre ifrån sig på testet när det kommer till att förstå hur man förhindrar/löser varje synkroniseringsproblem, medans de studenterna som deltog i det flippade klassrummet gjorde lite bättre ifrån sig när det kom till att förklara/beskriva problemet. / Much research has been done to find ways to improve teaching concurrency, from visualization tools to game-based learning and flipped classroom. However, research on comparing these methods or models when teaching concurrency are lacking. This paper looks at the different results from students who are studying concurrent programming by comparing two different modern ways of teaching. It also looks at which method/model students prefer and keeps them more engaged. The authors of this paper compare a game-based learning approach to a flipped classroom approach. The game-based learning approach used in this paper is developed by Dr. Robert Marmorstein and uses the game OpenTTD [1]. The students learn about race condition, deadlock and starvation by using semaphores (railway signals) to prevent collisions. The flipped classroom approach in this paper is used in a concurrent programming course at Malmö University. After both of the approaches have been completed, the students take a test and answer a survey to see how much the students learn, how engaged they are and what they prefer. To gain an accurate result, each student that took part in the study only participated in one of the approaches. The results of the survey favor the OpenTTD lab approach as the students were more engaged during the exercise and preferred the exercise more. The students that participated in the OpenTTD lab also did better on the test when it came to explaining how to prevent/solve each synchronization problem, while in the flipped classroom students did better when it came to describing the problem. Teaching Learning Multithreading Concurrency Game-based learning Flipped classroom Engineering and Technology Teknik och teknologier
65	A Comparison of Parallel Design Patterns for Game Development Andblom, Robin, Sjöberg, Carl January 2018 (has links) ----- / As processor performance capabilities can only be increased through the useof a multicore architecture, software needs to be developed to utilize the parallelismoffered by the additional cores. Especially game developers need toseize this opportunity to save cycles and decrease the general rendering time.One of the existing advances towards this potential has been the creation ofmultithreaded game engines that take advantage of the additional processingunits. In such engines, different branches of the game loop are parallelized.However, the specifics of the parallel design patterns used are not outlined.Neither are any ideas of how to combine these patterns proposed. Thesemissing factors are addressed in this article, to provide a guideline for whento use which one of two parallel design patterns; fork-join and pipeline parallelism.Through a collection of data and a comparison using the metricsspeedup and efficiency, conclusions were derived that shed light on the waysin which a typical part of a game loop most efficiently can be organized forparallel execution through the use of different parallel design patterns. Thepipeline and fork-join patterns were applied respectively in a variety of testcases for two branches of a game loop: a BOIDS system and an animationsystem. Flocking Parallelism Multithreading Rendering time Speedup Skeletal animation BOIDS Engineering and Technology Teknik och teknologier
66	On testing concurrent systems through contexts of queues Huo, Jiale. January 2006 (has links) No description available. Asynchronous circuits -- Testing. Queuing networks (Data transmission)
67	Multitrådad Schemaläggning avxtUML Modeller : Utfört på Saab Dynamics / Multithreaded scheduling of xtUML models Gripsborn, Carolina January 2022 (has links) Executable and Translatable UML (xtUML) is a modeling methodology where a system is constructed using a set of UML models and an action language, which can be translated to a target implementation and compiled into an executable program. It allows for good readability and understanding of the workings of the system and relations between its different parts, easy testing and reusability. With a subset of UML diagrams and finite state machines, the actors in the system and the execution progression can be defined. These models are then made into an executable program using a model compiler. Saab Dynamics has developed their own model to C++ compiler, also made using xtUML with the open source tool Bridgepoint. In the current implementation of the compiler, events which trigger a class instance to transition from one state in a state machine to another, are picked from a queue and processed one by one. In theory, a speedup of the execution time for programs could be achieved if multiple events were run simultaneously. To enable parallel execution, additional functionality needs to be added to the compiler to map dependencies between classes and schedule events on threads.To achieve this a parser was implemented, which iterates every state machine and finds statements which access other classes and could result in a potential data race. These shared data accesses are mapped as instances of a Dependency class if at least one writes to it. These are then later used by the compiler to determine for each class to which classes it has a dependency. During execution when events are picked from queue, a check is made for the target class of the event to the currently executing classes on other active threads, to determine if the event is allowed to be processed immediately or if it should be placed in queue again. Threads are created at the start of the program in a thread pool, and are awakened once an independent event is found and added to the thread's own queue. Results from test models compiled using the new version of the model compiler show that the parser finds all data accesses to other classes and accurately maps the dependencies between them. The end results of the programs are equal to that of the serial executions, and the principles of xtUML are maintained. While there are still improvements to be made to increase the parallelization of events, there was a significant speedup in execution time to be seen for models containing time consuming independent state machines. xtuml bridgepoint multithreading xtuml bridgepoint multitrådning Computer and Information Sciences Data- och informationsvetenskap
68	Simplifying Embedded System Development through Whole-Program Compilers McCartney, William P. 18 May 2011 (has links) No description available. Computer Engineering Computer Science Electrical Engineering Engineering Stackless Multithreading Embedded Systems Firmware Compilers C
69	Contention-Aware Scheduling for SMT Multicore Processors Feliu Pérez, Josué 27 March 2017 (has links) The recent multicore era and the incoming manycore/manythread era generate a lot of challenges for computer scientists going from productive parallel programming, over network congestion avoidance and intelligent power management, to circuit design issues. The ultimate goal is to squeeze out as much performance as possible while limiting power and energy consumption and guaranteeing a reliable execution. The increasing number of hardware contexts of current and future systems makes the scheduler an important component to achieve this goal, as there is often a combinatorial amount of different ways to schedule the distinct threads or applications, each with a different performance due to the inter-application interference. Picking an optimal schedule can result in substantial performance gains. This thesis deals with inter-application interference, covering the problems this fact causes on performance and fairness on actual machines. The study starts with single-threaded multicore processors (Intel Xeon X3320), follows with simultaneous multithreading (SMT) multicores supporting up to two threads per core (Intel Xeon E5645), and goes to the most highly threaded per-core processor that has ever been built (IBM POWER8). The dissertation analyzes the main contention points of each experimental platform and proposes scheduling algorithms that tackle the interference arising at each contention point to improve the system throughput and fairness. First we analyze contention through the memory hierarchy of current multicore processors. The performed studies reveal high performance degradation due to contention on main memory and any shared cache the processors implement. To mitigate such contention, we propose different bandwidth-aware scheduling algorithms with the key idea of balancing the memory accesses through the workload execution time and the cache requests among the different caches at each cache level. The high interference that different applications suffer when running simultaneously on the same SMT core, however, does not only affect performance, but can also compromise system fairness. In this dissertation, we also analyze fairness in current SMT multicores. To improve system fairness, we design progress-aware scheduling algorithms that estimate, at runtime, how the processes progress, which allows to improve system fairness by prioritizing the processes with lower accumulated progress. Finally, this dissertation tackles inter-application contention in the IBM POWER8 system with a symbiotic scheduler that addresses overall SMT interference. The symbiotic scheduler uses an SMT interference model, based on CPI stacks, that estimates the slowdown of any combination of applications if they are scheduled on the same SMT core. The number of possible schedules, however, grows too fast with the number of applications and makes unfeasible to explore all possible combinations. To overcome this issue, the symbiotic scheduler models the scheduling problem as a graph problem, which allows finding the optimal schedule in reasonable time. In summary, this thesis addresses contention in the shared resources of the memory hierarchy and SMT cores of multicore processors. We identify the main contention points of three systems with different architectures and propose scheduling algorithms to tackle contention at these points. The evaluation on the real systems shows the benefits of the proposed algorithms. The symbiotic scheduler improves system throughput by 6.7\% over Linux. Regarding fairness, the proposed progress-aware scheduler reduces Linux unfairness to a third. Besides, since the proposed algorithm are completely software-based, they could be incorporated as scheduling policies in Linux and used in small-scale servers to achieve the mentioned benefits. / La actual era multinúcleo y la futura era manycore/manythread generan grandes retos en el área de la computación incluyendo, entre otros, la programación paralela productiva o la gestión eficiente de la energía. El último objetivo es alcanzar las mayores prestaciones limitando el consumo energético y garantizando una ejecución confiable. El incremento del número de contextos hardware de los sistemas hace que el planificador se convierta en un componente importante para lograr este objetivo debido a que existen múltiples formas diferentes de planificar las aplicaciones, cada una con distintas prestaciones debido a las interferencias que se producen entre las aplicaciones. Seleccionar la planificación óptima puede proporcionar importantes mejoras de prestaciones. Esta tesis se ocupa de las interferencias entre aplicaciones, cubriendo los problemas que causan en las prestaciones y equidad de los sistemas actuales. El estudio empieza con procesadores multinúcleo monohilo (Intel Xeon X3320), sigue con multinúcleos con soporte para la ejecución simultanea (SMT) de dos hilos (Intel Xeon E5645), y llega al procesador que actualmente soporta un mayor número de hilos por núcleo (IBM POWER8). La disertación analiza los principales puntos de contención en cada plataforma y propone algoritmos de planificación que mitigan las interferencias que se generan en cada uno de ellos para mejorar la productividad y equidad de los sistemas. En primer lugar, analizamos la contención a lo largo de la jerarquía de memoria. Los estudios realizados revelan la alta degradación de prestaciones provocada por la contención en memoria principal y en cualquier cache compartida. Para mitigar esta contención, proponemos diversos algoritmos de planificación cuya idea principal es distribuir los accesos a memoria a lo largo del tiempo de ejecución de la carga y las peticiones a las caches entre las diferentes caches compartidas en cada nivel. Las altas interferencias que sufren las aplicaciones que se ejecutan simultáneamente en un núcleo SMT, sin embargo, no solo afectan a las prestaciones, sino que también pueden comprometer la equidad del sistema. En esta tesis, también abordamos la equidad en los actuales multinúcleos SMT. Para mejorarla, diseñamos algoritmos de planificación que estiman el progreso de las aplicaciones en tiempo de ejecución, lo que permite priorizar los procesos con menor progreso acumulado para reducir la inequidad. Finalmente, la tesis se centra en la contención entre aplicaciones en el sistema IBM POWER8 con un planificador simbiótico que aborda la contención en todo el núcleo SMT. El planificador simbiótico utiliza un modelo de interferencia basado en pilas de CPI que predice las prestaciones para la ejecución de cualquier combinación de aplicaciones en un núcleo SMT. El número de posibles planificaciones, no obstante, crece muy rápido y hace inviable explorar todas las posibles combinaciones. Por ello, el problema de planificación se modela como un problema de teoría de grafos, lo que permite obtener la planificación óptima en un tiempo razonable. En resumen, esta tesis aborda la contención en los recursos compartidos en la jerarquía de memoria y el núcleo SMT de los procesadores multinúcleo. Identificamos los principales puntos de contención de tres sistemas con diferentes arquitecturas y proponemos algoritmos de planificación para mitigar esta contención. La evaluación en sistemas reales muestra las mejoras proporcionados por los algoritmos propuestos. Así, el planificador simbiótico mejora la productividad, en promedio, un 6.7% con respecto a Linux. En cuanto a la equidad, el planificador que considera el progreso consigue reducir la inequidad de Linux a una tercera parte. Además, dado que los algoritmos propuestos son completamente software, podrían incorporarse como políticas de planificación en Linux y usarse en servidores a pequeña escala para obtener los benefi / L'actual era multinucli i la futura era manycore/manythread generen grans reptes en l'àrea de la computació incloent, entre d'altres, la programació paral·lela productiva o la gestió eficient de l'energia. L'últim objectiu és assolir les majors prestacions limitant el consum energètic i garantint una execució confiable. L'increment del número de contextos hardware dels sistemes fa que el planificador es convertisca en un component important per assolir aquest objectiu donat que existeixen múltiples formes distintes de planificar les aplicacions, cadascuna amb unes prestacions diferents degut a les interferències que es produeixen entre les aplicacions. Seleccionar la planificació òptima pot donar lloc a millores importants de les prestacions. Aquesta tesi s'ocupa de les interferències entre aplicacions, cobrint els problemes que provoquen en les prestacions i l'equitat dels sistemes actuals. L'estudi comença amb processadors multinucli monofil (Intel Xeon X3320), segueix amb multinuclis amb suport per a l'execució simultània (SMT) de dos fils (Intel Xeon E5645), i arriba al processador que actualment suporta un major nombre de fils per nucli (IBM POWER8). Aquesta dissertació analitza els principals punts de contenció en cada plataforma i proposa algoritmes de planificació que aborden les interferències que es generen en cadascun d'ells per a millorar la productivitat i l'equitat dels sistemes. En primer lloc, estudiem la contenció al llarg de la jerarquia de memòria en els processadors multinucli. Els estudis realitzats revelen l'alta degradació de prestacions provocada per la contenció en memòria principal i en qualsevol cache compartida. Per a mitigar la contenció, proposem diversos algoritmes de planificació amb la idea principal de distribuir els accessos a memòria al llarg del temps d'execució de la càrrega i les peticions a les caches entre les diferents caches compartides en cada nivell. Les altes interferències que sofreixen las aplicacions que s'executen simultàniament en un nucli SMT, no obstant, no sols afecten a las prestacions, sinó que també poden comprometre l'equitat del sistema. En aquesta tesi, també abordem l'equitat en els actuals multinuclis SMT. Per a millorar-la, dissenyem algoritmes de planificació que estimen el progrés de les aplicacions en temps d'execució, el que permet prioritzar els processos amb menor progrés acumulat para a reduir la inequitat. Finalment, la tesi es centra en la contenció entre aplicacions en el sistema IBM POWER8 amb un planificador simbiòtic que aborda la contenció en tot el nucli SMT. El planificador simbiòtic utilitza un model d'interferència basat en piles de CPI que prediu les prestacions per a l'execució de qualsevol combinació d'aplicacions en un nucli SMT. El nombre de possibles planificacions, no obstant, creix molt ràpid i fa inviable explorar totes les possibles combinacions. Per resoldre aquest contratemps, el problema de planificació es modela com un problema de teoria de grafs, la qual cosa permet obtenir la planificació òptima en un temps raonable. En resum, aquesta tesi aborda la contenció en els recursos compartits en la jerarquia de memòria i el nucli SMT dels processadors multinucli. Identifiquem els principals punts de contenció de tres sistemes amb diferents arquitectures i proposem algoritmes de planificació per a mitigar aquesta contenció. L'avaluació en sistemes reals mostra les millores proporcionades pels algoritmes proposats. Així, el planificador simbiòtic millora la productivitat una mitjana del 6.7% respecte a Linux. Pel que fa a l'equitat, el planificador que considera el progrés aconsegueix reduir la inequitat de Linux a una tercera part. A més, donat que els algoritmes proposats son completament software, podrien incorporar-se com a polítiques de planificació en Linux i emprar-se en servidors a petita escala per obtenir els avantatges mencionats. / Feliu Pérez, J. (2017). Contention-Aware Scheduling for SMT Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79081 / Premios Extraordinarios de tesis doctorales process scheduling bandwidth contention symbiotic scheduling chip multicore processors simultaneous multithreading (SMT) performance counters
70	Designing a Compiler for a Distributed Memory Parallel Computing System Bennett, Sidney Page 22 January 2004 (has links) The SCMP processor presents a unique approach to processor design: integrating multiple processors, a network, and memory onto a single chip. The benefits to this design include a reduction in overhead incurred by synchronization, communication, and memory accesses. To properly determine its effectiveness, the SCMP architecture must be exercised under a wide variety of workloads, creating the need for a variety of applications. A compiler can relieve the time spent developing these applications by allowing the use of languages such as C and Fortran. However, compiler development is a research area in its own right, requiring extensive knowledge of the architecture to make good use of its resources. This thesis presents the design and implementation of a compiler for the SCMP architecture. The thesis includes an in-depth analysis of SCMP and the necessary design choices for an effective compiler using the SUIF and MachSUIF toolsets. Two optimizations passes are included in the discussion: partial redundancy elimination and instruction scheduling. While these optimizations are not specific to parallel computing, architectural considerations must still be made to properly implement the algorithms within the SCMP compiler. These optimizations yield an overall reduction in execution time of 15-36%. / Master of Science SUIF Compiler Design Parallel Computing Optimization MachSUIF MultithreadingOptimization Parallel Computing Multithreading

Search results