Global ETD Search

61	ENHANCING FAIRNESS AND PERFORMANCE ON CHIP MULTI-PROCESSOR PLATFORMS WITH CONTENTION-AWARE SCHEDULING POLICIES Marinakis, Theodoros 01 December 2019 (has links) (PDF) Chip Multi-Processor (CMP) platforms, well-established in the server, desktop and embedded domain, succeeded in overcoming the power consumption and heat dissipation bottlenecks by integrating multiple cores, less complex and powerful than their single-core ancestors, in a single die. A major issue induced by the design of the CMPs is contention for the shared resources of the platform, Last Level Cache (LLC) and main memory bandwidth. Applications, running concurrently on the cores, compete with each other for the shared resources, and are subject to performance degradation. The way applications are assigned to the CMP, is crucial for the overall performance of the system. A scheduling policy that accounts for contention will bring high performance speed-ups, whereas an agnostic one will generate unpredictable contention conditions. For this reason the significance of the scheduler has been elevated, as it is the component that determines which applications utilize the resources each time period.In this thesis, we address cross-core interference on CMP platforms, by designing scheduling policies that improve performance and fairness. We deal with contention in three ways. In our first approach, we incorporate the notion of progress in order to balance unfairness among the applications of the workload. Performance degradation is not evenly distributed and progress greatly varies among them. In order to provide a fair execution environment, we monitor, at run-time, applications assigned to the CPU and prioritize them based on the extent at which they are affected by contention.In our second approach, we target performance by mitigating contention on shared resources. It is necessary to decide, out of all the possible application schedules, the one that generates the least amount of resource interference. To achieve that, the first indispensable step is to extract an interference profile for the applications executed on the CMP. We accomplish that by applying pressure to all levels of memory hierarchy and identifying the point at which performance is compromised. From our analysis, we understand that shared resources can tolerate pressure of certain amount; applications can be grouped together if the overall generated pressure does not reach the saturation point of the shared resources. Having extracted this information, we proceed to the placement of the application in such a way that overall resource requirements are as balanced as possible across the execution.Finally, we design a policy in order to improve performance and fairness at the same time. Applications that heavily rely on the LLC are separated from those with high main memory bandwidth, in order to avoid the destructive effects caused by the LLC thrashing behavior of the latter. The group executed on the CPU is determined based on the key observation that the overall requirements of the group should not exceed the saturation limits of the CMP. Additionally, during execution, the progress for each application is estimated and those with the least accumulated progress are prioritized.Our proposed policies are evaluated in an Intel Xeon E5-2620 v3 processor. A variety of benchmark suites were utilized to generate mixes of diverse characteristics. Our methodologies are implemented in user-space and can be deployed on Linux-based systems. Experimental results show the benefits of tackling contention in shared resources. We achieve throughput gains of up to 16% and unfairness is reduced by 2.37x on average compared to Linux scheduler. chip multi-processor contention fairness performance scheduling
62	A Raster-Scan Video Graphics Display System Implemented on a KIM Microcomputer Drummond, Mark Douglas 09 1900 (has links) <p> The "microelectronic revolution" and the accompanying decrease in the cost of semiconductor memory has increased the availability of raster-scan graphical displays, yet, as pointed out in a recent survey [BAE79], the implementation of graphics software for raster-scan systems has lagged behind that for random-scan systems. The aim of the work described in this report has been to apply random-scan display techniques to a system employing a relatively inexpensive raster-scan device. The system, incorporating a segmented display file processor, is implemented on a KIM^TM microcomputer. The display device is composed of a Micro Technology Unlimited^TM (MTU) 8K Visable Memory^TM (VM) video board and a standard TV monitor.</p> / Thesis / Master of Science (MSc)
63	Underwater source localization with a generalized likelihood ratio processor Conn, Rebecca M. January 1994 (has links) No description available. Underwater source localization Generalized likelihood Ratio processor
64	Evaluation in which context a 32-bit, rather than an 8-bit processor may be appropriate to use, based on power consumption Jönsson, Patricia January 2017 (has links) Uttrycket Internet of Things växer sig större och större och världen är på väg att ha 50miljarder uppkopplade enheter till 2020. IoT-enheter är beroende av att ha en låg effektförbrukningoch därför är en processor med låg effektförbrukning viktigt att ha. Denna studieutför tester på två strömsnåla processorer för att komma fram till vilken processor somär mest lämplig till vilken IoT-produkt. Testningen utgick från tre applikationer som i sintur baseras på verkliga IoT-situationer. De tre applikationerna har olika intesitetsnivåer. Iden första applikationen arbetar processorerna inte särskilt hårt, I den andra applikationenfår processorena arbeta mer och i den tredje applikationen får processorerna jobba somhårdast. Effektförbrukningen mäts med hjälp av Atmel Power debugger. Resultatet visaratt IoT-enheter som inte är särskilt aktiva har en lägre effektförbrukning med en 8-bitarsprocessor men en IoT-enhet som är mer aktiv har lägre effektförbrukning med Cortex-M0+baserad 32-bitars processor. / The term Internet of Things grows bigger and bigger and the world is about to have 50 billionconnected devices. IoT devices are dependent on low power consumption and thereforea low power processor is important to have. This study performs tests on two power-savingprocessors to determine which processor is most suitable for an IoT product. The test wasbased on three applications, which in turn are based on actual IoT situations. The threeapplications have different levels of intency. In the first application, the processors do notwork very hard. In the second application, the processors get more work and in the thirdapplication, the processors get the hardest work. Power consumption is measured usingAtmel Power debugger The result shows that low-active IoT devices have a lower powerconsumption with an 8-bit processor, but an IoT device that is more active has lower powerconsumption with a Cortex-M0 + based 32-bit processor. Effektförbrukning Cortex-M0+ 8-bitars processor 32-bitars processor Power consumption 8-bit processor 32-bit processor Engineering and Technology Teknik och teknologier
65	Performance Modeling of Single Processor and Multi-Processor Computer Architectures Commissariat, Hormazd P. 11 March 2000 (has links) Determining the optimum computer architecture configuration for a specific application or a generic algorithm is a difficult task. The complexity involved in today's computer architectures and systems makes it more difficult and expensive to easily and economically implement and test full functional prototypes of computer architectures. High level VHDL performance modeling of architectures is an efficient way to rapidly prototype and evaluate computer architectures. Determining the architecture configuration is fixed, one would like to know the tolerance and expected performance of individual/critical components and also what would be the best way to map the software tasks onto the processor(s). Trade-offs and engineering compromises can be analyzed and the effects of certain component failures and communication bottle-necks can be studied. A part of the research work done for the RASSP (Rapid Prototyping of Application Specific Signal Processors) project funded by Department of Defense contracts is documented in this thesis. The architectures modeled include a single-processor, single-global-bus system; a four processor, single-global-bus system; a four processor, multiple-local-bus, single-global-bus system; and finally, a four processor multiple-local-bus system interconnected by a crossbar interconnection switch. The hardware models used are mostly legacy/inherited models from an earlier project and they were upgraded, modified and customized to suit the current research needs and requirements. The software tasks that are run on the processors are pieces of the signal and image processing algorithm run on the Synthetic Aperture Radar (SAR). The communication between components/devices is achieved in the form of tokens which are record structures. The output is a trace file which tracks the passage of the tokens through various components of the architecture. The output trace file is post-processed to obtain activity plots and latency plots for individual components of the architecture. / Master of Science Performance Modeling Computer Architectures VHDL Multi-Processor
66	Simulation and implementation of fixed-point digital filter structures Bailey, Daniel A. 11 July 2009 (has links) The purpose of this research is to develop a fixed-point arithmetic model based on a common general purpose Digital Signal Processor (DSP). A detailed non-linear model is developed to emulate the convergent (un-biased) rounding process performed by the Motorola DSP56002 fixed-point DSP. This model is incorporated into several different filter structures and compared to the linear stochastic simulation and the actual hardware implementation. It turns out that the convergent rounding operation has an insignificant effect on the overall roundoff noise power. The Direct Form, Section Optimal and MA Lattice forms are studied. F or these structures, Matlab routines are developed to automate the process of fixed-point scaling and DSP56002 code generation. Each structure's non-linear simulation is validated using two filter examples. The scaling and simulation routines allow the filter designer to investigate the finite word length performance of various structures, scaling norms, overflow safety factors, and word lengths to determine the best filter parameters prior to hardware implementation. / Master of Science Digital Signal Processor LD5655.V855 1995.B355
67	On-Board Spacecraft Time-Keeping Mission System Design and Verification Wickham, Mark E. 10 1900 (has links) International Telemetering Conference Proceedings / October 17-20, 1994 / Town & Country Hotel and Conference Center, San Diego, California / Spacecraft on-board time keeping, to an accuracy better than 1 millisecond, is a requirement for many satellite missions. Scientific satellites must precisely "time tag" their data to allow it to be correlated with data produced by a network of ground and space based observatories. Multiple vehicle satellite missions, and satellite networks, sometimes require several spacecraft to execute tasks in time phased fashion with respect to absolute time. In all cases, mission systems designed to provide a high accuracy on-board clock must necessarily include mechanisms for the determination and correction of spacecraft clock error. In addition, an approach to on-orbit verification of these mechanisms may be required. Achieving this accuracy however need not introduce significant mission cost if the task of maintaining this accuracy is appropriately distributed across both the space and ground mission segments. This paper presents the mission systems approaches taken by two spacecraft programs to provide high accuracy on-board spacecraft clocks at minimum cost. The first, NASA Goddard Space Flight Center's (GSFC) Extreme Ultraviolet Explorer (EUVE) program demonstrated the ability to use the NASA Tracking and Data Relay Satellite System (TDRSS) mission environment to maintain an on-board spacecraft clock to within 100 microseconds of Naval Observatory Standard (NOS) Time. The second approach utilizes an on-board spacecraft Global Positioning System (GPS) receiver as a time reference for spacecraft clock tracking which is facilitated through the use of Fairchild's Telemetry and Command Processor (TCP) spacecraft Command & Data Handling Subsystem Unit. This approach was designed for a future Shuttle mission requiring the precise coordination of events among multiple space-vehicles. Spacecraft Clock On-Board Processor Exterme Ultraviolet Explorer GPS Telemetry and Command Processor
68	Design automation methodologies for extensible processor platform Cheung, Newton, Computer Science & Engineering, Faculty of Engineering, UNSW January 2005 (has links) This thesis addresses two ubiquitous trends in the embedded system world - the increasing importance of design turnaround time as a design metric, and the move towards closing the design productivity gap. Adopting the right choice of design approach has been recognised as an integral part of the design flow in order to meet desired characteristics such as increasing software content, satisfying the growing complexities of an application, reusing off-the-shelf components, and exploring design metrics tradeoff, which closes the design productivity gap. The importance of design turnaround time is motivated by the intensive competition between manufacturers, especially makers of mainstream electronic consumer products, who shrinks the product life cycle and requires faster time-to-market to maximise economic benefits. This thesis presents a suite of design automation methodologies to automatically design embedded systems for an application in the state-of-the-art design approach - the extensible processor platform. These design automation methodologies systematise the extensible processor platform???s design flow, with particular emphasis on solving four challenging design problems: i) code segment identification; ii) instruction generation; iii) architectural customisation selection; and iv) processor evaluation. Our suite of design automation methodologies includes: i) a semi-automatic design system - to design an extensible processor that maximises the application performance while satisfying the area constraint. By specifying a fitting function to identify suitable code segments within an application, a two-level hierarchy selection algorithm is used to first select a predefined processor and then select the right instruction, and a performance estimator is used to estimate an application's performance; ii) a tool to match instructions - to automatically match the pre-designed instructions with computationally intensive code segments, reducing verification time and effort; iii) an instructions estimation model - to estimate the area overhead, latency, power consumption of extensible instructions, exploring larger design space; and iv) an instructions generation tool - to generate new extensible instructions that maximises the speedup while minimising power dissipation. A number of techniques such as system decomposition, combinational equivalence checking and regression analysis etc., have been heavily relied upon in the creation of the final design system. This thesis shows results at every stage to demonstrate the efficacy of our design methodologies in the creation of extensible processors. The methodologies and results presented in this thesis demonstrate that automating the design process for an extensible processor platform results in significant performance increase - on average, an increase of 4.74x (up to 15.71x) compared to the original base processor. Our system achieves significant design turnaround time savings (2.5% of the full simulation time for the entire design space) with majority Pareto points obtained (91% on average), and can lead to fewer and faster design iterations. Our instruction matching tool is 7.3x faster on average compared to the best known approaches to the problem (partial simulations). Our estimation model has a mean absolute error as small as 3.4% (6.7% max.) for area overhead, 5.9% (9.4% max.) for latency, and 4.2% (7.2% max.) for power consumption, compared to estimation through the time consuming synthesis and simulation steps using commercial tools. Finally, the instruction generation tool reduces energy consumption by a further 5.8% on average (up to 17.7%) compared to extensible instructions generated by previous approaches. data processing design and construction integrated circuits design automation extensible processor
69	The Optimal Design for Face Detection Algorithm on Cell Processor Architecture Ku, Po-Yu 24 August 2011 (has links) With the advance of facial recognition technology, many related applications such as the clearance of specific facilities, air port security, video camera surveillance, and personnel recognition. To maximize working efficiency and reduce human resource, the platform used for facial recognition should possess both low cost, multimedia performance, and the ease of use. Among the list of available platforms, a IBM CELL multi-core based platform that features the aforementioned advantages is used to manifest our work. To meet the demand of recognition accuracy, a recognition algorithms using features low error rate and regular data patterns are adopted. These algorithms are carried out in two parts: Modified Census Transform (MCT) and hypotheses of human facial calculation. The multi-point average value required by the MCT is obtained through parallel processing, and potential improvement in recognition efficiency is possible if wider data paths are used. A PlayStation 3 (PS3) platform equipped with the IBM CELL multi-core processor is used in this thesis. The IBM CELL multi-core processor consists of a PowerPC Processor Element (PPE) and 8 Synergistic Processor (SPE), which forms a heterogeneous multi-core system. This system is capable of parallelizing thread-level and data-level data words, which can meet the demand of high data bandwidth and data parallelization. By using this platform to accelerate the processing of facial recognition, simulation results suggest that the execution efficiency is improved by 24 times when compared with a single core SPE. The simulation also reveals that the use of parallelization of processing facial recognition data feasible. In the future, improved algorithms can be applied to improve the accuracy of facial recognition. Multiple Buffering Modified Census Transform (MCT) SIMD Synergistic Processor (SPE) Heterogeneous PowerPC Processor Element (PPE)
70	FPGA-based Soft Vector Processors Yiannacouras, Peter 23 February 2010 (has links) FPGAs are increasingly used to implement embedded digital systems because of their low time-to-market and low costs compared to integrated circuit design, as well as their superior performance and area over a general purpose microprocessor. However, the hardware design necessary to achieve this superior performance and area is very difficult to perform causing long design times and preventing wide-spread adoption of FPGA technology. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which can preserve the benefits of a single-chip FPGA solution without specializing the device with dedicated hard processors. Current soft processors have simple architectures that provide performance adequate for only the least-critical computations. Our goal is to improve soft processors by scaling their performance and expanding their suitability to more critical computation. To this end we focus on the data parallelism found in many embedded applications and propose that soft processors be augmented with vector extensions to exploit this parallelism. We support this proposal through experimentation with a parameterized soft vector processor called VESPA (Vector Extended Soft Processor Architecture) which is designed, implemented, and evaluated on real FPGA hardware. The scalability of VESPA combined with several other architectural parameters can be used to finely span a large design space and derive a custom architecture for exactly matching the needs of an application. Such customization is a key advantage for soft processors since their architectures can be easily reconfigured by the end-user. Specifically, customizations can be made to the pipeline, functional units, and memory system within VESPA. In addition, general purpose overheads can be automatically eliminated from VESPA. Comparing VESPA to manual hardware design, we observe a 13x speed advantage for hardware over our fastest VESPA, though this is significantly less than the 500x speed advantage over scalar soft processors. The performance-per-area of VESPA is also observed to be significantly higher than a scalar soft processor suggesting that the addition of vector extensions makes more efficient use of silicon area for data parallel workloads. soft processor soft Vector processor FPGA custom embedded SIMD data parallel 0544

Search results