Global ETD Search

241	Design and Implementation of Single Issue DSP Processor Core Ravinath, Vinodh January 2007 (has links) <p>Micro processors built specifically for digital signal processing are DSP processors. DSP is one of the core technologies in rapidly growing applications like communications and audio processing. The estimated growth of DSP processors in the last 6 years is over 40%. The variety of DSP capable processors for various applications also increased with the rising popularity of DSP processors. The design flow and architecture of such processors are not commonly available to students for learning.</p><p>This report is a structured approach to design and implementation of an embedded DSP processor core for voice, audio and video codec. The report focuses on the design requirement specification, senior instruction set and assembly manual release, micro architecture design and implementation of the core. Details about the core verification are also included in this report. The instruction set of this processor supports running basic kernels of BDTI benchmarking.</p> DSP processor codec Instruction set Microarchitecture FSM RTL Computer engineering Datorteknik
242	Benchmarking a DSP processor / Benchmarking av en DSP processor Lennartsson, Per, Nordlander, Lars January 2002 (has links) <p>This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. </p><p>The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. </p><p>The algorithms were programmed in assembly code and then executed on the instruction set simulator. After that, we proposed changes to the instruction set, with the aim to reduce the execution time for the algorithms. </p><p>The results from the benchmark show that our processor is at the same level as the ones tested by BDTI. Probably would a more experienced programmer be able to reduce the cycle count even more, especially for some of the more complex benchmarks.</p> Datorteknik Benchmarking DSP processor digital signal processing algorithm kernels DSP Datorteknik Computer engineering Datorteknik
243	An investigation of the measurement accuracy and productivity of a Waratah HTH 625c Processor Head Saathof, David January 2014 (has links) Log processor heads have become increasingly used in New Zealand (NZ) forest harvesting operations to increase productivity and improve worker safety. Information regarding the measurement accuracy and productivity of new model processor heads is limited. As a result, log quality control (QC) is carried out on logs that have been merchandised by a processor head. This task can have a high risk for injury from man – machine interaction. A trend between studies was that older model Waratah’s did not have sufficient measurement accuracy to alleviate the requirement for log QC. In this study, a Waratah HTH 625c processor head operating in NZ was analysed for measurement accuracy and productivity. Measurement accuracy was considered by measuring logs for length, diameter and branch size. A comparison of two methods of processing was also considered to determine measurement accuracy, productivity and production efficiency for the way logs are delimbed and merchandised. Once gathered, the data was then analysed to identify significant effects, trends and relationships between variables. Length measurements were highly accurate but diameter measurements were under- estimated. It was also evident that although there was absolute accuracy, there was a high variability in measurements with underestimating and overestimating. Branch size was also found to have a significant impact in reducing length measurement accuracy and productivity. Single pass processing has significantly higher production efficiency than two pass processing, although single pass processing had a higher length error associated with it. The Waratah HTH 625c processor head has better measurement accuracy than older model Waratah’s. However, logs are still cut out-of-spec which will require a log QC to identify. As measurement technology is further improved in processor heads, and improvements to NZ’s plantation resource (improved form and smaller branching) are realised at harvest age, measurement accuracy and productivity of log processor heads will further improve. Processor Head Waratah HTH 625c Log Quality Control Measurement Accuracy Branch Size Productivity Production Time Tolerances
244	Memory-subsystem resource management for the many-core era Kaseridis, Dimitrios 11 July 2012 (has links) As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads. The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before. This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system. To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements. As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve. Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text Chip-multiprocessors Many-core Cache Memory Resource-management Processor Computer architecture Memory controllers
245	Resource management for efficient single-ISA heterogeneous computing Chen, Jian, doctor of electrical and computer engineering 11 July 2012 (has links) Single-ISA heterogeneous multi-core processors (SHMP) have become increasingly important due to their potential to significantly improve the execution efficiency for diverse workloads and thereby alleviate the power density constraints in Chip Multiprocessors (CMP). The importance of SHMP is further underscored by the fact that manufacturing defects and process variation could also cause single-ISA heterogeneity in CMPs even though the CMP is originally designed as homogeneous. However, to fully exploit the execution efficiency that SHMP has to offer, programs have to be efficiently mapped/scheduled to the appropriate cores such that the hardware resources of the cores match the resource demands of the programs, which is challenging and remains an open problem. This dissertation presents a comprehensive set of off-line and on-line techniques that leverage analytical performance modeling to bridge the gap between the workload diversity and the hardware heterogeneity. For the off-line scenario, this dissertation presents an efficient resource demand analysis framework that can estimate the resource demands of a program based on the inherent characteristics of the program without using any detailed simulation. Based on the estimated resource demands, this dissertation further proposes a multi-dimensional program-core matching technique that projects program resource demands and core configurations to a unified multi-dimensional space, and uses the weighted Euclidean distance between these two to identify the matching program-core pair. This dissertation also presents a dynamic and predictive application scheduler for SHMPs. It uses a set of hardware-efficient online profilers and an analytical performance model to simultaneously predict the application’s performance on different cores. Based on the predicted performance, the scheduler identifies and enforces near-optimal application assignment for each scheduling interval without any trial runs or off-line profiling. Using only a few kilo-bytes of extra hardware, the proposed heterogeneity-aware scheduler improves the weighted speedup by 11.3% compared with the commodity OpenSolaris scheduler and by 6.8% compared with the best known research scheduler. Finally, this dissertation presents a predictive yet cost effective mechanism to manage intra-core and/or inter-core resources in dynamic SHMP. It also uses a set of hardware-efficient online profilers and an analytical performance model to predict the application’s performance with different resource allocations. Based on the predicted performance, the resource allocator identifies and enforces near optimum resource partitions for each epoch without any trial runs. The experimental results show that the proposed predictive resource management framework could improve the weighted speedup of the CMP system by an average of 11.6% compared with the equal partition scheme, and 9.3% compared with existing reactive resource management scheme. / text Performance modeling Heterogeneous multicore processor Hardware resource demand Application scheduling Resource management
246	Predictive power management for multi-core processors Bircher, William Lloyd 07 February 2011 (has links) Energy consumption by computing systems is rapidly increasing due to the growth of data centers and pervasive computing. In 2006 data center energy usage in the United States reached 61 billion kilowatt-hours (KWh) at an annual cost of 4.5 billion USD [Pl08]. It is projected to reach 100 billion KWh by 2011 at a cost of 7.4 billion USD. The nature of energy usage in these systems provides an opportunity to reduce consumption. Specifically, the power and performance demand of computing systems vary widely in time and across workloads. This has led to the design of dynamically adaptive or power managed systems. At runtime, these systems can be reconfigured to provide optimal performance and power capacity to match workload demand. This causes the system to frequently be over or under provisioned. Similarly, the power demand of the system is difficult to account for. The aggregate power consumption of a system is composed of many heterogeneous systems, each with a unique power consumption characteristic. This research addresses the problem of when to apply dynamic power management in multi-core processors by accounting for and predicting power and performance demand at the core-level. By tracking performance events at the processor core or thread-level, power consumption can be accounted for at each of the major components of the computing system through empirical, power models. This also provides accounting for individual components within a shared resource such as a power plane or top-level cache. This view of the system exposes the fundamental performance and power phase behavior, thus making prediction possible. This dissertation also presents an extensive analysis of complete system power accounting for systems and workloads ranging from servers to desktops and laptops. The analysis leads to the development of a simple, effective prediction scheme for controlling power adaptations. The proposed Periodic Power Phase Predictor (PPPP) identifies patterns of activity in multi-core systems and predicts transitions between activity levels. This predictor is shown to increase performance and reduce power consumption compared to reactive, commercial power management schemes by achieving higher average frequency in active phases and lower average frequency in idle phases. / text Power management CPU Processor Multi-core DVFS Power modeling Power adaptation Program phase
247	A hybrid MPI/OpenMP parallelization of the adaptive integral method for multi-core clusters Wei, Fangzhou 02 August 2011 (has links) A hybrid of message passing and shared memory techniques is presented for scalable parallelization of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary uniform grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the uniform grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the MOM solution of combined-field integral equations pertinent to the analysis of scattering from perfectly conducting surfaces. The scalability and efficiency of the implementation are investigated theoretically and verified numerically by solving benchmark scattering problems on a (near) petaflop supercomputing cluster of quad-core processors. The timing and speedup results on up to 1024 processors show that the proposed hybrid MPI/OpenMP parallelization exhibits better strong scalability (fixed problem size speedup) compared to pure MPI parallelization when multiple cores are used on each processor. / text AIM CFIE MPI/OpenMP Multi-core processor cluster Multi-core processing Computer architecture
248	"Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment" Utrera Iglesias, Gladys Miriam 10 December 2007 (has links) This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future. parallel applications malleability process scheduling processor scheduling job scheduling mpi parallel computing load balancing message passsing.
249	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544
250	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544

Search results