Spelling suggestions: "subject:"computer architecture."" "subject:"coomputer architecture.""
491 |
Efficient Handling of Narrow Width and Streaming Data in Embedded ApplicationsLi, Bengu January 2006 (has links)
Embedded environment imposes severe constraints of system resources on embedded applications. Performance, memory footprint, and power consumption are critical factors for embedded applications. Meanwhile, the data in embedded applications demonstrate unique properties. More specifically, narrow width data are data representable in considerably fewer bits than in one word, which nevertheless occupy an entire register or memory word and streaming data are the input data processed by an application sequentially, which stay in the system for a short duration and thus exhibit little data locality. Narrow width and streaming data affect the efficiency of register, cache, and memory and must be taken into account when optimizing for performance, memory footprint, and power consumption.This dissertation proposes methods to efficiently handle narrow width and streaming data in embedded applications. Quantitative measurements of narrow width and streaming data are performed to provide guidance for optimizations. Novel architectural features and associated compiler algorithms are developed. To efficiently handle narrow width data in registers, two register allocation schemes are proposed for the ARM processor to allocate two narrow width variables to one register. A static scheme exploits maximum bitwidth. A speculative scheme further exploits dynamic bitwidth. Both result in reduced spill cost and performance improvement. To efficiently handle narrow width data in memory, a memory layout method is proposed to coalesce multiple narrow width data in one memory location in a DSP processor, leading to fewer explicit address calculations. This method improves performance and shrinks memory footprint. To efficiently handle streaming data in network processor, two cache mechanisms are proposed to enable the reuse of data and computation. The slack created is further transformed into reduction in energy consumption through a fetch gating mechanism.
|
492 |
Efficient Multi-ported Memories for FPGAsLaForest, Charles Eric 15 February 2010 (has links)
Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller.
|
493 |
Efficient Multi-ported Memories for FPGAsLaForest, Charles Eric 15 February 2010 (has links)
Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller.
|
494 |
Vienlusčių tinklo architektūrų tyrimas / Research in Network on Chip architecturesČepaitis, Modestas 16 August 2007 (has links)
“International Technology Roadmap for Semiconductors” (ITRS) teigia, kad baigiantis dešimtmečiui bus pagamintas 50 – 100 nm lustas susidedantis iš maždaug 4 bilijonų tranzistorių operuojančių 10Ghz. Taigi tokių parametrų sistemas pagaminti yra nelengva, tranzistorių skaičiaus didėjimas taip pat didina lusto dydį, lusto sud��tingumą bei sunkina lusto integruojamumą. Be to vienlustės sistemos naudoja bendras magistrales bendrauti su kitais lustiniais resursais. Šios magistralės yra nedalomos, todėl ateityje toks komunikavimo metodas taps vis didesne problema gaminant lustus susidedančius iš bilijonų tranzistorių. Šiems tikslams spręsti ir buvo pasiūlyta vienlusčių tinklo paradigma. Vienlusčių tinklų paradigma siūlo galimus komunikavimo infrastruktūrų sprendimus, susiduriant su vis sudėtingesnėmis sistemomis, bei padeda trumpinanti lusto pagaminimo laiko periodą. Šiame darbe naudojama atkartojimo technologijos, kuriant bendrini komponentą vienlusčių tinklams. Metaspecifikacija, sukurta naudojant preprocesorinį atkartojimo metodą. / According to the International Technology Roadmap for Semiconductors (ITRS), before the end of this decade we will be entering the era of a billion transistors on a single chip. It is being stated that soon we will have a chip of 50- 100 nm comprising around 4 billion transistors operating at a frequency of 10 Ghz. However, it has been observed that as the system grows, so does the complexity of integrating various components on a chip. The major threat toward the achievement of a billion transistor chip is poor scalability of current interconnect structure of today’s SoC. In order to cope with growing interconnect infrastructure, the “Network on chip (NoC)” concept was introduced. NoCs present a possible communication infrastructure solution to deal with increased design complexity and shrinking time-to-market. It is clear that NoCs can potentially become the preferred interconnection approach for SoCs being developed in a near future. This paper discusses the impact of the reuse of NoC components methods, for parameterize and test these systems. There is preprocessing type reusing being used, to create a router metaspecification of router for the 2d torus interconnect network.
|
495 |
Compiling for a Multithreaded Horizontally-microcoded Soft Processor FamilyTili, Ilian 28 November 2013 (has links)
Soft processing engines make FPGA programming simpler for software programmers. TILT is a multithreaded soft processing engine that contains multiple deeply pipelined and varying latency functional units. In this thesis, we present a compiler framework for compiling and scheduling for TILT. By using the compiler to generate schedules and manage hardware we create computationally dense designs (high throughput per hardware area) which make compelling processing engines. High schedule density is achieved by mixing instructions from different threads and by prioritizing the longest path of data flow graphs. Averaged across benchmark kernels we can achieve 90% of the theoretical throughput, and can reduce the performance gap relative to custom hardware from 543x for a scalar processor to only 4.41x by replicating TILT cores up to a comparable area cost. We also present methods of quickly navigating the design space and predicting the area of hardware configurations.
|
496 |
Compiling for a Multithreaded Horizontally-microcoded Soft Processor FamilyTili, Ilian 28 November 2013 (has links)
Soft processing engines make FPGA programming simpler for software programmers. TILT is a multithreaded soft processing engine that contains multiple deeply pipelined and varying latency functional units. In this thesis, we present a compiler framework for compiling and scheduling for TILT. By using the compiler to generate schedules and manage hardware we create computationally dense designs (high throughput per hardware area) which make compelling processing engines. High schedule density is achieved by mixing instructions from different threads and by prioritizing the longest path of data flow graphs. Averaged across benchmark kernels we can achieve 90% of the theoretical throughput, and can reduce the performance gap relative to custom hardware from 543x for a scalar processor to only 4.41x by replicating TILT cores up to a comparable area cost. We also present methods of quickly navigating the design space and predicting the area of hardware configurations.
|
497 |
Scalability issues in distributed and parallel databasesGottemukkala, Vibby January 1996 (has links)
No description available.
|
498 |
Asynchronous events : tools for distributed programming on concurrent object-based systemsMenon, Sathis N. 12 1900 (has links)
No description available.
|
499 |
A trace-driven simulation study of cache memoriesXiong, Bo January 1989 (has links)
The purpose of this study is to explore the relationship between hit ratio of cache memory and design parameters. Cache memories are widely used in the design of computer system architectures to match relatively slow memories against fast CPUs. Caches hold the active segments of a program which are currently in use. Since instructions and data in cache memories can be referenced much faster than the time required to access main memory, cache memories permit the execution rate of the machine to be substantially increased. In order to function effectively, cache memories must be carefully designed and implemented. In this study, a trace-driven simulation study of direct mapped, associative mapped and set-associative mapped cache memories is made. In the simulation, cache fetch algorithm, placement policy, cache size and various parameters related to cache design and the resulting effect on system performance is investigated. The cache memories are simulated using the C language and the simulation results are analyzed for the design and implementation of cache memories. / Department of Physics and Astronomy
|
500 |
Understanding Multicore Performance : Efficient Memory System Modeling and SimulationSandberg, Andreas January 2014 (has links)
To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results. In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution. / CoDeR-MP / UPMARC
|
Page generated in 0.0973 seconds