Global ETD Search

21	Towards Energy-Efficient Mobile Sensing: Architectures and Frameworks for Heterogeneous Sensing and Computing Fan, Songchun January 2016 (has links) <p>Modern sensing apps require continuous and intense computation on data streams. Unfortunately, mobile devices are failing to keep pace despite advances in hardware capability. In contrast to powerful system-on-chips that rapidly evolve, battery capacities merely grow. This hinders the potential of long-running, compute-intensive sensing services such as image/audio processing, motion tracking and health monitoring, especially on small, wearable devices. </p><p>In this thesis, we present three pieces of work that target at improving the energy efficiency for mobile sensing. (1) In the first work, we study heterogeneous mobile processors that dynamically switch between high-performance and low-power cores according to tasks' performance requirements. We benchmark interactive mobile workloads and quantify the energy improvement of different microarchitectures. (2) Realizing that today's users often carry more than one mobile devices, in the second work, we extend the resource boundary of individual devices by prototyping a distributed framework that coordinates multiple devices. When devices share common sensing goals, the framework schedules sensing and computing tasks according to devices' heterogeneity, improving the performance and latency for compute-intensive sensing apps. (3) In the third work, we study the power breakdown of motion sensing apps on wearable devices and show that traditional offloading schemes cannot mitigate sensing’s high energy costs. We design a framework that allows the phone to take over sensing and computation by predicting the wearable's sensory data, when motions of the two devices are highly correlated. This allows the wearable to offload without communicating raw sensing data, resulting in little performance loss but significant energy savings.</p> / Dissertation Computer science Distributed systems Energy efficiency Microarchitecture Mobile computing Mobile sensing
22	HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors Deb, Abhishek 03 May 2012 (has links) In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions. Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table. Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors. Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic. / En aquesta tesis hem explorat el paradigma de les màquines issue i commit per processadors actuals. Hem implementat una màquina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bàsics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions: Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execució d’aplicacions de proposit general. La PFU està formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuït a les unitats funcionals que la composen. Les entrades de la macro-operació que s’executa en la PFU es mouen del banc de registres físic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusió combina més micro-operacions en temps d’execució. Aquest algorisme es basa en un pas de planificació que mesura el benefici de les decisions de fusió. Les micro operacions corresponents a la macro operació s’emmagatzemen com a senyals de control en una configuració. Les macro-operacions tenen associat un identificador de configuració que ajuda a localitzar d’aquestes. Una petita cache de configuracions està present dintre de la PFU per tal de guardar-les. En cas de que la configuració no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat. Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre. Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta solució hem substituït el ROB convencional, i en lloc hem introduït el Superblock Ordering Buffer (SOB). El SOB manté l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es manté en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducció de complexitat considerarada és l’ús de FIFOs a la lògica d’issue. En aquest últim àmbit hem proposat una heurística de distribució que solventa les ineficiències de l’heurística basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament. co-designed virtual machines complexity - effective processors power - efficient processors microarchitecture dynamic binary optimization 004
23	Characterization and Avoidance of Critical Pipeline Structures in Aggressive Superscalar Processors Sassone, Peter G. 20 July 2005 (has links) In recent years, with only small fractions of modern processors now accessible in a single cycle, computer architects constantly fight against propagation issues across the die. Unfortunately this trend continues to shift inward, and now the even most internal features of the pipeline are designed around communication, not computation. To address the inward creep of this constraint, this work focuses on the characterization of communication within the pipeline itself, architectural techniques to avoid it when possible, and layout co-design for early detection of problems. I present work in creating a novel detection tool for common case operand movement which can rapidly characterize an applications dataflow patterns. The results produced are suitable for exploitation as a small number of patterns can describe a significant portion of modern applications. Work on dynamic dependence collapsing takes the observations from the pattern results and shows how certain groups of operations can be dynamically grouped, avoiding unnecessary communication between individual instructions. This technique also amplifies the efficiency of pipeline data structures such as the reorder buffer, increasing both IPC and frequency. I also identify the same sets of collapsible instructions at compile time, producing the same benefits with minimal hardware complexity. This technique is also done in a backward compatible manner as the groups are exposed by simple reordering of the binarys instructions. I present aggressive pipelining approaches for these resources which avoids the critical timing often presumed necessary in aggressive superscalar processors. As these structures are designed for the worst case, pipelining them can produce greater frequency benefit than IPC loss. I also use the observation that the dynamic issue order for instructions in aggressive superscalar processors is predictable. Thus, a hardware mechanism is introduced for caching the wakeup order for groups of instructions efficiently. These wakeup vectors are then used to speculatively schedule instructions, avoiding the dynamic scheduling when it is not necessary. Finally, I present a novel approach to fast and high-quality chip layout. By allowing architects to quickly evaluate what if scenarios during early high-level design, chip designs are less likely to encounter implementation problems later in the process. Computer architecture Microarchitecture Dependency collapsing Pipeline Floorplanning Microprocessors Design and construction Computer organization
24	Design and Implementation of Single Issue DSP Processor Core Ravinath, Vinodh January 2007 (has links) <p>Micro processors built specifically for digital signal processing are DSP processors. DSP is one of the core technologies in rapidly growing applications like communications and audio processing. The estimated growth of DSP processors in the last 6 years is over 40%. The variety of DSP capable processors for various applications also increased with the rising popularity of DSP processors. The design flow and architecture of such processors are not commonly available to students for learning.</p><p>This report is a structured approach to design and implementation of an embedded DSP processor core for voice, audio and video codec. The report focuses on the design requirement specification, senior instruction set and assembly manual release, micro architecture design and implementation of the core. Details about the core verification are also included in this report. The instruction set of this processor supports running basic kernels of BDTI benchmarking.</p> DSP processor codec Instruction set Microarchitecture FSM RTL Computer engineering Datorteknik
25	Efficient execution of sequential applications on multicore systems Robatmili, Behnam 19 September 2011 (has links) Conventional CMOS scaling has been the engine of the technology revolution in most application domains. This trend has changed as in each technology generation, transistor densities continue to increase while due to the limits on threshold voltage scaling, per-transistor energy consumption decreases much more slowly than in the past. The power scaling issues will restrict the adaptability of designs to operate in different power and performance regimes. Consequently, future systems must employ more efficient architectures for optimizing every thread in the program across different power and performance regimes, rather than architectures that utilize more transistors. One solution is composable or dynamic multicore architectures that can span a wide range of energy/performance operating points by enabling multiple simple cores to compose to form a larger and more powerful core. Explicit Data Graph Execution (EDGE) architectures represent a highly scalable class of composable processors that exploit predicated dataflow block execution and distributed microarchitectures. However, prior EDGE architectures suffer from several energy and performance bottlenecks including expensive intra-block operand communication due to fine-grain instruction distribution among cores, the compiler-generated fanout trees built for high-fanout operand delivery, poor next-block prediction accuracy, and low speculation rates due to predicates and expensive refills after pipeline flushes. To design an energy-efficient and flexible dynamic multicore, this dissertation employs a systematic methodology that detects inefficiencies and then designs and evaluates solutions that maximize power and performance efficiency across different power and performance regimes. Some innovations and optimization techniques include: (a) Deep Block Mapping extracts more coarse-grained parallelism and reduces cross-core operand network traffic by mapping each block of instructions into the instruction queue of one core instead of distributing blocks across all composed cores as done in previous EDGE designs, (b) Iterative Path Predictor (IPP) reduces branch and predication overheads by unifying multi-exit block target prediction and predicate path prediction while providing improved accuracy for each, (c) Register Bypassing reduces cross-core register communication delays by bypassing register values predicted to be critical directly from producing to consuming cores, (d) Block Reissue reduces pipeline flush penalties by reissuing instructions in previously executed instances of blocks while they are still in the instruction queue, and (e) Exposed Operand Broadcasts (EOBs) reduce wide-fanout instruction overheads by extending the ISA to employ architecturally exposed low-overhead broadcasts combined with dataflow for efficient operand delivery for both high- and low-fanout instructions. These components form the basis for a third-generation EDGE microarchitecture called T3. T3 improves energy efficiency by about 2x and performance by 47% compared to previous EDGE architectures. T3 also performs in a highly power efficient manner across a wide spectrum of energy and performance operating points (low-power to high-performance), extending the domain of power/performance trade-offs beyond what dynamic voltage and frequency scaling offers on state-of-the-art conventional processors. This high level of flexibility and power efficiency makes T3 an attractive candidate for future systems which need to operate on a wide range of workloads under varying power and performance constraints. / text Microarchitecture EDGE Multicore Single-thread performance Dataflow Block-atomic execution Power efficiency Composable cores
26	Kompiuterių mikroarchitektūrinis lygmuo / Computer microarchitecture level Damušis, Eimantas 24 September 2008 (has links) Pagrindinė kompiuterio sudedamoji dalis yra duomenų traktas. Jį sudaro keletas registrų, dvi arba trys magistralės bei vienas ar keli funkciniai blokai, tokie kaip aritmetinis ir loginis įrenginys. Pagrindinis ciklas susideda iš keleto operandų iškvietimo, kurie yra registruose ir jų perdavimui magistralėmis į ALĮ ar kitą funkcinį bloką. Po šių operacijų įvykdymo rezultatai yra saugomi atgal į registrus. Duomenų traktas gali būti valdomas eiliškumo tvarka, kuri iškviečia mikroinstrukcijas iš atminties. Kiekvieną mikroinstrukciją sudaro bitai, kurie yra valdomi duomenų trakto vieno ciklo metu. Šie bitai nustato kuriuos operandus reikia išrinkti, kokią operaciją reikia atlikti ir ką reikia daryti su rezultatais. IJVM – mašina su stekine organizacija ir vieno baito operacijos kodu, kuris padeda žodžius į steką, taip pat juos paima iš steko ir atlieka įvairias operacijas su žodžiais iš steko. Kalbant apie kompiuterių produktyvumą galima paminėti, jog pagrindinis metodas yra dalinės atminties naudojimas. Tiesioginio atvaizdavimo ir asociatyvi tarpin÷ atmintis plačiai taikoma kreipimosi į atmintį pagreitinimui. Nagrinėti trys pavyzdžiai, Pentium II, UltraSPARC II ir picoJava II daugumoje atvejų skiriasi vieni nuo kitų, tačiau vykdant instrukcijas jie yra labai panašūs. Pentium II turi CISC tipo instrukcijas, kurias padalina į mikroinstrukcijas. Jos yra apdirbamos superskaliarinės architektūros instrukcijos su šakojimosi numatymu. UltraSPARC II yra modernus 64 bitų procesorius su... [toliau žr. visą tekstą] / The main component of the computer is a data path. It consists of several registers, two or three buses, one or more functional blocks, such as ALU. The primary cycle consists of a call of operands from registers. Operands are transferred to ALU by buses. After these operations, results are saved in again in the registers. The data path can be controlled by the sequence method, which calls the microinstruction from the memory. Every microinstruction consists of bits, which are controlled by data path in one cycle. These bits determine which operands they have to choose, what you need to execute and what to do with the results. IJVM – a machine with stack organization. It has one-byte operations, which places words on the stack, pushing them from the stack and carry out various operations with words from the stack. Our three examples, Pentium II, UltraSPARC II and picoJava II in many ways differs from each other, but surprisingly are similar in executing instructions. Pentium II takes CISC instructions and splits them into microinstructions. UltraSPARC II is a modern 64 – bit processor with instruction set of RISC. PicoJava II is a simple designed processor for lowcost devices. It has no such features like dynamic branch prediction, but it can execute JVM instructions rather quickly like its done with RISC instructions. All three machines contain similar functional blocks that can process three microinstructions when they pass through the conveyor. Informatics Kompiuteriai Mikroarchitektūra Lygmenys Procesoriai Registrai Computers Microarchitecture Processor Level Registers
27	RELATIONSHIPS OF LONG-TERM BISPHOSPHONATE TREATMENT WITH MEASURES OF BONE MICROARCHITECTURE AND MECHANICAL COMPETENCE Ward, Jonathan Joseph 01 January 2014 (has links) Oral bisphosphonate drug therapy is a common and effective treatment for osteoporosis. Little is known about the long-term effects of bisphosphonates on bone quality. This study examined the structural and mechanical properties of trabecular bone following 0-16 years of bisphosphonate treatment. Fifty-three iliac crest bone samples of Caucasian women diagnosed with low turnover osteoporosis were identified from the Kentucky Bone Registry. Forty-five were treated with oral bisphosphonates for 1 to 16 years while eight were treatment naive. A section of trabecular bone was chosen from a micro-computed tomography (Scanco µCT 40) scan of each sample for a uniaxial linearly elastic compression simulation using finite element analysis (ANSYS 14.0). Morphometric parameters (BV/TV, SMI, Tb.Sp., etc.) were computed using µCT. Apparent modulus, effective modulus and estimated failure stress were calculated. Biomechanical and morphometric parameters improved with treatment duration, peaked around 7 years, and then declined independently of age. The findings suggest a limit to the benefits associated with bisphosphonate treatment and that extended continuous bisphosphonate treatment does not continue to improve bone quality. Bone quality, and subsequently bone strength, may eventually regress to a state poorer than at the onset of treatment. Treatment duration limited to less than 7 years is recommended. bisphosphonates bone microarchitecture finite element analysis micro-computed tomography osteoporosis Biomechanics and biotransport
28	Computational and storage based power and performance optimizations for highly accurate branch predictors relying on neural networks Aasaraai, Kaveh 09 August 2007 (has links) In recent years, highly accurate branch predictors have been proposed primarily for high performance processors. Unfortunately such predictors are extremely energy consuming and in some cases not practical as they come with excessive prediction latency. Perceptron and O-GEHL are two examples of such predictors. To achieve high accuracy, these predictors rely on large tables and extensive computations and require high energy and long prediction delay. In this thesis we propose power optimization techniques that aim at reducing both computational complexity and storage size for these predictors. We show that by eliminating unnecessary data from computations, we can reduce both predictor's energy consumption and prediction latency. Moreover, we apply information theory findings to remove noneffective storage used by O-GEHL, without any significant accuracy penalty. We reduce the dynamic and static power dissipated in the computational parts of the predictors. Meantime we improve performance as we make faster prediction possible. Branch Prediction Perceptron O-GEHL Microarchitecture
29	Microarchitecture and FPGA Implementation of the Multi-level Computing Architecture Capalija, Davor 30 July 2008 (has links) We design the microarchitecture of the Multi-Level Computing Architecture (MLCA), focusing on its Control Processor (CP). The design of the microarchitecture of the CP faces us with both opportunities and challenges that stem from the coarse granularity of the tasks and the large number of inputs and outputs for each task instruction. Thus, we explore changes to standard superscalar microarchitectural techniques. We design the entire CP microarchitecture and implement it on an FPGA using SystemVerilog. We synthesize and evaluate the MLCA system based on a 4-processor shared-memory multiprocessor. The performance of realistic applications shows scalable speedups that are comparable to that of simulation. We believe that our implementation achieves low complexity in terms of FPGA resource usage and operating frequency. In addition, we argue that our design methodology allows the scalability of the CP as the entire system grows. Computer architecture FPGA applications Microarchitecture Parallelism Embedded systems Multi-core systems 0984
30	Microarchitecture and FPGA Implementation of the Multi-level Computing Architecture Capalija, Davor 30 July 2008 (has links) We design the microarchitecture of the Multi-Level Computing Architecture (MLCA), focusing on its Control Processor (CP). The design of the microarchitecture of the CP faces us with both opportunities and challenges that stem from the coarse granularity of the tasks and the large number of inputs and outputs for each task instruction. Thus, we explore changes to standard superscalar microarchitectural techniques. We design the entire CP microarchitecture and implement it on an FPGA using SystemVerilog. We synthesize and evaluate the MLCA system based on a 4-processor shared-memory multiprocessor. The performance of realistic applications shows scalable speedups that are comparable to that of simulation. We believe that our implementation achieves low complexity in terms of FPGA resource usage and operating frequency. In addition, we argue that our design methodology allows the scalability of the CP as the entire system grows. Computer architecture FPGA applications Microarchitecture Parallelism Embedded systems Multi-core systems 0984

Search results