Global ETD Search

41	Early evaluation of multicore systems soft error reliability using virtual platforms / Avaliação de sistema de larga escala sob à influência de falhas temporárias durante a exploração de inicial projetos através do uso de plataformas virtuais Rosa, Felipe Rocha da January 2018 (has links) A crescente capacidade de computação dos componentes multiprocessados como processadores e unidades de processamento gráfico oferecem novas oportunidades para os campos de pesquisa relacionados computação embarcada e de alto desempenho (do inglês, high-performance computing). A crescente capacidade de computação progressivamente dos sistemas baseados em multicores permite executar eficientemente aplicações complexas com menor consumo de energia em comparação com soluções tradicionais de núcleo único. Essa eficiência e a crescente complexidade das cargas de trabalho das aplicações incentivam a indústria a integrar mais e mais componentes de processamento no mesmo sistema. O número de componentes de processamento empregados em sistemas grande escala já ultrapassa um milhão de núcleos, enquanto as plataformas embarcadas de 1000 núcleos estão disponíveis comercialmente. Além do enorme número de núcleos, a crescente capacidade de processamento, bem como o número de elementos de memória interna (por exemplo, registradores, memória RAM) inerentes às arquiteturas de processadores emergentes, está tornando os sistemas em grande escala mais vulneráveis a erros transientes e permanentes. Além disso, para atender aos novos requisitos de desempenho e energia, os processadores geralmente executam com frequências de relógio agressivos e múltiplos domínios de tensão, aumentando sua susceptibilidade à erros transientes, como os causados por efeitos de radiação. A ocorrência de erros transientes pode causar falhas críticas no comportamento do sistema, o que pode acarretar em perdas de vidas financeiras ou humanas. Embora tenha sido observada uma taxa de 280 erros transientes por dia durante o voo de uma nave espacial, os sistemas de processamento que trabalham à nível do solo devem experimentar pelo menos um erro transiente por dia em um futuro próximo. A susceptibilidade crescente de sistemas multicore à erros transientes necessariamente exige novas ferramentas para avaliar a resiliência à erro transientes de componentes multiprocessados em conjunto com pilhas complexas de software (sistema operacional, drivers) durante o início da fase de projeto. O objetivo principal abordado por esta Tese é desenvolver um conjunto de técnicas de injeção de falhas, que formam uma ferramenta de injeção de falha. O segundo objetivo desta Tese é estabelecer as bases para novas disciplinas de gerenciamento de confiabilidade considerando erro transientes em sistemas emergentes multi/manycore utilizando aprendizado de máquina. Este trabalho identifica multiplicas técnicas que podem ser usadas para fornecer diferentes níveis de confiabilidade na carga de trabalho e na criticidade do aplicativo. / The increasing computing capacity of multicore components like processors and graphics processing unit (GPUs) offer new opportunities for embedded and high-performance computing (HPC) domains. The progressively growing computing capacity of multicore-based systems enables to efficiently perform complex application workloads at a lower power consumption compared to traditional single-core solutions. Such efficiency and the ever-increasing complexity of application workloads encourage industry to integrate more and more computing components into the same system. The number of computing components employed in large-scale HPC systems already exceeds a million cores, while 1000-cores on-chip platforms are available in the embedded community. Beyond the massive number of cores, the increasing computing capacity, as well as the number of internal memory cells (e.g., registers, internal memory) inherent to emerging processor architectures, is making large-scale systems more vulnerable to both hard and soft errors. Moreover, to meet emerging performance and power requirements, the underlying processors usually run in aggressive clock frequencies and multiple voltage domains, increasing their susceptibility to soft errors, such as the ones caused by radiation effects. The occurrence of soft errors or Single Event Effects (SEEs) may cause critical failures in system behavior, which may lead to financial or human life losses. While a rate of 280 soft errors per day has been observed during the flight of a spacecraft, electronic computing systems working at ground level are expected to experience at least one soft error per day in near future. The increased susceptibility of multicore systems to SEEs necessarily calls for novel cost-effective tools to assess the soft error resilience of underlying multicore components with complex software stacks (operating system-OS, drivers) early in the design phase. The primary goal addressed by this Thesis is to describe the proposal and development of a fault injection framework using state-of-the-art virtual platforms, propose set of novel fault injection techniques to direct the fault campaigns according to with the software stack characteristics, and an extensive framework validation with over a million of simulation hours. The second goal of this Thesis is to set the foundations for a new discipline in soft error reliability management for emerging multi/manycore systems using machine learning techniques. It will identify and propose techniques that can be used to provide different levels of reliability on the application workload and criticality. Microeletrônica Tolerancia : Falhas Aprendizado : máquina Multi/Manycore Systems Machine Learning Soft Errors ARM Simulation Virtual Platforms Reliability Fault Tolerance
42	Analysis of transistor sizing and folding effectiveness to mitigate soft errors / Análise da influência do dimensionamento e partição de transistores e na proteção de circuitos contra efeitos de radiação Assis, Thiago Rocha de January 2009 (has links) Este trabalho apresenta uma avaliação da eficiência do dimensionamento e particionamento (folding) de transistores para a eliminação ou redução de efeitos de radiação. Durante o trabalho foi construído um modelo de transistor tipo-n MOSFET para a tecnologia 90nm, utilizando modelos preditivos. O transistor 3D modelado foi comparado com o modelo de transistor elétrico PTM level 54 da Arizona State University e os resultados mostraram uma coerência entre os dispositivos. Este transistor modelado foi irradiado por uma série de partículas que caracterizam ambientes terrestres e espaciais. Foi descoberto que a técnica de redimensionamento de transistores tem sua eficiência relacionada ao tipo de partícula do ambiente e não é aplicável em ambientes com partículas com alta energia. Descobriu-se também que aplicando o particionamento de transistores é possível reduzir a amplitude e a duração de erros transientes. A combinação do dimensionamento e o particionamento de transistores pode ser utilizada para a redução de efeitos de radiação incluindo partículas leves e pesadas. Por fim um estudo de caso foi realizado com uma célula de memória estática de 6 transistores utilizando as técnicas mencionadas anteriormente. Os resultados da célula de memória indicaram que a combinação das duas técnicas pode de fato reduzir e até impedir a mudança do estado lógico armazenado na célula. / In this work the transistor sizing and folding techniques were evaluated for SET robustness in a 90nm MOSFET technology using a 3D device model. A n-type MOSFET transistor using a 90nm technology predictive profile was modeled and functional behavior compared with PTM level 54 model showing a fit of the device with the PTM. During simulations the modeled device was irradiated in a simulation environment using particles with the profile of sea and space level ions. The radiation effects simulation had indicated that the transistor sizing can be more or less efficient to reduce SET according to the collected charge. It was found that for environments with high energy particle, transistor sizing was not able to reduce soft errors intensity. The use of folding has shown significant reduction of the amplitude and duration of the transient pulse, making this technique very useful to reduce soft errors. For alpha particles and heavy ions the combination of transistor folding and sizing had shown to be an effective combination to enhance the reliability of the circuits. A 6T SRAM cell was modeled to evaluate transistor sizing and folding techniques and the results confirmed the efficiency of folding plus sizing to reduce the effects of radiation. Microeletrônica Cmos Deteccao : Erros Tolerancia : Falhas Radiation effects Single event effect Transistor sizing Folding Microelectronics Fault tolerance Soft errors
43	Early evaluation of multicore systems soft error reliability using virtual platforms / Avaliação de sistema de larga escala sob à influência de falhas temporárias durante a exploração de inicial projetos através do uso de plataformas virtuais Rosa, Felipe Rocha da January 2018 (has links) A crescente capacidade de computação dos componentes multiprocessados como processadores e unidades de processamento gráfico oferecem novas oportunidades para os campos de pesquisa relacionados computação embarcada e de alto desempenho (do inglês, high-performance computing). A crescente capacidade de computação progressivamente dos sistemas baseados em multicores permite executar eficientemente aplicações complexas com menor consumo de energia em comparação com soluções tradicionais de núcleo único. Essa eficiência e a crescente complexidade das cargas de trabalho das aplicações incentivam a indústria a integrar mais e mais componentes de processamento no mesmo sistema. O número de componentes de processamento empregados em sistemas grande escala já ultrapassa um milhão de núcleos, enquanto as plataformas embarcadas de 1000 núcleos estão disponíveis comercialmente. Além do enorme número de núcleos, a crescente capacidade de processamento, bem como o número de elementos de memória interna (por exemplo, registradores, memória RAM) inerentes às arquiteturas de processadores emergentes, está tornando os sistemas em grande escala mais vulneráveis a erros transientes e permanentes. Além disso, para atender aos novos requisitos de desempenho e energia, os processadores geralmente executam com frequências de relógio agressivos e múltiplos domínios de tensão, aumentando sua susceptibilidade à erros transientes, como os causados por efeitos de radiação. A ocorrência de erros transientes pode causar falhas críticas no comportamento do sistema, o que pode acarretar em perdas de vidas financeiras ou humanas. Embora tenha sido observada uma taxa de 280 erros transientes por dia durante o voo de uma nave espacial, os sistemas de processamento que trabalham à nível do solo devem experimentar pelo menos um erro transiente por dia em um futuro próximo. A susceptibilidade crescente de sistemas multicore à erros transientes necessariamente exige novas ferramentas para avaliar a resiliência à erro transientes de componentes multiprocessados em conjunto com pilhas complexas de software (sistema operacional, drivers) durante o início da fase de projeto. O objetivo principal abordado por esta Tese é desenvolver um conjunto de técnicas de injeção de falhas, que formam uma ferramenta de injeção de falha. O segundo objetivo desta Tese é estabelecer as bases para novas disciplinas de gerenciamento de confiabilidade considerando erro transientes em sistemas emergentes multi/manycore utilizando aprendizado de máquina. Este trabalho identifica multiplicas técnicas que podem ser usadas para fornecer diferentes níveis de confiabilidade na carga de trabalho e na criticidade do aplicativo. / The increasing computing capacity of multicore components like processors and graphics processing unit (GPUs) offer new opportunities for embedded and high-performance computing (HPC) domains. The progressively growing computing capacity of multicore-based systems enables to efficiently perform complex application workloads at a lower power consumption compared to traditional single-core solutions. Such efficiency and the ever-increasing complexity of application workloads encourage industry to integrate more and more computing components into the same system. The number of computing components employed in large-scale HPC systems already exceeds a million cores, while 1000-cores on-chip platforms are available in the embedded community. Beyond the massive number of cores, the increasing computing capacity, as well as the number of internal memory cells (e.g., registers, internal memory) inherent to emerging processor architectures, is making large-scale systems more vulnerable to both hard and soft errors. Moreover, to meet emerging performance and power requirements, the underlying processors usually run in aggressive clock frequencies and multiple voltage domains, increasing their susceptibility to soft errors, such as the ones caused by radiation effects. The occurrence of soft errors or Single Event Effects (SEEs) may cause critical failures in system behavior, which may lead to financial or human life losses. While a rate of 280 soft errors per day has been observed during the flight of a spacecraft, electronic computing systems working at ground level are expected to experience at least one soft error per day in near future. The increased susceptibility of multicore systems to SEEs necessarily calls for novel cost-effective tools to assess the soft error resilience of underlying multicore components with complex software stacks (operating system-OS, drivers) early in the design phase. The primary goal addressed by this Thesis is to describe the proposal and development of a fault injection framework using state-of-the-art virtual platforms, propose set of novel fault injection techniques to direct the fault campaigns according to with the software stack characteristics, and an extensive framework validation with over a million of simulation hours. The second goal of this Thesis is to set the foundations for a new discipline in soft error reliability management for emerging multi/manycore systems using machine learning techniques. It will identify and propose techniques that can be used to provide different levels of reliability on the application workload and criticality. Microeletrônica Tolerancia : Falhas Aprendizado : máquina Multi/Manycore Systems Machine Learning Soft Errors ARM Simulation Virtual Platforms Reliability Fault Tolerance
44	Analysis of transistor sizing and folding effectiveness to mitigate soft errors / Análise da influência do dimensionamento e partição de transistores e na proteção de circuitos contra efeitos de radiação Assis, Thiago Rocha de January 2009 (has links) Este trabalho apresenta uma avaliação da eficiência do dimensionamento e particionamento (folding) de transistores para a eliminação ou redução de efeitos de radiação. Durante o trabalho foi construído um modelo de transistor tipo-n MOSFET para a tecnologia 90nm, utilizando modelos preditivos. O transistor 3D modelado foi comparado com o modelo de transistor elétrico PTM level 54 da Arizona State University e os resultados mostraram uma coerência entre os dispositivos. Este transistor modelado foi irradiado por uma série de partículas que caracterizam ambientes terrestres e espaciais. Foi descoberto que a técnica de redimensionamento de transistores tem sua eficiência relacionada ao tipo de partícula do ambiente e não é aplicável em ambientes com partículas com alta energia. Descobriu-se também que aplicando o particionamento de transistores é possível reduzir a amplitude e a duração de erros transientes. A combinação do dimensionamento e o particionamento de transistores pode ser utilizada para a redução de efeitos de radiação incluindo partículas leves e pesadas. Por fim um estudo de caso foi realizado com uma célula de memória estática de 6 transistores utilizando as técnicas mencionadas anteriormente. Os resultados da célula de memória indicaram que a combinação das duas técnicas pode de fato reduzir e até impedir a mudança do estado lógico armazenado na célula. / In this work the transistor sizing and folding techniques were evaluated for SET robustness in a 90nm MOSFET technology using a 3D device model. A n-type MOSFET transistor using a 90nm technology predictive profile was modeled and functional behavior compared with PTM level 54 model showing a fit of the device with the PTM. During simulations the modeled device was irradiated in a simulation environment using particles with the profile of sea and space level ions. The radiation effects simulation had indicated that the transistor sizing can be more or less efficient to reduce SET according to the collected charge. It was found that for environments with high energy particle, transistor sizing was not able to reduce soft errors intensity. The use of folding has shown significant reduction of the amplitude and duration of the transient pulse, making this technique very useful to reduce soft errors. For alpha particles and heavy ions the combination of transistor folding and sizing had shown to be an effective combination to enhance the reliability of the circuits. A 6T SRAM cell was modeled to evaluate transistor sizing and folding techniques and the results confirmed the efficiency of folding plus sizing to reduce the effects of radiation. Microeletrônica Cmos Deteccao : Erros Tolerancia : Falhas Radiation effects Single event effect Transistor sizing Folding Microelectronics Fault tolerance Soft errors
45	Analysis of transistor sizing and folding effectiveness to mitigate soft errors / Análise da influência do dimensionamento e partição de transistores e na proteção de circuitos contra efeitos de radiação Assis, Thiago Rocha de January 2009 (has links) Este trabalho apresenta uma avaliação da eficiência do dimensionamento e particionamento (folding) de transistores para a eliminação ou redução de efeitos de radiação. Durante o trabalho foi construído um modelo de transistor tipo-n MOSFET para a tecnologia 90nm, utilizando modelos preditivos. O transistor 3D modelado foi comparado com o modelo de transistor elétrico PTM level 54 da Arizona State University e os resultados mostraram uma coerência entre os dispositivos. Este transistor modelado foi irradiado por uma série de partículas que caracterizam ambientes terrestres e espaciais. Foi descoberto que a técnica de redimensionamento de transistores tem sua eficiência relacionada ao tipo de partícula do ambiente e não é aplicável em ambientes com partículas com alta energia. Descobriu-se também que aplicando o particionamento de transistores é possível reduzir a amplitude e a duração de erros transientes. A combinação do dimensionamento e o particionamento de transistores pode ser utilizada para a redução de efeitos de radiação incluindo partículas leves e pesadas. Por fim um estudo de caso foi realizado com uma célula de memória estática de 6 transistores utilizando as técnicas mencionadas anteriormente. Os resultados da célula de memória indicaram que a combinação das duas técnicas pode de fato reduzir e até impedir a mudança do estado lógico armazenado na célula. / In this work the transistor sizing and folding techniques were evaluated for SET robustness in a 90nm MOSFET technology using a 3D device model. A n-type MOSFET transistor using a 90nm technology predictive profile was modeled and functional behavior compared with PTM level 54 model showing a fit of the device with the PTM. During simulations the modeled device was irradiated in a simulation environment using particles with the profile of sea and space level ions. The radiation effects simulation had indicated that the transistor sizing can be more or less efficient to reduce SET according to the collected charge. It was found that for environments with high energy particle, transistor sizing was not able to reduce soft errors intensity. The use of folding has shown significant reduction of the amplitude and duration of the transient pulse, making this technique very useful to reduce soft errors. For alpha particles and heavy ions the combination of transistor folding and sizing had shown to be an effective combination to enhance the reliability of the circuits. A 6T SRAM cell was modeled to evaluate transistor sizing and folding techniques and the results confirmed the efficiency of folding plus sizing to reduce the effects of radiation. Microeletrônica Cmos Deteccao : Erros Tolerancia : Falhas Radiation effects Single event effect Transistor sizing Folding Microelectronics Fault tolerance Soft errors
46	Conception de systèmes embarqués fiables et auto-réglables : applications sur les systèmes de transport ferroviaire / Design of self-tuning reliable embedded systems and its application in railway transportation systems Alouani, Ihsen 26 April 2016 (has links) Un énorme progrès dans les performances des semiconducteurs a été accompli ces dernières années. Avec l’´émergence d’applications complexes, les systèmes embarqués doivent être à la fois performants et fiables. Une multitude de travaux ont été proposés pour améliorer l’efficacité des systèmes embarqués en réduisant le décalage entre la flexibilité des solutions logicielles et la haute performance des solutions matérielles. En vertu de leur nature reconfigurable, les FPGAs (Field Programmable Gate Arrays) représentent un pas considérable pour réduire ce décalage performance/flexibilité. Cependant, la reconfiguration dynamique a toujours souffert d’une limitation liée à la latence de reconfiguration.Dans cette thèse, une nouvelle technique de reconfiguration dynamiqueau niveau ”grain-moyen” pour les circuits à base de blocks DSP48E1 est proposée. L’idée est de profiter de la reprogrammabilité des blocks DSP48E1 couplée avec un circuit d’interconnection reconfigurable afin de changer la fonction implémentée par le circuit en un cycle horloge. D’autre part, comme les nouvelles technologies s’appuient sur la réduction des dimensions des transistors ainsi que les tensions d’alimentation, les circuits électroniques sont devenus de plus en plus susceptibles aux fautes transitoires. L’impact de ces erreurs au niveau système peut être catastrophique et les SETs (Single Event Transients) sont devenus une menace tangible à la fiabilité des systèmes embarqués, en l’occurrence pour les applications critiques comme les systèmes de transport. Les techniques de fiabilité qui se basent sur des taux d’erreurs (SERs) surestimés peuvent conduire à un gaspillage de ressources et par conséquent un cout en consommation de puissance électrique. Il est primordial de prendre en compte le phénomène de masquage d’erreur pour une estimation précise des SERs.Cette thèse propose une nouvelle modélisation inter-couches de la vulnérabilité des circuits qui combine les mécanismes de masquage au niveau transistor (TLM) et le masquage au niveau Système (SLM). Ce modèle est ensuite utilisé afin de construire une architecture adaptative tolérante aux fautes qui évalue la vulnérabilité effective du circuit en runtime. La stratégie d’amélioration de fiabilité est adaptée pour ne protéger que les parties vulnérables du système, ce qui engendre un circuit fiable avec un cout optimisé. Les expérimentations effectuées sur un système de détection d’obstacles à base de radar pour le transport ferroviaire montre que l’approche proposée permet d’´établir un compromis fiabilité/ressources utilisées. / During the last few decades, a tremendous progress in the performance of semiconductor devices has been accomplished. In this emerging era of high performance applications, machines need not only to be efficient but also need to be dependable at circuit and system levels. Several works have been proposed to increase embedded systems efficiency by reducing the gap between software flexibility and hardware high-performance. Due to their reconfigurable aspect, Field Programmable Gate Arrays (FPGAs) represented a relevant step towards bridging this performance/flexibility gap. Nevertheless, Dynamic Reconfiguration (DR) has been continuously suffering from a bottleneck corresponding to a long reconfiguration time.In this thesis, we propose a novel medium-grained high-speed dynamic reconfiguration technique for DSP48E1-based circuits. The idea is to take advantage of the DSP48E1 slices runtime reprogrammability coupled with a re-routable interconnection block to change the overall circuit functionality in one clock cycle. In addition to the embedded systems efficiency, this thesis deals with the reliability chanllenges in new sub-micron electronic systems. In fact, as new technologies rely on reduced transistor size and lower supply voltages to improve performance, electronic circuits are becoming remarkably sensitive and increasingly susceptible to transient errors. The system-level impact of these errors can be far-reaching and Single Event Transients (SETs) have become a serious threat to embedded systems reliability, especially for especially for safety critical applications such as transportation systems. The reliability enhancement techniques that are based on overestimated soft error rates (SERs) can lead to unnecessary resource overheads as well as high power consumption. Considering error masking phenomena is a fundamental element for an accurate estimation of SERs.This thesis proposes a new cross-layer model of circuits vulnerability based on a combined modeling of Transistor Level (TLM) and System Level Masking (SLM) mechanisms. We then use this model to build a self adaptive fault tolerant architecture that evaluates the circuit’s effective vulnerability at runtime. Accordingly, the reliability enhancement strategy is adapted to protect only vulnerable parts of the system leading to a reliable circuit with optimized overheads. Experimentations performed on a radar-based obstacle detection system for railway transportation show that the proposed approach allows relevant reliability/resource utilization tradeoffs. Systèmes embarqués Fiabilité Erreurs transitoires Embedded systems Reliability Dependability Dynamically reconfigurable architectures Soft errors
47	Scheduling and Optimization of Fault-Tolerant Embedded Systems Izosimov, Viacheslav January 2006 (has links) Safety-critical applications have to function correctly even in presence of faults. This thesis deals with techniques for tolerating effects of transient and intermittent faults. Reexecution, software replication, and rollback recovery with checkpointing are used to provide the required level of fault tolerance. These techniques are considered in the context of distributed real-time systems with non-preemptive static cyclic scheduling. Safety-critical applications have strict time and cost constrains, which means that not only faults have to be tolerated but also the constraints should be satisfied. Hence, efficient system design approaches with consideration of fault tolerance are required. The thesis proposes several design optimization strategies and scheduling techniques that take fault tolerance into account. The design optimization tasks addressed include, among others, process mapping, fault tolerance policy assignment, and checkpoint distribution. Dedicated scheduling techniques and mapping optimization strategies are also proposed to handle customized transparency requirements associated with processes and messages. By providing fault containment, transparency can, potentially, improve testability and debugability of fault-tolerant applications. The efficiency of the proposed scheduling techniques and design optimization strategies is evaluated with extensive experiments conducted on a number of synthetic applications and a real-life example. The experimental results show that considering fault tolerance during system-level design optimization is essential when designing cost-effective fault-tolerant embedded systems. Embedded systems Real-Time systems Design optimization Fault tolerance Transient faults Soft errors Information Systems
48	Testing Methodologies and Results of Radiation Induced Soft Errors for a COTS SRAM, FRAM, and SoC Stirk, Wesley Raymond 19 April 2023 (has links) (PDF) Methods for testing commercial off-the-shelf (COTS) digital devices at varying levels of complexity is presented and discussed as well as the results for testing a COTS SRAM, FRAM, and SoC using these methodologies in a pulsed dose rate environment at Little Mountain Test Facility (LMTF) and neutron testing at Los Alamos Neutron Science Center (LANSCE). Investigations at LMTF revealed a dependence in all three devices on the integrated dose of a single pulse of radiation, implying that the duration of radiation plays a significant role in the response. The test infrastructure necessary to dynamically access an FRAM at LMTF and time the access with the pulse of radiation allowed for the discovery of a new FRAM failure mode where an entire word of the FRAM becomes corrupted as well as selecting between two different failure modes based on the timing of the pulse. A novel component-based testing methodology for testing complicated SoCs is presented and used to report on the cross-sections of several components on the Xilinx MPSoC, including its DMA which has not previously been reported. radiation testing soft errors COTS SRAM FRAM SoC MPSoC neutrons SEU dose rate testing photocurrent integrated dose gamma rays Engineering
49	Selective software-implemented hardware fault tolerance tecnhiques to detect soft errors in processors with reduced overhead / Técnicas seletivas de tolerência a falhas em software com custo reduzido para detectar erros causados por falhas transientes em processadores Chielle, Eduardo January 2016 (has links) A utilização de técnicas de tolerância a falhas em software é uma forma de baixo custo para proteger processadores contra soft errors. Contudo, elas causam aumento no tempo de execução e utilização de memória. Em consequência disso, o consumo de energia também aumenta. Sistemas que operam com restrição de tempo ou energia podem ficar impossibilitados de utilizar tais técnicas. Por esse motivo, este trabalho propoe técnicas de tolerância a falhas em software com custos no desempenho e memória reduzidos e cobertura de falhas similar a técnicas presentes na literatura. Como detecção é menos custoso que correção, este trabalho foca em técnicas de detecção. Primeiramente, um conjunto de técnicas de dados baseadas em regras de generalização, chamada VAR, é apresentada. As técnicas são baseadas nesse conjunto generalizado de regras para permitir uma investigação exaustiva, em termos de confiabilidade e custos, de diferentes variações de técnicas. As regras definem como a técnica duplica o código e insere verificadores. Cada técnica usa um diferente conjunto de regras. Então, uma técnica de controle, chamada SETA, é introduzida. Comparando SETA com uma técnica estado-da-arte, SETA é 11.0% mais rápida e ocupa 10.3% menos posições de memória. As técnicas de dados mais promissoras são combinadas com a técnica de controle com o objetivo de proteger tanto os dados quanto o fluxo de controle da aplicação alvo. Para reduzir ainda mais os custos, métodos para aplicar seletivamente as técnicas propostas foram desenvolvidos. Para técnica de dados, em vez de proteger todos os registradores, somente um conjunto de registradores selecionados é protegido. O conjunto é selecionado com base em uma métrica que analisa o código e classifica os registradores por sua criticalidade. Para técnicas de controle, há duas abordagens: (1) remover verificadores de blocos básicos, e (2) seletivamente proteger blocos básicos. As técnicas e suas versões seletivas são avaliadas em termos de tempo de execução, tamanho do código, cobertura de falhas, e o Mean Work to Failure (MWTF), o qual é uma métrica que mede o compromisso entre cobertura de falhas e tempo de execução. Resultados mostram redução dos custos sem diminuição da cobertura de falhas, e para uma pequena redução na cobertura de falhas foi possível significativamente reduzir os custos. Por fim, uma vez que a avaliação de todas as possíveis combinações utilizando métodos seletivos toma muito tempo, este trabalho utiliza um método para extrapolar os resultados obtidos por simulação com o objetivo de encontrar os melhores parâmetros para a proteção seletiva e combinada de técnicas de dados e de controle que melhorem o compromisso entre confiabilidade e custos. / Software-based fault tolerance techniques are a low-cost way to protect processors against soft errors. However, they introduce significant overheads to the execution time and code size, which consequently increases the energy consumption. System operation with time or energy restrictions may not be able to make use of these techniques. For this reason, this work proposes software-based fault tolerance techniques with lower overheads and similar fault coverage to state-of-the-art software techniques. Once detection is less costly than correction, the work focuses on software-based detection techniques. Firstly, a set of data-flow techniques called VAR is proposed. The techniques are based on general building rules to allow an exhaustive assessment, in terms of reliability and overheads, of different technique variations. The rules define how the technique duplicates the code and insert checkers. Each technique uses a different set of rules. Then, a control-flow technique called SETA (Software-only Error-detection Technique using Assertions) is introduced. Comparing SETA with a state-of-the-art technique, SETA is 11.0% faster and occupies 10.3% fewer memory positions. The most promising data-flow techniques are combined with the control-flow technique in order to protect both dataflow and control-flow of the target application. To go even further with the reduction of the overheads, methods to selective apply the proposed software techniques have been developed. For the data-flow techniques, instead of protecting all registers, only a set of selected registers is protected. The set is selected based on a metric that analyzes the code and rank the registers by their criticality. For the control-flow technique, two approaches are taken: (1) removing checkers from basic blocks: all the basic blocks are protected by SETA, but only selected basic blocks have checkers inserted, and (2) selectively protecting basic blocks: only a set of basic blocks is protected. The techniques and their selective versions are evaluated in terms of execution time, code size, fault coverage, and Mean Work To Failure (MWTF), which is a metric to measure the trade-off between fault coverage and execution time. Results show that was possible to reduce the overheads without affecting the fault coverage, and for a small reduction in the fault coverage it was possible to significantly reduce the overheads. Lastly, since the evaluation of all the possible combinations for selective hardening of every application takes too much time, this work uses a method to extrapolate the results obtained by simulation in order to find the parameters for the selective combination of data and control-flow techniques that are probably the best candidates to improve the trade-off between reliability and overheads. Microeletrônica Tolerancia : Falhas : Software Processadores SIHFT techniques Selective hardening Transient faults Soft errors Single event effects SEU SET Processor Reliability Execution time Code size Energy consumption Lower overheads
50	Robust Design of Variation-Sensitive Digital Circuits Moustafa, Hassan January 2011 (has links) The nano-age has already begun, where typical feature dimensions are smaller than 100nm. The operating frequency is expected to increase up to 12 GHz, and a single chip will contain over 12 billion transistors in 2020, as given by the International Technology Roadmap for Semiconductors (ITRS) initiative. ITRS also predicts that the scaling of CMOS devices and process technology, as it is known today, will become much more difficult as the industry advances towards the 16nm technology node and further. This aggressive scaling of CMOS technology has pushed the devices to their physical limits. Design goals are governed by several factors other than power, performance and area such as process variations, radiation induced soft errors, and aging degradation mechanisms. These new design challenges have a strong impact on the parametric yield of nanometer digital circuits and also result in functional yield losses in variation-sensitive digital circuits such as Static Random Access Memory (SRAM) and flip-flops. Moreover, sub-threshold SRAM and flip-flops circuits, which are aggravated by the strong demand for lower power consumption, show larger sensitivity to these challenges which reduces their robustness and yield. Accordingly, it is not surprising that the ITRS considers variability and reliability as the most challenging obstacles for nanometer digital circuits robust design. Soft errors are considered one of the main reliability and robustness concerns in SRAM arrays in sub-100nm technologies due to low operating voltage, small node capacitance, and high packing density. The SRAM arrays soft errors immunity is also affected by process variations. We develop statistical design-oriented soft errors immunity variations models for super-threshold and sub-threshold SRAM cells accounting for die-to-die variations and within-die variations. This work provides new design insights and highlights the important design knobs that can be used to reduce the SRAM cells soft errors immunity variations. The developed models are scalable, bias dependent, and only require the knowledge of easily measurable parameters. This makes them useful in early design exploration, circuit optimization as well as technology prediction. The derived models are verified using Monte Carlo SPICE simulations, referring to an industrial hardware-calibrated 65nm CMOS technology. The demand for higher performance leads to very deep pipelining which means that hundreds of thousands of flip-flops are required to control the data flow under strict timing constraints. A violation of the timing constraints at a flip-flop can result in latching incorrect data causing the overall system to malfunction. In addition, the flip-flops power dissipation represents a considerable fraction of the total power dissipation. Sub-threshold flip-flops are considered the most energy efficient solution for low power applications in which, performance is of secondary importance. Accordingly, statistical gate sizing is conducted to different flip-flops topologies for timing yield improvement of super-threshold flip-flops and power yield improvement of sub-threshold flip-flops. Following that, a comparative analysis between these flip-flops topologies considering the required overhead for yield improvement is performed. This comparative analysis provides useful recommendations that help flip-flops designers on selecting the best flip-flops topology that satisfies their system specifications while taking the process variations impact and robustness requirements into account. Adaptive Body Bias (ABB) allows the tuning of the transistor threshold voltage, Vt, by controlling the transistor body voltage. A forward body bias reduces Vt, increasing the device speed at the expense of increased leakage power. Alternatively, a reverse body bias increases Vt, reducing the leakage power but slowing the device. Therefore, the impact of process variations is mitigated by speeding up slow and less leaky devices or slowing down devices that are fast and highly leaky. Practically, the implementation of the ABB is desirable to bias each device in a design independently, to mitigate within-die variations. However, supplying so many separate voltages inside a die results in a large area overhead. On the other hand, using the same body bias for all devices on the same die limits its capability to compensate for within-die variations. Thus, the granularity level of the ABB scheme is a trade-off between the within-die variations compensation capability and the associated area overhead. This work introduces new ABB circuits that exhibit lower area overhead by a factor of 143X than that of previous ABB circuits. In addition, these ABB circuits are resolution free since no digital-to-analog converters or analog-to-digital converters are required on their implementations. These ABB circuits are adopted to high performance critical paths, emulating a real microprocessor architecture, for process variations compensation and also adopted to SRAM arrays, for Negative Bias Temperature Instability (NBTI) aging and process variations compensation. The effectiveness of the new ABB circuits is verified by post layout simulation results and test chip measurements using triple-well 65nm CMOS technology. The highly capacitive nodes of wide fan-in dynamic circuits and SRAM bitlines limit the performance of these circuits. In addition, process variations mitigation by statistical gate sizing increases this capacitance further and fails in achieving the target yield improvement. We propose new negative capacitance circuits that reduce the overall parasitic capacitance of these highly capacitive nodes. These negative capacitance circuits are adopted to wide fan-in dynamic circuits for timing yield improvement up to 99.87% and to SRAM arrays for read access yield improvement up to 100%. The area and power overheads of these new negative capacitance circuits are amortized over the large die area of the microprocessor and the SRAM array. The effectiveness of the new negative capacitance circuits is verified by post layout simulation results and test chip measurements using 65nm CMOS technology. robust design process variation soft errors NBTI SRAM Flip-flops high performance low power negative capacitance adaptive body bias nanometer regime CMOS technology Electrical and Computer Engineering

Search results