Global ETD Search

11	Arquiteturas de alto desempenho e baixo custo em hardware para a estimação de movimento em vídeos digitais / High performance and low cost hardware architectures for digital videos motion estimation Porto, Marcelo January 2008 (has links) A evolução das Tecnologias de Informação e Comunicação (TIC) favoreceu o crescimento do uso de variados meios na comunicação. Entre diversos meios, o vídeo em particular, necessita de uma grande banda para ser transmitido, ou de um grande espaço para ser armazenado. Uma análise dos diversos sinais de uma comunicação multimídia mostra, entretanto, que existe uma grande redundância de informação. Utilizando técnicas de compressão é possível reduzir de uma a duas ordens de grandeza a quantidade de informação veiculada, mantendo uma qualidade satisfatória. Uma das formas de compressão busca a relação de similaridade entre os quadros vizinhos de uma cena, identificando a redundância temporal existente entre as imagens. Essa técnica chama-se estimação de movimento, este processo é muito eficaz, mas o custo computacional é elevado, exigindo a implementação de algoritmos eficientes em hardware, para o caso de compressão em tempo real de vídeos de alta resolução. Esta dissertação apresenta uma investigação sobre algoritmos de estimação de movimento visando implementações em hardware. Todos os algoritmos foram desenvolvidos primeiramente em linguagem C e submetidos a diversos testes para avaliação de desempenho e custo computacional. Os algoritmos foram aplicados a diversas amostras de vídeo utilizadas pela comunidade científica, para avaliação em aplicações reais. As avaliações demonstraram que os algoritmos rápidos conseguem realizar o processo de estimação de movimento de maneira eficiente, obtendo bons resultados em termos de qualidade de vetores, esforço computacional e desempenho. Com as análises dos resultados obtidos, o algoritmo Busca Diamante (Diamond Search) foi escolhido para ser implementado em hardware, com dois níveis diferentes de subamostragem de pixel: 2:1 e 4:1. As arquiteturas para o algoritmo Busca Diamante, com sub-amostragem de pixel de 2:1 e 4:1, foram descritas em VHDL, sintetizadas para FPGAs Virtex-4 da Xilinx e também para standard cells na tecnologia TSMC 0,18μm. Os resultados mostram que as arquiteturas desenvolvidas possuem desempenho superior ao necessário para tratar vídeos HDTV 1080p em tempo real a 30 quadros por segundo. As arquiteturas desenvolvidas também apresentam um baixo consumo de recursos de hardware, após a síntese para FPGA e ASIC. / The evolution of the communication and information technologies push the development of several communication media. These media, video in particular, need a large bandwidth to be transmitted, or a large digital storage capacity. Many multimedia signals show, however, a high information redundancy. By using compression techniques it is possible to reduce the amount of coded information by one or two orders of magnitude, keeping a satisfactory visual quality. One of these compression techniques searches the similarity between neighboring frames of a scene, identifying the temporal redundancy between them. This technique is called motion estimation, and it is a very efficient method for compression. However, the computational complexity of the motion estimation requires high performance algorithms in hardware, when used for real time compression of high resolution videos. This dissertation presents a comprehensive investigation about motion estimation algorithms, targeting a hardware implementation. All the investigated algorithms were first developed in C language and submitted to many evaluation tests. The algorithms were applied to ten video samples used by the scientific community for the evaluation of real application. The evaluation showed that fast algorithms can carry out the motion estimation process efficiently, producing good results in vectors quality, computational effort and performance. With the results analyses, the Diamond Search algorithm was chosen to be hardware designed, with two different levels of pixel subsampling, 2:1 and 4:1. The architectures for Diamond Search algorithm, with pixel subsampling of 2:1 and 4:1, were described in VHDL, synthesized to Xilinx Virtex-4 FPGAs and also to standard cells TSMC 0.18μm technology. The developed architectures have sufficient performance to process HDTV 1080p videos at 30 frames per second and demand small hardware resources consumption after synthesis to FPGA and ASIC. Keywords: Video compression, motion estimation, VLSI design. Microeletrônica Compressao : Video Video compression Motion estimation VLSI design
12	Trapp : uma ferramenta para particionamento/posicionamento de celulas para metodologia tranca / A trapp tool for partitioning/placement of methodology tranca's cells Schermer, Paulo Armando January 1995 (has links) Este trabalho propõe e avalia um novo algoritmo para o posicionamento de células de circuitos que utilizam a metodologia de projeto TRANCA. O algoritmo proposto realiza o posicionamento por particionamento, em n-blocos, baseado no conceito de balanceamento de redes, realizando um pré-roteamento global. A maioria dos algoritmos de posicionamento por particionamento são baseados na heurística de Kernighan-Lin[KER 70] e Fidducia-Mattheyses[FID 82] com migração de grupos. Estes algoritmos utilizam uma função de corte mínimo para diminuir o cruzamento de redes entre as duas partições, produzindo regiões saturadas. Sendo assim, o conceito de balanceamento de redes significa a busca de um equilíbrio no comprimento das conexões para evitar a criação de regiões saturadas, diminuindo o tempo computacional e facilitando a etapa de roteamento. Apresenta-se uma visão geral de síntese automática. Descreve-se os estilos de projeto mais utilizados, define-se e analisa-se o problema de particionamento e posicionamento de células. As principais características da metodologia TRANCA são apresentadas. Resume-se as principais características das ferramentas de síntese TRANCA, destacando-se as etapas de particionamento e posicionamento de cada uma, visando o aproveitamento destas características positivas. Com o propósito de fundamentar os conceitos usados para o desenvolvimento do algoritmo, apresenta-se os métodos de posicionamento mais relevantes, dando destaque aqueles baseados em particionamento. Descreve-se algumas das heurísticas existentes. Os conceitos utilizados para o desenvolvimento do algoritmo são então descritos. O algoritmo consiste basicamente da distribuição das conexões, utilizando um mapa de congestionamento do circuito, o que caracteriza um pré-roteamento global. O mapa de congestionamento é montado sobre as partições geradas no circuito. Além do mapa de congestionamento, a descrição dos caminhos das redes é realizada sobre um modelo definido para controlar o cruzamento de redes. Apos a definição dos conceitos, o ambiente criado para o algoritmo é apresentado. Com o objetivo de validar os conceitos estudados e aqueles propostos, implementou-se um protótipo, chamado TRAPP(TRAnsparent Placement by Partitioning), e um visualizador de posicionamento chamado CIPPATO. Finalmente, alguns resultados do protótipo desenvolvido e uma avaliação sobre o comportamento dente protótipo são apresentados. Propõe também implementações alternativas e direções para trabalhos futuros. / This work proposes and evaluates a new algorithm for cells' placement, for use on TRANCA[REI 87] layouts. The algorithm proposed makes a placement by partitioning using multiple steps, based on the concept of net balancing, in order to make a global prerouting. Most partitioning algorithms are based on the Kernighan-Lin[KER 70] and Fidducia-Mattheyses[FID 82] heuristics with migration groups. These algorithms use a mincut heuristic to decrease the crossing nets between the two blocks, producing saturated regions. Therefore, the nets balancing concept means to search for a balance in the connections size to avoid satured regions, decreasing a computation time and to increase the routing performance. The global vision of automatic synthesis is shown. The main design styles are described and the placement and partitioning problems are analysed. The main features of TRANCA methodology are shown. A summary about the TRANCA synthesis tools is presented, emphasizing the partitioning and placement step in each one. This main features are evaluated. The basic ideas that suported the development of the algorithm are described. The algorithm provides a connection distribuition, using a congestion map of the circuit that describes a global pre-routing. The congestion map is generated based on the circuit partitioning. In addition (to the congestion map), the net paths are defined to control the crossing nets. After the definition of the concepts, the environment created for the algorithm is showed. The most important placement methods are studied and presented in order to provide a general picture of the problem. Among them, specifc attention is given to those based an partitioning. Some particular heuristics are detailed. A prototype system called TRAPP( TRAnsparent Placement by Partitioning) was developed to evaluate this approach. It is completed by a placement viewer, CIPPATO. Finally, some results and conclusions are presented. New implementations and directions for further works are proposed too. Microeletrônica Sintese automatica Microeletronic VLSI design PAC tools
13	Arquiteturas de alto desempenho e baixo custo em hardware para a estimação de movimento em vídeos digitais / High performance and low cost hardware architectures for digital videos motion estimation Porto, Marcelo January 2008 (has links) A evolução das Tecnologias de Informação e Comunicação (TIC) favoreceu o crescimento do uso de variados meios na comunicação. Entre diversos meios, o vídeo em particular, necessita de uma grande banda para ser transmitido, ou de um grande espaço para ser armazenado. Uma análise dos diversos sinais de uma comunicação multimídia mostra, entretanto, que existe uma grande redundância de informação. Utilizando técnicas de compressão é possível reduzir de uma a duas ordens de grandeza a quantidade de informação veiculada, mantendo uma qualidade satisfatória. Uma das formas de compressão busca a relação de similaridade entre os quadros vizinhos de uma cena, identificando a redundância temporal existente entre as imagens. Essa técnica chama-se estimação de movimento, este processo é muito eficaz, mas o custo computacional é elevado, exigindo a implementação de algoritmos eficientes em hardware, para o caso de compressão em tempo real de vídeos de alta resolução. Esta dissertação apresenta uma investigação sobre algoritmos de estimação de movimento visando implementações em hardware. Todos os algoritmos foram desenvolvidos primeiramente em linguagem C e submetidos a diversos testes para avaliação de desempenho e custo computacional. Os algoritmos foram aplicados a diversas amostras de vídeo utilizadas pela comunidade científica, para avaliação em aplicações reais. As avaliações demonstraram que os algoritmos rápidos conseguem realizar o processo de estimação de movimento de maneira eficiente, obtendo bons resultados em termos de qualidade de vetores, esforço computacional e desempenho. Com as análises dos resultados obtidos, o algoritmo Busca Diamante (Diamond Search) foi escolhido para ser implementado em hardware, com dois níveis diferentes de subamostragem de pixel: 2:1 e 4:1. As arquiteturas para o algoritmo Busca Diamante, com sub-amostragem de pixel de 2:1 e 4:1, foram descritas em VHDL, sintetizadas para FPGAs Virtex-4 da Xilinx e também para standard cells na tecnologia TSMC 0,18μm. Os resultados mostram que as arquiteturas desenvolvidas possuem desempenho superior ao necessário para tratar vídeos HDTV 1080p em tempo real a 30 quadros por segundo. As arquiteturas desenvolvidas também apresentam um baixo consumo de recursos de hardware, após a síntese para FPGA e ASIC. / The evolution of the communication and information technologies push the development of several communication media. These media, video in particular, need a large bandwidth to be transmitted, or a large digital storage capacity. Many multimedia signals show, however, a high information redundancy. By using compression techniques it is possible to reduce the amount of coded information by one or two orders of magnitude, keeping a satisfactory visual quality. One of these compression techniques searches the similarity between neighboring frames of a scene, identifying the temporal redundancy between them. This technique is called motion estimation, and it is a very efficient method for compression. However, the computational complexity of the motion estimation requires high performance algorithms in hardware, when used for real time compression of high resolution videos. This dissertation presents a comprehensive investigation about motion estimation algorithms, targeting a hardware implementation. All the investigated algorithms were first developed in C language and submitted to many evaluation tests. The algorithms were applied to ten video samples used by the scientific community for the evaluation of real application. The evaluation showed that fast algorithms can carry out the motion estimation process efficiently, producing good results in vectors quality, computational effort and performance. With the results analyses, the Diamond Search algorithm was chosen to be hardware designed, with two different levels of pixel subsampling, 2:1 and 4:1. The architectures for Diamond Search algorithm, with pixel subsampling of 2:1 and 4:1, were described in VHDL, synthesized to Xilinx Virtex-4 FPGAs and also to standard cells TSMC 0.18μm technology. The developed architectures have sufficient performance to process HDTV 1080p videos at 30 frames per second and demand small hardware resources consumption after synthesis to FPGA and ASIC. Keywords: Video compression, motion estimation, VLSI design. Microeletrônica Compressao : Video Video compression Motion estimation VLSI design
14	Gigahertz-Range Multiplier Architectures Using MOS Current Mode Logic (MCML) Srinivasan, Venkataramanujam 18 December 2003 (has links) The tremendous advancement in VLSI technologies in the past decade has fueled the need for intricate tradeoffs among speed, power dissipation and area. With gigahertz range microprocessors becoming commonplace, it is a typical design requirement to push the speed to its extreme while minimizing power dissipation and die area. Multipliers are critical components of many computational intensive circuits such as real time signal processing and arithmetic systems. The increasing demand in speed for floating-point co-processors, graphic processing units, CDMA systems and DSP chips has shaped the need for high-speed multipliers. The focus of our research for modern digital systems is two fold. The first one is to analyze a relatively unexplored logic style called MOS Current Mode Logic (MCML), which is a promising logic technique for the design of high performance arithmetic circuits with minimal power dissipation. The second one is to design high-speed arithmetic circuits, in particular, gigahertz-range multipliers that exploit the many attractive features of the MCML logic style. A small library of MCML gates that form the core components of the multiplier were designed and optimized for high-speed operation. The three 8-bit MCML multiplier architectures designed and simulated in TSMC 0.18 mm CMOS technology are: 3-2-tree architecture with ripple carry adder (Architecture I), 4-2-tree design with ripple carry adder (Architecture II) and 4-2-tree architecture with carry look-ahead adders (Architecture III). Architecture I operates with a maximum throughput of 4.76 GHz (4.76 Billion multiplications per second) and a latency of 3.78 ns. Architecture II has a maximum throughput of 3.3 GHz and a latency of 3 ns and Architecture III has a maximum throughput of 2 GHz and a latency of 3 ns. Architecture I achieves the highest throughput among the three multipliers, but it incurs the largest area and latency, in terms of clock cycle count as well as absolute delay. Although it is difficult to compare the speed of our multipliers with existing ones, due to the use of different technologies and different optimization goals, we believe our multipliers are among the fastest found in contemporary literature. / Master of Science VLSI design MCML High-speed circuit design High-speed multipliers
15	Low Density Parity Check Encoder and Decoder on SiLago Coarse Grain Reconfigurable Architecture Kong, Weijiang January 2019 (has links) Low density parity check (LDPC) code is an error correction code that has been widely adopted as an optional error correcting operation in most of today’s communication protocols. Current design of ASIC or FPGA based LDPC accelerators can reach Gbit/s data rate. However, the hardware cost of ASIC based methods and related interface is considerably high to be integrated into coarse grain reconfigurable architectures (CGRA). Moreover, for platforms aiming at high level synthesis or system level synthesis, they don’t provide flexibility under low-performance low-cost design scenarios. In this degree project, we establish connectivity between SiLago CGRA and a typical QC-LDPC code defined in IEEE 802.11n standard. We design lightweight LDPC encoder and decoder blocks using FSM+Datapath design pattern. The encoder provides sufficient throughput and consumes very little area and power. The decoder provides sufficient performance for low speed modulations while consuming significantly lower hardware resources. Both encoder and decoder are capable of cooperating with SiLago based DRRA through standard Network on Chip (NOC) based shared memory, DiMArch. And extra hardware for interface is no longer necessary. We verified our design through RTL simulation and synthesis. Encoder went through logic and physical synthesis while decoder went through only logic synthesis. The result acquired proves that our design is closely coupled with the SiLago CGRA while provides a solution with lowperformance and low-cost. / LDPC-kod med låg densitet är en felkorrigeringskod som har vidtagits i stor utsträckning som en valfri felsökande operation i de flesta av dagens kommunikationsprotokoll. Nuvarande design av ASICeller FPGAbaserade LDPC-acceleratorer kan nå Gbit / s datahastighet. Hårdvarukostnaden för ASIC-baserade metoder och relaterade gränssnitt är emellertid avsevärt hög för att integreras i grova kornkonfigurerbara arkitekturer (CGRA). Dessutom ger plattformar som syftar till syntese på hög nivå eller syntes på systemnivå inte flexibilitet under lågprestanda med låg kostnadsscenarier. I detta examensarbete upprättar vi anslutning mellan SiLago CGRA och en typisk QC-LDPC-kod definierad i IEEE 802.11n-standarden. Vi designar lätta LDPC-kodare och avkodarblock med FSM + Datapathdesignmönster. Kodaren ger tillräcklig genomströmning och förbrukar mycket lite areal och effekt. Avkodaren ger tillräckligt med prestanda för moduleringar med låg hastighet medan den förbrukar betydligt lägre hårdvaruressurser. Både kodare och avkodare kan samarbeta med SiLago-baserade DRRA genom standard Network on Chip (NOC) baserat delat minne, DiMArch. Och extra hårdvara för gränssnittet är inte längre nödvändigt. Vi verifierade vår design genom RTL-simulering och syntes. Kodaren genomgick logik och fysisk syntes medan avkodare genomgick endast logisk syntes. Det förvärvade resultatet bevisar att vår design är nära kopplad till SiLago CGRA och ger en lösning med låg prestanda och låg kostnad. LDPC CGRA Reconfigurable architecture VLSI design ASIC LDPC CGRA Konfigurerbar arkitektur VLSI design ASIC Elektroteknik och elektronik
16	Improved Approximation Algorithms for Geometric Packing Problems With Experimental Evaluation Song, Yongqiang 12 1900 (has links) Geometric packing problems are NP-complete problems that arise in VLSI design. In this thesis, we present two novel algorithms using dynamic programming to compute exactly the maximum number of k x k squares of unit size that can be packed without overlap into a given n x m grid. The first algorithm was implemented and ran successfully on problems of large input up to 1,000,000 nodes for different values. A heuristic based on the second algorithm is implemented. This heuristic is fast in practice, but may not always be giving optimal times in theory. However, over a wide range of random data this version of the algorithm is giving very good solutions very fast and runs on problems of up to 100,000,000 nodes in a grid and different ranges for the variables. It is also shown that this version of algorithm is clearly superior to the first algorithm and has shown to be very efficient in practice. Approximation theory. Algorithms. Combinatorial packing and covering. Approximation algorithm geometric packing problems VLSI design
17	Test Generation Guided Design for Testability Wu, Peng 01 July 1988 (has links) This thesis presents a new approach to building a design for testability (DFT) system. The system takes a digital circuit description, finds out the problems in testing it, and suggests circuit modifications to correct those problems. The key contributions of the thesis research are (1) setting design for testability in the context of test generation (TG), (2) using failures during FG to focus on testability problems, and (3) relating circuit modifications directly to the failures. A natural functionality set is used to represent the maximum functionalities that a component can have. The current implementation has only primitive domain knowledge and needs other work as well. However, armed with the knowledge of TG, it has already demonstrated its ability and produced some interesting results on a simple microprocessor. artificial intelligence knowledge representation testsgeneration knowledge-based systems VLSI design for testability
18	Updating the Vertex Separation of a Dynamically Changing Tree Olsar, Peter January 2004 (has links) This thesis presents several algorithms that update the vertex separation of a tree after the tree is modified; the vertex separation of a graph measures the largest number of vertices to the left of and including a vertex that are adjacent to vertices to the right of the vertex, when the vertices in the graph are arranged in the best possible linear ordering. Vertex separation was introduced by Lipton and Tarjan and has since been applied mainly in VLSI design. The tree is modified by either attaching another tree or removing a subtree. The first algorithm handles the special case when another tree is attached to the root, and the second algorithm updates the vertex separation after a subtree of the root is removed. The last two algorithms solve the more general problem when subtrees are attached to or removed from arbitrary vertices; they have good running time performance only in the amortized sense. The running time of all our algorithms is sublinear in the number of vertices in the tree, assuming certain information is precomputed for the tree. This improves upon current algorithms by Skodinis and Ellis, Sudborough, and Turner, both of which have linear running time for this problem. Lower and upper bounds on the vertex separation of a general graph are also derived. Furthermore, analogous bounds are presented for the cutwidth of a general graph, where the cutwidth of a graph equals the maximum number of edges that cross over a vertex, when the vertices in the graph are arranged in the best possible linear ordering. Computer Science algorithm tree graph layout vertex separation cutwidth linear ordering VLSI design amortized analysis bounds
19	Updating the Vertex Separation of a Dynamically Changing Tree Olsar, Peter January 2004 (has links) This thesis presents several algorithms that update the vertex separation of a tree after the tree is modified; the vertex separation of a graph measures the largest number of vertices to the left of and including a vertex that are adjacent to vertices to the right of the vertex, when the vertices in the graph are arranged in the best possible linear ordering. Vertex separation was introduced by Lipton and Tarjan and has since been applied mainly in VLSI design. The tree is modified by either attaching another tree or removing a subtree. The first algorithm handles the special case when another tree is attached to the root, and the second algorithm updates the vertex separation after a subtree of the root is removed. The last two algorithms solve the more general problem when subtrees are attached to or removed from arbitrary vertices; they have good running time performance only in the amortized sense. The running time of all our algorithms is sublinear in the number of vertices in the tree, assuming certain information is precomputed for the tree. This improves upon current algorithms by Skodinis and Ellis, Sudborough, and Turner, both of which have linear running time for this problem. Lower and upper bounds on the vertex separation of a general graph are also derived. Furthermore, analogous bounds are presented for the cutwidth of a general graph, where the cutwidth of a graph equals the maximum number of edges that cross over a vertex, when the vertices in the graph are arranged in the best possible linear ordering. Computer Science algorithm tree graph layout vertex separation cutwidth linear ordering VLSI design amortized analysis bounds
20	Testability considerations for implementing an embedded memory subsystem Seok, Geewhun 01 February 2012 (has links) There are a number of testability considerations for VLSI design, but test coverage, test time, accuracy of test patterns and correctness of design information for DFD (Design for debug) are the most important ones in design with embedded memories. The goal of DFT (Design-for-Test) is to achieve zero defects. When it comes to the memory subsystem in SOCs (system on chips), many flavors of memory BIST (built-in self test) are able to get high test coverage in a memory, but often, no proper attention is given to the memory interface logic (shadow logic). Functional testing and BIST are the most prevalent tests for this logic, but functional testing is impractical for complicated SOC designs. As a result, industry has widely used at-speed scan testing to detect delay induced defects. Compared with functional testing, scan-based testing for delay faults reduces overall pattern generation complexity and cost by enhancing both controllability and observability of flip-flops. However, without proper modeling of memory, Xs are generated from memories. Also, when the design has chip compression logic, the number of ATPG patterns is increased significantly due to Xs from memories. In this dissertation, a register based testing method and X prevention logic are presented to tackle these problems. An important design stage for scan based testing with memory subsystems is the step to create a gate level model and verify with this model. The flow needs to provide a robust ATPG netlist model. Most industry standard CAD tools used to analyze fault coverage and generate test vectors require gate level models. However, custom embedded memories are typically designed using a transistor-level flow, there is a need for an abstraction step to generate the gate models, which must be equivalent to the actual design (transistor level). The contribution of the research is a framework to verify that the gate level representation of custom designs is equivalent to the transistor-level design. Compared to basic stuck-at fault testing, the number of patterns for at-speed testing is much larger than for basic stuck-at fault testing. So reducing test and data volume are important. In this desertion, a new scan reordering method is introduced to reduce test data with an optimal routing solution. With in depth understanding of embedded memories and flows developed during the study of custom memory DFT, a custom embedded memory Bit Mapping method using a symbolic simulator is presented in the last chapter to achieve high yield for memories. / text VLSI design VLSI testing DFT Custom macro Scan based testing Bit mapping

Search results