Lindholm, Jeffery L.
(has links) (PDF)
Thesis (M.S. in Computer Science)--Naval Postgraduate School, Dec. 2004. / Thesis Advisor(s): Wen, Su ; Gibson, John. "December 2004." Includes bibliographical references (p. 63-64). Also available in print.
Low, Douglas Wai Kok.
Thesis (Ph. D.)--University of Washington, 2005. / Vita. Includes bibliographical references (p. 132-136).
Thesis (M.S.)--Mississippi State University. Department of Electrical and Computer Engineering. / Title from title screen. Includes bibliographical references.
<p>Multithreading is a processor technique that can effectively hide long latencies that can occur due to memory accesses, coprocessor operations and similar. While this looks promising, there is an additional hardware cost that will vary with for example the number of contexts to switch to and what technique is used for it and this might limit the possible gain of multithreading.</p><p>Network processors are, traditionally, multiprocessor systems that share a lot of common resources, such as memories and coprocessors, so the potential gain of multithreading could be high for these applications. On the other hand, the increased hardware required will be relatively high since the rest of the processor is fairly small. Instead of having a multithreaded processor, higher performance gains could be achieved by using more processors instead.</p><p>As a solution, a simulator was built where a system can effectively be modelled and where the simulation results can give hints of the optimal solution for a system in the early design phase of a network processor system. A theoretical background to multithreading, network processors and more is also provided in the thesis.</p>
Rafiq, A. N. M. Ehtesham
26 January 2010
A network processor unit (NPU) is a programmable device that consists of several hardware accelerators for wire-speed networking operations. One of the most important functional units in an NPU is packet classification unit (PCU) that classifies data packets based on single or multiple fields of packet header or contents in payload data. Large number of tasks in computer communication require packet classification. Network packet classification requires two types of matching techniques: (i) exact. and (ii) inexact match. There are two solutions for exact match: (i) sequential and (ii) parallel solutions. Inexact match can be of two types: (i) Longest prefix match and (ii) Best match. This dissertation talks about these four techniques required for the PCU. For the sequential solution. we propose a string search algorithm that requires reduced time complexity. It also requires a small amount of memory. arid shows better performance than any other related algorithms as proved by numerical analysis and extensive computer simulations. For parallel solution. we present a systematic technique for expressing the string search algorithm as a regular iterative expression to explore all possible processor arrays. The technique allows some of the algorithm variables to be pipelined while others are broadcast over system-wide buses. Nine possible processor array structures are obtained and analyzed in terms of speed, area. power. and I/O timing requirements. The proposed designs exhibit optimum speed and area complexities. The parallel solution requires an embedding technique that embeds a source processor array onto a target processor array having smaller number of processing elements (PE) to meet the hardware resource constraint. We propose a. novel embed-ding technique. Through numerical analysis and extensive computer simulation. it is proved that the performance of the target array shows the same performance as the source array. For Longest prefix match (LPM), we propose a novel variable-stride multi-bit trie data structure for IP-lookup table to assist. fast IP-lookup and fast lookup table update. In this dissertation. we first explicitly elaborate the solution of a problem in expanding IP (internet protocol) addresses. Through extensive computer simulation on several routing tables. it is proved that our proposed algorithm shows better performance (lookup and update time) than existing algorithms. However. our proposed technique requires larger memory than others. But the memory requirement is quite acceptable considering the current memory availability and price. We propose a novel Best Match technique required to detect best-matched English words of obfuscated spam words. We have used a non-deterministic finite automaton (NFA) to build the English dictionary. We have used dynamic programming with state pruning to detect the best-matched word of an obfuscated spam word in the NFA. We have done extensive numerical simulations to prove the accuracy of our proposed system. Our system can detect best-matched words of the words obfuscated by spammers using five different techniques: insertion, deletion. substitution. trans-pose. and word boundary. Upto our knowledge. no other system can deal with all these obfuscating techniques so quickly as ours.
Uma metodologia analítico-determinística para a avaliação de desempenho no tempo de processadores de rede implementados como sistemas-sobre-silício. / An analytical deterministic methodology for the performance evaluation of network processors deployed as systems-on-chip.Frederico de Faria 26 June 2007 (has links)
O grande aumento da capacidade de integração de transistores em um único circuito integrado tem exigido grande e constante evolução na metodologia de projeto e práticas de implementação de sistemas eletrônicos embarcados. Tal capacidade de integração resultou no surgimento de sistemas sobre silício (SoCs). O projeto de tais sistemas, mais complexos que seus predecessores, alteram significativamente os fluxos tradicionais de concepção de sistemas, fazendo surgir estratégias tais quais reuso, projetos orientados a plataformas, assim como modelagens e simulações em diferentes níveis de abstração. Um dos diferentes níveis de abstração estudados é o analítico, onde os sistemas são modelados através de representações abstratas. A adoção de modelos analíticos apresenta vantagens, como alta velocidade de execução (permitindo um grande número de análises de modelos diferentes) e facilidade de alteração. No entanto, por se tratarem de modelagens distantes, em termos de abstração, de implementações reais, podem oferecer prognósticos não exatos. Faz-se então necessária a investigação de metodologias que tenham como propósito o aperfeiçoamento de tais modelos em termos de acurácia e fidelidade. O presente trabalho apresenta uma metodologia de modelagem analítica para avaliação de desempenho de sistemas-sobre-silício orientada a aplicação de processadores de redes de pacotes. A metodologia de Network Calculus, a ser implementada nos estágios iniciais de projeto de sistemas-sobre-silício baseados em plataforma, contribui para reduzir o espaço de avaliação de projeto. Trata do equacionamento analítico de representações abstratas das cargas de entrada e também da capacidade de processamento de recursos, visando obter prognósticos mais pessimistas e mais otimistas de parâmetros como latência, requisição de buffer e utilização do sistema, descrito de modo abstrato através de grafos. / The great increase in terms of integration capacity of transistors on integrated circuits has demanded great and constant evolution in the design methodology and practical implementation of embedded electronic systems. Such capacity of integration resulted in the sprouting of systems-on-chips (SoCs). The design of such systems, more complex than their predecessors, significantly changes the traditional flow in the conception of systems, bringing up strategies such like reuse, platform based design, as well as modeling and simulation in different abstraction levels. One of the different abstraction levels under study is the analytical one, where the systems are shaped through abstract representations. The adoption of analytical models presents advantages, such as high speed of execution (allowing a great number of analyses of different models) and easiness for alteration. However, due to their distant representation models, in terms of abstraction, from real implementations, they cannot offer accurate prognostics on several design metrics. Therefore, it is necessary the investigation on methodologies aiming to the enhancement of such models in terms of accuracy and fidelity. The present work shows a methodology of analytical modeling for evaluation of system-on-chip performance guided to the application of network processors of packages. The methodology of Network Calculus, to be implemented in the initial steps of of system-on-chip´s design cycle, contributes to reduce the design space exploration. It deals with the building of analytical equations for abstract representations of workloads and also the processing capacity of resources, aiming at to get most pessimistic and most optimistic prognostics of parameters such like latency, buffer requirements and the system utilization, described in abstract way through graphs.
Improving The Communication Performance Of I/O Intensive And Communication Intensive Application In Cluster Computer SystemsKumar, V Santhosh 10 1900 (has links)
Cluster computer systems assembled from commodity off-the-shelf components have emerged as a viable and cost-effective alternative to high-end custom parallel computer systems.In this thesis, we investigate how scalable performance can be achieved for database systems on clusters. In this context we specﬁcally considered database query processing for evaluation of botlenecks and suggest optimization techniques for obtaining scalable application performance. First we systematically demonstrated that in a large cluster with high disk bandwidth, the processing capability and the I/O bus bandwidth are the two major performance bottlenecks in database systems. To identify and assess bottlenecks, we developed a Petri net model of parallel query execution on a cluster. Once identiﬁed and assessed,we address the above two performance bottlenecks by offoading certain application related tasks to the processor in the network interface card. Offoading application tasks to the processor in the network interface cards shifts the bottleneck from cluster processor to I/O bus. Further, we propose a hardware scheme,network attached disk ,and a software scheme to achieve a balanced utilization of re-sources like host processor, I/O bus, and processor in the network interface card. The proposed schemes result in a speedup of upto 1.47 compared to the base scheme, and ensures scalable performance upto 64 processors. Encouraged by the beneﬁts of ofﬂoading application tasks to network processors, we explore the possibilities of performing the bloom ﬁlter operations in network processors. We combine ofﬂoading bloom ﬁlter operations with the proposed hardware schemes to achieve upto 50% reduction in execution time. The later part of the thesis provides introductory experiments conducted in Community At-mospheric Model(CAM), a large scale parallel application used for global weather and climate prediction. CAM is a communication intensive application that involves collective communication of large messages. In our limited experiment, we identiﬁed CAM to see the effect of compression techniques and ofﬂoading techniques (as formulated for database) on the performance of communication intensive applications. Due to time constraint, we considered only the possibility of compression technique for improving the application performance. However, ofﬂoading technique could be taken as a full-ﬂedged research problem for further investigation In our experiment, we found compression of messages reduces the message latencies, and hence improves the execution time and scalability of the application. Without using compression techniques, performance measured on 64 processor cluster resulted in a speed up of only 15.6. While lossless compression retains the accuracy and correctness of the program, it does not result in high compression. We therefore propose lossy compression technique which can achieve a higher compression, yet retain the accuracy and numerical stability of the application while achieving a scalable performance. This leads to speedup of 31.7 on 64 processors compared to a speedup of 15.6 without message compression. We establish that the accuracy within prescribed limit of variation and numerical stability of CAM is retained under lossy compression.
07 July 2006
Network processors are new types of multithreaded multicore processors geared towards achieving both fast processing speed and flexibility of programming. The architecture of network processors considers many special properties for packet processing, including multiple threads, multiple processor cores on the same chip, special functional units, simplified ISA and simplified pipeline, etc. The architectural peculiarities of network processors raise new challenges for compiler design and optimization. Due to very high clocking speeds, the CPU memory gap on such processors is huge, making registers extremely precious. Moreover, the register file is split into two banks, and for any ALU instruction, the two source operands must come from different banks. We present and compare three different approaches to do register allocation and bank assignment. We also address the problem of sharing registers across threads in order to maximize the utilization of hardware resources. The context switches on the IXP network processor only happen when long latency operations are encountered. As a result, context switches are highly frequent. Therefore, the designer of the IXP network processor decided to make context switches extremely lightweight, i.e. only the program counter(PC) is stored together with the context. Since registers are not saved and restored during context switches, it becomes difficult to share registers across threads. For a conventional processor, each thread can assume that it can use the entire register file, because registers are always part of the context. However, with lightweight context switch, each thread must take a separate piece of the register file, making register usage inefficient. Programs executing on network processors typically have runtime constraints. Scheduling of multiple programs sharing a CPU must be orchestrated by the OS and the hardware using certain sharing policies. Real time applications demand a real time aware OS kernel to meet their specified deadlines. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms is typically feasible. In this work, we demonstrate that a compiler approach could achieve some of the OS scheduling and real time scheduling functionalities without introducing a hefty overhead.
[en] GENERATION OF BUILT-IN OPTICAL INTELIGENCE ON ETHERNET / IP NETWORKS / [pt] GERAÇÃO DE INTELIGÊNCIA ÓPTICA EM REDES ETHERNET / IPHENRIQUE JOSE PINTO PORTELA DA SILVA 06 July 2005 (has links)
[pt] O principal objetivo desta dissertação consiste na geração de novas funcionalidades inteligentes em redes ópticas associadas aos protocolos IP e Gigabit Ethernet, através da utilização de circuitos integrados programáveis operando na taxa do Gigabit. A padronização Ethernet é apresentada através das camadas PHY e MAC, destacando suas funções, interfaces e os tipos de chips disponíveis no mercado. A camada PHY do padrão Ethernet para meios ópticos é detalhada. Algumas tecnologias de chips são discutidas, entre elas o crescimento dedicado, os ASICs, as NPUs e as tecnologias programáveis: FPGAs e CPLDs. O conceito de inteligência óptica e o perfil de camadas equivalentes associados a este conceito são introduzidos. Um novo elemento de rede dedicado à inserção de sinalização na camada óptica é apresentado, destacando-se sua estrutura, sua realização, seu detalhamento para utilização em redes. Diversas montagens experimentais com o elemento desenvolvido são utilizadas para demonstrar as características do sistema, entre elas a eficiência da utilização da tecnologia de FPGAs e a transparência da inteligência na camada óptica para o padrão Ethernet. / [en] The main objective of this work is the generation of new functionalities in optical networks, associated to the Ethernet and IP protocols, by the use of programmable integrated circuits operating in Gigabit rates. The Ethernet standard is presented through its PHY and MAC layers, highlighting its functions, interfaces and the types of commercially available ICs. The Ethernet standard PHY layer for optical media is described. Some IC technologies are discussed, such as dedicated growth, ASICs, NPUs and the programmable technologies: FPGAs e CPLDs. The concept of built-in optical intelligence and a new layers model associated to it are presented. A new network element, dedicated to the insertion of signaling in the optical layer is also presented, and special attention is dedicated to its structure, to its implementation and to the aspects of its use in networks. Several experimental setups using the developed element are shown, demonstrating the characteristics of the system, particularly the efficiency obtained by the use of FPGA technology and the transparency of the optical intelligence with respect to the Ethernet standard.
Page generated in 0.0882 seconds