Multithreading is a processor technique that can effectively hide long latencies that can occur due to memory accesses, coprocessor operations and similar. While this looks promising, there is an additional hardware cost that will vary with for example the number of contexts to switch to and what technique is used for it and this might limit the possible gain of multithreading. Network processors are, traditionally, multiprocessor systems that share a lot of common resources, such as memories and coprocessors, so the potential gain of multithreading could be high for these applications. On the other hand, the increased hardware required will be relatively high since the rest of the processor is fairly small. Instead of having a multithreaded processor, higher performance gains could be achieved by using more processors instead. As a solution, a simulator was built where a system can effectively be modelled and where the simulation results can give hints of the optimal solution for a system in the early design phase of a network processor system. A theoretical background to multithreading, network processors and more is also provided in the thesis.
11 January 2010
Η μικρή ταχύτητα επεξεργασίας των πακέτων που μεταδίδονται στα δίκτυα σε σχέση με την ταχύτητα μετάδοσης τους μέσα σε αυτά, δημιουργεί την ανάγκη για την εφαρμογή καινοτομιών στα συστήματα δικτύωσης, με σκοπό την ελάττωση αυτού του χάσματος και την καλύτερη εκμετάλλευση των μεγάλων ταχυτήτων μετάδοσης δεδομένων. Το πρόβλημα αυτό είναι γνωστό ως «πρόβλημα διατήρησης της ρυθμαπόδοσης» Η ενσωμάτωση επεξεργαστών στα δικτυακά συστήματα έχει βοηθήσει στην αντιμετώπιση του προβλήματος. Μια αρχιτεκτονική που προτείνεται για αυτούς τους επεξεργαστές πρωτοκόλλου, όπως ονομάζονται, εισάγει τη χρήση μιας καινοτόμας δομής καταχωρητών με την ονομασία Τripod. Η ιδέα της είναι η αντικατάσταση του επεξεργαστή που βρίσκεται στο εσωτερικό ενός προσαρμογέα δικτύου, από έναν επεξεργαστικό πυρήνα και τρεις ξεχωριστές πανομοιότυπες ομάδες καταχωρητών. Το όλο σύστημα θα λειτουργεί σε μια λογική διοχέτευσης (pipeline) με τα εξής στάδια: φόρτωσης, επεξεργασίας, εκφόρτωσης. Σκοπός αυτής της διπλωματικής είναι η σχεδίαση ενός υποσυτήματος το οποίο θα διαχειρίζεται αυτές τις ομάδες καταχωρητών και θα επιτρέπει στο σύστημα να λειτουργεί σύμφωνα με τις προδιαγραφές. Πιο συγκεκριμένα, θα υλοποιεί την φόρτωση και εκφόρτωση δεδομένων προς και από τους καταχωρητές, καθώς και την σύνδεση της κατάλληλης ομάδας καταχωρητών με τον επεξεργαστικό πυρήνα. / The low processing speed of packets that are transmitted in the networks compared to their transmission speed, creates the need for inserting innovations in the network systems, aiming at the alleviation of this gap and the better exploitation of the high transmission speeds of data. This problem is known as “the throughput preservation problem” The incorporation of embedded processors in the network systems has helped in the confrontation of the problem. An architecture that is proposed for these protocol processors, as they are named, imports the use of an innovative register structure, called Tripod. The idea is the replacement of the processor that is found in the interior of a network adapter, with a processor core and three separate similar register files. The system will function in a logic of pipeline with the following stages: loading, processing, unloading. Aim of this diploma thesis is the designing of a subsystem that will manage these register files and make the system to function according to the specifications. More concretely, it will execute the loading and unloading of data to and from the registers, as well as the connection of the suitable register file with the processor core.
Kumarapillai Chandrikakutty, Harikrishnan
01 January 2013
Technological advancements have transformed the way people interact with the world. The Internet now forms a critical infrastructure that links different aspects of our life like personal communication, business transactions, social networking, and advertising. In order to cater to this ever increasing communication overhead there has been a fundamental shift in the network infrastructure. Modern network routers often employ software programmable network processors instead of ASIC-based technology for higher throughput performance and adaptability to changing resource requirements. This programmability makes networking infrastructure vulnerable to new class of network attacks by compromising the software on network processors. This issue has resulted in the need for security systems which can monitor the behavior of network processors at run time. This thesis describes an FPGA-based security monitoring system for multi-core network processors. The implemented security monitor improves upon previous hardware monitoring schemes. We demonstrate a state machine based hardware programmable monitor which can track program execution flow at run time. Applications are analyzed offline and a hash of the instructions is generated to form a state machine sequence. If the state machine deviates from expected behavior, an error flag is raised, forcing a network processor reset. For testing purposes, the monitoring logic along with the multi-core network processor system is implemented in FPGA logic. In this research, we modify the network processor memory architecture to improve security monitor functionality. The efficiency of this approach is validated using a diverse set of network benchmarks. Experiments are performed on the prototype system using known network attacks to test the performance of the monitoring subsystem. Experimental results demonstrate that out security monitor approach provides an efficient monitoring system in detecting and recovering from network attacks with minimum overhead while maintaining line rate packet forwarding. Additionally, our monitor is capable of defending against attacks on processor with a Harvard architecture, the dominant contemporary network processor organization. We demonstrate that our monitor architecture provides no network slowdown in the absence of an attack and provides the capability to drop packets without otherwise affecting regular network traffic when an attack occurs.
01 January 2005
In this work, we present off-chip communications architectures for line cards to increase the throughput of the currently used memory system. In recent years there is a significant increase in memory bandwidth demand on line cards as a result of higher line rates, an increase in deep packet inspection operations and an unstoppable expansion in lookup tables. As line-rate data and NPU processing power increase, memory access time becomes the main system bottleneck during data store/retrieve operations. The growing demand for memory bandwidth contrasts the notion of indirect interconnect methodologies. Moreover, solutions to the memory bandwidth bottleneck are limited by physical constraints such as area and NPU I/O pins. Therefore, indirect interconnects are replaced with direct, packet-based networks such as mesh, torus or k-ary n-cubes. We investigate multiple k-ary n-cube based interconnects and propose two variations of 2-ary 3-cube interconnect called the 3D-bus and 3D-mesh. All of the k-ary n-cube interconnects include multiple, highly efficient techniques to route, switch, and control packet flows in order to minimize congestion spots and packet loss. We explore the tradeoffs between implementation constraints and performance. We also developed an event-driven, interconnect simulation framework to evaluate the performance of packet-based off-chip k-ary n-cube interconnect architectures for line cards. The simulator uses the state-of-the-art software design techniques to provide the user with a flexible yet robust tool, that can emulate multiple interconnect architectures under non-uniform traffic patterns. Moreover, the simulator offers the user with full control over network parameters, performance enhancing features and simulation time frames that make the platform as identical as possible to the real line card physical and functional properties. By using our network simulator, we reveal the best processor-memory configuration, out of multiple configurations, that achieves optimal performance. Moreover, we explore how network enhancement techniques such as virtual channels and sub-channeling improve network latency and throughput. Our performance results show that k-ary n-cube topologies, and especially our modified version of 2-ary 3-cube interconnect - the 3D-mesh, significantly outperform existing line card interconnects and are able to sustain higher traffic loads. The flow control mechanism proved to extensively reduce hot-spots, load-balance areas of high traffic rate and achieve low transmission failure rate. Moreover, it can scale to adopt more memories and/or processors and as a result to increase the line card's processing power.
Rafiq, A. N. M. Ehtesham
26 January 2010
A network processor unit (NPU) is a programmable device that consists of several hardware accelerators for wire-speed networking operations. One of the most important functional units in an NPU is packet classification unit (PCU) that classifies data packets based on single or multiple fields of packet header or contents in payload data. Large number of tasks in computer communication require packet classification. Network packet classification requires two types of matching techniques: (i) exact. and (ii) inexact match. There are two solutions for exact match: (i) sequential and (ii) parallel solutions. Inexact match can be of two types: (i) Longest prefix match and (ii) Best match. This dissertation talks about these four techniques required for the PCU. For the sequential solution. we propose a string search algorithm that requires reduced time complexity. It also requires a small amount of memory. arid shows better performance than any other related algorithms as proved by numerical analysis and extensive computer simulations. For parallel solution. we present a systematic technique for expressing the string search algorithm as a regular iterative expression to explore all possible processor arrays. The technique allows some of the algorithm variables to be pipelined while others are broadcast over system-wide buses. Nine possible processor array structures are obtained and analyzed in terms of speed, area. power. and I/O timing requirements. The proposed designs exhibit optimum speed and area complexities. The parallel solution requires an embedding technique that embeds a source processor array onto a target processor array having smaller number of processing elements (PE) to meet the hardware resource constraint. We propose a. novel embed-ding technique. Through numerical analysis and extensive computer simulation. it is proved that the performance of the target array shows the same performance as the source array. For Longest prefix match (LPM), we propose a novel variable-stride multi-bit trie data structure for IP-lookup table to assist. fast IP-lookup and fast lookup table update. In this dissertation. we first explicitly elaborate the solution of a problem in expanding IP (internet protocol) addresses. Through extensive computer simulation on several routing tables. it is proved that our proposed algorithm shows better performance (lookup and update time) than existing algorithms. However. our proposed technique requires larger memory than others. But the memory requirement is quite acceptable considering the current memory availability and price. We propose a novel Best Match technique required to detect best-matched English words of obfuscated spam words. We have used a non-deterministic finite automaton (NFA) to build the English dictionary. We have used dynamic programming with state pruning to detect the best-matched word of an obfuscated spam word in the NFA. We have done extensive numerical simulations to prove the accuracy of our proposed system. Our system can detect best-matched words of the words obfuscated by spammers using five different techniques: insertion, deletion. substitution. trans-pose. and word boundary. Upto our knowledge. no other system can deal with all these obfuscating techniques so quickly as ours.
Improving The Communication Performance Of I/O Intensive And Communication Intensive Application In Cluster Computer SystemsKumar, V Santhosh 10 1900 (has links)
Cluster computer systems assembled from commodity off-the-shelf components have emerged as a viable and cost-effective alternative to high-end custom parallel computer systems.In this thesis, we investigate how scalable performance can be achieved for database systems on clusters. In this context we specﬁcally considered database query processing for evaluation of botlenecks and suggest optimization techniques for obtaining scalable application performance. First we systematically demonstrated that in a large cluster with high disk bandwidth, the processing capability and the I/O bus bandwidth are the two major performance bottlenecks in database systems. To identify and assess bottlenecks, we developed a Petri net model of parallel query execution on a cluster. Once identiﬁed and assessed,we address the above two performance bottlenecks by offoading certain application related tasks to the processor in the network interface card. Offoading application tasks to the processor in the network interface cards shifts the bottleneck from cluster processor to I/O bus. Further, we propose a hardware scheme,network attached disk ,and a software scheme to achieve a balanced utilization of re-sources like host processor, I/O bus, and processor in the network interface card. The proposed schemes result in a speedup of upto 1.47 compared to the base scheme, and ensures scalable performance upto 64 processors. Encouraged by the beneﬁts of ofﬂoading application tasks to network processors, we explore the possibilities of performing the bloom ﬁlter operations in network processors. We combine ofﬂoading bloom ﬁlter operations with the proposed hardware schemes to achieve upto 50% reduction in execution time. The later part of the thesis provides introductory experiments conducted in Community At-mospheric Model(CAM), a large scale parallel application used for global weather and climate prediction. CAM is a communication intensive application that involves collective communication of large messages. In our limited experiment, we identiﬁed CAM to see the effect of compression techniques and ofﬂoading techniques (as formulated for database) on the performance of communication intensive applications. Due to time constraint, we considered only the possibility of compression technique for improving the application performance. However, ofﬂoading technique could be taken as a full-ﬂedged research problem for further investigation In our experiment, we found compression of messages reduces the message latencies, and hence improves the execution time and scalability of the application. Without using compression techniques, performance measured on 64 processor cluster resulted in a speed up of only 15.6. While lossless compression retains the accuracy and correctness of the program, it does not result in high compression. We therefore propose lossy compression technique which can achieve a higher compression, yet retain the accuracy and numerical stability of the application while achieving a scalable performance. This leads to speedup of 31.7 on 64 processors compared to a speedup of 15.6 without message compression. We establish that the accuracy within prescribed limit of variation and numerical stability of CAM is retained under lossy compression.
07 July 2006
Network processors are new types of multithreaded multicore processors geared towards achieving both fast processing speed and flexibility of programming. The architecture of network processors considers many special properties for packet processing, including multiple threads, multiple processor cores on the same chip, special functional units, simplified ISA and simplified pipeline, etc. The architectural peculiarities of network processors raise new challenges for compiler design and optimization. Due to very high clocking speeds, the CPU memory gap on such processors is huge, making registers extremely precious. Moreover, the register file is split into two banks, and for any ALU instruction, the two source operands must come from different banks. We present and compare three different approaches to do register allocation and bank assignment. We also address the problem of sharing registers across threads in order to maximize the utilization of hardware resources. The context switches on the IXP network processor only happen when long latency operations are encountered. As a result, context switches are highly frequent. Therefore, the designer of the IXP network processor decided to make context switches extremely lightweight, i.e. only the program counter(PC) is stored together with the context. Since registers are not saved and restored during context switches, it becomes difficult to share registers across threads. For a conventional processor, each thread can assume that it can use the entire register file, because registers are always part of the context. However, with lightweight context switch, each thread must take a separate piece of the register file, making register usage inefficient. Programs executing on network processors typically have runtime constraints. Scheduling of multiple programs sharing a CPU must be orchestrated by the OS and the hardware using certain sharing policies. Real time applications demand a real time aware OS kernel to meet their specified deadlines. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms is typically feasible. In this work, we demonstrate that a compiler approach could achieve some of the OS scheduling and real time scheduling functionalities without introducing a hefty overhead.
Uma metodologia analítico-determinística para a avaliação de desempenho no tempo de processadores de rede implementados como sistemas-sobre-silício. / An analytical deterministic methodology for the performance evaluation of network processors deployed as systems-on-chip.Frederico de Faria 26 June 2007 (has links)
O grande aumento da capacidade de integração de transistores em um único circuito integrado tem exigido grande e constante evolução na metodologia de projeto e práticas de implementação de sistemas eletrônicos embarcados. Tal capacidade de integração resultou no surgimento de sistemas sobre silício (SoCs). O projeto de tais sistemas, mais complexos que seus predecessores, alteram significativamente os fluxos tradicionais de concepção de sistemas, fazendo surgir estratégias tais quais reuso, projetos orientados a plataformas, assim como modelagens e simulações em diferentes níveis de abstração. Um dos diferentes níveis de abstração estudados é o analítico, onde os sistemas são modelados através de representações abstratas. A adoção de modelos analíticos apresenta vantagens, como alta velocidade de execução (permitindo um grande número de análises de modelos diferentes) e facilidade de alteração. No entanto, por se tratarem de modelagens distantes, em termos de abstração, de implementações reais, podem oferecer prognósticos não exatos. Faz-se então necessária a investigação de metodologias que tenham como propósito o aperfeiçoamento de tais modelos em termos de acurácia e fidelidade. O presente trabalho apresenta uma metodologia de modelagem analítica para avaliação de desempenho de sistemas-sobre-silício orientada a aplicação de processadores de redes de pacotes. A metodologia de Network Calculus, a ser implementada nos estágios iniciais de projeto de sistemas-sobre-silício baseados em plataforma, contribui para reduzir o espaço de avaliação de projeto. Trata do equacionamento analítico de representações abstratas das cargas de entrada e também da capacidade de processamento de recursos, visando obter prognósticos mais pessimistas e mais otimistas de parâmetros como latência, requisição de buffer e utilização do sistema, descrito de modo abstrato através de grafos. / The great increase in terms of integration capacity of transistors on integrated circuits has demanded great and constant evolution in the design methodology and practical implementation of embedded electronic systems. Such capacity of integration resulted in the sprouting of systems-on-chips (SoCs). The design of such systems, more complex than their predecessors, significantly changes the traditional flow in the conception of systems, bringing up strategies such like reuse, platform based design, as well as modeling and simulation in different abstraction levels. One of the different abstraction levels under study is the analytical one, where the systems are shaped through abstract representations. The adoption of analytical models presents advantages, such as high speed of execution (allowing a great number of analyses of different models) and easiness for alteration. However, due to their distant representation models, in terms of abstraction, from real implementations, they cannot offer accurate prognostics on several design metrics. Therefore, it is necessary the investigation on methodologies aiming to the enhancement of such models in terms of accuracy and fidelity. The present work shows a methodology of analytical modeling for evaluation of system-on-chip performance guided to the application of network processors of packages. The methodology of Network Calculus, to be implemented in the initial steps of of system-on-chip´s design cycle, contributes to reduce the design space exploration. It deals with the building of analytical equations for abstract representations of workloads and also the processing capacity of resources, aiming at to get most pessimistic and most optimistic prognostics of parameters such like latency, buffer requirements and the system utilization, described in abstract way through graphs.
[en] GENERATION OF BUILT-IN OPTICAL INTELIGENCE ON ETHERNET / IP NETWORKS / [pt] GERAÇÃO DE INTELIGÊNCIA ÓPTICA EM REDES ETHERNET / IPHENRIQUE JOSE PINTO PORTELA DA SILVA 06 July 2005 (has links)
[pt] O principal objetivo desta dissertação consiste na geração de novas funcionalidades inteligentes em redes ópticas associadas aos protocolos IP e Gigabit Ethernet, através da utilização de circuitos integrados programáveis operando na taxa do Gigabit. A padronização Ethernet é apresentada através das camadas PHY e MAC, destacando suas funções, interfaces e os tipos de chips disponíveis no mercado. A camada PHY do padrão Ethernet para meios ópticos é detalhada. Algumas tecnologias de chips são discutidas, entre elas o crescimento dedicado, os ASICs, as NPUs e as tecnologias programáveis: FPGAs e CPLDs. O conceito de inteligência óptica e o perfil de camadas equivalentes associados a este conceito são introduzidos. Um novo elemento de rede dedicado à inserção de sinalização na camada óptica é apresentado, destacando-se sua estrutura, sua realização, seu detalhamento para utilização em redes. Diversas montagens experimentais com o elemento desenvolvido são utilizadas para demonstrar as características do sistema, entre elas a eficiência da utilização da tecnologia de FPGAs e a transparência da inteligência na camada óptica para o padrão Ethernet. / [en] The main objective of this work is the generation of new functionalities in optical networks, associated to the Ethernet and IP protocols, by the use of programmable integrated circuits operating in Gigabit rates. The Ethernet standard is presented through its PHY and MAC layers, highlighting its functions, interfaces and the types of commercially available ICs. The Ethernet standard PHY layer for optical media is described. Some IC technologies are discussed, such as dedicated growth, ASICs, NPUs and the programmable technologies: FPGAs e CPLDs. The concept of built-in optical intelligence and a new layers model associated to it are presented. A new network element, dedicated to the insertion of signaling in the optical layer is also presented, and special attention is dedicated to its structure, to its implementation and to the aspects of its use in networks. Several experimental setups using the developed element are shown, demonstrating the characteristics of the system, particularly the efficiency obtained by the use of FPGA technology and the transparency of the optical intelligence with respect to the Ethernet standard.
Page generated in 0.0984 seconds