Global ETD Search

41	Paralelização em CUDA do algoritmo Aho-Corasick utilizando as hierarquias de memórias da GPU e nova compactação da Tabela de Transcrição de Estados Silva Júnior, José Bonifácio da 21 June 2017 (has links) The Intrusion Detection System (IDS) needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This This dissertation aims to parallelize the Brute Force and Aho-Corasick string matching algorithms and to propose a new compression of the State Transition Table of the Aho-Corasick algorithm in order to make it possible to use it in shared memory and accelerate the comparison of strings. The two algorithms were parallelized using the NVIDIA CUDA platform and executed in the GPU memories to allow a comparative analysis of the performance of these memories. Initially, the AC algorithm proved to be faster than the Brute Force algorithm and so it was followed for optimization. The AC algorithm was compressed and executed in parallel in shared memory, achieving a performance gain of 15% over other GPU memories and being 48 times faster than its serial version when testing with real network packets. When the tests were done with synthetic data (less random data) the gain reached 73% and the parallel algorithm was 56 times faster than its serial version. Thus, it can be seen that the use of compression in shared memory becomes a suitable solution to accelerate the processing of IDSs that need agility in the search for patterns. / Um Sistema de Detecção de Intrusão (IDS) necessita comparar o conteúdo de todos os pacotes que chegam na interface da rede com um conjunto de assinaturas que indicam possíveis ataques, tarefa esta que consome bastante tempo de processamento da CPU. Para amenizar esse problema, tem-se tentado paralelizar o motor de comparação dos IDSs transferindo sua execução da CPU para a GPU. Esta dissertação tem como objetivo fazer a paralelização dos algoritmos de comparação de strings Força-Bruta e Aho-Corasick e propor uma nova compactação da Tabela de Transição de Estados do algoritmo Aho-Corasick a fim de possibilitar o uso dela na memória compartilhada e acelerar a comparação de strings. Os dois algoritmos foram paralelizados utilizando a plataforma CUDA da NVIDIA e executados nas memórias da GPU a fim de possibilitar uma análise comparativa de desempenho dessas memórias. Inicialmente, o algoritmo AC mostrou-se mais veloz do que o algoritmo Força-Bruta e por isso seguiu-se para sua otimização. O algoritmo AC foi compactado e executado de forma paralela na memória compartilhada, alcançando um ganho de desempenho de 15% em relação às outras memórias da GPU e sendo 48 vezes mais rápido que sua versão na CPU quando os testes foram feitos com pacotes de redes reais. Já quando os testes foram feitos com dados sintéticos (dados menos aleatórios) o ganho chegou a 73% e o algoritmo paralelo chegou a ser 56 vezes mais rápido que sua versão serial. Com isso, pode-se perceber que o uso da compactação na memória compartilhada torna-se uma solução adequada para acelerar o processamento de IDSs que necessitem de agilidade na busca por padrões. Ciência da computação Computação de alto desempenho Arquitetura de computador Segurança da informação GPUS CUDA Algoritmos de comparação de strings Aho-Corasick IDS Hierarquia de memória da GPU Técnicas de compactação String matching algorithms Aho-Corasick GPU memory hierarchy Compaction techniques
42	Development of methodologies for memory management and design space exploration of SW/HW computer architectures for designing embedded systems / Ανάπτυξη μεθοδολογιών διαχείρισης μνήμης και εξερεύνησης σχεδιασμών σε αρχιτεκτονικές υπολογιστών υλικού/λογισμικού για σχεδίαση ενσωματωμένων συστημάτων Κρητικάκου, Αγγελική 16 May 2014 (has links) This PhD dissertation proposes innovative methodologies to support the designing and the mapping process of embedded systems. Due to the increasing requirements, embedded systems have become quite complex, as they consist of several partially dependent heterogeneous components. Systematic Design Space Exploration (DSE) methodologies are required to support the near-optimal design of embedded systems within the available short time-to-market. In this target domain, the existing DSE approaches either require too much exploration time to find near-optimal designs due to the high number of parameters and the correlations between the parameters of the target domain, or they end up with a less efficient trade-off result in order to find a design within acceptable time. In this dissertation we present an alternative DSE methodology, which is based on systematic creation of scalable and near-optimal DSE frameworks. The frameworks describe all the available options of the exploration space in a finite set of classes. A set of principles is presented which is used in the reusable DSE methodology to create a scalable and near-optimal framework and to efficiently use it to derive scalable and near-optimal design solutions within a Pareto trade-off space. The DSE reusable methodology is applied to several stages of the embedded system design flow to derive scalable and near-optimal methodologies. The first part of the dissertation is dedicated to the development of mapping methodologies for storing large embedded system data arrays in the lower layers of the on-chip background data memory hierarchy, and the second part to the DSE methodologies for the processing part of SW/HW architectures in embedded systems including the foreground memory systems. Existing mapping approaches for the background memory part are either enumerative, symbolic/polyhedral and worst case (heuristics) approximations. The enumerative approaches require too much exploration time, the worst case approximation lead to overestimation of the storage requirements, whereas the symbolic/polytope approaches are scalable and near-optimal for solid and regular iteration spaces. By applying the new reusable DSE methodology, we have developed an intra-signal in-place optimization methodology which is scalable and near-optimal for highly irregular access schemes. Scalable and near-optimal solutions for the different cases of the proposed methodology have been developed for the cases of non-overlapping and overlapping store and load access schemes. To support the proposed methodology, a new representation of the array access schemes, which is appropriate to express the irregular shapes in a scalable and near-optimal way, is presented. A general pattern formulation has been proposed which describes the access scheme in a compact and repetitive way. Pattern operations were developed to combine the patterns in a scalable and near-optimal way under all the potential pattern combination cases, which may exist in the application under study. In the processing oriented part of the dissertation, a DSE methodology is developed for mapping instance of a predefined target application domain onto a partially fixed architecture platform template, which consists of one processor core and several custom hardware accelerators. The DSE methodology consists of uni-directional steps, which are implemented through parametric templates and are applied without costly design iterations. The proposed DSE methodology explores the space by instantiating the steps and propagating design constraints which prune design options following the steps ordering. The result is a final Pareto trade-off curve with the most relevant near-optimal designs. As the scheduling and the assignment are the major tasks of both the foreground and the datapath, near-optimal and scalable techniques are required to support the parametric templates of the proposed DSE methodology. A framework which describes the scheduling and assignment of the scalars into the registers and the scheduling and assignment of the operation into the function units of the data path is developed. Based on the framework, a systematic methodology to arrive at parametric templates for scheduling and assignment techniques which satisfy the target domain constraints is developed. In this way, a scalable parametric template for scheduling and assignment tasks is created, which guarantees near-optimality for the domain under study. The developed template can be used in the Foreground Memory Management step and Data-path mapping step of the overall design flow. For the DSE of the domain under study, near-optimal results are hence achieved through a truly scalable technique. / Η παρούσα διδακτορική διατριβή προτείνει καινοτόμες μεθοδολογίες για τον σχεδιασμό και τη διαδικασία απεικόνισης σε ενσωματωμένα συστημάτα. Λόγω των αυξανόμενων απαιτήσεων, τα ενσωματωμένα συστήματα είναι αρκετά περίπλοκα, καθώς αποτελούνται από πολλά και εν μέρει εξαρτώμενα ετερογενή στοιχεία. Συστηματικές μεθοδολογίες για την εξερεύνηση του χώρου λύσεων (Design Space Exploration – DSE) απαιτούνται σχεδόν βέλτιστες σχεδιάσεις ενσωματωμένων συστημάτων εντός του διαθέσιμου χρονου. Οι υπάρχουσες DSE μεθοδολογίες απαιτούν είτε πάρα πολύ χρόνο εξερεύνησης για να βρουν τους σχεδόν βέλτιστους σχεδιασμούς, λόγω του μεγάλου αριθμού των παραμέτρων και τις συσχετίσεις μεταξύ των παραμέτρων, ή καταλήγουν με ένα λιγότερο βέλτιστο σχέδιο, προκειμένου να βρειθεί ένας σχεδιασμός εντός του διαθέσιμου χρόνου. Στην παρούσα διδακτορική διατριβή παρουσιάζουμε μια εναλλακτική DSE μεθοδολογία, η οποία βασίζεται στη συστηματική δημιουργία επεκτάσιμων και σχεδόν βέλτιστων DSE πλαισίων. Τα πλαίσια περιγράφουν όλες τις διαθέσιμες επιλογές στο χώρο εξερεύνησης με ένα πεπερασμένο σύνολο κατηγοριών. Ένα σύνολο αρχών χρησιμοποιείται στην επαναχρησιμοποιήούμενη DSE μεθοδολογία για να δημιουργήσει ένα επεκτάσιμο και σχεδόν βέλτιστο DSE πλαίσιο και να χρησιμοποιήθεί αποτελεσματικά για να δημιουργήσει επεκτάσιμες και σχεδόν βέλτιστες σχεδιαστικές λύσεις σε ένα Pareto Trade-off χώρο λύσεων. Η DSE μεθοδολογία εφαρμόζεται διάφορα στάδια της σχεδιαστικής ροής για ενσωματωμένα συστήματα και να δημιουργήσει επεκτάσιμες και σχεδόν βέλτιστες μεθοδολογίες. Το πρώτο μέρος της διατριβής είναι αφιερωμένο στην ανάπτυξη των μεθόδων απεικόνισης για την αποθήκευση μεγάλων πινάκων που χρησιμοποιούνται στα ενσωματωμένα συστήματα και αποθηκεύονται στα χαμηλότερα στρώματα της on-chip Background ιεραρχία μνήμης. Το δεύτερο μέρος είναι αφιερωμένο σε DSE μεθοδολογίες για το τμήμα επεξεργασίας σε αρχιτεκτονικές λογισμικού/υλικού σε ενσωματωμένα συστήματα, συμπεριλαμβανομένων των συστημάτων της προσκήνιας (foreground) μνήμης. Υπάρχουσες μεθοδολογίες απεικόνισης για την Background μνήμης είτε εξονυχιστικές, συμβολικές/πολυεδρικές και προσεγγίσεις με βάση τη χειρότερη περίπτωση. Οι εξονυχιστικές απαιτούν πάρα πολύ μεγάλο χρόνο εξερεύνησης, οι προσεγγίσεις οδηγούν σε υπερεκτίμηση των απαιτήσεων αποθήκευσης, ενώ οι συμβολικές είναι επεκτάσιμη και σχεδόν βέλτιστές μονο για τακτικούς χώρους επαναλήψεων. Με την εφαρμογή της προτεινόμενης DSE μεθοδολογίας αναπτύχθηκε μια επεκτάσιμη και σχεδόν βέλτιστη μεθοδολγοία για την εύρεση του αποθηκευτικού μεγέθους για τα δεδομένα ενός πίνακα για άτακτους και για τακτικούς χώρους επαναλήψεων. Προτάθηκε μια νέα αναπαράσταση των προσπελάσεων στη μνήμη, η οποία εκφράζει τα ακανόνιστα σχήματα στο χώρο επεναλήψεων με επακτάσιμο και σχεδόν βέλτιστο τρόπο. Στο δεύτερο τμήμα της διατριβής, μια DSE μεθοδολογία αναπτύχθηκε για το σχεδιασμό ενός προκαθορισμένου τομέα από εφαρμογές σε μια μερικώς αποφασισμένη αρχιτεκτονική πλατφόρμα, η οποία αποτελείται από ένα πυρήνα επεξεργαστή και αρκετούς συνεπεξεργαστές. Η DSE μεθοδολογία αποτελείται από μονής κατεύθυνσης βήματα, τα οποία υλοποιούνται μέσω παραμετρικών πλαισίων και εφαρμόζονται αποφέυγοντας τις δαπανηρές επαναλήψεις κατά τον σχεδιασμό. Η προτεινόμενη DSE μεθοδολογία εξερευνά το χώρο βρίσκοντας στιγμιότυπα για καθε βήμα και διαδίδονατς τις αποφάσεις μεταξύ βημάτων. Με αυτό το τρόπο κλαδεύουν τις επιλογές σχεδιασμού στα επόμενα βήματα. Το αποτέλεσμα είναι μια Pareto καμπύλη. Ένα DSE πλαίσιο προτάθηκε που περιγράφει τις τεχνικές χρονοπρογραμματισμού και ανάθεσης πόρων των καταχωρητών και των μονάδων εκτέλεσης του συστήματος. Προτάθηκε μια μεθοδολογία για να δημιουργεί σχεδόν βέλτιστα και επεκτάσιμα παραμετρικά πρότυπα για τον χρονοπρογραμματισμό και την ανάθεση πόρων που ικανοποιεί τους περιορισμούς ενός τομέα εφαρμογών. Embedded systems System design flow Design space exploration Memory hierarchy Software/hardware mapping Memory storage size Scheduling and assignment Top-down methodologies 004.21 Ιεραρχία μνήμης Top-down μεθοδολογίες
43	Interference Analysis and Resource Management in Server Processors: from HPC to Cloud Computing Pons Escat, Lucía 01 September 2023 (has links) [ES] Una de las principales preocupaciones de los centros de datos actuales es maximizar la utilización de los servidores. En cada servidor se ejecutan simultáneamente varias aplicaciones para aumentar la eficiencia de los recursos. Sin embargo, las prestaciones dependen en gran medida de la proporción de recursos que recibe cada aplicación. El mayor número de núcleos (y de aplicaciones ejecutándose) con cada nueva generación de procesadores hace que crezca la preocupación por la interferencia en los recursos compartidos. Esta tesis se centra en mitigar la interferencia cuando diferentes aplicaciones se consolidan en un mismo procesador desde dos perspectivas: computación de alto rendimiento (HPC) y computación en la nube. En el contexto de HPC, esta tesis propone políticas de gestión para dos de los recursos más críticos: la caché de último nivel (LLC) y los núcleos del procesador. La LLC desempeña un papel clave en las prestaciones de los procesadores actuales al reducir considerablemente el número de accesos de alta latencia a memoria principal. Se proponen estrategias de particionado de la LLC tanto para cachés inclusivas como no inclusivas, ambos diseños presentes en los procesadores para servidores actuales. Para los esquemas, se detectan nuevos comportamientos problemáticos y se asigna un mayor espacio de caché a las aplicaciones que hacen mejor uso de este. En cuanto a los núcleos del procesador, muchas aplicaciones paralelas (como aplicaciones de grafos) no escalan bien con un mayor número de núcleos. Además, el planificador de Linux aplica una estrategia de tiempo compartido que no ofrece buenas prestaciones cuando se ejecutan aplicaciones de grafo. Para maximizar la utilización del sistema, esta tesis propone ejecutar múltiples aplicaciones de grafo en el mismo procesador, asignando a cada una el número óptimo de núcleos (y adaptando el número de hilos creados) dinámicamente. En cuanto a la computación en la nube, esta tesis aborda tres grandes retos: la compleja infraestructura de estos sistemas, las características de sus aplicaciones y el impacto de la interferencia entre máquinas virtuales (MV). Primero, esta tesis presenta la plataforma experimental desarrollada con los principales componentes de un sistema en la nube. Luego, se presenta un amplio estudio de caracterización sobre un conjunto de aplicaciones de latencia crítica representativas con el fin de identificar los puntos que los proveedores de servicios en la nube deben tener en cuenta para mejorar el rendimiento y la utilización de los recursos. Por último, se realiza una propuesta que permite detectar y estimar dinámicamente la interferencia entre MV. El enfoque usa métricas que pueden monitorizarse fácilmente en la nube pública, ya que las MV deben tratarse como "cajas negras". Toda la investigación descrita se lleva a cabo respetando las restricciones y cumpliendo los requisitos para ser aplicable en entornos de producción de nube pública. En resumen, esta tesis aborda la contención en los principales recursos compartidos del sistema en el contexto de la consolidación de servidores. Los resultados experimentales muestran importantes ganancias sobre Linux. En los procesadores con LLC inclusiva, el tiempo de ejecución (TT) se reduce en más de un 40%, mientras que se mejora el IPC más de un 3%. Con una LLC no inclusiva, la equidad y el TT mejoran en un 44% y un 24%, respectivamente, al mismo tiempo que se mejora el rendimiento hasta un 3,5%. Al distribuir los núcleos del procesador de forma eficiente, se alcanza una equidad casi perfecta (94%), y el TT se reduce hasta un 80%. En entornos de computación en la nube, la degradación del rendimiento puede estimarse con un error de un 5% en la predicción global. Todas las propuestas presentadas han sido diseñadas para ser aplicadas en procesadores comerciales sin requerir ninguna información previa, tomando las decisiones dinámicamente con datos recogidos de los contadores de prestaciones. / [CAT] Una de les principals preocupacions dels centres de dades actuals és maximitzar la utilització dels servidors. A cada servidor s'executen simultàniament diverses aplicacions per augmentar l'eficiència dels recursos. Tot i això, el rendiment depèn en gran mesura de la proporció de recursos que rep cada aplicació. El nombre creixent de nuclis (i aplicacions executant-se) amb cada nova generació de processadors fa que creixca la preocupació per l'efecte causat per les interferències en els recursos compartits. Aquesta tesi se centra a mitigar la interferència en els recursos compartits quan diferents aplicacions es consoliden en un mateix processador des de dues perspectives: computació d'alt rendiment (HPC) i computació al núvol. En el context d'HPC, aquesta tesi proposa polítiques de gestió per a dos dels recursos més crítics: la memòria cau d'últim nivell (LLC) i els nuclis del processador. La LLC exerceix un paper clau a les prestacions del sistema en els processadors actuals reduint considerablement el nombre d'accessos d'alta latència a la memòria principal. Es proposen estratègies de particionament de la LLC tant per a caus inclusives com no inclusives, ambdós dissenys presents en els processadors actuals. Per als dos esquemes, se detecten nous comportaments problemàtics i s'assigna un major espai de memòria cau a les aplicacions que en fan un millor ús. Pel que fa als nuclis del processador, moltes aplicacions paral·leles (com les aplicacions de graf) no escalen bé a mesura que s'incrementa el nombre de nuclis. A més, el planificador de Linux aplica una estratègia de temps compartit que no ofereix bones prestacions quan s'executen aplicacions de graf. Per maximitzar la utilització del sistema, aquesta tesi proposa executar múltiples aplicacions de grafs al mateix processador, assignant a cadascuna el nombre òptim de nuclis (i adaptant el nombre de fils creats) dinàmicament. Pel que fa a la computació al núvol, aquesta tesi aborda tres grans reptes: la complexa infraestructura d'aquests sistemes, les característiques de les seues aplicacions i l'impacte de la interferència entre màquines virtuals (MV). En primer lloc, aquesta tesi presenta la plataforma experimental desenvolupada amb els principals components d'un sistema al núvol. Després, es presenta un ampli estudi de caracterització sobre un conjunt d'aplicacions de latència crítica representatives per identificar els punts que els proveïdors de serveis al núvol han de tenir en compte per millorar el rendiment i la utilització dels recursos. Finalment, es fa una proposta que de manera dinàmica permet detectar i estimar la interferència entre MV. L'enfocament es basa en mètriques que es poden monitoritzar fàcilment al núvol públic, ja que les MV han de tractar-se com a "caixes negres". Tota la investigació descrita es duu a terme respectant les restriccions i complint els requisits per ser aplicable en entorns de producció al núvol públic. En resum, aquesta tesi aborda la contenció en els principals recursos compartits del sistema en el context de la consolidació de servidors. Els resultats experimentals mostren que s'obtenen importants guanys sobre Linux. En els processadors amb una LLC inclusiva, el temps d'execució (TT) es redueix en més d'un 40%, mentres que es millora l'IPC en més d'un 3%. En una LLC no inclusiva, l'equitat i el TT es milloren en un 44% i un 24%, respectivament, al mateix temps que s'obté una millora del rendiment de fins a un 3,5%. Distribuint els nuclis del processador de manera eficient es pot obtindre una equitat quasi perfecta (94%), i el TT pot reduir-se fins a un 80%. En entorns de computació al núvol, la degradació del rendiment pot estimar-se amb un error de predicció global d'un 5%. Totes les propostes presentades en aquesta tesi han sigut dissenyades per a ser aplicades en processadors de servidors comercials sense requerir cap informació prèvia, prenent decisions dinàmicament amb dades recollides dels comptadors de prestacions. / [EN] One of the main concerns of today's data centers is to maximize server utilization. In each server processor, multiple applications are executed concurrently, increasing resource efficiency. However, performance and fairness highly depend on the share of resources that each application receives, leading to performance unpredictability. The rising number of cores (and running applications) with every new generation of processors is leading to a growing concern for interference at the shared resources. This thesis focuses on addressing resource interference when different applications are consolidated on the same server processor from two main perspectives: high-performance computing (HPC) and cloud computing. In the context of HPC, resource management approaches are proposed to reduce inter-application interference at two major critical resources: the last level cache (LLC) and the processor cores. The LLC plays a key role in the system performance of current multi-cores by reducing the number of long-latency main memory accesses. LLC partitioning approaches are proposed for both inclusive and non-inclusive LLCs, as both designs are present in current server processors. In both cases, newly problematic LLC behaviors are identified and efficiently detected, granting a larger cache share to those applications that use best the LLC space. As for processor cores, many parallel applications, like graph applications, do not scale well with an increasing number of cores. Moreover, the default Linux time-sharing scheduler performs poorly when running graph applications, which process vast amounts of data. To maximize system utilization, this thesis proposes to co-locate multiple graph applications on the same server processor by assigning the optimal number of cores to each one, dynamically adapting the number of threads spawned by the running applications. When studying the impact of system-shared resources on cloud computing, this thesis addresses three major challenges: the complex infrastructure of cloud systems, the nature of cloud applications, and the impact of inter-VM interference. Firstly, this thesis presents the experimental platform developed to perform representative cloud studies with the main cloud system components (hardware and software). Secondly, an extensive characterization study is presented on a set of representative latency-critical workloads which must meet strict quality of service (QoS) requirements. The aim of the studies is to outline issues cloud providers should consider to improve performance and resource utilization. Finally, we propose an online approach that detects and accurately estimates inter-VM interference when co-locating multiple latency-critical VMs. The approach relies on metrics that can be easily monitored in the public cloud as VMs are handled as ``black boxes''. The research described above is carried out following the restrictions and requirements to be applicable to public cloud production systems. In summary, this thesis addresses contention in the main system shared resources in the context of server consolidation, both in HPC and cloud computing. Experimental results show that important gains are obtained over the Linux OS scheduler by reducing interference. In inclusive LLCs, turnaround time (TT) is reduced by over 40% while improving IPC by more than 3%. In non-inclusive LLCs, fairness and TT are improved by 44% and 24%, respectively, while improving performance by up to 3.5%. By distributing core resources efficiently, almost perfect fairness can be obtained (94%), and TT can be reduced by up to 80%. In cloud computing, performance degradation due to resource contention can be estimated with an overall prediction error of 5%. All the approaches proposed in this thesis have been designed to be applied in commercial server processors without requiring any prior information, making decisions dynamically with data collected from hardware performance counters. / Pons Escat, L. (2023). Interference Analysis and Resource Management in Server Processors: from HPC to Cloud Computing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/195840 Computación en la nube Compartición de recursos Multiprocesadores multinúcleo Virtualización Cargas de latencia crítica Retardo de cola Multi-core multiprocessors Interapplication interference Resource sharing Resource contention Performance High-performance computing Cache memories Cache partitioning Memory structures Memory hierarchy Graph processing Scheduling System utilization Runtime optimization Cloud computing Public cloud Virtualization Latency-critical workloads Tail latency Hyper-Threading Quality of Service (QoS)

Search results

Paralelização em CUDA do algoritmo Aho-Corasick utilizando as hierarquias de memórias da GPU e nova compactação da Tabela de Transcrição de Estados

Interference Analysis and Resource Management in Server Processors: from HPC to Cloud Computing