Global ETD Search

51	Analysis and Architectural Support for Parallel Stateful Packet Processing Verdú Mulà, Javier 09 July 2008 (has links) The evolution of network services is closely related to the network technology trend. Originally network nodes forwarded packets from a source to a destination in the network by executing lightweight packet processing, or even negligible workloads. As links provide more complex services, packet processing demands the execution of more computational intensive applications. Complex network applications deal with both packet header and payload (i.e. packet contents) to provide upper layer network services, such as enhanced security, system utilization policies, and video on demand management.Applications that provide complex network services arise two key capabilities that differ from the low layer network applications: a) deep packet inspection examines the packet payload tipically searching for a matching string or regular expression, and b) stateful processing keeps track information of previous packet processing, unlike other applications that don't keep any data about other packet processing. In most cases, deep packet inspection also integrates stateful processing.Computer architecture researches aim to maximize the system throughput to sustain the required network processing performance as well as other demands, such as memory and I/O bandwidth. In fact, there are different processor architectures depending on the sharing degree of hardware resources among streams (i.e. hardware context). Multicore architectures present multiple processing engines within a single chip that share cache levels of memory hierarchy and interconnection network. Multithreaded architectures integrates multiple streams in a single processing engine sharing functional units, register file, fecth unit, and inner levels of cache hierarchy. Scalable multicore multithreaded architectures emerge as a solution to overcome the requirements of high throughput systems. We call massively multithreaded architectures to the architectures that comprise tens to hundreds of streams distributed across multiple cores on a chip. Nevertheless, the efficient utilization of these architectures depends on the application characteristics. On one hand, emerging network applications show large computational workloads with significant variations in the packet processing behavior. Then, it is important to analyze the behavior of each packet processing to optimally assign packets to threads (i.e. software context) for reducing any negative interaction among them. On the other hand, network applications present Packet Level Parallelism (PLP) in which several packets can be processed in parallel. As in other paradigms, dependencies among packets limit the amount of PLP. Lower network layer applications show negligible packet dependencies. In contrast, complex upper network applications show dependencies among packets leading to reduce the amount of PLP.In this thesis, we address the limitations of parallelism in stateful network applications to maximize the throughput of advanced network devices. This dissertation comprises three complementary sets of contributions focused on: network analysis, workload characterization and architectural proposal.The network analysis evaluates the impact of network traffic on stateful network applications. We specially study the impact of network traffic aggregation on memory hierarchy performance. We categorize and characterize network applications according to their data management. The results point out that stateful processing presents reduced instruction level parallelism and high rate of long latency memory accesses. Our analysis reveal that stateful applications expose a variety of levels of parallelism related to stateful data categories. Thus, we propose the MultiLayer Processing (MLP) as an execution model to exploit multiple levels of parallelism. The MLP is a thread migration based mechanism that increases the sinergy among streams in the memory hierarchy and alleviates the contention in critical sections of parallel stateful workloads. 004 - Informàtica
52	Enhancing the Rendering of Natural Scenes: Rain, Sea and Terrain Puig Centelles, Anna 12 February 2010 (has links) Los escenarios de exteriores en aplicaciones gráficas incluyen generalmente fenómenos naturales que son complicados de simular en tiempo real. Así, en el campo de la visualización interactiva se han propuesto soluciones que son capaces de ofrecer realismo a un coste elevado. El principal objetivo de esta tesis es presentar un conjunto de técnicas para mejorar la visualización en tiempo real de estos fenómenos naturales. Proponemos una solución para facilitar la creación y el control de escenas de lluvia por medio de un nuevo método de nivel de detalle. Posteriormente, esta técnica se ha extendido para incluir el manejo de colisiones de las gotas con el entorno y la simulación de las salpicaduras. Como una mejora sobre las técnicas de simulación de mar, se ha desarrollado un nuevo algoritmo de teselación dependiente de la vista, explotando el hardware gráfico actual y permitiendo el mantenimiento de la coherencia entre las aproximaciones calculadas. Por último, se ha propuesto una aplicación de sketching para diseñar terreno, ofreciendo una alternativa a los generadores de terreno que ofrecen un control al usuario muy limitado. Estas distintas técnicas cubren la simulación de diferentes fenómenos naturales y ofrecen interesantes mejoras. 004 - Informàtica
53	On Collective Computation Delgado, Jordi 05 December 1997 (has links) No description available. 004 - Informàtica
54	Information theoretic refinement criteria for image synthesis Rigau Vilalta, Jaume 17 November 2006 (has links) Aquest treball està enmarcat en el context de gràfics per computador partint de la intersecció de tres camps: rendering, teoria de la informació, i complexitat.Inicialment, el concepte de complexitat d'una escena es analitzat considerant tres perspectives des d'un punt de vista de la visibilitat geomètrica: complexitat en un punt interior, complexitat d'una animació, i complexitat d'una regió. L'enfoc principal d'aquesta tesi és l'exploració i desenvolupament de nous criteris de refinament pel problema de la il·luminació global. Mesures de la teoria de la informació basades en la entropia de Shannon i en la entropia generalitzada de Harvda-Charvát-Tsallis, conjuntament amb les f-divergències, són analitzades com a nuclis del refinement. Mostrem com ens aporten una rica varietat d'eficients i altament discriminatòries mesures que són aplicables al rendering en els seus enfocs de pixel-driven (ray-tracing) i object-space (radiositat jeràrquica).Primerament, basat en la entropia de Shannon, es defineixen un conjunt de mesures de qualitat i contrast del pixel. S'apliquen al supersampling en ray-tracing com a criteris de refinement, obtenint un algorisme nou de sampleig adaptatiu basat en entropia, amb un alt rati de qualitat versus cost. En segon lloc, basat en la entropia generalitzada de Harvda-Charvát-Tsallis, i en la informació mutua generalitzada, es defineixen tres nous criteris de refinament per la radiositat jeràrquica. En correspondencia amb tres enfocs clàssics, es presenten els oracles basats en la informació transportada, el suavitzat de la informació, i la informació mutua, amb resultats molt significatius per aquest darrer. Finalment, tres membres de la familia de les f-divergències de Csiszár's (divergències de Kullback-Leibler, chi-square, and Hellinger) son analitzats com a criteris de refinament mostrant bons resultats tant pel ray-tracing com per la radiositat jeràrquica. / This work is framed within the context of computer graphics starting out from the intersection of three fields: rendering, information theory, and complexity.Initially, the concept of scene complexity is analysed considering three perspectives from a geometric visibility point of view: complexity at an interior point, complexity of an animation, and complexity of a region. The main focus of this dissertation is the exploration and development of new refinement criteria for the global illumination problem. Information-theoretic measures based on Shannon entropy and Harvda-Charvát-Tsallis generalised entropy, together with f-divergences, are analysed as kernels of refinement. We show how they give us a rich variety of efficient and highly discriminative measures which are applicable to rendering in its pixel-driven (ray-tracing) and object-space (hierarchical radiosity) approaches.Firstly, based on Shannon entropy, a set of pixel quality and pixel contrast measures are defined. They are applied to supersampling in ray-tracing as refinement criteria, obtaining a new entropy-based adaptive sampling algorithm with a high rate quality versus cost. Secondly, based on Harvda-Charvát-Tsallis generalised entropy, and generalised mutual information, three new refinement criteria are defined for hierarchical radiosity. In correspondence with three classic approaches, oracles based on transported information, information smoothness, and mutual information are presented, with very significant results for the latter. And finally, three members of the family of Csiszár's f-divergences (Kullback-Leibler, chi-square, and Hellinger divergences) are analysed as refinement criteria showing good results for both ray-tracing and hierarchical radiosity. 004 - Informàtica
55	Security in peer-to-peer communication systems Suárez Touceda, Diego 26 July 2011 (has links) P2PSIP (Peer-to-Peer Session Initiation Protocol) is a protocol developed by the IETF (Internet Engineering Task Force) for the establishment, completion and modi¿cation of communication sessions that emerges as a complement to SIP (Session Initiation Protocol) in environments where the original SIP protocol may fail for technical, ¿nancial, security, or social reasons. In order to do so, P2PSIP systems replace all the architecture of servers of the original SIP systems used for the registration and location of users, by a structured P2P network that distributes these functions among all the user agents that are part of the system. This new architecture, as with any emerging system, presents a completely new security problematic which analysis, subject of this thesis, is of crucial importance for its secure development and future standardization. Starting with a study of the state of the art in network security and continuing with more speci¿c systems such as SIP and P2P, we identify the most important security services within the architecture of a P2PSIP communication system: access control, bootstrap, routing, storage and communication. Once the security services have been identi¿ed, we conduct an analysis of the attacks that can a¿ect each of them, as well as a study of the existing countermeasures that can be used to prevent or mitigate these attacks. Based on the presented attacks and the weaknesses found in the existing measures to prevent them, we design speci¿c solutions to improve the security of P2PSIP communication systems. To this end, we focus on the service that stands as the cornerstone of P2PSIP communication systems¿ security: access control. Among the new designed solutions stand out: a certi¿cation model based on the segregation of the identity of users and nodes, a model for secure access control for on-the-¿y P2PSIP systems and an authorization framework for P2PSIP systems built on the recently published Internet Attribute Certi¿cate Pro¿le for Authorization. Finally, based on the existing measures and the new solutions designed, we de¿ne a set of security recommendations that should be considered for the design, implementation and maintenance of P2PSIP communication systems. 004 - Informàtica
56	Improving multithreading performance for clustered VLIW architectures. Gupta, Manoj 14 June 2013 (has links) Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others. 004 - Informàtica
57	A hierarchical framework for peer-to peer systems: design and optimizations Sánchez Artigas, Marc 12 January 2009 (has links) En los últimos años, las redes peer-to-peer (P2P) ha experimentado una fuerte expansión. Estos sustratos se constituyen en forma de redes overlay o de recubrimiento que interconectan usuarios de manera lógica y desacoplada de la topología física, y que proporcionan un servicio descentralizado de búsqueda de recursos. Existen dos grandes familias de redes P2P descentralizadas: las redes P2P desestructuradas y las redes P2P estructuradas. Desde el punto de vista funcional, las redes estructuradas también se denominan Tablas de Hash Distribuidas (DHTs). Básicamente, las DHTs proporcionan la misma funcionalidad de las tablas de hash tradicional, esto es, la interficie estándar Put(clave, valor) y Get(clave), pero asociando los pares clave-valor con usuarios de la DHT.Debido a su excelente escalabilidad, las DHT han suscitado una gran expectación en los últimos años. Sin embargo, su adopción como herramienta generalizada de comunicación es aún lenta debido a un conjunto de inconvenientes. El primer inconveniente es que la estructura lógica de las DHTs no se corresponde con la topología física de Internet. En otras palabras, un usuario puede tener como vecinos a otros participantes que en realidad se encuentren muy alejados (en términos de latencia) de él. Para aplicaciones en que la latencia extremo-a-extremo ha de ser necesariamente baja, esta falta de correspondencia supone un gran obstáculo. Por otro lado, muchos diseños asumen que la comunicación es uniforme, mientras que en la práctica los usuarios se comunican de manera más frecuente con los usuarios que pertenecen al mismo dominio administrativo, comparten los mismos intereses etc.Para resolver estas deficiencias, tradicionalmente se ha recurrido a la organización de los usuarios en dominios jerárquicos. Ejemplos típicos de esta estrategia son el sistema DNS y los sistemas de distribución y gestión de contenido multimedia de alta calidad.El problema básico es que la mayoría de DHTs se han diseñado como estructuras llanas y por tanto, no pueden disfrutar de las ventajas de las jerarquías. En esta disertación, hemos intentado solucionar este problema de la forma siguiente:Seducidos por la escalabilidad de los diseños jerárquicos, en la primera parte de esta tesis, describimos un framework o marco de trabajo jerárquico para DHTs. El objetivo principal de este framework es proporcionar una metodología genérica para transformar una DHT cualquiera en una DHT jerárquica constituida por grupos o clusters telescópicos, esto es, clusters de clusters de ... de clusters de usuarios. La idea básica consiste en explotar, si es posible, su estructura recursiva. En caso afirmativo, la construcción jerárquica hereda la homogeneidad en carga y funcionalidad del diseño original, pero con las ventajas adicionales derivadas de una estructura jerárquica. Para ilustrar la utilidad de nuestro framework, proporcionamos la versión jerárquica de Chord y un conjunto de indicaciones para poder transformar seis DHTs de manera sencilla. Cerramos esta parte con el estudio de la mejora potencial en el rendimiento de nuestros diseños. En la segunda parte de esta tesis, respondemos a una cuestión que uno debería de tener en cuenta a fin de poder valorar objetivamente la utilidad de nuestro framework: En cuáles aspectos nuestras construcciones jerárquicas son superiores a las existentes? Para dar una respuesta satisfactoria a esta pregunta, introducimos un modelo genérico basado en costes. En general, nuestros diseños jerárquicos ofrecen un amplio abanico de posibilidades relacionadas con la explotación de un sustrato con múltiples dominios. Un ejemplo ilustrativo es la mejora del rendimiento. Si la comunicación es frecuente entre usuarios de un mismo dominio, la adaptación de los dominios a la red física permitirá reducir el tiempo de búsqueda medio del sistema. El problema básico es como organizar los usuarios en clusters de baja latencia, de manera descentralizada y escalable. Para solucionar este problema, la última parte de esta tesis introduce un nuevo algoritmo de clustering o de agrupamiento. La función de este algoritmo es organizar a los usuarios en múltiples clusters de manera que los usuarios dentro de un cluster estén mutuamente más cercanos (en términos de latencia) que los usuarios pertenecientes a clusters distintos. Para juzgar la calidad de nuestra solución, proponemos una nueva métrica denominada false clustering rate. Esta métrica mide la proporción de usuarios falsamente agrupados dentro del sistema. Por usuarios falsamente agrupados nos referimos a usuarios lejanos que han estado erróneamente agrupados dentro de un mismo cluster. Finalmente, demostramos por medio de diversos experimentos como nuestro algoritmo permite obtener mejores significativas con respecto a las técnicas existentes. 004 - Informàtica
58	Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs Subotic, Vladimir 26 July 2013 (has links) Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC. / La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC. 004 - Informàtica
59	Performance and power optimizations in chip multiprocessors for throughput-aware computation Vega, Augusto J. 30 July 2013 (has links) The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency. / El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias. 004 - Informàtica
60	Multipath: un sistema para la programación lógica Tubella, Jordi 13 November 1996 (has links) No description available. 004 - Informàtica

Search results