• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 367
  • 83
  • 46
  • 1
  • Tagged with
  • 497
  • 486
  • 125
  • 96
  • 77
  • 45
  • 44
  • 44
  • 42
  • 40
  • 40
  • 40
  • 40
  • 39
  • 36
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
111

Optimizing SIMD execution in HW/SW co-designed processors

Kumar, Rakesh 24 July 2014 (has links)
SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.
112

Hardware design of task superscalar architecture

Yazdanpanah, Fahimeh 26 June 2014 (has links)
Exploiting concurrency to achieve greater performance is a difficult and important challenge for current high performance systems. Although the theory is plain, the complexity of traditional parallel programming models in most cases impedes the programmer to harvest performance. Several partitioning granularities have been proposed to better exploit concurrency at task granularity. In this sense, different dynamic software task management systems, such as task-based dataflow programming models, benefit dataflow principles to improve task-level parallelism and overcome the limitations of static task management systems. These models implicitly schedule computation and data and use tasks instead of instructions as a basic work unit, thereby relieving the programmer of explicitly managing parallelism. While these programming models share conceptual similarities with the well-known Out-of-Order superscalar pipelines (e.g., dynamic data dependency analysis and dataflow scheduling), they rely on software-based dependency analysis, which is inherently slow, and limits their scalability when there is fine-grained task granularity and a large amount of tasks. The aforementioned problem increases with the number of available cores. In order to keep all the cores busy and accelerate the overall application performance, it becomes necessary to partition it into more and smaller tasks. The task scheduling (i.e., creation and management of the execution of tasks) in software introduces overheads, and so becomes increasingly inefficient with the number of cores. In contrast, a hardware scheduling solution can achieve greater speed-ups as a hardware task scheduler requires fewer cycles than the software version to dispatch a task. The Task Superscalar is a hybrid dataflow/von-Neumann architecture that exploits the task level parallelism of the program. The Task Superscalar combines the effectiveness of Out-of-Order processors together with the task abstraction, and thereby provides an unified management layer for CMPs which effectively employs processors as functional units. The Task Superscalar has been implemented in software with limited parallelism and high memory consumption due to the nature of the software implementation. In this thesis, a Hardware Task Superscalar architecture is designed to be integrated in a future High Performance Computer with the ability to exploit fine-grained task parallelism. The main contributions of this thesis are: (1) a design of the operational flow of Task Superscalar architecture adapted and improved for hardware implementation, (2) a HDL prototype for latency exploration, (3) a full cycle-accurate simulator of the Hardware Task Superscalar (based on the previously obtained latencies), (4) full design space exploration of the Task Superscalar component configuration (number and size) for systems with different number of processing elements (cores), (5) comparison with a software implementation of a real task-based programming model runtime using real benchmarks, and (6) hardware resource usage exploration of the selected configurations. / Explotar la concurrencia para conseguir un mejor rendimiento es un reto importante y difícil para los sistemas de alto rendimiento. Aunque la teoría es sencilla, en muchos casos la complejidad de los modelos de programación paralela tradicionales impide al programador obtener un buen rendimiento. Se han propuesto diferentes granularidades de particionamiento de tareas para explotar mejor la concurrencia implícita en las aplicaciones. En este sentido, diferentes sistemas software de manejo dinámico de tareas utilizan los principios de ejecución "dataflow" para mejorar el paralelismo a nivel de tarea y superar el rendimiento de los sistemas de planificación estáticos. Estos modelos planfican la ejecución dinámicamente y utilizan tareas, en lugar de instrucciones, como unidad básica de trabajo. De esta forma descargan al programador de tener que realizar la sincronización de las tareas explícitamente en su programa. Aunque estos modelos de programación comparten muchas similitudes con los bien conocidos procesadores fuera de orden (como el análisis dinámico de dependencias y la ejecución en "dataflow"), dependen de un análisis dinámico software de las dependencias. Dicho análisis es inherentemente lento y limita la escalabilidad cuando hay un gran número de tareas pequeñas. Los problemas antes mencionados se incrementan exponencialmente con el número de núcleos disponibles. Para conseguir mantener todos los núcleos ocupados y conseguir acelerar el rendimiento global de la aplicación se hace necesario particionarla en muchas tareas pequeñas. La gestión de dichas tareas (es decir, su creación y distribución entre los núcleos) en software introduce sobrecostes, y por tanto resulta ineficiente conforme aumenta el número de núcleos. En contraposición, un sistema hardware de planificación de tareas puede conseguir mejores rendimientos ya que requiere una menor latencia en la gestión de las tareas. El Task Superscalar (TSS) es una arquitectura híbrida dataflow/von-Neumann que explota el paralelismo a nivel de tareas de los programas. El TSS combina la efectividad de los procesadores fuera de orden con la abstracción de tarea, y por tanto provee una capa unificada de gestión para los CMPs que gestiona los núcleos como unidades funcionales. Previo al trabajo de esta tesis el Task Superscalar se había implementado en software con un paralelismo limitado y mucho consumo de memoria debido a las limitaciones inherentes de una implementación software. En esta tesis se diseñado una implementación hardware de la arquitectura Task Superscalar con capacidad para manejar muchas tareas de pequeño tamaño que es integrable en un futuro computador de altas prestaciones. Así pues, las contribuciones principales de esta tesis son: (1) el diseño de un flujo operacional de la arquitectura Task Superscalar adaptado y mejorado para su implementación hardware; (2) un prototipo HDL de dicho flujo para la exploración de las latencias asociadas a la implementación hardware; (3) un simulador ciclo a ciclo del diseño hardware basado en los resultados obtenidos en la implementación hardware; (4) una exploración completa del espacio de diseño de los componentes hardware (número y cantidad de módulos, tamaños de las memorias, etc.) para diferentes tamaños de computadores (es decir, para diferentes cantidades de nucleos); (5) una comparación con la implementación software actual del mismo modelo de programación utilizando aplicaciones reales y; (6) una exploración de la utilización de recursos hardware de las diferentes configuraciones seleccionadas.
113

Proposal and development of a highly modular and scalable self-adaptive hardware architecture with parallel processing capability

Soto Vargas, Javier E. 02 July 2014 (has links)
This dissertation describes a novel unconventional self-adaptive hardware architecture with capacity for parallel processing. For scalability issues, this bioinspired architecture is based on a regular array of homogeneous cells. The proposed programmable architecture implements in a distributed way self-adaptive capabilities including self-placement and self-routing which, due to its intrinsic design, enable the development of systems with runtime reconfiguration, self-repair and/or fault tolerance capabilities. The physical implementation of this architecture is composed of two-layers, interconnected cells in the first level and interconnected switch and pin matrices in the second level. The cell is the basic element of the proposed self-adaptive architecture. Any application scheduled to the system has to be organized in components, where each component is composed by one or more interconnected cells. The interconnection of cells inside a component is made at cell level (first layer), while the physical interconnections of components are made in the second layer. Additionally, two layers are defined as conceptual organization for the implementation of general purpose applications: the SANE and the SANE assembly. The Self-Adaptive Networked Entity (SANE) is composed by a group of components. This is the basic self-adaptive computing system. It has the ability to monitor its local environment and its internal computation process. The SANE-Assembly (SANE-ASM) is composed by a group of interconnected SANEs. The processing capabilities of the cell are included in its Functional Unit (FU), which can be described as a four-core configurable multicomputer. The FU includes twelve programmable configuration modes, i.e., each cell permits to select from one to four processors working in parallel, with different size of program and data memories. The self-adaptive capabilities of the cell are executed mainly by the Cell Configuration Unit (CCU). The self-placement algorithm is responsible for finding out the most suitable position in the cell array to insert the new cell of a component. The self-routing algorithm permits interconnecting the ports of the FU of two cells through the cell ports. The self-placement and self-routing processes allow for performing complex functionality changes in real time, these processes endow the system with enhanced functionality, enabling the system to change itself, this allows for the implementation of run-time self-configuration, without the need for any configuration manager. The architecture proposed includes two mechanisms of fault tolerance. One of these is the Dynamic Fault Tolerance Scaling Technique, that has the ability to create and eliminate the redundant copies of the functional section of a specific application. The other mechanism of fault tolerance is a dedicated or static Fault Tolerance System. It provides redundant processing capabilities that are working continuously. When a failure in the execution of a program is detected, the processors of the cell are stopped and the self-elimination and self-replication processes start for the cell (or cells) involved in the failure. An FPGA-based prototype and a software tool have been built for demonstration purposes. The prototype includes all the self-adaptive capabilities described in this dissertation. With the purpose of having a complete development system, the software tool SANE Project Developer (SPD) has been implemented. The SPD is an Integrated Development Environment (IDE) that allows generating the memory initialization data for the control microprocessor inside the prototype. / Esta tesis doctoral describe una arquitectura de hardware auto-adaptable novedosa y no convencional con capacidad de procesamiento en paralelo. Por razones de escalabilidad, esta arquitectura bioinspirada está basada en una matriz regular de células homogéneas. La arquitectura propuesta es programable, e implementa de manera distribuida diversas capacidades auto-adaptables incluyendo el auto-emplazamiento y auto-enrutamiento, los cuales debido a su diseño intrínseco, permiten el desarrollo de sistemas reconfigurables en tiempo de ejecución, así como de sistemas autoreparables y/o con capacidades de tolerancia a fallos. La implementación física de esta arquitectura esta compuesta de dos capas, que incluyen células interconectadas en el primer nivel y matrices de conmutación y pines en el segundo nivel. La célula es el elemento básico de la arquitectura propuesta. Cualquier aplicación que se quiera programar en el sistema debe estar organizada en componentes, donde cada componente está compuesto por una o más células interconectadas. La interconexión de células dentro de un componente es realizado en el mismo nivel de la matriz de células, mientras que la interconexión de componentes es realizada en la segunda capa. Adicionalmente, se definen dos capas conceptuales que son usadas con propósitos organizativos en aplicaciones de propósito general, estas son: el SANE y el SANE-assembly (o conjunto de SANEs). La entidad auto-adaptable interconectada o SANE está compuesta por un grupo de componentes. Este es el sistema de computación auto-adaptable básico, el cual tiene la habilidad de monitorizar su entorno local y su proceso de computación interno. Las capacidades de procesamiento de la célula están incluidas en su unidad funcional (FU). Esta puede ser definida como un multicomputador configurable con cuatro núcleos, los cuales son agrupados o no dependiendo del modo de configuración. La FU tiene doce modos de configuración programables, por lo que cada célula permite seleccionar entre uno y cuatro procesadores trabajando en paralelo con diversas capacidades en las memorias de programa y datos. Las capacidades auto-adaptables de la célula son ejecutadas principalmente por la unidad de configuración de la célula (CCU). El algoritmo de auto-emplazamiento es el encargado de encontrar la posición mas adecuada dentro de la matriz de células para insertar la nueva célula de un componente. El algoritmo de auto-enrutamiento permite interconectar los puertos de las FU de dos células. Los procesos de auto-emplazamiento y auto-enrutamiento permiten realizar en tiempo real cambios funcionales complejos; estos procesos dotan al sistema de una mayor funcionalidad, permitiendo que el sistema cambie por si mismo, lo que permite la implementación de la auto-configuración en tiempo real, sin la necesidad de ningún gestor de configuración. La arquitectura propuesta incluye dos mecanismos de tolerancia a fallos. Uno de estos es una técnica escalonada y dinámica de tolerancia a fallos, que tiene la habilidad de crear y eliminar copias redundantes de la unidad funcional (o de cómputo) de una aplicación específica. El otro mecanismo de tolerancia a fallos es el Sistema de Tolerancia a Fallos dedicado o estático. Este provee capacidades de procesamiento redundante que están en funcionamiento continuamente. Cuando un fallo en la ejecución de un programa es detectado, los procesadores de la célula son detenidos y los procesos de auto-eliminación y auto-replicación se inician para la célula (o células) implicada en el fallo. Se desarrolló un prototipo basado en FPGAs y una herramienta de software para comprobar la funcionalidad del sistema. El prototipo incluye todas las características de los sistemas auto-adaptable descritas en este trabajo. El SANE Project developer (SPD) es un ambiente integrado de desarrollo (IDE) que permite generar y descargar la memoria de inicialización de datos para el Microprocesador de Control dentro del prototipo.
114

Exploiting distributional semantics for content-based and context-aware recommendation

Codina Busquet, Victor 13 June 2014 (has links)
During the last decade, the use of recommender systems has been increasingly growing to the point that, nowadays, the success of many well-known services depends on these technologies. Recommenders Systems help people to tackle the choice overload problem by effectively presenting new content adapted to the user¿s preferences. However, current recommendation algorithms commonly suffer from data sparsity, which refers to the incapability of producing acceptable recommendations until a minimum amount of users¿ ratings are available for training the prediction models. This thesis investigates how the distributional semantics of concepts describing the entities of the recommendation space can be exploited to mitigate the data-sparsity problem and improve the prediction accuracy with respect to state-of-the-art recommendation techniques. The fundamental idea behind distributional semantics is that concepts repeatedly co-occurring in the same context or usage tend to be related. In this thesis, we propose and evaluate two novel semantically-enhanced prediction models that address the sparsity-related limitations: (1) a content-based approach, which exploits the distributional semantics of item¿s attributes during item and user-profile matching, and (2) a context-aware recommendation approach that exploits the distributional semantics of contextual conditions during context modeling. We demonstrate in an exhaustive experimental evaluation that the proposed algorithms outperform state-of-the-art ones, especially when data are sparse. Finally, this thesis presents a recommendation framework, which extends the widespread machine learning library Apache Mahout, including all the proposed and evaluated recommendation algorithms as well as a tool for offline evaluation and meta-parameter optimization. The framework has been developed to allow other researchers to reproduce the described evaluation experiments and make new progress on the Recommender Systems field easier / Durant l'última dècada, l'ús dels sistemes de recomanació s'ha vist incrementat fins al punt que, actualment, l'èxit de molts dels serveis web més coneguts depèn en aquesta tecnologia. Els Sistemes de Recomanació ajuden als usuaris a trobar els productes o serveis que més s¿adeqüen als seus interessos i preferències. Una gran limitació dels algoritmes de recomanació actuals és el problema de "data-sparsity", que es refereix a la incapacitat d'aquests sistemes de generar recomanacions precises fins que un cert nombre de votacions d'usuari és disponible per entrenar els models de predicció. Per mitigar aquest problema i millorar així la precisió de predicció de les tècniques de recomanació que conformen l'estat de l'art, en aquesta tesi hem investigat diferents maneres d'aprofitar la semàntica distribucional dels conceptes que descriuen les entitats que conformen l'espai del problema de la recomanació, principalment, els objectes a recomanar i la informació contextual. En la semàntica distribucional s'assumeix la següent hipotesi: conceptes que coincideixen repetidament en el mateix context o ús tendeixen a estar semànticament relacionats. Concretament, en aquesta tesi hem proposat i avaluat dos algoritmes de recomanació que fan ús de la semàntica distribucional per mitigar el problem de "data-sparsity": (1) un model basat en contingut que explota les similituds distribucionals dels atributs que representen els objectes a recomanar durant el càlcul de la correspondència entre els perfils d'usuari i dels objectes; (2) un model de recomanació contextual que fa ús de les similituds distribucionals entre condicions contextuals durant la representació del context. Mitjançant una avaluació experimental exhaustiva dels models de recomanació proposats hem demostrat la seva efectivitat en situacions de falta de dades, confirmant que poden millorar la precisió d'algoritmes que conformen l'estat de l'art. Finalment, aquesta tesi presenta una llibreria pel desenvolupament i avaluació d'algoritmes de recomanació com una extensió de la llibreria de "Machine Learning" Apache Mahout, àmpliament utilitzada en el camp del Machine Learning. La nostra extensió inclou tots els algoritmes de recomanació avaluats en aquesta tesi, així com una eina per facilitar l'avaluació experimental dels algoritmes. Hem desenvolupat aquesta llibreria per facilitar a altres investigadors la reproducció dels experiments realitzats i, per tant, el progrés en el camp dels Sistemes de Recomanació.
115

Optimization techniques for fine-grained communication in PGAS environments

Alvanos, Michail 10 December 2013 (has links)
Partitioned Global Address Space (PGAS) languages promise to deliver improved programmer productivity and good performance in large-scale parallel machines. However, adequate performance for applications that rely on fine-grained communication without compromising their programmability is difficult to achieve. Manual or compiler assistance code optimization is required to avoid fine-grained accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine-grained accesses require knowledge of physical data mapping and the use of parallel loop constructs. This thesis presents optimizations for solving the three main challenges of the fine-grain communication: (i) low network communication efficiency; (ii) large number of runtime calls; and (iii) network hotspot creation for the non-uniform distribution of network communication, To solve this problems, the dissertation presents three approaches. First, it presents an improved inspector-executor transformation to improve the network efficiency through runtime aggregation. Second, it presents incremental optimizations to the inspector-executor loop transformation to automatically remove the runtime calls. Finally, the thesis presents a loop scheduling loop transformation for avoiding network hotspots and the oversubscription of nodes. In contrast to previous work that use static coalescing, prefetching, limited privatization, and caching, the solutions presented in this thesis focus cover all the aspect of fine-grained communication, including reducing the number of calls generated by the compiler and minimizing the overhead of the inspector-executor optimization. A performance evaluation with various microbenchmarks and benchmarks, aiming at predicting scaling and absolute performance numbers of a Power 775 machine, indicates that applications with regular accesses can achieve up to 180% of the performance of hand-optimized versions, while in applications with irregular accesses the transformations are expected to yield from 1.12X up to 6.3X speedup. The loop scheduling shows performance gains from 3-25% for NAS FT and bucket-sort benchmarks, and up to 3.4X speedup for the microbenchmarks.
116

Rights and services interoperability for multimedia content management

Maroñas Borras, Xavier 04 December 2013 (has links)
The main goal of the work presented in this thesis is to describe the definition of interoperability mechanisms between rights expression languages and policy languages. Starting from languages interoperability, the intention is to go a step further and define how services for multimedia content management can interoperate by means of service-oriented generic and standardised architectures. In order to achieve this goal, several standards and existing initiatives will be analysed and taken into account. Regarding rights expression languages and policy languages, standards like MPEG-21 Rights Expression Language (REL), Open Digital Rights Language (ODRL) and eXtensible Access Control Markup Language (XACML) are considered. Regarding services for content management, the Multimedia Information Protection And Management System (MIPAMS), a standards-based implemented architecture, and the Multimedia Service Platform Technologies (MSPT), also known as MPEG-M standard, are considered. The contribution of this thesis is divided into two parts, one devoted to languages interoperability and the other one devoted to services interoperability, both addressed to multimedia content management. They are briefly described next. The first part of the contribution describes how MPEG-21 REL, ODRL and XACML can interoperate, defining the mapping mechanisms to translate expressions from language to language. The mappings provided have different levels of granularity, starting from a mapping based on a programmatic approach coming from high-level modelling diagrams done using Unified Modelling Language (UML) and Entity-Relationship (ER). The next level of mappings includes specific mappings between MPEG-21 REL and XACML and ODRL and XACML. Finally, a more general solution is proposed by using a broker. Part of this work was done in the context of the VISNET-II Network of Excellence and the AXMEDIS Integrated Project. The findings done prove the validity of the interoperability methods described. The second part of the contribution describes how to describe standards based building blocks to provide interoperable services for multimedia content management. This definition is based on the analysis of existing content management use cases, from the ones involving less security over multimedia content managed to the ones providing full-featured digital rights management (DRM) (including access control and ciphering techniques) to support secure content management. In this section it is also presented the work done in the research projects AXMEDIS, Musiteca and Culturalive. It is also shown the standardisation work done for MPEG-M, particularly on elementary services and service aggregation. To demonstrate the usage of both technologies a mobile application integrating both MPEG-M and MIPAMS is presented. Furthermore, some conclusions and future work is presented in the corresponding section, together with the refereed publications, which are briefly described in the document. In summary, the work presented can follow different research lines. On the one hand, further study on rights expression languages and policy languages is required as new versions of them have recently appeared. It is worth noting the standardisation of a contract expression language, MPEG-21 CEL, which has also to be further analysed in order to evaluate its interoperability with rights and policy languages. On the other hand, standard initiatives must be followed in order to complete the map of SB3's, considering MPEG standards and also other standards not only related to multimedia but also other application scenarios, like e-health or e-government.
117

Reliability in the face of variability in nanometer embedded memories

Ganapathy, Shrikanth 28 April 2014 (has links)
In this thesis, we have investigated the impact of parametric variations on the behaviour of one performance-critical processor structure - embedded memories. As variations manifest as a spread in power and performance, as a first step, we propose a novel modeling methodology that helps evaluate the impact of circuit-level optimizations on architecture-level design choices. Choices made at the design-stage ensure conflicting requirements from higher-levels are decoupled. We then complement such design-time optimizations with a runtime mechanism that takes advantage of adaptive body-biasing to lower power whilst improving performance in the presence of variability. Our proposal uses a novel fully-digital variation tracking hardware using embedded DRAM (eDRAM) cells to monitor run-time changes in cache latency and leakage. A special fine-grain body-bias generator uses the measurements to generate an optimal body-bias that is needed to meet the required yield targets. A novel variation-tolerant and soft-error hardened eDRAM cell is also proposed as an alternate candidate for replacing existing SRAM-based designs in latency critical memory structures. In the ultra low-power domain where reliable operation is limited by the minimum voltage of operation (Vddmin), we analyse the impact of failures on cache functional margin and functional yield. Towards this end, we have developed a fully automated tool (INFORMER) capable of estimating memory-wide metrics such as power, performance and yield accurately and rapidly. Using the developed tool, we then evaluate the #effectiveness of a new class of hybrid techniques in improving cache yield through failure prevention and correction. Having a holistic perspective of memory-wide metrics helps us arrive at design-choices optimized simultaneously for multiple metrics needed for maintaining lifetime requirements.
118

Business-driven resource allocation and management for data centres in cloud computing markets

Macias Lloret, Mario 28 May 2014 (has links)
Cloud Computing markets arise as an efficient way to allocate resources for the execution of tasks and services within a set of geographically dispersed providers from different organisations. Client applications and service providers meet in a market and negotiate for the sales of services by means of the signature of a Service Level Agreement that contains the Quality of Service terms that the Cloud provider has to guarantee by managing properly its resources. Current implementations of Cloud markets suffer from a lack of information flow between the negotiating agents, which sell the resources, and the resource managers that allocate the resources to fulfil the agreed Quality of Service. This thesis establishes an intermediate layer between the market agents and the resource managers. In consequence, agents can perform accurate negotiations by considering the status of the resources in their negotiation models, and providers can manage their resources considering both the performance and the business objectives. This thesis defines a set of policies for the negotiation and enforcement of Service Level Agreements. Such policies deal with different Business-Level Objectives: maximisation of the revenue, classification of clients, trust and reputation maximisation, and risk minimisation. This thesis demonstrates the effectiveness of such policies by means of fine-grained simulations. A pricing model may be influenced by many parameters. The weight of such parameters within the final model is not always known, or it can change as the market environment evolves. This thesis models and evaluates how the providers can self-adapt to changing environments by means of genetic algorithms. Providers that rapidly adapt to changes in the environment achieve higher revenues than providers that do not. Policies are usually conceived for the short term: they model the behaviour of the system by considering the current status and the expected immediate after their application. This thesis defines and evaluates a trust and reputation system that enforces providers to consider the impact of their decisions in the long term. The trust and reputation system expels providers and clients with dishonest behaviour, and providers that consider the impact of their reputation in their actions improve on the achievement of their Business-Level Objectives. Finally, this thesis studies the risk as the effects of the uncertainty over the expected outcomes of cloud providers. The particularities of cloud appliances as a set of interconnected resources are studied, as well as how the risk is propagated through the linked nodes. Incorporating risk models helps providers differentiate Service Level Agreements according to their risk, take preventive actions in the focus of the risk, and pricing accordingly. Applying risk management raises the fulfilment rate of the Service-Level Agreements and increases the profit of the provider
119

Cache designs for reliable hybrid high and ultra-low voltage operation

Maric, Bojan 16 May 2014 (has links)
Increasing demand for implementing highly-miniaturized battery-powered ultra-low-cost systems (e.g., below 1 USD) in emerging applications such as body, urban life and environment monitoring, etc., has introduced many challenges in the chip design. Such applications require high performance occasionally, but very little energy consumption during most of the time in order to extend battery lifetime. In addition, they require real-time guarantees. The most suitable technological solution for those devices consists of using hybrid processors able to operate at: (i) high voltage to provide high performance and (ii) near-/sub-threshold (NST) voltage to provide ultra-low energy consumption. However, the most efficient SRAM memories for each voltage level differ and it is mandatory trading off different SRAM designs, especially in cache memories, which occupy most of the processor¿s area. In this Thesis, we analyze the performance/power tradeoffs involved in the design of SRAM L1 caches for reliable hybrid high and NST Vcc operation from a microarchitectural perspective. We develop new, simple, single-Vcc domain hybrid cache architectures and data management mechanisms that satisfy all stringent needs of our target market. Proposed solutions are shown to have high energy efficiency with negligible impact on average performance while maintaining strong performance guarantees as required for our target market.
120

Scalable system software for high performance large-scale applications

Morari, Alessadro 27 May 2014 (has links)
In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability.

Page generated in 0.0372 seconds