• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 147
  • 30
  • 21
  • 15
  • 7
  • 6
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 268
  • 76
  • 50
  • 50
  • 49
  • 38
  • 35
  • 35
  • 33
  • 32
  • 32
  • 30
  • 30
  • 30
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Placement, ordonnancement et mécanismes de migration de tâches temps-réel pour des architectures distribuées multicoeurs / Real-time tasks assignment, scheduling and migration mechanisms for multicore distributed architectures

Mégel, Thomas 03 April 2012 (has links)
Les systèmes temps-réel embarqués critiques intègrent un nombre croissant de fonctionnalités comme le montrent les domaines de l'automobile ou de l'aéronautique. Ces systèmes doivent offrir un niveau maximal de sûreté de fonctionnement en disposant des mécanismes pour traiter les défaillances éventuelles et doivent être également performants, avec le respect de contraintes temps-réel strictes. Ces systèmes sont en outre contraints par leur nature embarquée : les ressources sont limitées, tels que par exemple leur espace mémoire et leur capacité de calcul. Dans cette thèse, nous traitons deux problématiques principales de ce type de systèmes. La première porte sur la manière d'apporter une meilleure tolérance aux fautes dans les systèmes temps-réel distribués subissant des défaillances matérielles multiples et permanentes. Ces systèmes sont souvent conçus avec une allocation statique des tâches. Une approche plus flexible effectuant des reconfigurations est utile si elle permet d'optimiser l'allocation à chaque défaillance rencontrée, pour les ressources restantes. Nous proposons une telle approche hors-ligne assurant un dimensionnement adapté pour prendre en compte les ressources nécessaires à l'exécution de ces actions. Ces reconfigurations peuvent demander une réallocation des tâches ou répliques si l'espace mémoire local est limité. Dans un contexte temps-réel strict, nous définissons notamment des mécanismes et des techniques de migration garantissant l'ordonnançabilité globale du système. La deuxième problématique se focalise sur l'optimisation de l'exécution des tâches au niveau local dans un contexte multicoeurs préemptif. Nous proposons une méthode d'ordonnancement optimal disposant d'une meilleure extensibilité que les approches existantes en minimisant les surcoûts : le nombre de changements de contexte préemptions et migrations locales) et la complexité de l'ordonnanceur / Critical real-time embedded systems are integrating an increasing number of functionalities, as shown in automotive domain or aeronautics. These systems require high dependability including mechanisms to handle possible failures and have to be effective, meeting hard real-time constraints. These systems are also constrained by their embedded nature : resources are limited, such as their memory and their computing capacities. In this thesis, we focus on two main problems for this type of systems. The first one is about a way to bring a better fault-tolerance in distributed real-time systems when multiple and permanent hardware failures can occur. In classical systems, the design is limited to a static task assignment. A more flexible approach exploiting reconfigurations is useful if it allows to optimize assignment at each failure for the remaining resources. We propose an off-line approach to obtain an adapted sizing taking into account necessary resources to execute these actions. These reconfigurations may require to reallocate tasks or replicas if memory capacities are limited. In a hard real-time context, we define mechanisms and migration techniques to guarantee global schedulability of the system. The second problem focus on optimizing performance to run tasks at a local level in a multicore preemptive context. We propose an optimal scheduling method allowing a better scalability than existing approaches by minimizing overheads : the number of context switches (local preemptions and migrations) and the scheduler complexity
92

Improving performance on NUMA systems / Amélioration de performance sur les architectures NUMA

Lepers, Baptiste 24 January 2014 (has links)
Les machines multicœurs actuelles utilisent une architecture à Accès Mémoire Non-Uniforme (Non-Uniform Memory Access - NUMA). Dans ces machines, les cœurs sont regroupés en nœuds. Chaque nœud possède son propre contrôleur mémoire et est relié aux autres nœuds via des liens d'interconnexion. Utiliser ces architectures à leur pleine capacité est difficile : il faut notamment veiller à éviter les accès distants (i.e., les accès d'un nœud vers un autre nœud) et la congestion sur les bus mémoire et les liens d'interconnexion. L'optimisation de performance sur une machine NUMA peut se faire de deux manières : en implantant des optimisations ad-hoc au sein des applications ou de manière automatique en utilisant des heuristiques. Cependant, les outils existants fournissent trop peu d'informations pour pouvoir implanter efficacement des optimisations et les heuristiques existantes ne permettent pas d'éviter les problèmes de congestion. Cette thèse résout ces deux problèmes. Dans un premier temps nous présentons MemProf, le premier outil d'analyse permettant d'implanter efficacement des optimisations NUMA au sein d'applications. Pour ce faire, MemProf construit des flots d'interactions entre threads et objets. Nous évaluons MemProf sur 3 machines NUMA et montrons que les optimisations trouvées grâce à MemProf permettent d'obtenir des gains de performance significatifs (jusqu'à 2.6x) et sont très simples à implanter (moins de 10 lignes de code). Dans un second temps, nous présentons Carrefour, un algorithme de gestion de la mémoire pour machines NUMA. Contrairement aux heuristiques existantes, Carrefour se concentre sur la réduction de la congestion sur les machines NUMA. Carrefour permet d'obtenir des gains de performance significatifs (jusqu'à 3.3x) et est toujours plus performant que les heuristiques existantes. / Modern multicore systems are based on a Non-Uniform Memory Access (NUMA) design. In a NUMA system, cores are grouped in a set of nodes. Each node has a memory controller and is interconnected with other nodes using high speed interconnect links. Efficiently exploiting such architectures is notoriously complex for programmers. Two key objectives on NUMA multicore machines are to limit as much as possible the number of remote memory accesses (i.e., accesses from a node to another node) and to avoid contention on memory controllers and interconnect links. These objectives can be achieved by implementing application-level optimizations or by implementing application-agnostic heuristics. However, in many cases, existing profilers do not provide enough information to help programmers implement application-level optimizations and existing application-agnostic heuristics fail to address contention issues. The contributions of this thesis are twofold. First we present MemProf, a profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems. MemProf builds temporal flows of interactions between threads and objects, which help programmers understand why and which memory objects are accessed remotely. We evaluate MemProf on Linux on three different machines. We show how MemProf helps us choose and implement efficient optimizations, unlike existing profilers. These optimizations provide significant performance gains (up to 2.6x), while requiring very lightweight modifications (10 lines of code or less). Then we present Carrefour, an application-agnostic memory management algorithm. Contrarily to existing heuristics, Carrefour focuses on traffic contention on memory controllers and interconnect links. Carrefour provides significant performance gains (up to 3.3x) and always performs better than existing heuristics.
93

Co-projeto de hardware e software de um escalonador de processos para arquiteturas multicore heterogêneas baseadas em computação reconfigurável / Hardware and software co-design of a process scheduler for heterogeneous multicore architectures based on reconfigurable computing

Bueno, Maikon Adiles Fernandez 05 November 2013 (has links)
As arquiteturas multiprocessadas heterogêneas têm como objetivo principal a extração de maior desempenho da execução dos processos, por meio da utilização de núcleos apropriados às suas demandas. No entanto, a extração de maior desempenho é dependente de um mecanismo eficiente de escalonamento, capaz de identificar as demandas dos processos em tempo real e, a partir delas, designar o processador mais adequado, de acordo com seus recursos. Este trabalho tem como objetivo propor e implementar o modelo de um escalonador para arquiteturas multiprocessadas heterogêneas, baseado em software e hardware, aplicado ao sistema operacional Linux e ao processador SPARC Leon3, como prova de conceito. Nesse sentido, foram implementados monitores de desempenho dentro dos processadores, os quais identificam as demandas dos processos em tempo real. Para cada processo, sua demanda é projetada para os demais processadores da arquitetura e em seguida é realizado um balanceamento visando maximizar o desempenho total do sistema, distribuindo os processos entre processadores, de modo a diminuir o tempo total de processamento de todos os processos. O algoritmo de maximização Hungarian, utilizado no balanceamento do escalonador, foi desenvolvido em hardware, proporcionando paralelismo e maior desempenho na execução do algoritmo. O escalonador foi validado por meio da execução paralela de diversos benchmarks, resultando na diminuição dos tempos de execução em relação ao escalonador sem suporte à heterogeneidade / Heterogeneous multiprocessor architectures have as main objective the extraction of higher performance from processes through the use of appropriate cores to their demands. However, the extraction of higher performance is dependent on an efficient scheduling mechanism, able to identify in real-time the demands of processes and to designate the most appropriate processor according to their resources. This work aims at design and implementations of a model of a scheduler for heterogeneous multiprocessor architectures based on software and hardware, applied to the Linux operating system and the SPARC Leon3 processor as proof of concept. In this sense, performance monitors have been implemented within the processors, which in real-time identifies the demands of processes. For each process, its demand is projected for the other processors in the architecture and then it is performed a balancing to maximize the total system performance by distributing processes among processors. The Hungarian maximization algorithm, used in balancing scheduler was developed in hardware, providing greater parallelism and performance in the execution of the algorithm. The scheduler has been validated through the parallel execution of several benchmarks, resulting in decreased execution times compared to the scheduler without the heterogeneity support
94

Performance optimization of geophysics stencils on HPC architectures / Optimização de desempenho de estênceis geofísicos sobre arquiteturas HPC

Abaunza, Víctor Eduardo Martínez January 2018 (has links)
A simulação de propagação de onda é uma ferramenta crucial na pesquisa de geofísica (para análise eficiente dos terremotos, mitigação de riscos e a exploração de petróleo e gáz). Devido à sua simplicidade e sua eficiência numérica, o método de diferenças finitas é uma das técnicas implementadas para resolver as equações da propagação das ondas. Estas aplicações são conhecidas como estênceis porque consistem num padrão que replica a mesma computação num domínio multidimensional de dados. A Computação de Alto Desempenho é requerida para solucionar este tipo de problemas, como consequência do grande número de pontos envolvidos nas simulações tridimensionais do subsolo. A optimização do desempenho dos estênceis é um desafio e depende do arquitetura usada. Neste contexto, focamos nosso trabalho em duas partes. Primeiro, desenvolvemos nossa pesquisa nas arquiteturas multicore; analisamos a implementação padrão em OpenMP dos modelos numéricos da transferência de calor (um estêncil Jacobi de 7 pontos), e o aplicativo Ondes3D (um simulador sísmico desenvolvido pela Bureau de Recherches Géologiques et Minières); usamos dois algoritmos conhecidos (nativo, e bloqueio espacial) para encontrar correlações entre os parâmetros da configuração de entrada, na execução, e o desempenho computacional; depois, propusemos um modelo baseado no Aprendizado de Máquina para avaliar, predizer e melhorar o desempenho dos modelos estênceis na arquitetura usada; também usamos um modelo de propagação da onda acústica fornecido pela empresa Petrobras; e predizemos o desempenho com uma alta precisão (até 99%) nas arquiteturas multicore. Segundo, orientamos nossa pesquisa nas arquiteturas heterogêneas, analisamos uma implementação padrão do modelo de propagação de ondas em CUDA, para encontrar os fatores que afetam o desempenho quando o número de aceleradores é aumentado; então, propusemos uma implementação baseada em tarefas para amelhorar o desempenho, de acordo com um conjunto de configuração no tempo de execução (algoritmo de escalonamento, tamanho e número de tarefas), e comparamos o desempenho obtido com as versões de só CPU ou só GPU e o impacto no desempenho das arquiteturas heterogêneas; nossos resultados demostram um speedup significativo (até 25) em comparação com a melhor implementação disponível para arquiteturas multicore. / Wave modeling is a crucial tool in geophysics, for efficient strong motion analysis, risk mitigation and oil & gas exploration. Due to its simplicity and numerical efficiency, the finite-difference method is one of the standard techniques implemented to solve the wave propagation equations. This kind of applications is known as stencils because they consist in a pattern that replicates the same computation on a multi-dimensional domain. High Performance Computing is required to solve this class of problems, as a consequence of a large number of grid points involved in three-dimensional simulations of the underground. The performance optimization of stencil computations is a challenge and strongly depends on the underlying architecture. In this context, this work was directed toward a twofold aim. Firstly, we have led our research on multicore architectures and we have analyzed the standard OpenMP implementation of numerical kernels from the 3D heat transfer model (a 7-point Jacobi stencil) and the Ondes3D code (a full-fledged application developed by the French Geological Survey). We have considered two well-known implementations (naïve, and space blocking) to find correlations between parameters from the input configuration at runtime and the computing performance; thus, we have proposed a Machine Learning-based approach to evaluate, to predict, and to improve the performance of these stencil models on the underlying architecture. We have also used an acoustic wave propagation model provided by the Petrobras company and we have predicted the performance with high accuracy on multicore architectures. Secondly, we have oriented our research on heterogeneous architectures, we have analyzed the standard implementation for seismic wave propagation model in CUDA, to find which factors affect the performance; then, we have proposed a task-based implementation to improve the performance, according to the runtime configuration set (scheduling algorithm, size, and number of tasks), and we have compared the performance obtained with the classical CPU or GPU only versions with the results obtained on heterogeneous architectures.
95

FPGA-based programmable embedded platform for image processing applications

Siddiqui, Fahad Manzoor January 2018 (has links)
A vast majority of electronic systems including medical, surveillance and critical infrastructure employs image processing to provide intelligent analysis. They use onboard pre-processing to reduce data bandwidth and memory requirements before sending information to the central system. Field Programmable Gate Arrays (FPGAs) represent a strong platform as they permit reconfigurability and pipelining for streaming applications. However, rapid advances and changes in these application use cases crave adaptable hardware architectures that can process dynamic data workloads and be easily programmed to achieve ecient solutions in terms of area, time and power. FPGA-based development needs iterative design cycles, hardware synthesis and place-and-route times which are alien to the software developers. This work proposes an FPGA-based programmable hardware acceleration approach to reduce design effort and time. This allows developers to use FPGAs to profile, optimise and quickly prototype algorithms using a more familiar software-centric, edit-compile-run design flow that enables the programming of the platform by software rather than high-level synthesis (HLS) engineering principles. Central to the work has been the development of an optimised FPGA-based processor called Image Processing Processor (IPPro) which efficiently uses the underlying resources and presents a programmable environment to the programmer using a dataflow design principle. This gives superior performance when compared to competing alternatives. From this, a three-layered platform has been created which enables the realisation of parallel computing skeletons on FPGA which are used to eciently express designs in high-level programming languages. From bottom-up, these layers represent programming (actor, multiple actors and parallel skeletons) and hardware (IPPro core, multicore IPPro, system infrastructure) abstraction. The platform allows acceleration of parallel and non-parallel dataflow applications. A set of point and area image pre-processing functions are implemented on Avnet Zedboard platform which allows the evaluation of the performance. The point function achieved 2.53 times better performance than the area functions and point and area functions achieved performance improvements of 7.80 and 5.27 times over sin- gle core IPPro by exploiting data parallelism. The pipelined execution of multiple stages revealed that a dataflow graph can be decomposed into balanced actors to deliver maximum performance by hiding data transfer and processing time through exploiting task parallelism; otherwise, the maximum achievable performance is limited by the slowest actor due to the ripple effect caused by unbalanced actors. The platform delivered better performance in terms of fps/Watt/Area than Embedded Graphic Processing Unit (GPU) considering both technologies allows a software-centric design flow.
96

Enhancing the performance of decoupled software pipeline through backward slicing

Alwan, Esraa January 2014 (has links)
The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring, preferably automatically, before the benefits can be observed in improved run-times. Even then, much depends upon the intrinsic capacity of the original program for concurrent execution. Using software techniques to parallelize the sequential application can raise the level of gain from multicore systems. Parallel programming is not an easy job for the user, who has to deal with many issues such as dependencies, synchronization, load balancing, and race conditions. For this reason the role of automatically parallelizing compilers and techniques for the extraction of several threads from single-threaded programs, without programmer intervention, is becoming more important and may help to deliver better utilization of modern hardware. One parallelizing technique that has been shown to be an effective for the parallelization of applications that have irregular control flow and complex memory access patterns is Decoupled Software Pipeline (DSWP). This transformation partitions the loop body into a set of stages, ensuring that critical path dependencies are kept local to a stage. Each stage becomes a thread and data is passed between threads using inter-core communication. The success of DSWP depends on being able to extract the relatively fine-grain parallelism that is present in many applications. Another technique which offers potential gains in parallelizing general purpose applications is slicing. Program slicing transforms large programs into several smaller ones that execute independently, each consisting of only statements relevant to the computation of certain, socalled, (program) points. This dissertation explores the possibility of performance benefits arising from a secondary transformation of DSWP stages by slicing. To that end a new combination method called DSWP/Slice is presented. Our observation is that individual DSWP stages can be parallelized by slicing, leading to an improvement in performance of the longest duration DSWP stages. In particular, this approach can be applicable in cases where DOALL is not. In consequence better load balancing can be achieved between the DSWP stages. Moreover, we introduce an automatic implementation of the combination method using Low Level Virtual Machine (LLVM) compiler framework. This combination is particularly effective when the whole long stage comprises a function body. More than one slice extracted from a function body can speed up its execution time and also increases the scalability of DSWP. An evaluation of this technique on six programs with a range of dependence patterns leads to considerable performance gains on a core-i7 870 machine with 4-cores/8-threads. The results are obtained from an automatic implementation that shows the proposed method can give a factor of up to 1.8 speed up compared with the original sequential code.
97

A reconfigurable heterogeneous multicore system with homogeneous ISA / Um sistema multinucleo, heterogeneo e reconfiguravel de ISA homogênea

Souza, Jeckson Dellagostin January 2016 (has links)
Dada a grande diversidade de aplicações embarcadas presentes nos atuais dispositivos portáveis, ambos os paralelismos em nível de threads e de instruções devem ser explorados para obter ganhos de desempenho e energia. Enquanto MPSoCs (sistemas em chip de múltiplos núcleos) são amplamente usados para esse propósito, estes falham quando consideramos produtividade de software, já que eles são compostos de chips com diferentes arquiteturas que precisam ser programados separadamente. Por outro lado, processadores multi núcleos de propósito geral implementam a mesma arquitetura, mas são compostos de núcleos homogêneos de processadores superescalares que consomem muita potência. Nesta dissertação, propõe-se um novo sistema, que tira proveito de circuitos reconfiguráveis para criar diferentes organizações que implementam a mesma arquitetura, capazes de apresentar alto desempenho com baixo custo energético. Para garantir a compatibilidade binária, usa-se um mecanismo de tradução binária que transforma o código a ser executado no circuito reconfigurável durante a execução. Usando aplicações representativas, mostra-se que uma versão do sistema heterogêneo pode ganhar da sua versão homogênea em média de 59% em desempenho e 10% em energia, com melhoras em EDP (Energy-Delay Product – Produto da energia pelo tempo de execução) em quase todos os cenários. Além disso, este trabalho também propõe e avalia seis escalonadores para este sistema heterogêneo: dois algoritmos estáticos, os quais alocam as threads no primeiro núcleo livre, onde elas permanecerão durante toda a execução; um escalonador direcionado por contagem de instruções, o qual realoca as threads durante pontos de sincronização de acordo com a sua contagem de instruções; um escalonador de Feedback, que usa dados de dentro da unidade reconfigurável para realocar threads; o PC-Feedback, que adiciona um mecanismo de reuso de dados ao último escalonador; e um escalonador Oráculo, que é capaz de decidir a melhor alocação de threads possível. Mostra-se que o algoritmo estático pode ter alto desempenho em aplicações com alto paralelismo, contudo para um desempenho mais uniforme em todas as aplicações os algoritmos de Feedback e PC-Feedback são mais indicados. / Given the large diversity of embedded applications one can find in current portable devices, for energy and performance reasons one must exploit both Thread- and Instruction Level Parallelism. While MPSoCs (Multiprocessor system-on-chip) are largely used for this purpose, they fail when one considers software productivity, since it comprises different ISAs (Instruction Set Architecture) that must be programmed separately. On the other hand, general purpose multicores implement the same ISA, but are composed of a homogeneous set of very power consuming superscalar processors. In this dissertation, we show how one can effectively use a reconfigurable unit to provide a number of different possible heterogeneous configurations while still sustaining the same ISA, capable of reaching high performance with low energy cost. To ensure ISA compatibility, we use a binary translation mechanism that transforms code to be executed on the fabric at run-time. Using representative benchmarks, we show that one version of the heterogeneous system can outperform its homogenous counterpart in average by 59% in performance and 10% in energy, with EDP (Energy-Delay Product) improvements in almost every scenario. Furthermore, this work also proposes and evaluates six schedulers for the heterogeneous system: two static algorithms, which allocate the threads on the first free core, where they will run during the entire execution; an Instruction Count (IC) Driven scheduler, which reallocates threads during synchronization points accordingly to their instruction count; a Feedback scheduler, which uses data from inside the reconfigurable unit to reallocate threads; the PCFeedback scheduler, that adds a reuse mechanism to the last one; and an Oracle scheduler, which is capable of deciding the best thread allocation possible. We show that the static algorithm can reach high performance in applications with high parallelism, however for uniform performance in all applications, the Feedback and PC-Feedback algorithms are better designated.
98

Ordonnancement de processus légers sur architectures multiprocesseurs hiérarchiques : BubbleSched, une approche exploitant la structure du parallélisme des applications

Thibault, Samuel 06 December 2007 (has links) (PDF)
La tendance des constructeurs pour le calcul scientifique est à l'imbrication de technologies permettant un degré de parallélisme toujours plus fort au sein d'une même machine : architecture NUMA, puces multicœurs, SMT. L'efficacité de l'exécution d'une application parallèle irrégulière sur de telles machines hiérarchiques repose alors sur la qualité de l'ordonnancement des tâches et du placement des données, pour éviter le plus possible les pénalités NUMA et les défauts de cache. Les systèmes d'exploitation actuels, pris au dépourvu car trop généralistes, laissent les concepteurs d'application contraints à « câbler » leurs programmes pour une machine donnée.<br /><br />Dans cette thèse, pour garantir une certaine portabilité des performances, nous définissons la notion de /bulle/ permettant d'exprimer la nature structurée du parallélisme du calcul, et nous modélisons l'architecture de la machine cible par une hiérarchie de listes de tâches. Une interface de programmation et des outils de débogage de haut niveau permettent alors de développer simplement des ordonnanceurs dédiés, efficaces et portables. Différents ordonnanceurs mettant en œuvre des approches variées ont été développés, en partie notamment par des stagiaires encadrés au sein de l'équipe, ce qui montre à la fois la puissance et la simplicité de l'interface. C'est ainsi une véritable plate-forme de développement et d'expérimentation d'ordonnanceurs à bulles qui a été intégrée au sein de la bibliothèque de threads utilisateur marcel. Le support OpenMP du compilateur GCC, GOMP, a été étendu pour utiliser cette bibliothèque et exprimer la nature structurée des sections parallèles imbriquées à l'aide de bulles. Avec la couche de compatibilité POSIX de marcel, ces supports ont permis de tester les différents ordonnanceurs à bulles développés, sur différentes applications. Les gains obtenus, de l'ordre de 20 à 40%, montrent l'intérêt de notre approche.
99

Effectiveness of Tracing in a Multicore Environment

Sivakumar, Narendran, Sundar Rajan, Sriram January 2010 (has links)
<p>Debugging in real time is imperative for telecommunication networks with their ever increasing size and complexity. In event of an error or an unexpected occurrence of event, debugging the complex systems that controls these networks becomes an insurmountable task. With the help of tracing, it is possible to capture the snapshot of a system at any given point of time. Tracing, in essence, captures the state of the system along with the programs currently running on the system. LTTng is one such tool developed to perform tracing in both kernel space and user space of an application. In this thesis, we evaluate the effectiveness of LTTng and its impact on the performance on the applications traced by it. As part of this thesis we have formulated a comprehensive load matrix to simulate varying load demands in a telecommunication network. We have also devised a detailed experimental methodology which encompasses a collection of test suites used to determine the efficiency of various LTTng trace primitives. We were also able to prove that, in our experiments, LTTng’s kernel tracing is more efficient than User Space Tracing and LTTng’s User Space Tracing has a performance impact of around three to five percent.</p>
100

Towards guidelines for development of energy conscious software / Mot riktlinjer för utveckling av enegisnål mjukvara

Carlstedt-Duke, Edward, Elfström, Erik January 2009 (has links)
<p>In recent years, the drive for ever increasing energy efficiency has intensified. The main driving forces behind this development are the increased innovation and adoption of mobile battery powered devices, increasing energy costs, environmental concerns, and strive for denser systems.</p><p>This work is meant to serve as a foundation for exploration of energy conscious software. We present an overview of previous work and a background to energy concerns from a software perspective. In addition, we describe and test a few methods for decreasing energy consumption with emphasis on using software parallelism. The experiments are conducted using both a simulation environment and real hardware. Finally, a method for measuring energy consumption on a hardware platform is described.</p><p>We conclude that energy conscious software is very dependent on what hardware energy saving features, such as frequency scaling and power management, are available. If the software has a lot of unnecessary, or overcomplicated, work, the energy consumption can be lowered to some extent by optimizing the software and reducing the overhead. If the hardware provides software-controllable energy features, the energy consumption can be lowered dramatically.</p><p>For suitable workloads, using parallelism and multi-core technologies seem very promising for producing low power software. Realizing this potential requires a very flexible hardware platform. Most important is to have fine grained control over power management, and voltage and frequency scaling, preferably on a per core basis.</p>

Page generated in 0.0558 seconds