Global ETD Search

71	Sub-Polyhedral Compilation using (Unit-)Two-Variables-Per-Inequality Polyhedra / Compilation sous-polyédrique reposant sur des systèmes à deux variables par inégalité Upadrasta, Ramakrishna 13 March 2013 (has links) Notre étude de la compilation sous-polyédrique est dominée par l’introduction de la notion l’ordonnancement affine sous-polyédrique, pour laquelle nous proposons une technique utilisant des sous-polyèdres (U)TVPI. Dans ce cadre, nous introduisons des algorithmes capables de construire des sous-approximations de systèmes de contraintes résultant de problèmes d’ordonnancement affine. Cette technique repose sur des algorithmes polynomiaux simples pour approcher un polyèdre quelconque par un polyèdre (U)TVPI. Nos algorithmes sont suffisamment génériques pour s’appliquer à de nombreux problèmes d’ordonnancement, de parallélisation, et d’optimisation de boucles, réduisant leur complexité temporelle à des fonctions polynomiales. Nous introduisons également une méthode pour la génération de code utilisant des algorithmes sous-polyédriques, tirant parti de la faible complexité des sous-polyèdres (U)TVPI. Dans ce cadre, nous montrons comment réduire la complexité associée aux générateurs de code les plus populaires, ramenant la complexité de plusieurs facteurs exponentiels à des fonctions polynomiales. Nombre de ces techniques sont évaluées expérimentalement. Pour cela, nous avons réalisé une version modifiée du compilateur PLuTo, capable de paralléliser et d’optimiser des nids de boucles pour des architectures multi-cœurs à l’aide de transformations affines, et notamment de partitionnement (tiling). Nous montrons qu’une majorité des noyaux de calcul de la suite Polybench (2.0) peut être manipulée à l’aide de notre technique d’ordonnancement, en préservant la faisabilité des polyèdres lors des sous-approximations. L’utilisation des systèmes approchés par des sous-polyèdres conduit à des gains asymptotiques en complexité, qui se traduit par des réductions significatives en temps de compilation, par rapport à un solveur de programmation linéaire de référence. Nous vérifions également que le code généré par notre prototype de parallélisation sous-polyédrique est compétitif par rapport à la performance du code généré par Pluto. / The goal of this thesis is to design algorithms that run with better complexity when compiling or parallelizing loop programs. The framework within which our algorithms operate is the polyhedral model of compilation which has been successful in the design and implementation of complex loop nest optimizers and parallelizing compilers. The algorithmic complexity and scalability limitations of the above framework remain one important weakness. We address it by introducing sub-polyhedral compilation by using (Unit-)Two-Variable-Per-Inequality or (U)TVPI Polyhedra, namely polyhedrawith restricted constraints of the type ax_{i}+bx_{j}\le c (\pm x_{i}\pm x_{j}\le c). A major focus of our sub-polyhedral compilation is the introduction of sub-polyhedral scheduling, where we propose a technique for scheduling using (U)TVPI polyhedra. As part of this, we introduce algorithms that can be used to construct under-aproximations of the systems of constraints resulting from affine scheduling problems. This technique relies on simple polynomial time algorithms to under approximate a general polyhedron into (U)TVPI polyhedra. The above under-approximation algorithms are generic enough that they can be used for many kinds of loop parallelization scheduling problems, reducing each of their complexities to asymptotically polynomial time. We also introduce sub-polyhedral code-generation where we propose algorithms to use the improved complexities of (U)TVPI sub-polyhedra in polyhedral code generation. In this problem, we show that the exponentialities associated with the widely used polyhedral code generators could be reduced to polynomial time using the improved complexities of (U)TVPI sub-polyhedra. The above presented sub-polyhedral scheduling techniques are evaluated in an experimental framework. For this, we modify the state-of-the-art PLuTo compiler which can parallelize for multi-core architectures using permutation and tiling transformations. We show that using our scheduling technique, the above under-approximations yield polyhedra that are non-empty for 10 out of 16 benchmarks from the Polybench (2.0) kernels. Solving the under-approximated system leads to asymptotic gains in complexity, and shows practically significant improvements when compared to a traditional LP solver. We also verify that code generated by our sub-polyhedral parallelization prototype matches the performance of PLuTo-optimized code when the under-approximation preserves feasibility. Ordonnancement affine Algorithmes d'approximation Optimisations du compilateur Compilateurs Transformations de boucles Optimisation Parallélisme Complexité asymptotique Génération de code Affine scheduling Approximation algorithms Compiler optimizations Compilers Loop transformations Optimization Parallelism Asymptotic complexity Code generation
72	Belief Propagation and Algorithms for Mean-Field Combinatorial Optimisations Khandwawala, Mustafa January 2014 (has links) (PDF) We study combinatorial optimization problems on graphs in the mean-field model, which assigns independent and identically distributed random weights to the edges of the graph. Specifically, we focus on two generalizations of minimum weight matching on graphs. The ﬁrst problem of minimum cost edge cover finds application in a computational linguistics problem of semantic projection. The second problem of minimum cost many-to-one matching appears as an intermediate optimization step in the restriction scaffold problem applied to shotgun sequencing of DNA. For the minimum cost edge cover on a complete graph on n vertices, where the edge weights are independent exponentially distributed random variables, we show that the expectation of the minimum cost converges to a constant as n →∞ For the minimum cost many-to-one matching on an n x m complete bipartite graph, scaling m as [ n/α ] for some fixed α > 1, we find the limit of the expected minimum cost as a function of α. For both problems, we show that a belief propagation algorithm converges asymptotically to the optimal solution. The belief propagation algorithm yields a near optimal solution with lesser complexity than the known best algorithms designed for optimality in worst-case settings. Our proofs use the machinery of the objective method and local weak convergence, which are ideas developed by Aldous for proving the ζ(2) limit for the minimum cost bipartite matching. We use belief propagation as a constructive proof technique to supplement the objective method. Recursive distributional equations(RDEs) arise naturally in the objective method approach. In a class of RDEs that arise as extensions of the minimum weight matching and travelling salesman problems, we prove existence and uniqueness of a fixed point distribution, and characterize its domain of attraction. Combinatorial Optimization Belief Propagation Algorithms Local Weak Convergence Edge-Cover Many-to-one Matchings Combinatorial Probability Mean Field Combinatorial Optimizations Mean-Field Model Combinatorial Probability Computer Science
73	Designing Energy-Aware Optimization Techniques through Program Behaviour Analysis Kommaraju, Ananda Varadhan January 2014 (has links) (PDF) Green computing techniques aim to reduce the power foot print of modern embedded devices with particular emphasis on processors, the power hot-spots of these devices. In this thesis we propose compiler-driven and profile-driven optimizations that reduce power consumption in a modern embedded processor. We show that these optimizations reduce power consumption in functional units and memory subsystems with very low performance loss. We present three new techniques to reduce power consumption in processors, namely, transition aware scheduling, leakage reduction in data caches using criticality analysis, and dynamic power reduction in data caches using locality analysis of data regions. A novel instruction scheduling technique to address leakage power consumption in functional units is proposed. This scheduling technique, transition aware scheduling, is motivated by idle periods that arise in the utilization of functional units during program execution. A continuously large idle period in a functional unit can be exploited to place the unit in low power state. This novel scheduling algorithm increases the duration of idle periods without hampering performance and drives power gating in these periods. A power model defined with idle cycles as a parameter shows that this technique saves up to 25% of leakage power with very low performance impact. In modern embedded programs, data regions can be classified as critical and non-critical. Critical data regions significantly impact the performance. A new technique to identify such data regions through profiling is proposed. This technique along with a new criticality based cache policy is used to control the power state of the data cache. This scheme allocates non-critical data regions to low-power cache regions, thereby reducing leakage power consumption by up to 40% without compromising on the performance. This profiling technique is extended to identify data regions that have low locality. Some data regions have high data reuse. A locality based cache policy based on cache parameters like size and associativity is proposed. This scheme reduces dynamic as well as static power consumption in the cache subsystem. This optimization reduces 25% of the total power consumption in the data caches without hampering the execution time. In this thesis, the problem of power consumption of a program is decoupled from the number of processor cores. The underlying architecture model is simplified to abstract away a variety of processor scenarios. This simplified model can be scaled up to be implemented in various multi-core architecture models like Chip Multi-Processors, Simultaneous Multi-Threaded Processors, Chip Multi-Threaded Processors, to name a few. The three techniques proposed in this thesis leverage underlying hardware features like low power functional units, drowsy caches and split data caches. These techniques reduce power consumption of a wide range of benchmarks with low performance loss. Power Aware Computer Systems Computers Energy Conservation Data Caches Power Reduction Energy Aware System Design Data Caches Leakage Reduction Transition Aware Scheduling Enegy Aware Compiler Optimization Data Cache Energy Consumption Energy Aware Optimizations, Embedded Systems Computer Science
74	A Coarse Grained Reconfigurable Architecture Framework Supporting Macro-Dataflow Execution Varadarajan, Keshavan 12 1900 (has links) (PDF) A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing platform which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C, N, T, P, O, M, H) where each of the terms have the following meaning: C -choice of computation unit, N -choice of interconnection network, T -Choice of number of context frame (single or multiple), P -presence of partial reconfiguration, O choice of orchestration mechanism, M -design of memory hierarchy and H host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C -ALU, N -Network-on-Chip (NoC), T -Multiple contexts, P -support for partial reconfiguration, O -Macro Dataflow based orchestration, M -data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the interconnection of computation units), H -loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the architecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host-CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is transformed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execution when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Verilog. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hardware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5x improvement in performance as compared to the base version. The reconfiguration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units interconnected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 x 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP. Coarse Grained Computation Reconfigurable Architectures Macro Dataflow Execution Macro-Dataflow Orchestration Microarchitectural Optimizations Reconfigurable Fabric Macro Dataflow Execution Computer Science
75	Návrh sanace stokové sítě vybrané části urbanizovaného celku / Design of sewer network rehabilitation of choice parts of urbanized areas Horák, Ondřej January 2013 (has links) The thesis approaches the problem of design of sewer network rehabilitation of urbanized area named Kamenná čtvrť. The thesis is divided into several parts. In the first part of the thesis there is the accompaying report, which describs characteristics of the area as a whole. The second part of the thesis is about describing each part of the network, clasification of defects according to the ČSN EN 13508 -2 and fothcoming TNV 75 6905 and photographs from network camera survey. In the third part of this thesis there is a chart with all defects located on the network. Also there is an evaluation of each part of network according to prepared TNV 75 6905. The fourth part of the thesis talks about evaluation of each part of the network as a whole. In the fifth part of the thesis we design possible choises of reahabilitation on the network. The sixth part of the thesis contains the economic aspects of possible alternatives and their comparsion. Last part of the thesis contains a hydrotechnical calculation, which means calculation of the current situation and the possible alternatives. There is an individual evaluation of the calculation and if its necessary optimizations of the network.
76	Efektivn metoda Äten adresovch poloek v souborov©m syst©mu Ext4 / An Efficient Way to Allocate and Read Directory Entries in the Ext4 File System Pazdera, Radek January 2013 (has links) Clem t©to prce je zvit vkon sekvenÄnho prochzen adres v souborov©m syst©mu ext4. Datov struktura HTree, jen je v souÄasn© dobÄ pouita k implementaci adresu v ext4 zvld velmi dobe nhodn© pstupy do adrese, avak nen optimalizovna pro sekvenÄn prochzen. Tato prce pin analzu tohoto probl©mu. Nejprve studuje implementaci souborov©ho syst©mu ext4 a dalch subsyst©mu Linuxov©ho jdra, kter© s nm souvis. Pro vyhodnocen vkonu souÄasn© implementace adresov©ho indexu byla vytvoena sada test. Na zkladÄ vsledk tÄchto test bylo navreno een, kter© bylo nslednÄ implementovno do Linuxov©ho jdra. V zvÄru t©to prce naleznete vyhodnocen pnosu a porovnn vkonu nov© implementace s dalmi souborovmi syst©my v Linuxu.
77	Analyses and Scalable Algorithms for Byzantine-Resilient Distributed Optimization Kananart Kuwaranancharoen (16480956) 03 July 2023 (has links) <p>The advent of advanced communication technologies has given rise to large-scale networks comprised of numerous interconnected agents, which need to cooperate to accomplish various tasks, such as distributed message routing, formation control, robust statistical inference, and spectrum access coordination. These tasks can be formulated as distributed optimization problems, which require agents to agree on a parameter minimizing the average of their local cost functions by communicating only with their neighbors. However, distributed optimization algorithms are typically susceptible to malicious (or "Byzantine") agents that do not follow the algorithm. This thesis offers analysis and algorithms for such scenarios. As the malicious agent's function can be modeled as an unknown function with some fundamental properties, we begin in the first two parts by analyzing the region containing the potential minimizers of a sum of functions. Specifically, we explicitly characterize the boundary of this region for the sum of two unknown functions with certain properties. In the third part, we develop resilient algorithms that allow correctly functioning agents to converge to a region containing the true minimizer under the assumption of convex functions of each regular agent. Finally, we present a general algorithmic framework that includes most state-of-the-art resilient algorithms. Under the strongly convex assumption, we derive a geometric rate of convergence of all regular agents to a ball around the optimal solution (whose size we characterize) for some algorithms within the framework.</p> Distributed systems and algorithms Optimisation Byzantine fault-tolerance Distributed Optimization Decentralized Optimization Convex optimizations Distributed Algorithm Multi Agent System Fault Tolerant System Graph theory. Quadratic functions Machine Learning Consensus Algorithms Convergence Analysis
78	Locality Optimizations for Regular and Irregular Applications Rajbhandari, Samyam 28 December 2016 (has links) No description available. Computer Science Computer Engineering
79	Adapting the polytope model for dynamic and speculative parallelization Jimborean, Alexandra 14 September 2012 (has links) (PDF) In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems. Speculative parallelization Runtime system Compiler Polyhedral model Dynamic optimizations Loops Partial parallelism LLVM Automatic parallelization
80	The Organic Permeable Base Transistor: Kaschura, Felix 23 October 2017 (has links) (PDF) Organic transistors are a core component for basically all relevant types of fully organic circuits and consumer electronics. The Organic Permeable Base Transistor (OPBT) is a transistor with a sandwich geometry like in Organic Light Emitting Diodes (OLEDs) and has a vertical current transport. Therefore, it combines simple fabrication with high performance due its short transit paths and has a fairly good chance of being used in new organic electronics applications that have to fall back to silicon transistors up to now. A detailed understanding of the operation mechanism that allows a targeted engineering without trial-and-error is required and there is a need for universal optimization techniques which require as little effort as possible. Several mechanisms that explain certain aspects of the operation are proposed in literature, but a comprehensive study that covers all transistor regimes in detail is not found. High performances have been reported for organic transistors which are, however, usually limited to certain materials. E. g., n-type C60 OPBTs are presented with excellent performance, but an adequate p-type OPBT is missing. In this thesis, the OPBT is investigated under two aspects: Firstly, drift-diffusion simulations of the OPBT are evaluated. By comparing the results from different geometry parameters, conclusions about the detailed operation mechanism can be drawn. It is discussed where charge carriers flow in the device and which parameters affect the performance. In particular, the charge carrier transmission through the permeable base layer relies on small openings. Contrary to an intuitive view, however, the size of these openings does not limit the device performance. Secondly, p-type OPBTs using pentacene as the organic semiconductor are fabricated and characterized with the aim to catch up with the performance of the n-type OPBTs. It is shown how an additional seed-layer can improve the performance by changing the morphology, how leakage currents can be defeated, and how parameters like the layer thickness should be chosen. With the combination of all presented optimization strategies, pentacene OPBTs are built that show a current density above 1000 mA/cm^2 and a current gain of 100. This makes the OPBT useful for a variety of applications, and also complementary logic circuits are possible now. The discussed optimization strategies can be extended and used as a starting point for further enhancements. Together with the deep understanding obtained from the simulations, purposeful modifications can be studied that have a great potential. / Organische Transistoren stellen eine Kernkomponente für praktisch jede Art von organischen Schaltungen und Elektronikgeräten dar. Der “Organic Permeable Base Transistor” (OPBT, dt.: Organischer Transistor mit durchlässiger Basis) ist ein Transistor mit einem Schichtaufbau wie in organischen Leuchtdioden (OLEDs) und weist einen vertikalen Stromfluss auf. Somit wird eine einfache Herstellung mit gutem Verhalten und Leistungsfähigkeit kombiniert, welche aus den kurzen Weglängen der Ladungsträger resultiert. Damit ist der OPBT bestens für neuartige organische Elektronik geeignet, wofür andernfalls auf Siliziumtransistoren zurückgegriffen werden müsste. Notwendig sind ein tiefgehendes Verständnis der Funktionsweise, welches ein zielgerichtetes Entwickeln der Technologie ohne zahlreiche Fehlversuche ermöglicht, sowie universell einsetzbare und leicht anwendbare Optimierungsstrategien. In der Literatur werden einige Mechanismen vorgeschlagen, die Teile der Funktionsweise betrachten, aber eine umfassende Untersuchung, die alle Arbeitsbereiche des Transistors abdeckt, findet sich derzeit noch nicht. Ebenso gibt es einige Veröffentlichungen, die Transistoren mit hervorragender Leistungsfähigkeit zeigen, aber meist nur mit Materialien für einen Ladungsträgertyp erzielt werden. So gibt es z.B. n-typ OPBTs auf Basis von C60, für die bisher vergleichbare p-typ OPBTs fehlen. In dieser Arbeit werden daher die folgenden beiden Aspekte des OPBT untersucht: Einerseits werden Drift-Diffusions-Simulationen von OPBTs untersucht und ausgewertet. Kennlinien und Ergebnisse von Transistoren aus verschiedenen Parametervariationen können verglichen werden und erlauben damit Rückschlüsse auf verschiedenste Aspekte der Funktionsweise. Der Fluss der Ladungsträger sowie für die Leistungsfähigkeit wichtige Parameter werden besprochen. Insbesondere sind für die Transmission von Ladungsträgern durch die Basisschicht kleine Öffnungen in dieser nötig. Die Größe dieser Öffnungen stellt jedoch entgegen einer intuitiven Vorstellung keine Begrenzung für die erreichbaren Ströme dar. Andererseits werden p-typ OPBTs auf Basis des organischen Halbleiters Pentacen hergestellt und charakterisiert. Das Ziel ist hierbei die Leistungsfähigkeit an die n-typ OPBTs anzugleichen. In dieser Arbeit wird gezeigt, wie durch eine zusätzliche Schicht die Morphologie und die Transmission verbessert werden kann, wie Leckströme reduziert werden können und welche Parameter bei der Optimierung besondere Beachtung finden sollten. Mit all den Optimierungen zusammen können Pentacen OPBTs hergestellt werden, die Stromdichten über 1000 mA/cm^2 und eine Stromverstärkung über 100 aufweisen. Damit kann der OPBT für eine Vielzahl von Anwendungen eingesetzt werden, unter anderem auch in Logik-Schaltungen zusammen mit n-typ OPBTs. Die besprochenen Optimierungen können weiterentwickelt werden und somit als Startpunkt für anschließende Verbesserungen dienen. In Verbindung mit erlangten Verständnis aus den Simulationsergebnissen können somit aussichtsreiche Veränderungen an der Struktur des OPBTs zielgerichtet eingeführt werden. Vertikaler Organischer Transistor Drift-Diffusions-Simulation Pentacen Morphologie Optimierungen Anwendungen Organic Permeable Base Transistor Vertical Organic Transistor Drift-Diffusion Simulation Pentacene Morphology Optimizations Applications ddc:530 rvk:VN 6057 Transistor Halbleiter Molekül Halbleiterbauelement Simulation Morphologie Kennlinie Anwendung

Search results