Global ETD Search

641	Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU Computing Cui, Xuewen 15 December 2020 (has links) The computer science community needs simpler mechanisms to achieve the performance potential of accelerators, such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi), due to their increasing use in state-of-the-art supercomputers. Over the past 10 years, we have seen a significant improvement in both computing power and memory connection bandwidth for accelerators. However, we also observe that the computation power has grown significantly faster than the interconnection bandwidth between the central processing unit (CPU) and the accelerator. Given that accelerators generally have their own discrete memory space, data needs to be copied from the CPU host memory to the accelerator (device) memory before computation starts on the accelerator. Moreover, programming models like CUDA, OpenMP, OpenACC, and OpenCL can efficiently offload compute-intensive workloads to these accelerators. However, achieving the overlapping of data transfers with computation in a kernel with these models is neither simple nor straightforward. Instead, codes copy data to or from the device without overlapping or requiring explicit user design and refactoring. Achieving performance can require extensive refactoring and hand-tuning to apply data transfer optimizations, and users must manually partition their dataset whenever its size is larger than device memory, which can be highly difficult when the device memory size is not exposed to the user. As the systems are becoming more and more complex in terms of heterogeneity, CPUs are responsible for handling many tasks related to other accelerators, computation and data movement tasks, task dependency checking, and task callbacks. Leaving all logic controls to the CPU not only costs extra communication delay over the PCI-e bus but also consumes the CPU resources, which may affect the performance of other CPU tasks. This thesis work aims to provide efficient directive-based data pipelining approaches for GPUs that tackle these issues and improve performance, programmability, and memory management. / Doctor of Philosophy / Over the past decade, parallel accelerators have become increasingly prominent in this emerging era of "big data, big compute, and artificial intelligence.'' In more recent supercomputers and datacenter clusters, we find multi-core central processing units (CPUs), many-core graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi) being used to accelerate many kinds of computation tasks. While many new programming models have been proposed to support these accelerators, scientists or developers without domain knowledge usually find existing programming models not efficient enough to port their code to accelerators. Due to the limited accelerator on-chip memory size, the data array size is often too large to fit in the on-chip memory, especially while dealing with deep learning tasks. The data need to be partitioned and managed properly, which requires more hand-tuning effort. Moreover, performance tuning is difficult for developers to achieve high performance for specific applications due to a lack of domain knowledge. To handle these problems, this dissertation aims to propose a general approach to provide better programmability, performance, and data management for the accelerators. Accelerator users often prefer to keep their existing verified C, C++, or Fortran code rather than grapple with the unfamiliar code. Since 2013, OpenMP has provided a straightforward way to adapt existing programs to accelerated systems. We propose multiple associated clauses to help developers easily partition and pipeline the accelerated code. Specifically, the proposed extension can overlap kernel computation and data transfer between host and device efficiently. The extension supports memory over-subscription, meaning the memory size required by the tasks could be larger than the GPU size. The internal scheduler guarantees that the data is swapped out correctly and efficiently. Machine learning methods are also leveraged to help with auto-tuning accelerator performance. OpenMP GPU Pipeline Partitioning Unified Memory High-performance Computing Machine learning
642	Real-Time Resource Optimization for Wireless Networks Huang, Yan 11 January 2021 (has links) Resource allocation in modern wireless networks is constrained by increasingly stringent real-time requirements. Such real-time requirements typically come from, among others, the short coherence time on a wireless channel, the small time resolution for resource allocation in OFDM-based radio frame structure, or the low-latency requirements from delay-sensitive applications. An optimal resource allocation solution is useful only if it can be determined and applied to the network entities within its expected time. For today's wireless networks such as 5G NR, such expected time (or real-time requirement) can be as low as 1 ms or even 100 μs. Most of the existing resource optimization solutions to wireless networks do not explicitly take real-time requirement as a constraint when developing solutions. In fact, the mainstream of research works relies on the asymptotic complexity analysis for designing solution algorithms. Asymptotic complexity analysis is only concerned with the growth of its computational complexity as the input size increases (as in the big-O notation). It cannot capture the real-time requirement that is measured in wall-clock time. As a result, existing approaches such as exact or approximate optimization techniques from operations research are usually not useful in wireless networks in the field. Similarly, many problem-specific heuristic solutions with polynomial-time asymptotic complexities may suffer from a similar fate, if their complexities are not tested in actual wall-clock time. To address the limitations of existing approaches, this dissertation presents novel real- time solution designs to two types of optimization problems in wireless networks: i) problems that have closed-form mathematical models, and ii) problems that cannot be modeled in closed-form. For the first type of problems, we propose a novel approach that consists of (i) problem decomposition, which breaks an original optimization problem into a large number of small and independent sub-problems, (ii) search intensification, which identifies the most promising problem sub-space and selects a small set of sub-problems to match the available GPU processing cores, and (iii) GPU-based large-scale parallel processing, which solves the selected sub-problems in parallel and finds a near-optimal solution to the original problem. The efficacy of this approach has been illustrated by our solutions to the following two problems. • Real-Time Scheduling to Achieve Fair LTE/Wi-Fi Coexistence: We investigate a resource optimization problem for the fair coexistence between LTE and Wi-Fi in the unlicensed spectrum. The real-time requirement for finding the optimal channel division and LTE resource allocation solution is on 1 ms time scale. This problem involves the optimal division of transmission time for LTE and Wi-Fi across multi- ple unlicensed bands, and the resource allocation among LTE users within the LTE's "ON" periods. We formulate this optimization problem as a mixed-integer linear pro- gram and prove its NP-hardness. Then by exploiting the unique problem structure, we propose a real-time solution design that is based on problem decomposition and GPU-based parallel processing techniques. Results from an implementation on the NVIDIA GPU/CUDA platform demonstrate that the proposed solution can achieve near-optimal objective and meet the 1 ms timing requirement in 4G LTE. • An Ultrafast GPU-based Proportional Fair Scheduler for 5G NR: We study the popular proportional-fair (PF) scheduling problem in a 5G NR environment. The real-time requirement for determining the optimal (with respect to the PF objective) resource allocation and MCS selection solution is 125 μs (under 5G numerology 3). In this problem, we need to allocate frequency-time resource blocks on an operating channel and assign modulation and coding scheme (MCS) for each active user in the cell. We present GPF+ — a GPU based real-time PF scheduler. With GPF+, the original PF optimization problem is decomposed into a large number of small and in- dependent sub-problems. We then employ a cross-entropy based search intensification technique to identify the most promising problem sub-space and select a small set of sub-problems to fit into a GPU. After solving the selected sub-problems in parallel using GPU cores, we find the best sub-problem solution and use it as the final scheduling solution. Evaluation results show that GPF+ is able to provide near-optimal PF performance in a 5G cell while meeting the 125 μs real-time requirement. For the second type of problems, where there is no closed-form mathematical formulation, we propose to employ model-free deep learning (DL) or deep reinforcement learning (DRL) techniques along with judicious consideration of timing requirement throughout the design. Under DL/DRL, we employ deep function approximators (neural networks) to learn the unknown objective function of an optimization problem, approximate an optimal algorithm to find resource allocation solutions, or discover important mapping functions related to the resource optimization. To meet the real-time requirement, we propose to augment DL or DRL methods with optimization techniques at the input or output of the deep function approximators to reduce their complexities and computational time. Under this approach, we study the following two problems: • A DRL-based Approach to Dynamic eMBB/URLLC Multiplexing in 5G NR: We study the problem of dynamic multiplexing of eMBB and URLLC on the same channel through preemptive resource puncturing. The real-time requirement for determining the optimal URLLC puncturing solution is 1 ms (under 5G numerology 0). A major challenge in solving this problem is that it cannot be modeled using closed-form mathematical expressions. To address this issue, we develop a model-free DRL approach which employs a deep neural network to learn an optimal algorithm to allocate the URLLC puncturing over the operating channel, with the objective of minimizing the adverse impact from URLLC traffic on eMBB. Our contributions include a novel learning method that exploits the intrinsic properties of the URLLC puncturing optimization problem to achieve a fast and stable learning convergence, and a mechanism to ensure feasibility of the deep neural network's output puncturing solution. Experimental results demonstrate that our DRL-based solution significantly outperforms state-of-the-art algorithms proposed in the literature and meets the 1 ms real-time requirement for dynamic multiplexing. • A DL-based Link Adaptation for eMBB/URLLC Multiplexing in 5G NR: We investigate MCS selection for eMBB traffic under the impact of URLLC preemptive puncturing. The real-time requirement for determining the optimal MCSs for all eMBB transmissions scheduled in a transmission interval is 125 μs (under 5G numerology 3). The objective is to have eMBB meet a given block-error rate (BLER) target under the adverse impact of URLLC puncturing. Since this problem cannot be mathematically modeled in closed-form, we proposed a DL-based solution design that uses a deep neural network to learn and predict the BLERs of a transmission under each MCS level. Then based on the BLER predictions, an optimal MCS can be found for each transmission that can achieve the BLER target. To meet the 5G real-time requirement, we implement this design through a hybrid CPU and GPU architecture to minimize the execution time. Extensive experimental results show that our design can select optimal MCS under the impact of preemptive puncturing and meet the 125 μs timing requirement. / Doctor of Philosophy / In modern wireless networks such as 4G LTE and 5G NR, the optimal allocation of radio resources must be performed within a real-time requirement of 1 ms or even 100 μs time scale. Such a real-time requirement comes from the physical properties of wireless channels, the short time resolution for resource allocation defined in the wireless communication standards, and the low-latency requirement from delay-sensitive applications. Real-time requirement, although necessary for wireless networks in the field, has hardly been considered as a key constraint for solution design in the research community. Existing solutions in the literature mostly consider theoretical computational complexities, rather than actual computation time as measured by wall clock. To address the limitations of existing approaches, this dissertation presents real-time solution designs to two types of optimization problems in wireless networks: i) problems that have mathematical models, and ii) problems that cannot be modeled mathematically. For the first type of problems, we propose a novel approach that consists of (i) problem decomposition, (ii) search intensification, and (iii) GPU-based large-scale parallel processing techniques. The efficacy of this approach has been illustrated by our solutions to the following two problems. • Real-Time Scheduling to Achieve Fair LTE/Wi-Fi Coexistence: We investigate a resource optimization problem for the fair coexistence between LTE and Wi-Fi users in the same (unlicensed) spectrum. The real-time requirement for finding the optimal LTE resource allocation solution is on 1 ms time scale. • An Ultrafast GPU-based Proportional Fair Scheduler for 5G NR: We study the popular proportional-fair (PF) scheduling problem in a 5G NR environment. The real-time requirement for determining the optimal resource allocation and modulation and coding scheme (MCS) for each user is 125 μs. For the second type of problems, where there is no mathematical formulation, we propose to employ model-free deep learning (DL) or deep reinforcement learning (DRL) techniques along with judicious consideration of timing requirement throughout the design. Under this approach, we study the following two problems: • A DRL-based Approach to Dynamic eMBB/URLLC Multiplexing in 5G NR: We study the problem of dynamic multiplexing of eMBB and URLLC on the same channel through preemptive resource puncturing. The real-time requirement for determining the optimal URLLC puncturing solution is 1 ms. • A DL-based Link Adaptation for eMBB/URLLC Multiplexing in 5G NR: We investigate MCS selection for eMBB traffic under the impact of URLLC preemptive puncturing. The real-time requirement for determining the optimal MCSs for all eMBB transmissions scheduled in a transmission interval is 125 μs. Wireless network resource allocation scheduling mathematical modeling Optimization real time GPU deep learning deep reinforcement learning
643	Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing Helal, Ahmed Elmohamadi Mohamed 29 January 2020 (has links) In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems can enable dramatic acceleration of user applications, extracting optimal performance via manual analysis and optimization is a complicated and time-consuming process. This dissertation presents graph-structured program representations to reason about the performance bottlenecks on modern HPC systems and to guide novel automation frameworks for performance analysis and modeling and runtime adaptation. The proposed program representations exploit domain knowledge and capture the inherent computation and communication patterns in user applications, at multiple levels of computational granularity, via compiler analysis and dynamic instrumentation. The empirical results demonstrate that the introduced modeling frameworks accurately estimate the realizable parallel performance and scalability of a given sequential code when ported to heterogeneous HPC systems. As a result, these frameworks enable efficient workload distribution schemes that utilize all the available compute resources in a performance-proportional way. In addition, the proposed runtime adaptation frameworks significantly improve the end-to-end performance of important real-world applications which suffer from limited parallelism and fine-grained data dependencies. Specifically, compared to the state-of-the-art methods, such an adaptive parallel execution achieves up to an order-of-magnitude speedup on the target HPC systems while preserving the inherent data dependencies of user applications. / Doctor of Philosophy / Current supercomputers integrate a massive number of heterogeneous compute units with varying speed, computational throughput, memory bandwidth, and memory access latency. This trend represents a major challenge to end users, as their applications have been designed from the ground up to primarily exploit homogeneous CPUs. While heterogeneous systems can deliver several orders of magnitude speedup compared to traditional CPU-based systems, end users need extensive software and hardware expertise as well as significant time and effort to efficiently utilize all the available compute resources. To streamline such a daunting process, this dissertation presents automated frameworks for analyzing and modeling the performance on parallel architectures and for transforming the execution of user applications at runtime. The proposed frameworks incorporate domain knowledge and adapt to the input data and the underlying hardware using novel static and dynamic analyses. The experimental results show the efficacy of the introduced frameworks across many important application domains, such as computational fluid dynamics (CFD), and computer-aided design (CAD). In particular, the adaptive execution approach on heterogeneous systems achieves up to an order-of-magnitude speedup over the optimized parallel implementations. Parallel Architectures Accelerators Heterogeneous Computing Performance Modeling Runtime Adaptation Scheduling Performance Portability MPI GPU LLVM
644	Characterization of FPGA-based High Performance Computers Pimenta Pereira, Karl Savio 02 September 2011 (has links) As CPU clock frequencies plateau and the doubling of CPU cores per processor exacerbate the memory wall, hybrid core computing, utilizing CPUs augmented with FPGAs and/or GPUs holds the promise of addressing high-performance computing demands, particularly with respect to performance, power and productivity. While traditional approaches to benchmark high-performance computers such as SPEC, took an architecture-based approach, they do not completely express the parallelism that exists in FPGA and GPU accelerators. This thesis follows an application-centric approach, by comparing the sustained performance of two key computational idioms, with respect to performance, power and productivity. Specifically, a complex, single precision, floating-point, 1D, Fast Fourier Transform (FFT) and a Molecular Dynamics modeling application, are implemented on state-of-the-art FPGA and GPU accelerators. As results show, FPGA floating-point FFT performance is highly sensitive to a mix of dedicated FPGA resources; DSP48E slices, block RAMs, and FPGA I/O banks in particular. Estimated results show that for the floating-point FFT benchmark on FPGAs, these resources are the performance limiting factor. Fixed-point FFTs are important in a lot of high performance embedded applications. For an integer-point FFT, FPGAs exploit a flexible data path width to trade-off circuit cost and speed of computation, improving performance and resource utilization. GPUs cannot fully take advantage of this, having a fixed data-width architecture. For the molecular dynamics application, FPGAs benefit from the flexibility in creating a custom, tightly-pipelined datapath, and a highly optimized memory subsystem of the accelerator. This can provide a 250-fold improvement over an optimized CPU implementation and 2-fold improvement over an optimized GPU implementation, along with massive power savings. Finally, to extract the maximum performance out of the FPGA, each implementation requires a balance between the formulation of the algorithm on the platform, the optimum use of available external memory bandwidth, and the availability of computational resources; at the expense of a greater programming effort. / Master of Science FFT molecular dynamics integer-point floating-point GPU HPC Field programmable gate arrays
645	Improving Bio-Inspired Frameworks Varadarajan, Aravind Krishnan 05 October 2018 (has links) In this thesis, we provide solutions to two different bio-inspired algorithms. The first is enhancing the performance of bio-inspired test generation for circuits described in RTL Verilog, specifically for branch coverage. We seek to improve upon an existing framework, BEACON, in terms of performance. BEACON is an Ant Colony Optimization (ACO) based test generation framework. Similar to other ACO frameworks, BEACON also has a good scope in improving performance using parallel computing. We try to exploit the available parallelism using both multi-core Central Processing Units (CPUs) and Graphics Processing Units(GPUs). Using our new multithreaded approach we can reduce test generation time by a factor of 25 — compared to the original implementation for a wide variety of circuits. We also provide a 2-dimensional factoring method for BEACON to improve available parallelism to yield some additional speedup. The second bio-inspired algorithm we address is for Deep Neural Networks. With the increasing prevalence of Neural Nets in artificial intelligence and mission-critical applications such as self-driving cars, questions arise about its reliability and robustness. We have developed a test-generation based technique and metric to evaluate the robustness of a Neural Nets outputs based on its sensitivity to its inputs. This is done by generating inputs which the neural nets find difficult to classify but at the same time is relatively apparent to human perception. We measure the degree of difficulty for generating such inputs to calculate our metric. / MS / High-level Hardware Design Languages (HDLs) has allowed designers to implement complicated hardware designs with considerably lesser effort. Unfortunately, design verification for the same circuits has failed to scale gracefully in terms of time and effort. Not only has it become more difficult for formal methods due to exponential complexity from increasing path explosion, but concrete test generation frameworks also face new issues such as the increased requirement in the volume of simulations. The advent of parallel computing using General Purpose Graphics Processing Units (GPGPUs) has led to improved performance for various applications. We propose to leverage both the multi-core CPU and the GPGPU for RTL test generation. This is achieved by implementing a test generation framework that can utilize the SIMD type parallelism available in GPGPUs and task level parallelism available on CPUs. The speedup achieved is extracted from both the test generation framework itself and also from refactoring the hardware model for multi-threaded test generation. For this purpose, we translate the RTL Verilog to a C++ and a CUDA compilable program. Experimental results show that considerable speedup can be achieved for test generation without loss of coverage. In recent years, machine learning and artificial intelligence have taken a substantial leap forward with the discovery of Deep Neural Networks(DNN). Unfortunately, apart from Accuracy and FTest numbers, there exist very few metrics to qualify a DNN. This becomes a reliability issue as DNNs are quite frequently used in safety-critical applications. It is difficult to interpret how the parameters of a trained DNN help store the knowledge from the training inputs. Therefore it is also difficult to infer whether a DNN has learned parameters which might cause an output neuron to misfire wrongly, a bug. An exhaustive search of the input space of the DNN is not only infeasible but is also misleading. Thus, in our work, we try to apply test generation techniques to generate new test inputs based on existing training and testing set to qualify the underlying robustness. Attempts to generate these inputs are guided only by the prediction probability values at the final output layer. We observe that depending on the amount of perturbation and time needed to generate these inputs we can differentiate between DNNs of varying quality. RTL GPU Neural Nets Reliability Performance Branch Coverage Test Generation Genetic Algorithm CUDA
646	Isolating Drone Frequencies in a Real-Time Drone Detection System Teglund, Jonas January 2024 (has links) The problems caused by commercial drones in air traffic, airports, and vital and military installations have increased the demand for drone detection and tracking systems. An acoustic beamforming system that tracks audio sources using 256 microphones in real-time was extended to detect and track drones. This thesis studied software-defined, multi-channel, real-time filtering solutions to improve the systems' drone detection and tracking capabilities. The sound frequencies of drone sound and disturbance noise were analyzed to create a suitable filter. Methods for implementing this filter on all channels while still operating in real-time were studied. SIMD intrinsics were used to create a few candidate algorithms, and a GPU algorithm was created as well. These algorithms were compared to each other based on execution time metrics, and the system was also analyzed for performance degradation and placement of the filtering algorithms. The results of the isolated execution time of the filtering algorithms showed the best SIMD algorithm to be operating at 0.41 milliseconds and the GPU algorithm at 0.12 milliseconds when filtering 256 samples from all 256 channels. The real-time constraint was around 5.2 milliseconds, meaning both solutions operated well below the limit. The results of the system's drone detection and tracking capabilities, when placed outdoors in a windy environment, showed the system clearly finding the drone 48% of the time without any filtering and 89% of the time with filtering. The signal-to-noise ratio was also improved by 21dB by using this filter. The results show that a software-defined multi-channel, real-time filter operating on a large data stream is a viable solution to real-time DSP applications. When specializing a beamforming application in tracking a desired frequency, filtering was revealed to be a good solution. Digital filtering Real-time computing Embedded system SIMD GPU Computer and Information Sciences Data- och informationsvetenskap
647	Etude en vue de la multirésolution de l’apparence Hadim, Julien 11 May 2009 (has links) Les fonctions de texture directionnelle "Bidirectional Texture Function" (BTF) ont rencontrés un certain succès ces dernières années, notamment pour le rendu temps-réel d'images de synthèse, grâce à la fois au réalisme qu'elles apportent et au faible coût de calcul nécessaire. Cependant, un inconvénient de cette approche reste la taille gigantesque des données : de nombreuses méthodes ont été proposées afin de les compresser. Dans ce document, nous proposons une nouvelle représentation des BTFs qui améliore la cohérence des données et qui permet ainsi une compression plus efficace. Dans un premier temps, nous étudions les méthodes d'acquisition et de génération des BTFs et plus particulièrement, les méthodes de compression adaptées à une utilisation sur cartes graphiques. Nous réalisons ensuite une étude à l'aide de notre logiciel "BTFInspect" afin de déterminer parmi les différents phénomènes visuels dans les BTFs, ceux qui influencent majoritairement la cohérence des données par texel. Dans un deuxième temps, nous proposons une nouvelle représentation pour les BTFs, appelées Flat Bidirectional Texture Function (Flat-BTFs), qui améliore la cohérence des données d'une BTF et donc la compression des données. Dans l'analyse des résultats obtenus, nous montrons statistiquement et visuellement le gain de cohérence obtenu ainsi que l'absence d'une perte significative de qualité en comparaison avec la représentation d'origine. Enfin, dans un troisième temps, nous démontrons l'utilisation de notre nouvelle représentation dans des applications de rendu en temps-réel sur cartes graphiques. Puis, nous proposons une compression de l'apparence grâce à une méthode de quantification sur GPU et présentée dans le cadre d'une application de diffusion de données 3D entre un serveur contenant des modèles 3D et un client désirant visualiser ces données. / In recent years, Bidirectional Texture Function (BTF) has emerged as a flexible solution for realistic and real-time rendering of material with complex appearance and low cost computing. However one drawback of this approach is the resulting huge amount of data: several methods have been proposed in order to compress and manage this data. In this document, we propose a new BTF representation that improves data coherency and allows thus a better data compression. In a first part, we study acquisition and digital generation methods of BTFs and more particularly, compression methods suitable for GPU rendering. Then, We realise a study with our software BTFInspect in order to determine among the different visual phenomenons present in BTF which ones induce mainly the data coherence per texel. In a second part, we propose a new BTF representation, named Flat Bidirectional Texture Function (Flat-BTF), which improves data coherency and thus, their compression. The analysis of results show statistically and visually the gain in coherency as well as the absence of a noticeable loss of quality compared to the original representation. In a third and last part, we demonstrate how our new representation may be used for realtime rendering applications on GPUs. Then, we introduce a compression of the appearance thanks to a quantification method on GPU which is presented in the context of a 3D data streaming between a server of 3D data and a client which want visualize them. Synthèse d'images Rendu en temps-réel Apparence réaliste Multirésolution BTF Méso-structure GPU Diffusion de données 3D Computer graphics Realtime 3D rendering Realistic appearance Multiresolution BTF Mesostructure GPU 3D data streaming
648	Méthodes de reconstruction et de quantification pour la microscopie de super-résolution par localisation de molécules individuelles / Reconstruction and quantification methods for single-molecule based super-resolution microscopy Kechkar, Mohamed Adel 20 December 2013 (has links) Le domaine de la microscopie de fluorescence a connu une réelle révolution ces dernières années, permettant d'atteindre des résolutions nanométriques, bien en dessous de la limite de diffraction prédite par Abbe il y a plus d’un siècle. Les techniques basées sur la localisation de molécules individuelles telles que le PALM (Photo-Activation Light Microscopy) ou le (d)STORM (direct Stochastic Optical Reconstruction Microscopy) permettent la reconstruction d’images d’échantillons biologiques en 2 et 3 dimensions, avec des résolutions quasi-moléculaires. Néanmoins, même si ces techniques nécessitent une instrumentation relativement simple, elles requièrent des traitements informatiques conséquents, limitant leur utilisation en routine. En effet, plusieurs dizaines de milliers d’images brutes contenant plus d’un million de molécules doivent être acquises et analysées pour reconstruire une seule image. La plupart des outils disponibles nécessitent une analyse post-acquisition, alourdissant considérablement le processus d’acquisition. Par ailleurs la quantification de l’organisation, de la dynamique mais aussi de la stœchiométrie des complexes moléculaires à des échelles nanométriques peut constituer une clé déterminante pour élucider l’origine de certaines maladies. Ces nouvelles techniques offrent de telles capacités, mais les méthodes d’analyse pour y parvenir restent à développer. Afin d’accompagner cette nouvelle vague de microscopie de localisation et de la rendre utilisable en routine par des expérimentateurs non experts, il est primordial de développer des méthodes de localisation et d’analyse efficaces, simples d’emploi et quantitatives. Dans le cadre de ce travail de thèse, nous avons développé dans un premier temps une nouvelle technique de localisation et reconstruction en temps réel basée sur la décomposition en ondelettes et l‘utilisation des cartes GPU pour la microscopie de super-résolution en 2 et 3 dimensions. Dans un second temps, nous avons mis au point une méthode quantitative basée sur la visualisation et la photophysique des fluorophores organiques pour la mesure de la stœchiométrie des récepteurs AMPA dans les synapses à l’échelle nanométrique. / The field of fluorescence microscopy has witnessed a real revolution these last few years, allowing nanometric spatial resolutions, well below the diffraction limit predicted by Abe more than a century ago. Single molecule-based super-resolution techniques such as PALM (Photo-Activation Light Microscopy) or (d)STORM (direct Stochastic Optical Reconstruction Microscopy) allow the image reconstruction of biological samples in 2 and 3 dimensions, with close to molecular resolution. However, while they require a quite straightforward instrumentation, they need heavy computation, limiting their use in routine. In practice, few tens of thousands of raw images with more than one million molecules must be acquired and analyzed to reconstruct a single super-resolution image. Most of the available tools require post-acquisition processing, making the acquisition protocol much heavier. In addition, the quantification of the organization, dynamics but also the stoichiometry of biomolecular complexes at nanometer scales can be a key determinant to elucidate the origin of certain diseases. Novel localization microscopy techniques offer such capabilities, but dedicated analysis methods still have to be developed. In order to democratize this new generation of localization microscopy techniques and make them usable in routine by non-experts, it is essential to develop simple and easy to use localization and quantitative analysis methods. During this PhD thesis, we first developed a new technique for real-time localization and reconstruction based on wavelet decomposition and the use of GPU cards for super-resolution microscopy in 2 and 3 dimensions. Second, we have proposed a quantitative method based on the visualization and the photophysics of organic fluorophores for measuring the stoichiometry of AMPA receptors in synapses at the molecular scale. Microscopie de super-résolution Mocalisation de molécules individuelles Traitement tempes-rél Décomposition en ondelettes GPU Photophysique de fluorophores Récepteurs post-synaptiques Super-resolution microscopy, Single-molecule localization Real-time processing Wavelet decomposition GPU Fluorophore photophysics Post-synaptic receptors
649	[en] TOWARD GPU-BASED GROUND STRUCTURES FOR LARGE SCALE TOPOLOGY OPTIMIZATION / [pt] OTIMIZAÇÃO TOPOLÓGICA DE ESTRUTURAS DE GRANDE PORTE UTILIZANDO O MÉTODO DE GROUND STRUCTURES EM GPU ARTURO ELI CUBAS RODRIGUEZ 14 May 2019 (has links) [pt] A otimização topológica tem como objetivo encontrar a distribuição mais eficiente de material em um domínio especificado sem violar as restrições de projeto definidas pelo usuário. Quando aplicada a estruturas contínuas, a otimização topológica é geralmente realizada por meio de métodos de densidade, conhecidos na literatura técnica. Neste trabalho, daremos ênfase à aplicação de sua formulação discreta, na qual um determinado domínio é discretizado na forma de uma estrutura base, ou seja, uma distribuição espacial finita de nós conectados entre si por meio de barras de treliça. O método de estrutura base fornece uma aproximação para as estruturas de Michell, que são compostas por um número infinito de barras, por meio de um número reduzido de elementos de treliça. O problema de determinar a estrutura final com peso mínimo, para um único caso de carregamento, considerando um comportamento linear elástico do material e restrições de tensão, pode ser formulado como um problema de programação linear. O objetivo deste trabalho é fornecer uma implementação escalável para o problema de otimização de treliças com peso mínimo, considerando domínios com geometrias arbitrárias. O método remove os elementos que são desnecessários, partindo de uma treliça cujo grau de conectividade é definido pelo usuário, mantendo-se fixos os pontos nodais. Propomos uma implementação escalável do método de estrutura base, utilizando um algoritmo de pontos interiores eficiente e robusto, em um ambiente de computação paralela (envolvendo unidades de processamento gráfico ou GPUs). Os resultados apresentados, em estruturas bi e tridimensionais com milhões de barras, ilustram a viabilidade e a eficiência computacional da implementação proposta. / [en] Topology optimization aims to find the most efficient material distribution in a specified domain without violating user-defined design constraints. When applied to continuum structures, topology optimization is usually performed by means of the well-known density methods. In this work we focus on the application of its discrete formulation where a given domain is discretized into a ground structure, i.e., a finite spatial distribution of nodes connected using truss members. The ground structure method provides an approximation to optimal Michell-type structures, composed of an infinite number of members, by using a reduced number of truss members. The optimal least weight truss for a single load case, under linear elastic conditions, subjected to stress constraints can be posed as a linear programming problem. The aim of this work is to provide a scalable implementation for the optimization of least weight trusses embedded in any domain geometry. The method removes unnecessary members from a truss that has a user-defined degree of connectivity while keeping the nodal locations fixed. We discuss in detail the scalable implementation of the ground structure method using an efficient and robust interior point algorithm within a parallel computing environment (involving Graphics Processing Units or GPUs). The capabilities of the proposed implementation is illustrated by means of large scale applications on practical problems with millions of members in both 2D and 3D structures. [pt] METODO DOS ELEMENTOS FINITOS [en] FINITE ELEMENT METHOD [pt] OTIMIZACAO TOPOLOGICA [en] TOPOLOGY OPTIMIZATION [pt] OTIMIZACAO LINEAR [en] LINEAR OPTIMIZATION [pt] COMPUTACAO DE ALTO DESEMPENHO [en] HIGH PERFORMANCE COMPUTING [pt] COMPUTACAO EM GPU [en] GPU COMPUTING [pt] RESOLVEDORES DE SISTEMAS [en] LINEAR EQUATIONS SOLVERS
650	Approche bayésienne pour la localisation de sources en imagerie acoustique / Bayesian approach in acoustic source localization and imaging Chu, Ning 22 November 2013 (has links) L’imagerie acoustique est une technique performante pour la localisation et la reconstruction de puissance des sources acoustiques en utilisant des mesures limitées au réseau des microphones. Elle est largement utilisée pour évaluer l’influence acoustique dans l’industrie automobile et aéronautique. Les méthodes d’imagerie acoustique impliquent souvent un modèle direct de propagation acoustique et l’inversion de ce modèle direct. Cependant, cette inversion provoque généralement un problème inverse mal-posé. Par conséquent, les méthodes classiques ne permettent d’obtenir de manière satisfaisante ni une haute résolution spatiale, ni une dynamique large de la puissance acoustique. Dans cette thèse, nous avons tout d’abord nous avons créé un modèle direct discret de la puissance acoustique qui devient alors à la fois linéaire et déterminé pour les puissances acoustiques. Et nous ajoutons les erreurs de mesures que nous décomposons en trois parties : le bruit de fond du réseau de capteurs, l’incertitude du modèle causée par les propagations à multi-trajets et les erreurs d’approximation de la modélisation. Pour la résolution du problème inverse, nous avons tout d’abord proposé une approche d’hyper-résolution en utilisant une contrainte de parcimonie, de sorte que nous pouvons obtenir une plus haute résolution spatiale robuste à aux erreurs de mesures à condition que le paramètre de parcimonie soit estimé attentivement. Ensuite, afin d’obtenir une dynamique large et une plus forte robustesse aux bruits, nous avons proposé une approche basée sur une inférence bayésienne avec un a priori parcimonieux. Toutes les variables et paramètres inconnus peuvent être estimées par l’estimation du maximum a posteriori conjoint (JMAP). Toutefois, le JMAP souffrant d’une optimisation non-quadratique d’importants coûts de calcul, nous avons cherché des solutions d’accélération algorithmique: une approximation du modèle direct en utilisant une convolution 2D avec un noyau invariant. Grâce à ce modèle, nos approches peuvent être parallélisées sur des Graphics Processing Unit (GPU) . Par ailleurs, nous avons affiné notre modèle statistique sur 2 aspects : prise en compte de la non stationarité spatiale des erreurs de mesures et la définition d’une loi a priori pour les puissances renforçant la parcimonie en loi de Students-t. Enfin, nous ont poussé à mettre en place une Approximation Variationnelle Bayésienne (VBA). Cette approche permet non seulement d’obtenir toutes les estimations des inconnues, mais aussi de fournir des intervalles de confiance grâce aux paramètres cachés utilisés par les lois de Students-t. Pour conclure, nos approches ont été comparées avec des méthodes de l’état-de-l’art sur des données simulées, réelles (provenant d’essais en soufflerie chez Renault S2A) et hybrides. / Acoustic imaging is an advanced technique for acoustic source localization and power reconstruction using limited measurements at microphone sensor array. This technique can provide meaningful insights into performances, properties and mechanisms of acoustic sources. It has been widely used for evaluating the acoustic influence in automobile and aircraft industries. Acoustic imaging methods often involve in two aspects: a forward model of acoustic signal (power) propagation, and its inverse solution. However, the inversion usually causes a very ill-posed inverse problem, whose solution is not unique and is quite sensitive to measurement errors. Therefore, classical methods cannot easily obtain high spatial resolutions between two close sources, nor achieve wide dynamic range of acoustic source powers. In this thesis, we firstly build up a discrete forward model of acoustic signal propagation. This signal model is a linear but under-determined system of equations linking the measured data and unknown source signals. Based on this signal model, we set up a discrete forward model of acoustic power propagation. This power model is both linear and determined for source powers. In the forward models, we consider the measurement errors to be mainly composed of background noises at sensor array, model uncertainty caused by multi-path propagation, as well as model approximating errors. For the inverse problem of the acoustic power model, we firstly propose a robust super-resolution approach with the sparsity constraint, so that we can obtain very high spatial resolution in strong measurement errors. But the sparsity parameter should be carefully estimated for effective performance. Then for the acoustic imaging with large dynamic range and robustness, we propose a robust Bayesian inference approach with a sparsity enforcing prior: the double exponential law. This sparse prior can better embody the sparsity characteristic of source distribution than the sparsity constraint. All the unknown variables and parameters can be alternatively estimated by the Joint Maximum A Posterior (JMAP) estimation. However, this JMAP suffers a non-quadratic optimization and causes huge computational cost. So that we improve two following aspects: In order to accelerate the JMAP estimation, we investigate an invariant 2D convolution operator to approximate acoustic power propagation model. Owing to this invariant convolution model, our approaches can be parallelly implemented by the Graphics Processing Unit (GPU). Furthermore, we consider that measurement errors are spatially variant (non-stationary) at different sensors. In this more practical case, the distribution of measurement errors can be more accurately modeled by Students-t law which can express the variant variances by hidden parameters. Moreover, the sparsity enforcing distribution can be more conveniently described by the Student's-t law which can be decomposed into multivariate Gaussian and Gamma laws. However, the JMAP estimation risks to obtain so many unknown variables and hidden parameters. Therefore, we apply the Variational Bayesian Approximation (VBA) to overcome the JMAP drawbacks. One of the fabulous advantages of VBA is that it can not only achieve the parameter estimations, but also offer the confidential interval of interested parameters thanks to hidden parameters used in Students-t priors. To conclude, proposed approaches are validated by simulations, real data from wind tunnel experiments of Renault S2A, as well as the hybrid data. Compared with some typical state-of-the-art methods, the main advantages of proposed approaches are robust to measurement errors, super spatial resolutions, wide dynamic range and no need for source number nor Signal to Noise Ration (SNR) beforehand. Localisation de sources acoustiques Approche bayésienne Hyper-résolution Parcimonie Students-t Imagerie acoustique Déconvolution Soufflerie JMAP VBA GPU Acoustic source localization Bayesian approach Super resolution Sparsity Students-t Acoustic imaging Deconvolution Wind tunnel JMAP VBA GPU

Search results