• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • Tagged with
  • 6
  • 6
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Distributed Execution of Recursive Irregular Applications

Nikhil Hegde (7043171) 13 August 2019 (has links)
Massive computing power and applications running on this power, primarily confined to expensive supercomputers a decade ago, have now become mainstream through the availability of clusters with commodity computers and high-speed interconnects running big-data era applications. The challenges associated with programming such systems, for effectively utilizing the computing power, have led to the creation of intuitive abstractions and implementations targeting average users, domain experts, and savvy (parallel) programmers. There is often a trade-off between the ease of programming and performance when using these abstractions. This thesis develops tools to bridge the gap between ease of programming and performance of irregular programs—programs that involve one or more of irregular- data structures, control structures, and communication patterns—on distributed-memory systems. <div><br></div><div>Irregular programs are focused heavily in domains ranging from data mining to bioinformatics to scientific computing. In contrast to regular applications such as stencil codes and dense matrix-matrix multiplications, which have a predictable pattern of data access and control flow, typical irregular applications operate over graphs, trees, and sparse matrices and involve input-dependent data access pattern and control flow. This makes it difficult to apply optimizations such as those targeting locality and parallelism to programs implementing irregular applications. Moreover, irregular programs are often used with large data sets that prohibit single-node execution due to memory limitations on the node. Hence, distributed solutions are necessary in order to process all the data.<br><br>In this thesis, we introduce SPIRIT, a framework consisting of an abstraction and a space-adaptive runtime system for simplifying the creation of distributed implementations of recursive irregular programs based on spatial acceleration structures. SPIRIT addresses the insufficiency of traditional data-parallel approaches and existing systems in effectively parallelizing computations involving repeated tree traversals. SPIRIT employs locality optimizations applied in a shared-memory context, introduces a novel pipeline-parallel approach to execute distributed traversals, and trades-off performance with memory usage to create a space-adaptive system that achieves a scalable performance, and outperforms implementations done in contemporary distributed graph processing frameworks.</div><div><br>We next introduce Treelogy to understand the connection between optimizations and tree-algorithms. Treelogy provides an ontology and a benchmark suite of a broader class of tree algorithms to help answer: (i) is there any existing optimization that is applicable or effective for a new tree algorithm? (ii) can a new optimization developed for a tree algorithm be applied to existing tree algorithms from other domains? We show that a categorization (ontology) based on structural properties of tree- algorithms is useful for both developers of new optimizations and new tree algorithm creators. With the help of a suite of tree traversal kernels spanning the ontology, we show that GPU, shared-, and distributed-memory implementations are scalable and the two-point correlation algorithm with vptree performs better than the standard kdtree implementation.</div><div><br>In the final part of the thesis, we explore the possibility of automatically generating efficient distributed-memory implementations of irregular programs. As manually creating distributed-memory implementations is challenging due to the explicit need for managing tasks, parallelism, communication, and load-balancing, we introduce a framework, D2P, to automatically generate efficient distributed implementations of recursive divide-conquer algorithms. D2P automatically generates a distributed implementation of a recursive divide-conquer algorithm from its specification, which is a high-level outline of a recursive formulation. We evaluate D2P with recursive Dynamic programming (DP) algorithms. The computation in DP algorithms is not irregular per se. However, when distributed, the computation in efficient recursive formulations of DP algorithms requires irregular communication. User-configurable knobs in D2P allow for tuning the amount of available parallelism. Results show that D2P programs scale well, are significantly better than those produced using a state-of-the-art framework for parallelizing iterative DP algorithms, and outperform even hand-written distributed-memory implementations in most cases.</div>
2

OPTIMIZATIONS FOR N-BODY PROBLEMS ON HETEROGENOUS SYSTEMS

Jianqiao Liu (6636020) 14 May 2019 (has links)
<div><div>N-body problems, such as simulating the motion of stars in a galaxy and evaluating the spatial statistics through n-point correlation function, are popularly solved. The naive approaches to n-body problems are typically O(n^2) algorithms. Tree codes take advantages of the fact that a group of bodies can be skipped or approximated as a union if their distance is far away from one body’s sight. It reduces the complexity from O(n^2) to O(n*lgn). However, tree codes rely on pointer chasing and have massive branch instructions. These are highly irregular and thus prevent tree codes from being easily parallelized. </div><div><br></div><div>GPU offers the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires the code to be carefully structured to deal with the limitations of the SIMT execution model. This dissertation focusses on optimizations for n-body problems on the heterogeneous system. A general inspector-executor based framework is proposed to automatically schedule GPU threads to achieve high performance. Essentially, the framework lets the GPU execute partial of the tree codes and profile threads behaviors, then it assigns the CPU to re-organize these threads to minimize the divergence before executing the remaining portion of the traversals on the GPU. We apply this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses hand tuned, application-specific scheduling. </div><div> </div><div>For large scale input, ChaNGa is the best-of-breed n-body platform. It uses an asymp-totically-efficient tree traversal strategy known as a dual-tree walk to quickly provide an accurate simulation result. On GPUs, ChaNGa uses a hybrid strategy where the CPU performs the tree walk to determine which bodies interact while the GPU performs the force computation. In this dissertation, we show that a highly optimized single-tree walk approach is able to achieve better GPU performance by significantly accelerating the tree walk and reducing CPU/GPU communication. Our experiments show that this new design can achieve a 8.25x speedup over baseline ChaNGa using one node, one process per node configuration. We also point out that ChaNGa's implementation doesn't satisfy the inclusion condition so that GPU-centric remote tree walk doesn't perform well.</div></div>
3

Accelerated, Collaborative & Extended BlobTree Modelling / Accelerated, Collaborative and Extended BlobTree Modelling

Grasberger, Herbert 23 April 2015 (has links)
BlobTree modelling has been used in several solid modelling packages to rapidly prototype models by making use of boolean and sketch-based modelling. Using these two techniques, a user can quickly create complex models as combinations of simple primitives and sketched objects. Because the BlobTree is based on continuous field-values, it offers a lot of possibilities to create and control smooth transitions between surfaces, something more complicated in other modelling approaches. In addition, the data required to describe a BlobTree is very compact. Despite these advantages, the BlobTree has not yet been integrated into state of the art industrial workflows to create models. This thesis identifies some shortcomings of the BlobTree, presents potential solutions to those problems and demonstrates an application that makes use of the BlobTree's compact representation. A main criticism is that the evaluation of a large BlobTree can be quite expensive, and, therefore, many applications are limited in the complexity of models that can be created interactively. This work presents an alternative way of traversing a BlobTree that lowers the time to calculate field-values by at least an order of magnitude. As a result, the limit of model complexity is raised for interactive modelling applications. In some domains, certain models need more than one designer or engineer to be created. Often, several iterations of a model are shared between multiple participants until it is finalized. Because the description of a BlobTree is very compact, it can be synchronized efficiently in a collaborative modelling environment. This work presents CollabBlob, an approach to collaborative modelling based on the BlobTree. CollabBlob is lock-free, and provides interactive feedback for all the participants, which helps with a fast iteration in the modelling process. In order to extend the range of models that can be created within CollabBlob, two areas of BlobTree modelling are improved in the context of this thesis. CAD modelling often makes use of a feature called filleting to add additional surface features, which could be caused by a manufacturing process. Filleting in general creates smooth transitions between surfaces, something that the BlobTree can do with less mathematical complexity than approaches needed in Constructive Solid Geometry (CSG), in the case of fillets between primitives. However, little research has been done on the construction of fillets between surfaces of a single BlobTree primitive. This work outlines Angle-Based Filleting and the Surface Fillet Curve, two solutions to improve the specification of fillets in the BlobTree. Sketch-based implicit modelling generates 3D shapes from 2D sketches by sampling the drawn shape and using the samples to create the implicit field via variational interpolation. Additional samples inside and outside the sketched shape are needed to generate a field compatible with BlobTree modelling and state of the art approaches use offset curves of the sketch to generate these samples. The approach presented in this work reduces the number of sample points, thus accelerating the interpolation time and improving the resulting implicit field. / Graduate / 0984 / herbert.grasberger@gmail.com
4

JavaFX Scene Graph Object Serialization

Khodabandehloo, Elmira January 2013 (has links)
Data visualization is used in order to analyze and perceive patterns in data. One of the use cases of visualization is to graphically represent and compare simulation results. At Ericsson Research, a visualization platform, based on JavaFX 2 is used to visualize simulation results. Three configuration files are required in order to create an application based on the visualization tool: XML, FXML, and CSS. The current problem is that, in order to set up a visualization application, the three configuration files must be written by hand which is a very tedious task. The purpose of this study is to reduce the amount of work which is required to construct a visualization application by providing a serialization function which makes it possible to save the layout (FXML) of the application at run-time based solely on the scene graph. In this master’s thesis, possible frameworks that might ease the implementation of a generic FXML serialization have been investigated and the most promising alternative according to a number of evaluation metrics has been identified. Then, using a design science research method, an algorithm is proposed which is capable of generic object/bean serialization to FXML based on a number of features or requirements. Finally, the implementation results are evaluated through a set of test cases. The evaluation is composed of an analysis of the serialization results &amp; tests and a comparison of the expected result and the actual results using unit testing and test coverage measurements. Evaluation results for each serialization function show that the results of the serialization are similar to the original files and hence the proposed algorithm provides the desired serialization functionality for the specific features of FXML needed for this platform, provided that the tests considered every aspect of the serialization functionality. / Datavisualisering används för att analysera och uppfatta mönster i data. Ett användningsfall för visualisering är att grafiskt representera och jämföra simuleringsresultat. På Ericsson Research har en visualiseringplattform för att visualisera simuleringsresultat utvecklats som baserats på JavaFX 2. Tre konfigurationsfiler krävs för att skapa en applikation baserad på denna visualiseringsplattform: XML, FXML och CSS. Det nuvarande problemet är att för att utveckla en ny applikation så måste de tre konfigurationsfilerna skrivas för hand vilket är kräver mycket utvecklingstid. Syftet med denna studie är att minska mängden arbete som krävs för att konstruera en visualiseringapplikation genom att tillhandahålla en serialiseringsfunktion som gör det möjligt att spara applikationens layout till en FXML-fil medan programmet exekverar enbart genom att extrahera information ur det grafiska gränsnittets scengraf. I detta examensarbete har ett antal mjukvarubibliotek eller API: er som kan underlätta utvecklandet av en generisk FXML serialiseringsfunktion analyserats och de mest lovande alternativen enligt ett antal utvärderingsmetriker har identifierats. Med hjälp av en iterativ, design-orienterad forskningsmetod har en algoritm designats som är kapabel till att serialisera generiska Java-objekt, eller Java-bönor till FXML. Den föreslagna algoritmen har sedan utvärderats genom automatiserade mjukvarutester. Utvärderingen består av: analys av serialiseringsresultat, design av testfall, samt jämförelse av förväntade resultat och de faktiska resultaten med hjälp av enhetstest och uppmätt kodtäckning. Utvärderingen visar att serialiseringsalgoritmen ger resultat som motsvarar de ursprungliga FXML-filerna som utformats för att verifiera olika delar av FXML standarden. Därmed anses den föreslagna serialiseringsalgoritmen uppfylla de delar av FXML-specifikationen som kravställts och beaktats i detta examensarbete.
5

Efficient multi-class objet detection with a hierarchy of classes / Détection efficace des objets multi-classes avec une hiérarchie des classes

Odabai Fard, Seyed Hamidreza 20 November 2015 (has links)
Dans cet article, nous présentons une nouvelle approche de détection multi-classes basée sur un parcours hiérarchique de classifieurs appris simultanément. Pour plus de robustesse et de rapidité, nous proposons d’utiliser un arbre de classes d’objets. Notre modèle de détection est appris en combinant les contraintes de tri et de classification dans un seul problème d’optimisation. Notre formulation convexe permet d’utiliser un algorithme de recherche pour accélérer le temps d’exécution. Nous avons mené des évaluations de notre algorithme sur les benchmarks PASCAL VOC (2007 et 2010). Comparé à l’approche un-contre-tous, notre méthode améliore les performances pour 20 classes et gagne 10x en vitesse. / Recent years have witnessed a competition in autonomous navigation for vehicles boosted by the advances in computer vision. The on-board cameras are capable of understanding the semantic content of the environment. A core component of this system is to localize and classify objects in urban scenes. There is a need to have multi-class object detection systems. Designing such an efficient system is a challenging and active research area. The algorithms can be found for applications in autonomous driving, object searches in images or video surveillance. The scale of object classes varies depending on the tasks. The datasets for object detection started with containing one class only e.g. the popular INRIA Person dataset. Nowadays, we witness an expansion of the datasets consisting of more training data or number of object classes. This thesis proposes a solution to efficiently learn a multi-class object detector. The task of such a system is to localize all instances of target object classes in an input image. We distinguish between three major efficiency criteria. First, the detection performance measures the accuracy of detection. Second, we strive low execution times during run-time. Third, we address the scalability of our novel detection framework. The two previous criteria should scale suitably with the number of input classes and the training algorithm has to take a reasonable amount of time when learning with these larger datasets. Although single-class object detection has seen a considerable improvement over the years, it still remains a challenge to create algorithms that work well with any number of classes. Most works on this subject extent these single-class detectors to work accordingly with multiple classes but remain hardly flexible to new object descriptors. Moreover, they do not consider all these three criteria at the same time. Others use a more traditional approach by iteratively executing a single-class detector for each target class which scales linearly in training time and run-time. To tackle the challenges, we present a novel framework where for an input patch during detection the closest class is ranked highest. Background labels are rejected as negative samples. The detection goal is to find the highest scoring class. To this end, we derive a convex problem formulation that combines ranking and classification constraints. The accuracy of the system is improved by hierarchically arranging the classes into a tree of classifiers. The leaf nodes represent the individual classes and the intermediate nodes called super-classes group recursively these classes together. The super-classes benefit from the shared knowledge of their descending classes. All these classifiers are learned in a joint optimization problem along with the previouslymentioned constraints. The increased number of classifiers are prohibitive to rapid execution times. The formulation of the detection goal naturally allows to use an adapted tree traversal algorithm to progressively search for the best class but reject early in the detection process the background samples and consequently reduce the system’s run-time. Our system balances between detection performance and speed-up. We further experimented with feature reduction to decrease the overhead of applying the high-level classifiers in the tree. The framework is transparent to the used object descriptor where we implemented the histogram of orientated gradients and deformable part model both introduced in [Felzenszwalb et al., 2010a]. The capabilities of our system are demonstrated on two challenging datasets containing different object categories not necessarily semantically related. We evaluate both the detection performance with different number of classes and the scalability with respect to run-time. Our experiments show that this framework fulfills the requirements of a multi-class object detector and highlights the advantages of structuring class-level knowledge.
6

Locality Optimizations for Regular and Irregular Applications

Rajbhandari, Samyam 28 December 2016 (has links)
No description available.

Page generated in 0.0728 seconds