Global ETD Search

21	Scalable and Efficient Analysis of Large High-Dimensional Data Sets in the Context of Recurrence Analysis Rawald, Tobias 13 February 2018 (has links) Die Recurrence Quantification Analysis (RQA) ist eine Methode aus der nicht-linearen Zeitreihenanalyse. Im Mittelpunkt dieser Methode steht die Auswertung des Inhalts sogenannter Rekurrenzmatrizen. Bestehende Berechnungsansätze zur Durchführung der RQA können entweder nur Zeitreihen bis zu einer bestimmten Länge verarbeiten oder benötigen viel Zeit zur Analyse von sehr langen Zeitreihen. Diese Dissertation stellt die sogenannte skalierbare Rekurrenzanalyse (SRA) vor. Sie ist ein neuartiger Berechnungsansatz, der eine gegebene Rekurrenzmatrix in mehrere Submatrizen unterteilt. Jede Submatrix wird von einem Berechnungsgerät in massiv-paralleler Art und Weise untersucht. Dieser Ansatz wird unter Verwendung der OpenCL-Schnittstelle umgesetzt. Anhand mehrerer Experimente wird demonstriert, dass SRA massive Leistungssteigerungen im Vergleich zu existierenden Berechnungsansätzen insbesondere durch den Einsatz von Grafikkarten ermöglicht. Die Dissertation enthält eine ausführliche Evaluation, die den Einfluss der Anwendung mehrerer Datenbankkonzepte, wie z.B. die Repräsentation der Eingangsdaten, auf die RQA-Verarbeitungskette analysiert. Es wird untersucht, inwiefern unterschiedliche Ausprägungen dieser Konzepte Einfluss auf die Effizienz der Analyse auf verschiedenen Berechnungsgeräten haben. Abschließend wird ein automatischer Optimierungsansatz vorgestellt, der performante RQA-Implementierungen für ein gegebenes Analyseszenario in Kombination mit einer Hardware-Plattform dynamisch bestimmt. Neben anderen Aspekten werden drastische Effizienzgewinne durch den Einsatz des Optimierungsansatzes aufgezeigt. / Recurrence quantification analysis (RQA) is a method from nonlinear time series analysis. It relies on the identification of line structures within so-called recurrence matrices and comprises a set of scalar measures. Existing computing approaches to RQA are either not capable of processing recurrence matrices exceeding a certain size or suffer from long runtimes considering time series that contain hundreds of thousands of data points. This thesis introduces scalable recurrence analysis (SRA), which is an alternative computing approach that subdivides a recurrence matrix into multiple sub matrices. Each sub matrix is processed individually in a massively parallel manner by a single compute device. This is implemented exemplarily using the OpenCL framework. It is shown that this approach delivers considerable performance improvements in comparison to state-of-the-art RQA software by exploiting the computing capabilities of many-core hardware architectures, in particular graphics cards. The usage of OpenCL allows to execute identical SRA implementations on a variety of hardware platforms having different architectural properties. An extensive evaluation analyses the impact of applying concepts from database technology, such memory storage layouts, to the RQA processing pipeline. It is investigated how different realisations of these concepts affect the performance of the computations on different types of compute devices. Finally, an approach based on automatic performance tuning is introduced that automatically selects well-performing RQA implementations for a given analytical scenario on specific computing hardware. Among others, it is demonstrated that the customised auto-tuning approach allows to considerably increase the efficiency of the processing by adapting the implementation selection. Paralleles Rechnen Paralleler Algorithmus Maschinelles Lernen Rekurrenzanalyse Nichtlineare Zeitreihenanalyse parallel computing parallel algorithm machine learning recurrence analysis nonlinear time series analysis 004 Datenverarbeitung; Informatik SK 845 ST 530 ddc:004 ddc:000 ddc:005
22	Designing scientific workflows following a structure and provenance-aware strategy Chen, Jiuqiang 11 October 2013 (has links) (PDF) Les systèmes de workflows disposent de modules de gestion de provenance qui collectent les informations relatives aux exécutions (données consommées et produites) permettant d'assurer la reproductibilité d'une expérience. Pour plusieurs raisons, la complexité de la structure du workflow et de ses d'exécutions est en augmentation, rendant la réutilisation de workflows plus difficile. L'objectif global de cette thèse est d'améliorer la réutilisation des workflows en fournissant des stratégies pour réduire la complexité des structures de workflow tout en préservant la provenance. Deux stratégies sont introduites. Tout d'abord, nous introduisons SPFlow un algorithme de réécriture de workflow scientifique préservant la provenance et transformant tout graphe acyclique orienté (DAG) en une structure plus simple, série-parallèle (SP). Ces structures permettent la conception d'algorithmes polynomiaux pour effectuer des opérations complexes sur les workflows (par exemple, leur comparaison) alors que ces mêmes opérations sont associées à des problèmes NP-difficile pour des structures générales de DAG. Deuxièmement, nous proposons une technique capable de réduire la redondance présente dans les workflow en détectant et supprimant des motifs responsables de cette redondance, nommés "anti-patterns". Nous avons conçu l'algorithme DistillFlow capable de transformer un workflow en un workflow sémantiquement équivalent "distillé", possédant une structure plus concise et dans laquelle on retire autant que possible les anti-patterns. Nos solutions (SPFlow et DistillFlow) ont été testées systématiquement sur de grandes collections de workflows réels, en particulier avec le système Taverna. Nos outils sont disponibles à l'adresse: https://www.lri.fr/~chenj/. workflows scientifiques provenance integration de données biologiques graphes series-paralleles
23	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 08 May 2014 (has links) (PDF) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. Data Mining Assoziationsanalyse Mehrprozessorsysteme Paralleles Data Mining SIMD Apriori Eclat FP-growth Data mining Association rule mining Multiprocessor Systems Parallel mining SIMD Compression Apriori Eclat FP-growth ddc:004 rvk:ST 530 Datenverarbeitung Informatik Computerprogrammierung Programme Daten Spezielle Computerverfahren Data Mining Algorithmen Multithreading SIMD Datenkompression
24	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 30 May 2013 (has links) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. info:eu-repo/classification/ddc/004 ddc:004
25	Towards Next Generation Sequential and Parallel SAT Solvers Manthey, Norbert 01 December 2014 (has links) This thesis focuses on improving the SAT solving technology. The improvements focus on two major subjects: sequential SAT solving and parallel SAT solving. To better understand sequential SAT algorithms, the abstract reduction system Generic CDCL is introduced. With Generic CDCL, the soundness of solving techniques can be modeled. Next, the conflict driven clause learning algorithm is extended with the three techniques local look-ahead, local probing and all UIP learning that allow more global reasoning during search. These techniques improve the performance of the sequential SAT solver Riss. Then, the formula simplification techniques bounded variable addition, covered literal elimination and an advanced cardinality constraint extraction are introduced. By using these techniques, the reasoning of the overall SAT solving tool chain becomes stronger than plain resolution. When using these three techniques in the formula simplification tool Coprocessor before using Riss to solve a formula, the performance can be improved further. Due to the increasing number of cores in CPUs, the scalable parallel SAT solving approach iterative partitioning has been implemented in Pcasso for the multi-core architecture. Related work on parallel SAT solving has been studied to extract main ideas that can improve Pcasso. Besides parallel formula simplification with bounded variable elimination, the major extension is the extended clause sharing level based clause tagging, which builds the basis for conflict driven node killing. The latter allows to better identify unsatisfiable search space partitions. Another improvement is to combine scattering and look-ahead as a superior search space partitioning function. In combination with Coprocessor, the introduced extensions increase the performance of the parallel solver Pcasso. The implemented system turns out to be scalable for the multi-core architecture. Hence iterative partitioning is interesting for future parallel SAT solvers. The implemented solvers participated in international SAT competitions. In 2013 and 2014 Pcasso showed a good performance. Riss in combination with Copro- cessor won several first, second and third prices, including two Kurt-Gödel-Medals. Hence, the introduced algorithms improved modern SAT solving technology. info:eu-repo/classification/ddc/004 ddc:004
26	Efficient Parallel Monte-Carlo Simulations for Large-Scale Studies of Surface Growth Processes Kelling, Jeffrey 21 August 2018 (has links) Lattice Monte Carlo methods are used to investigate far from and out-of-equilibrium systems, including surface growth, spin systems and solid mixtures. Applications range from the determination of universal growth or aging behaviors to palpable systems, where coarsening of nanocomposites or self-organization of functional nanostructures are of interest. Such studies require observations of large systems over long times scales, to allow structures to grow over orders of magnitude, which necessitates massively parallel simulations. This work addresses the problem of parallel processing introducing correlations in Monte Carlo updates and proposes a virtually correlation-free domain decomposition scheme to solve it. The effect of correlations on scaling and dynamical properties of surface growth systems and related lattice gases is investigated further by comparing results obtained by correlation-free and intrinsically correlated but highly efficient simulations using a stochastic cellular automaton (SCA). Efficient massively parallel implementations on graphics processing units (GPUs) were developed, which enable large-scale simulations leading to unprecedented precision in the final results. The primary subject of study is the Kardar–Parisi–Zhang (KPZ) surface growth in (2 + 1) dimensions, which is simulated using a dimer lattice gas and the restricted solid-on-solid model (RSOS) model. Using extensive simulations, conjectures regard- ing growth, autocorrelation and autoresponse properties are tested and new precise numerical predictions for several universal parameters are made.:1. Introduction 1.1. Motivations and Goals 1.2. Overview 2. Methods and Models 2.1. Estimation of Scaling Exponents and Error Margins 2.2. From Continuum- to Atomistic Models 2.3. Models for Phase Ordering and Nanostructure Evolution 2.3.1. The Kinetic Metropolis Lattice Monte-Carlo Method 2.3.2. The Potts Model 2.4. The Kardar–Parisi–Zhang and Edwards–Wilkinson Universality Classes 2.4.0.1. Physical Aging 2.4.1. The Octahedron Model 2.4.2. The Restricted Solid on Solid Model 3. Parallel Implementation: Towards Large-Scale Simulations 3.1. Parallel Architectures and Programming Models 3.1.1. CPU 3.1.2. GPU 3.1.3. Heterogeneous Parallelism and MPI 3.1.4. Bit-Coding of Lattice Sites 3.2. Domain Decomposition for Stochastic Lattice Models 3.2.1. DD for Asynchronous Updates 3.2.1.1. Dead border (DB) 3.2.1.2. Double tiling (DT) 3.2.1.3. DT DD with random origin (DTr) 3.2.1.4. Implementation 3.2.2. Second DD Layer on GPUs 3.2.2.1. Single-Hit DT 3.2.2.2. Single-Hit dead border (DB) 3.2.2.3. DD Parameters for the Octahedron Model 3.2.3. Performance 3.3. Lattice Level DD: Stochastic Cellular Automaton 3.3.1. Local Approach for the Octahedron Model 3.3.2. Non-Local Approach for the Octahedron Model 3.3.2.1. Bit-Vectorized GPU Implementation 3.3.3. Performance of SCA Implementations 3.4. The Multi-Surface Coding Approach 3.4.0.1. Vectorization 3.4.0.2. Scalar Updates 3.4.0.3. Domain Decomposition 3.4.1. Implementation: SkyMC 3.4.1.1. 2d Restricted Solid on Solid Model 3.4.1.2. 2d and 3d Potts Model 3.4.1.3. Sequential CPU Reference 3.4.2. SkyMC Benchmarks 3.5. Measurements 3.5.0.1. Measurement Intervals 3.5.0.2. Measuring using Heterogeneous Resources 4. Monte-Carlo Investigation of the Kardar–Parisi–Zhang Universality Class 4.1. Evolution of Surface Roughness 4.1.1. Comparison of Parallel Implementations of the Octahedron Model 4.1.1.1. The Growth Regime 4.1.1.2. Distribution of Interface Heights in the Growth Regime 4.1.1.3. KPZ Ansatz for the Growth Regime 4.1.1.4. The Steady State 4.1.2. Investigations using RSOS 4.1.2.1. The Growth Regime 4.1.2.2. The Steady State 4.1.2.3. Consistency of Fine-Size Scaling with Respect to DD 4.1.3. Results for Growth Phase and Steady State 4.2. Autocorrelation Functions 4.2.1. Comparison of DD Methods for RS Dynamics 4.2.1.1. Device-Layer DD 4.2.1.2. Block-Layer DD 4.2.2. Autocorrelation Properties under RS Dynamics 4.2.3. Autocorrelation Properties under SCA Dynamics 4.2.3.1. Autocorrelation of Heights 4.2.3.2. Autocorrelation of Slopes 4.2.4. Autocorrelation in the SCA Steady State 4.2.5. Autocorrelation in the EW Case under SCA 4.2.5.1. Autocorrelation of Heights 4.2.5.2. Autocorrelations of Slopes 4.3. Autoresponse Functions 4.3.1. Autoresponse Properties 4.3.1.1. Autoresponse of Heights 4.3.1.2. Autoresponse of Slopes 4.3.1.3. Self-Averaging 4.4. Summary 5. Further Topics 5.1. Investigations of the Potts Model 5.1.1. Testing Results from the Parallel Implementations 5.1.2. Domain Growth in Disordered Potts Models 5.2. Local Scale Invariance in KPZ Surface Growth 6. Conclusions and Outlook Acknowledgements A. Coding Details A.1. Bit-Coding A.2. Packing and Unpacking Signed Integers A.3. Random Number Generation / Gitter-Monte-Carlo-Methoden werden zur Untersuchung von Systemen wie Oberflächenwachstum, Spinsystemen oder gemischten Feststoffen verwendet, welche fern eines Gleichgewichtes bleiben oder zu einem streben. Die Anwendungen reichen von der Bestimmung universellen Wachstums- und Alterungsverhaltens hin zu konkreten Systemen, in denen die Reifung von Nanokompositmaterialien oder die Selbstorganisation von funktionalen Nanostrukturen von Interesse sind. In solchen Studien müssen große Systemen über lange Zeiträume betrachtet werden, um Strukturwachstum über mehrere Größenordnungen zu erlauben. Dies erfordert massivparallele Simulationen. Diese Arbeit adressiert das Problem, dass parallele Verarbeitung Korrelationen in Monte-Carlo-Updates verursachen und entwickelt eine praktisch korrelationsfreie Domänenzerlegungsmethode, um es zu lösen. Der Einfluss von Korrelationen auf Skalierungs- und dynamische Eigenschaften von Oberflächenwachtums- sowie verwandten Gittergassystemen wird weitergehend durch den Vergleich von Ergebnissen aus korrelationsfreien und intrinsisch korrelierten Simulationen mit einem stochastischen zellulären Automaten untersucht. Effiziente massiv parallele Implementationen auf Grafikkarten wurden entwickelt, welche großskalige Simulationen und damit präzedenzlos genaue Ergebnisse ermöglichen. Das primäre Studienobjekt ist das (2 + 1)-dimensionale Kardar–Parisi–Zhang- Oberflächenwachstum, welches durch ein Dimer-Gittergas und das Kim-Kosterlitz-Modell simuliert wird. Durch massive Simulationen werden Thesen über Wachstums-, Autokorrelations- und Antworteigenschaften getestet und neue, präzise numerische Vorhersagen zu einigen universellen Parametern getroffen.:1. Introduction 1.1. Motivations and Goals 1.2. Overview 2. Methods and Models 2.1. Estimation of Scaling Exponents and Error Margins 2.2. From Continuum- to Atomistic Models 2.3. Models for Phase Ordering and Nanostructure Evolution 2.3.1. The Kinetic Metropolis Lattice Monte-Carlo Method 2.3.2. The Potts Model 2.4. The Kardar–Parisi–Zhang and Edwards–Wilkinson Universality Classes 2.4.0.1. Physical Aging 2.4.1. The Octahedron Model 2.4.2. The Restricted Solid on Solid Model 3. Parallel Implementation: Towards Large-Scale Simulations 3.1. Parallel Architectures and Programming Models 3.1.1. CPU 3.1.2. GPU 3.1.3. Heterogeneous Parallelism and MPI 3.1.4. Bit-Coding of Lattice Sites 3.2. Domain Decomposition for Stochastic Lattice Models 3.2.1. DD for Asynchronous Updates 3.2.1.1. Dead border (DB) 3.2.1.2. Double tiling (DT) 3.2.1.3. DT DD with random origin (DTr) 3.2.1.4. Implementation 3.2.2. Second DD Layer on GPUs 3.2.2.1. Single-Hit DT 3.2.2.2. Single-Hit dead border (DB) 3.2.2.3. DD Parameters for the Octahedron Model 3.2.3. Performance 3.3. Lattice Level DD: Stochastic Cellular Automaton 3.3.1. Local Approach for the Octahedron Model 3.3.2. Non-Local Approach for the Octahedron Model 3.3.2.1. Bit-Vectorized GPU Implementation 3.3.3. Performance of SCA Implementations 3.4. The Multi-Surface Coding Approach 3.4.0.1. Vectorization 3.4.0.2. Scalar Updates 3.4.0.3. Domain Decomposition 3.4.1. Implementation: SkyMC 3.4.1.1. 2d Restricted Solid on Solid Model 3.4.1.2. 2d and 3d Potts Model 3.4.1.3. Sequential CPU Reference 3.4.2. SkyMC Benchmarks 3.5. Measurements 3.5.0.1. Measurement Intervals 3.5.0.2. Measuring using Heterogeneous Resources 4. Monte-Carlo Investigation of the Kardar–Parisi–Zhang Universality Class 4.1. Evolution of Surface Roughness 4.1.1. Comparison of Parallel Implementations of the Octahedron Model 4.1.1.1. The Growth Regime 4.1.1.2. Distribution of Interface Heights in the Growth Regime 4.1.1.3. KPZ Ansatz for the Growth Regime 4.1.1.4. The Steady State 4.1.2. Investigations using RSOS 4.1.2.1. The Growth Regime 4.1.2.2. The Steady State 4.1.2.3. Consistency of Fine-Size Scaling with Respect to DD 4.1.3. Results for Growth Phase and Steady State 4.2. Autocorrelation Functions 4.2.1. Comparison of DD Methods for RS Dynamics 4.2.1.1. Device-Layer DD 4.2.1.2. Block-Layer DD 4.2.2. Autocorrelation Properties under RS Dynamics 4.2.3. Autocorrelation Properties under SCA Dynamics 4.2.3.1. Autocorrelation of Heights 4.2.3.2. Autocorrelation of Slopes 4.2.4. Autocorrelation in the SCA Steady State 4.2.5. Autocorrelation in the EW Case under SCA 4.2.5.1. Autocorrelation of Heights 4.2.5.2. Autocorrelations of Slopes 4.3. Autoresponse Functions 4.3.1. Autoresponse Properties 4.3.1.1. Autoresponse of Heights 4.3.1.2. Autoresponse of Slopes 4.3.1.3. Self-Averaging 4.4. Summary 5. Further Topics 5.1. Investigations of the Potts Model 5.1.1. Testing Results from the Parallel Implementations 5.1.2. Domain Growth in Disordered Potts Models 5.2. Local Scale Invariance in KPZ Surface Growth 6. Conclusions and Outlook Acknowledgements A. Coding Details A.1. Bit-Coding A.2. Packing and Unpacking Signed Integers A.3. Random Number Generation info:eu-repo/classification/ddc/530 ddc:530
27	Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA / with Applications for QuantNet 2.0 and GitHub Borke, Lukas 08 September 2017 (has links) Mit der wachsenden Popularität von GitHub, dem größten Online-Anbieter von Programm-Quellcode und der größten Kollaborationsplattform der Welt, hat es sich zu einer Big-Data-Ressource entfaltet, die eine Vielfalt von Open-Source-Repositorien (OSR) anbietet. Gegenwärtig gibt es auf GitHub mehr als eine Million Organisationen, darunter solche wie Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly und viele mehr. GitHub verfügt über eine umfassende REST API, die es Forschern ermöglicht, wertvolle Informationen über die Entwicklungszyklen von Software und Forschung abzurufen. Unsere Arbeit verfolgt zwei Hauptziele: (I) ein automatisches OSR-Kategorisierungssystem für Data Science Teams und Softwareentwickler zu ermöglichen, das Entdeckbarkeit, Technologietransfer und Koexistenz fördert. (II) Visuelle Daten-Exploration und thematisch strukturierte Navigation innerhalb von GitHub-Organisationen für reproduzierbare Kooperationsforschung und Web-Applikationen zu etablieren. Um Mehrwert aus Big Data zu generieren, ist die Speicherung und Verarbeitung der Datensemantik und Metadaten essenziell. Ferner ist die Wahl eines geeigneten Text Mining (TM) Modells von Bedeutung. Die dynamische Kalibrierung der Metadaten-Konfigurationen, TM Modelle (VSM, GVSM, LSA), Clustering-Methoden und Clustering-Qualitätsindizes wird als "Smart Clusterization" abgekürzt. Data-Driven Documents (D3) und Three.js (3D) sind JavaScript-Bibliotheken, um dynamische, interaktive Datenvisualisierung zu erzeugen. Beide Techniken erlauben Visuelles Data Mining (VDM) in Webbrowsern, und werden als D3-3D abgekürzt. Latent Semantic Analysis (LSA) misst semantische Information durch Kontingenzanalyse des Textkorpus. Ihre Eigenschaften und Anwendbarkeit für Big-Data-Analytik werden demonstriert. "Smart clusterization", kombiniert mit den dynamischen VDM-Möglichkeiten von D3-3D, wird unter dem Begriff "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA" zusammengefasst. / With the growing popularity of GitHub, the largest host of source code and collaboration platform in the world, it has evolved to a Big Data resource offering a variety of Open Source repositories (OSR). At present, there are more than one million organizations on GitHub, among them Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly and many more. GitHub provides an extensive REST API, which enables scientists to retrieve valuable information about the software and research development life cycles. Our research pursues two main objectives: (I) provide an automatic OSR categorization system for data science teams and software developers promoting discoverability, technology transfer and coexistence; (II) establish visual data exploration and topic driven navigation of GitHub organizations for collaborative reproducible research and web deployment. To transform Big Data into value, in other words into Smart Data, storing and processing of the data semantics and metadata is essential. Further, the choice of an adequate text mining (TM) model is important. The dynamic calibration of metadata configurations, TM models (VSM, GVSM, LSA), clustering methods and clustering quality indices will be shortened as "smart clusterization". Data-Driven Documents (D3) and Three.js (3D) are JavaScript libraries for producing dynamic, interactive data visualizations, featuring hardware acceleration for rendering complex 2D or 3D computer animations of large data sets. Both techniques enable visual data mining (VDM) in web browsers, and will be abbreviated as D3-3D. Latent Semantic Analysis (LSA) measures semantic information through co-occurrence analysis in the text corpus. Its properties and applicability for Big Data analytics will be demonstrated. "Smart clusterization" combined with the dynamic VDM capabilities of D3-3D will be summarized under the term "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA". GitHub Mining Infrastruktur Software Mining Text Mining Verallgemeinerte Vektorraum-Modelle YAML Visuelles Data Mining Visual Analytics D3.js Clusteranalyse Qualitätsindizes Validierungs-Pipeline Verteilt-paralleles Rechnen Reproduzierbare Kooperationsforschung Risk Analytics GitHub Mining infrastructure Software Mining Text Mining Generalized Vector Space Models YAML Visual Data Mining Visual Analytics D3.js Cluster Analysis Quality Indices Validation Pipeline Cluster and Parallel Computing Collaborative Reproducible Research Risk Analytics 330 Wirtschaft QH 250 ddc:330

Page generated in 0.081 seconds