Global ETD Search

21	Skalierbare Ausführung von Prozessanwendungen in dienstorientierten Umgebungen Preißler, Steffen 19 November 2012 (has links) (PDF) Die Strukturierung und Nutzung von unternehmensinternen IT-Infrastrukturen auf Grundlage dienstorientierter Architekturen (SOA) und etablierter XML-Technologien ist in den vergangenen Jahren stetig gewachsen. Lag der Fokus anfänglicher SOA-Realisierungen auf der flexiblen Ausführung klassischer, unternehmensrelevanter Geschäftsprozesse, so bilden heutzutage zeitnahe Datenanalysen sowie die Überwachung von geschäftsrelevanten Ereignissen weitere wichtige Anwendungsklassen, um sowohl kurzfristig Probleme des Geschäftsablaufes zu identifizieren als auch um mittel- und langfristige Veränderungen im Markt zu erkennen und die Geschäftsprozesse des Unternehmens flexibel darauf anzupassen. Aufgrund der geschichtlich bedingten, voneinander unabhängigen Entwicklung der drei Anwendungsklassen, werden die jeweiligen Anwendungsprozesse gegenwärtig in eigenständigen Systemen modelliert und ausgeführt. Daraus resultiert jedoch eine Reihe von Nachteilen, welche diese Arbeit aufzeigt und ausführlich diskutiert. Vor diesem Hintergrund beschäftigte sich die vorliegende Arbeit mit der Ableitung einer konsolidierten Ausführungsplattform, die es ermöglicht, Prozesse aller drei Anwendungsklassen gemeinsam zu modellieren und in einer SOA-basierten Infrastruktur effizient auszuführen. Die vorliegende Arbeit adressiert die Probleme einer solchen konsolidierten Ausführungsplattform auf den drei Ebenen der Dienstkommunikation, der Prozessausführung und der optimalen Verteilung von SOA-Komponenten in einer Infrastruktur. Prozess SOA dienstorientierte Architektur Anwendung Ausführung Skalierbarkeit process SOA service-oriented architecture application scalability execution ddc:004 rvk:ST 230 rvk:ST 270
22	QPPT: Query Processing on Prefix Trees Kissinger, Thomas, Schlegel, Benjamin, Habich, Dirk, Lehner, Wolfgang 28 May 2013 (has links) (PDF) Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually columnstores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this paper, we present the indexed table-at-a-time processing model that makes indexes the first-class citizen of the database system. The processing model comprises the concepts of intermediate indexed tables and cooperative operators, which make indexes the common data exchange format between plan operators. To keep the intermediate index materialization costs low, we employ optimized prefix trees that offer a balanced read/write performance. The indexed tableat-a-time processing model allows the efficient construction of composed operators like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that our approach outperforms state-of-the-art in-memory databases. Datenbanksystem Query Processing Sonderforschungsbereich 912 Hochadaptive Energieeffiziente Systeme Database system query processing Collaborative Research Centre 912 ddc:004 rvk:ST 270
23	Design von Stichproben in analytischen Datenbanken Rösch, Philipp 28 July 2009 (has links) (PDF) Aktuelle Studien belegen ein rasantes, mehrdimensionales Wachstum in analytischen Datenbanken: Das Datenvolumen verzehnfachte sich in den letzten vier Jahren, die Anzahl der Nutzer wuchs um durchschnittlich 25% pro Jahr und die Anzahl der Anfragen verdoppelte sich seit 2004 jährlich. Bei den Anfragen handelt es sich zunehmend um komplexe Verbundanfragen mit Aggregationen; sie sind häufig explorativer Natur und werden interaktiv an das System gestellt. Eine Möglichkeit, der Forderung nach Interaktivität bei diesem starken, mehrdimensionalen Wachstum nachzukommen, stellen Stichproben und eine darauf aufsetzende näherungsweise Anfrageverarbeitung dar. Diese Lösung bietet signifikant kürzere Antwortzeiten sowie Schätzungen mit probabilistischen Fehlergrenzen. Mit den Operationen Verbund, Gruppierung und Aggregation als Hauptbestandteile analytischer Anfragen ergeben sich folgende Anforderungen an das Design von Stichproben in analytischen Datenbanken: Zwischen den Stichproben fremdschlüsselverbundener Relationen ist die referenzielle Integrität zu gewährleisten, sämtliche Gruppen sind angemessen zu repräsentieren und Aggregationsattribute sind auf extreme Werte zu untersuchen. In dieser Dissertation wird für jedes dieser Teilprobleme ein Stichprobenverfahren vorgestellt, das sich durch speicherplatzbeschränkte Stichproben und geringe Schätzfehler auszeichnet. Im ersten der vorgestellten Verfahren wird durch eine korrelierte Stichprobenerhebung die referenzielle Integrität bei minimalem zusätzlichen Speicherplatz gewährleistet. Das zweite vorgestellte Stichprobenverfahren hat durch eine Berücksichtigung der Streuung der Daten eine angemessene Repräsentation sämtlicher Gruppen zur Folge und unterstützt damit beliebige Gruppierungen, und im dritten Verfahren ermöglicht eine mehrdimensionale Ausreißerbehandlung geringe Schätzfehler für beliebig viele Aggregationsattribute. Für jedes dieser Verfahren wird die Qualität der resultierenden Stichprobe diskutiert und bei der Berechnung speicherplatzbeschränkter Stichproben berücksichtigt. Um den Berechnungsaufwand und damit die Systembelastung gering zu halten, werden für jeden Algorithmus Heuristiken vorgestellt, deren Kennzeichen hohe Effizienz und eine geringe Beeinflussung der Stichprobenqualität sind. Weiterhin werden alle möglichen Kombinationen der vorgestellten Stichprobenverfahren betrachtet; diese Kombinationen ermöglichen eine zusätzliche Verringerung der Schätzfehler und vergrößern gleichzeitig das Anwendungsspektrum der resultierenden Stichproben. Mit der Kombination aller drei Techniken wird ein Stichprobenverfahren vorgestellt, das alle Anforderungen an das Design von Stichproben in analytischen Datenbanken erfüllt und die Vorteile der Einzellösungen vereint. Damit ist es möglich, ein breites Spektrum an Anfragen mit hoher Genauigkeit näherungsweise zu beantworten. / Recent studies have shown the fast and multi-dimensional growth in analytical databases: Over the last four years, the data volume has risen by a factor of 10; the number of users has increased by an average of 25% per year; and the number of queries has been doubling every year since 2004. These queries have increasingly become complex join queries with aggregations; they are often of an explorative nature and interactively submitted to the system. One option to address the need for interactivity in the context of this strong, multi-dimensional growth is the use of samples and an approximate query processing approach based on those samples. Such a solution offers significantly shorter response times as well as estimates with probabilistic error bounds. Given that joins, groupings and aggregations are the main components of analytical queries, the following requirements for the design of samples in analytical databases arise: 1) The foreign-key integrity between the samples of foreign-key related tables has to be preserved. 2) Any existing groups have to be represented appropriately. 3) Aggregation attributes have to be checked for extreme values. For each of these sub-problems, this dissertation presents sampling techniques that are characterized by memory-bounded samples and low estimation errors. In the first of these presented approaches, a correlated sampling process guarantees the referential integrity while only using up a minimum of additional memory. The second illustrated sampling technique considers the data distribution, and as a result, any arbitrary grouping is supported; all groups are appropriately represented. In the third approach, the multi-column outlier handling leads to low estimation errors for any number of aggregation attributes. For all three approaches, the quality of the resulting samples is discussed and considered when computing memory-bounded samples. In order to keep the computation effort - and thus the system load - at a low level, heuristics are provided for each algorithm; these are marked by high efficiency and minimal effects on the sampling quality. Furthermore, the dissertation examines all possible combinations of the presented sampling techniques; such combinations allow to additionally reduce estimation errors while increasing the range of applicability for the resulting samples at the same time. With the combination of all three techniques, a sampling technique is introduced that meets all requirements for the design of samples in analytical databases and that merges the advantages of the individual techniques. Thereby, the approximate but very precise answering of a wide range of queries becomes a true possibility. Stichproben Stichprobenverfahren näherungsweise Anfrageverarbeitung Data Warehouse Decision Support sampling approximate query processing data warehouse OLAP ddc:004 rvk:ST 270 rvk:ST 530
24	Datenqualität in Sensordatenströmen / Data Quality in Sensor Data Streams Klein, Anja 23 March 2010 (has links) (PDF) Die stetige Entwicklung intelligenter Sensorsysteme erlaubt die Automatisierung und Verbesserung komplexer Prozess- und Geschäftsentscheidungen in vielfältigen Anwendungsszenarien. Sensoren können zum Beispiel zur Bestimmung optimaler Wartungstermine oder zur Steuerung von Produktionslinien genutzt werden. Ein grundlegendes Problem bereitet dabei die Sensordatenqualität, die durch Umwelteinflüsse und Sensorausfälle beschränkt wird. Ziel der vorliegenden Arbeit ist die Entwicklung eines Datenqualitätsmodells, das Anwendungen und Datenkonsumenten Qualitätsinformationen für eine umfassende Bewertung unsicherer Sensordaten zur Verfügung stellt. Neben Datenstrukturen zur effizienten Datenqualitätsverwaltung in Datenströmen und Datenbanken wird eine umfassende Datenqualitätsalgebra zur Berechnung der Qualität von Datenverarbeitungsergebnissen vorgestellt. Darüber hinaus werden Methoden zur Datenqualitätsverbesserung entwickelt, die speziell auf die Anforderungen der Sensordatenverarbeitung angepasst sind. Die Arbeit wird durch Ansätze zur nutzerfreundlichen Datenqualitätsanfrage und -visualisierung vervollständigt. Datenqualität Datenstromverarbeitung Sensordaten Intelligente Systeme Datenbank Optimierung Data Quality Data Stream Data Stream Processing Database Sensor Data Smart Items Optimization ddc:004 rvk:ST 265 rvk:ST 270
25	Generation and Implementation of Virtual Landscapes for an Augmented Reality HMI-Laboratory / Generierung und Implementierung von virtuellen Landschaften in ein HMI- Labor mit erweiterter Realität Milius, Jeannette 05 February 2014 (has links) (PDF) Three dimensional visualisation achieves tremendous savings in time and costs during the design process. Due to these circumstances this methods are gaining in importance. For example improvement in performance and the product security or enabling the operative optimization of a production sequence. By the virtual testing it is possible to validate a product in the whole developing process and product lifecycle. The flight simulator ATILa at Airbus Defence and Space in Friedrichshafen uses these advantages for own products. ATILa is used to test intelligent assistance systems for helicopter pilots. Here the graphic implementation of the virtual earth plays a key role when practicing realistical scenarios. This approach is implemented with the Common Database (CDB) which is enabled by the definition of specifications and standards. Different commercial software packages by Presagis are used to implement the aforementioned database. The software Terra Vista is used for the database generation, including the compilation. For the CDB implementation the software Vega Prime is used to prepare the data with the help of the RTP. The software Vega Prime is not able to display 3D models with LODs, due to a software error. Therefore a third software named Creator is used to modify them. The 3D models are available in the OpenFlight Format. This OpenFlight format consists of different kind of nodes with a complex hierarchical structure. Other software solutions, such as Autodesk or Blender, are not able to provide access to the specific structure. The edited models can be integrated in the virtual environment and have to defined by unambiguous indices. Various settings are used to implement the objects automatically. The compilation of the area of interest takes place by the definition of a geotile with a specific size depending on the latitude. The CDB ouput will be transferred by Vega Prime and with the help of the RTP into the simulator. In addition, there is the possibility to render various CDB databases in the simulator to enable a visualisation of the complete earth. Finally, any errors occurring will be described and methods of resolution explained. The complexity of the generation process of a CDB database could be represented with this thesis. However, the whole workflow of the visualisation of the earth is still in its initial stages, since among other things there are errors in the software. To sum up; the potential of the CDB can be evaluated as above average. / Die 3D Visualisierung vereinfacht den Planungsprozess und geht somit mit einer Zeit- und Kosten- einsparung einher. Aufgrund dieser Sachverhalte gewinnt sie immer weiter an Bedeutung, um zum Beispiel eine verbesserte und sichere Benutzung eines Produktes oder einen optimierten Betrieb einer Produktionskette zu ermöglichen. Durch vorherige virtuelle Erprobung und Vali- dierung eines Produktes können Kosten für den gesamten Entwicklungsprozess und den Pro- duktlebenszyklus gering gehalten werden. Im Flugsimulator für Helikopter namens ATILa in Friedrichshafen (Airbus Defence and Space) versucht man die genannten Vorteile für die eigenen Produkte zu nutzen. Im ATILa werden Assistenzsysteme geprüft, welche die Helikopterpiloten während ihres Fluges unterstützen sollen. Hierbei spielt die grafische Umsetzung der virtuellen Erde in dem Simulator eine entscheidende Rolle, um die Szenarien realitätsnah durchführen zu können. Dies kann mit Hilfe einer sogenannten Common Database (CDB), die durch Spezi- fikationen und Standards definiert ist, umgesetzt werden. Mittels verschiedener kommerzieller Softwarepakete der Firma Presagis lässt sich die oben genannte Datenbank erstellen. Die Gener- ierung und Kompilierung wird mit dem Softwareprogramm Terra Vista vorgenommen. Die Imple- mentierung der CDB in den Flugsimulator erfolgt mit der Software Vega Prime, welche die Daten über einen RTP zur Verfügung stellt. Da dieses Programm durch einen Softwarefehler nicht in der Lage ist, 3D Modelle mit verschiedenen Detaillierungsgraden darzustellen, muss eine dritte Soft- ware namens Creator genutzt werden. Die 3D Modelle liegen im OpenFlight Format vor. Dieses OpenFlight Format weist eine komplexe hierarchische Struktur aus verschiedenen Knoten auf. Andere Softwarelösungen, wie Autodesk oder Blender, sind nicht in der Lage einen Einblick in die spezielle Struktur zu geben. Die bearbeiteten Modelle können dann in der virtuellen Umgebung eingebunden und müssen durch eindeutige Indizes definiert werden. Verschiedene Einstellun- gen werden genutzt, um Objekte automatisch einzubinden. Die Kompilierung des Interessenge- bietes erfolgt über die Definition einer Geokachel mit einer bestimmten Größe, die abhängig vom Breitengrad ist. Die ausgegebene CDB wird mit Vega Prime und mit Hilfe des RTPs in den Simu- lator übertragen. Des Weiteren gibt es die Möglichkeit verschiedene CDB Datenbanken im Sim- ulator simultan zu rendern, was eine vollständige Visualisierung der kompletten Erde ermöglicht. Abschließend werden aufgetretene Fehler näher beschrieben und Lösungsansätze erläutert. Mit der vorliegenden Arbeit konnte die Komplexität der Entstehung einer CDB Datenbank dargestellt werden. Dennoch befindet sich der gesamte Arbeitsablauf der Visualisierung der Erde noch am Anfang, da u.a. Softwarefehler zu bemängeln sind. Zusammenfassend kann das Potenzial einer CDB als überdurchschnittlich bewertet werden. Visualisation Simulation Presagis CDB Database FACC Selector HMI Laboratory OpenFlight RTP Visualisierung Simulation Presagis CDB Datenbank FACC Selector HMI Labor OpenFlight RTP ddc:550 rvk:ST 270 rvk:ZI 9765
26	Query-Time Data Integration Eberius, Julian 16 December 2015 (has links) (PDF) Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections. Datenintegration Ad-hoc Integration Top-k Anfrageverarbeitung Webdaten data integration ad-hoc integration top-k query processing web data ddc:004 rvk:ST 270
27	Community-Based Intrusion Detection Weigert, Stefan 06 February 2017 (has links) (PDF) Today, virtually every company world-wide is connected to the Internet. This wide-spread connectivity has given rise to sophisticated, targeted, Internet-based attacks. For example, between 2012 and 2013 security researchers counted an average of about 74 targeted attacks per day. These attacks are motivated by economical, financial, or political interests and commonly referred to as “Advanced Persistent Threat (APT)” attacks. Unfortunately, many of these attacks are successful and the adversaries manage to steal important data or disrupt vital services. Victims are preferably companies from vital industries, such as banks, defense contractors, or power plants. Given that these industries are well-protected, often employing a team of security specialists, the question is: How can these attacks be so successful? Researchers have identified several properties of APT attacks which make them so efficient. First, they are adaptable. This means that they can change the way they attack and the tools they use for this purpose at any given moment in time. Second, they conceal their actions and communication by using encryption, for example. This renders many defense systems useless as they assume complete access to the actual communication content. Third, their actions are stealthy — either by keeping communication to the bare minimum or by mimicking legitimate users. This makes them “fly below the radar” of defense systems which check for anomalous communication. And finally, with the goal to increase their impact or monetisation prospects, their attacks are targeted against several companies from the same industry. Since months can pass between the first attack, its detection, and comprehensive analysis, it is often too late to deploy appropriate counter-measures at businesses peers. Instead, it is much more likely that they have already been attacked successfully. This thesis tries to answer the question whether the last property (industry-wide attacks) can be used to detect such attacks. It presents the design, implementation and evaluation of a community-based intrusion detection system, capable of protecting businesses at industry-scale. The contributions of this thesis are as follows. First, it presents a novel algorithm for community detection which can detect an industry (e.g., energy, financial, or defense industries) in Internet communication. Second, it demonstrates the design, implementation, and evaluation of a distributed graph mining engine that is able to scale with the throughput of the input data while maintaining an end-to-end latency for updates in the range of a few milliseconds. Third, it illustrates the usage of this engine to detect APT attacks against industries by analyzing IP flow information from an Internet service provider. Finally, it introduces a detection algorithm- and input-agnostic intrusion detection engine which supports not only intrusion detection on IP flow but any other intrusion detection algorithm and data-source as well. Cybersicherheit Publish-Subscribe Data Mining Verteilte Systeme Community Detection Graphentheorie Cyber security Publish-Subscribe Data Mining Distributed Systems Community Detection Graph theory ddc:004 rvk:ST 277 rvk:ST 270 rvk:ST 200
28	Role-based Data Management Jäkel, Tobias 29 May 2017 (has links) (PDF) Database systems build an integral component of today’s software systems and as such they are the central point for storing and sharing a software system’s data while ensuring global data consistency at the same time. Introducing the primitives of roles and their accompanied metatype distinction in modeling and programming languages, results in a novel paradigm of designing, extending, and programming modern software systems. In detail, roles as modeling concept enable a separation of concerns within an entity. Along with its rigid core, an entity may acquire various roles in different contexts during its lifetime and thus, adapts its behavior and structure dynamically during runtime. Unfortunately, database systems, as important component and global consistency provider of such systems, do not keep pace with this trend. The absence of a metatype distinction, in terms of an entity’s separation of concerns, in the database system results in various problems for the software system in general, for the application developers, and ﬁnally for the database system itself. In case of relational database systems, these problems are concentrated under the term role-relational impedance mismatch. In particular, the whole software system is designed by using different semantics on various layers. In case of role-based software systems in combination with relational database systems this gap in semantics between applications and the database system increases dramatically. Consequently, the database system cannot directly represent the richer semantics of roles as well as the accompanied consistency constraints. These constraints have to be ensured by the applications and the database system loses its single point of truth characteristic in the software system. As the applications are in charge of guaranteeing global consistency, their development requires more effort in data management. Moreover, the software system’s data management is distributed over several layers, which results in an unstructured software system architecture. To overcome the role-relational impedance mismatch and bring the database system back in its rightful position as single point of truth in a software system, this thesis introduces the novel and tripartite RSQL approach. It combines a novel database model that represents the metatype distinction as ﬁrst class citizen in a database system, an adapted query language on the database model’s basis, and ﬁnally a proper result representation. Precisely, RSQL’s logical database model introduces Dynamic Data Types, to directly represent the separation of concerns within an entity type on the schema level. On the instance level, the database model deﬁnes the notion of a Dynamic Tuple that combines an entity with the notion of roles and thus, allows for dynamic structure adaptations during runtime without changing an entity’s overall type. These deﬁnitions build the main data structures on which the database system operates. Moreover, formal operators connecting the query language statements with the database model data structures, complete the database model. The query language, as external database system interface, features an individual data deﬁnition, data manipulation, and data query language. Their statements directly represent the metatype distinction to address Dynamic Data Types and Dynamic Tuples, respectively. As a consequence of the novel data structures, the query processing of Dynamic Tuples is completely redesigned. As last piece for a complete database integration of a role-based notion and its accompanied metatype distinction, we specify the RSQL Result Net as result representation. It provides a novel result structure and features functionalities to navigate through query results. Finally, we evaluate all three RSQL components in comparison to a relational database system. This assessment clearly demonstrates the beneﬁts of the roles concept’s full database integration. Rollenkonzept Datenbankmanagementsystem DBMS RSQL Anfragesprache Datenbankmodell logische Operatorenen Role concept database management system DBMS RSQL query language database model compartment role object model logical database operators ddc:004 rvk:ST 270
29	Neue Indexingverfahren für die Ähnlichkeitssuche in metrischen Räumen über großen Datenmengen / New indexing techniques for similarity search in metric spaces Guhlemann, Steffen 06 July 2016 (has links) (PDF) Ein zunehmend wichtiges Thema in der Informatik ist der Umgang mit Ähnlichkeit in einer großen Anzahl unterschiedlicher Domänen. Derzeit existiert keine universell verwendbare Infrastruktur für die Ähnlichkeitssuche in allgemeinen metrischen Räumen. Ziel der Arbeit ist es, die Grundlage für eine derartige Infrastruktur zu legen, die in klassische Datenbankmanagementsysteme integriert werden könnte. Im Rahmen einer Analyse des State of the Art wird der M-Baum als am besten geeignete Basisstruktur identifiziert. Dieser wird anschließend zum EM-Baum erweitert, wobei strukturelle Kompatibilität mit dem M-Baum erhalten wird. Die Abfragealgorithmen werden im Hinblick auf eine Minimierung notwendiger Distanzberechnungen optimiert. Aufbauend auf einer mathematischen Analyse der Beziehung zwischen Baumstruktur und Abfrageaufwand werden Freiheitsgrade in Baumänderungsalgorithmen genutzt, um Bäume so zu konstruieren, dass Ähnlichkeitsanfragen mit einer minimalen Anzahl an Anfrageoperationen beantwortet werden können. / A topic of growing importance in computer science is the handling of similarity in multiple heterogenous domains. Currently there is no common infrastructure to support this for the general metric space. The goal of this work is lay the foundation for such an infrastructure, which could be integrated into classical data base management systems. After some analysis of the state of the art the M-Tree is identified as most suitable base and enhanced in multiple ways to the EM-Tree retaining structural compatibility. The query algorithms are optimized to reduce the number of necessary distance calculations. On the basis of a mathematical analysis of the relation between the tree structure and the query performance degrees of freedom in the tree edit algorithms are used to build trees optimized for answering similarity queries using a minimal number of distance calculations. Metrik Metrischer Raum Indexing Curse of Dimensionality EM-Baum M-Baum Ähnlichkeitssuche Bereichssuche k-Nächste-Nachbarn-Suche Metric Metric space Indexing Curse of Dimensionality EM-Tree M-Tree Similarity search Range query k-Nearest-Neighbor-Query ddc:004 rvk:ST 270
30	Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System Kernert, David 20 September 2016 (has links) (PDF) Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists. Main-Memory Datenbanksysteme Spaltenorientierte DBMS Lineare Algebra Implementierung Matrixdatenstrukturen Ausdrucksoptimierung in-memory database management systems column-oriented DBMS linear algebra implementation matrix data structures expression optimization ddc:004 rvk:ST 270

Search results