Global ETD Search

31	Understanding cryptic schemata in large extract-transform-load systems Albrecht, Alexander, Naumann, Felix January 2012 (has links) Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema. / Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt. Extract-Transform-Load (ETL) Data Warehouse Datenintegration Extract-Transform-Load (ETL) Data Warehouse Data Integration Data processing Computer science
32	Efficient and exact computation of inclusion dependencies for data integration Bauckmann, Jana, Leser, Ulf, Naumann, Felix January 2010 (has links) Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation. Metadatenentdeckung Metadatenqualität Schemaentdeckung Datenanalyse Datenintegration metadata discovery metadata quality schema discovery data profiling data integration Data processing Computer science
33	Geometrische und stochastische Modelle für die integrierte Auswertung terrestrischer Laserscannerdaten und photogrammetrischer Bilddaten Schneider, Danilo 07 September 2009 (has links) (PDF) Terrestrische Laserscanner finden seit einigen Jahren immer stärkere Anwendung in der Praxis und ersetzen bzw. ergänzen bisherige Messverfahren, oder es werden neue Anwendungsgebiete erschlossen. Werden die Daten eines terrestrischen Laserscanners mit photogrammetrischen Bilddaten kombiniert, ergeben sich viel versprechende Möglichkeiten, weil die Eigenschaften beider Datentypen als weitestgehend komplementär angesehen werden können: Terrestrische Laserscanner erzeugen schnell und zuverlässig dreidimensionale Repräsentationen von Objektoberflächen von einem einzigen Aufnahmestandpunkt aus, während sich zweidimensionale photogrammetrische Bilddaten durch eine sehr gute visuelle Qualität mit hohem Interpretationsgehalt und hoher lateraler Genauigkeit auszeichnen. Infolgedessen existieren bereits zahlreiche Ansätze, sowohl software- als auch hardwareseitig, in denen diese Kombination realisiert wird. Allerdings haben die Bildinformationen bisher meist nur ergänzenden Charakter, beispielsweise bei der Kolorierung von Punktwolken oder der Texturierung von aus Laserscannerdaten erzeugten Oberflächenmodellen. Die konsequente Nutzung der komplementären Eigenschaften beider Sensortypen bietet jedoch ein weitaus größeres Potenzial. Aus diesem Grund wurde im Rahmen dieser Arbeit eine Berechnungsmethode – die integrierte Bündelblockausgleichung – entwickelt, bei dem die aus terrestrischen Laserscannerdaten und photogrammetrischen Bilddaten abgeleiteten Beobachtungen diskreter Objektpunkte gleichberechtigt Verwendung finden können. Diese Vorgehensweise hat mehrere Vorteile: durch die Nutzung der individuellen Eigenschaften beider Datentypen unterstützen sie sich gegenseitig bei der Bestimmung von 3D-Objektkoordinaten, wodurch eine höhere Genauigkeit erreicht werden kann. Alle am Ausgleichungsprozess beteiligten Daten werden optimal zueinander referenziert und die verwendeten Aufnahmegeräte können simultan kalibriert werden. Wegen des (sphärischen) Gesichtsfeldes der meisten terrestrischen Laserscanner von 360° in horizontaler und bis zu 180° in vertikaler Richtung bietet sich die Kombination mit Rotationszeilen-Panoramakameras oder Kameras mit Fisheye-Objektiv an, weil diese im Vergleich zu zentralperspektiven Kameras deutlich größere Winkelbereiche in einer Aufnahme abbilden können. Grundlage für die gemeinsame Auswertung terrestrischer Laserscanner- und photogrammetrischer Bilddaten ist die strenge geometrische Modellierung der Aufnahmegeräte. Deshalb wurde für terrestrische Laserscanner und verschiedene Kameratypen ein geometrisches Modell, bestehend aus einem Grundmodell und Zusatzparametern zur Kompensation von Restsystematiken, entwickelt und verifiziert. Insbesondere bei der Entwicklung des geometrischen Modells für Laserscanner wurden verschiedene in der Literatur beschriebene Ansätze berücksichtigt. Dabei wurde auch auf von Theodoliten und Tachymetern bekannte Korrekturmodelle zurückgegriffen. Besondere Bedeutung innerhalb der gemeinsamen Auswertung hat die Festlegung des stochastischen Modells. Weil verschiedene Typen von Beobachtungen mit unterschiedlichen zugrunde liegenden geometrischen Modellen und unterschiedlichen stochastischen Eigenschaften gemeinsam ausgeglichen werden, muss den Daten ein entsprechendes Gewicht zugeordnet werden. Bei ungünstiger Gewichtung der Beobachtungen können die Ausgleichungsergebnisse negativ beeinflusst werden. Deshalb wurde die integrierte Bündelblockausgleichung um das Verfahren der Varianzkomponentenschätzung erweitert, mit dem optimale Beobachtungsgewichte automatisch bestimmt werden können. Erst dadurch wird es möglich, das Potenzial der Kombination terrestrischer Laserscanner- und photogrammetrischer Bilddaten vollständig auszuschöpfen. Zur Berechnung der integrierten Bündelblockausgleichung wurde eine Software entwickelt, mit der vielfältige Varianten der algorithmischen Kombination der Datentypen realisiert werden können. Es wurden zahlreiche Laserscannerdaten, Panoramabilddaten, Fisheye-Bilddaten und zentralperspektive Bilddaten in mehreren Testumgebungen aufgenommen und unter Anwendung der entwickelten Software prozessiert. Dabei wurden verschiedene Berechnungsvarianten detailliert analysiert und damit die Vorteile und Einschränkungen der vorgestellten Methode demonstriert. Ein Anwendungsbeispiel aus dem Bereich der Geologie veranschaulicht das Potenzial des Algorithmus in der Praxis. / The use of terrestrial laser scanning has grown in popularity in recent years, and replaces and complements previous measuring methods, as well as opening new fields of application. If data from terrestrial laser scanners are combined with photogrammetric image data, this yields promising possibilities, as the properties of both types of data can be considered mainly complementary: terrestrial laser scanners produce fast and reliable three-dimensional representations of object surfaces from only one position, while two-dimensional photogrammetric image data are characterised by a high visual quality, ease of interpretation, and high lateral accuracy. Consequently there are numerous approaches existing, both hardware- and software-based, where this combination is realised. However, in most approaches, the image data are only used to add additional characteristics, such as colouring point clouds or texturing object surfaces generated from laser scanner data. A thorough exploitation of the complementary characteristics of both types of sensors provides much more potential. For this reason a calculation method – the integrated bundle adjustment – was developed within this thesis, where the observations of discrete object points derived from terrestrial laser scanner data and photogrammetric image data are utilised equally. This approach has several advantages: using the individual characteristics of both types of data they mutually strengthen each other in terms of 3D object coordinate determination, so that a higher accuracy can be achieved; all involved data sets are optimally co-registered; and each instrument is simultaneously calibrated. Due to the (spherical) field of view of most terrestrial laser scanners of 360° in the horizontal direction and up to 180° in the vertical direction, the integration with rotating line panoramic cameras or cameras with fisheye lenses is very appropriate, as they have a wider field of view compared to central perspective cameras. The basis for the combined processing of terrestrial laser scanner and photogrammetric image data is the strict geometric modelling of the recording instruments. Therefore geometric models, consisting of a basic model and additional parameters for the compensation of systematic errors, was developed and verified for terrestrial laser scanners and different types of cameras. Regarding the geometric laser scanner model, different approaches described in the literature were considered, as well as applying correction models known from theodolites and total stations. A particular consideration within the combined processing is the definition of the stochastic model. Since different types of observations with different underlying geometric models and different stochastic properties have to be adjusted simultaneously, adequate weights have to be assigned to the measurements. An unfavourable weighting can have a negative influence on the adjustment results. Therefore a variance component estimation procedure was implemented in the integrated bundle adjustment, which allows for an automatic determination of optimal observation weights. Hence, it becomes possible to exploit the potential of the combination of terrestrial laser scanner and photogrammetric image data completely. For the calculation of the integrated bundle adjustment, software was developed allowing various algorithmic configurations of the different data types to be applied. Numerous laser scanner, panoramic image, fisheye image and central perspective image data were recorded in different test fields and processed using the developed software. Several calculation alternatives were analysed, demonstrating the advantages and limitations of the presented method. An application example from the field of geology illustrates the potential of the algorithm in practice. Photogrammetrie Terrestrisches Laserscanning Bündelblockausgleichung Datenintegration photogrammetry terrestrial laser scanning bundle adjustment data integration ddc:550 rvk:ZI 9510
34	Semantic Enrichment of Ontology Mappings Arnold, Patrick 04 January 2016 (has links) (PDF) Schema and ontology matching play an important part in the field of data integration and semantic web. Given two heterogeneous data sources, meta data matching usually constitutes the first step in the data integration workflow, which refers to the analysis and comparison of two input resources like schemas or ontologies. The result is a list of correspondences between the two schemas or ontologies, which is often called mapping or alignment. Many tools and research approaches have been proposed to automatically determine those correspondences. However, most match tools do not provide any information about the relation type that holds between matching concepts, for the simple but important reason that most common match strategies are too simple and heuristic to allow any sophisticated relation type determination. Knowing the specific type holding between two concepts, e.g., whether they are in an equality, subsumption (is-a) or part-of relation, is very important for advanced data integration tasks, such as ontology merging or ontology evolution. It is also very important for mappings in the biological or biomedical domain, where is-a and part-of relations may exceed the number of equality correspondences by far. Such more expressive mappings allow much better integration results and have scarcely been in the focus of research so far. In this doctoral thesis, the determination of the correspondence types in a given mapping is the focus of interest, which is referred to as semantic mapping enrichment. We introduce and present the mapping enrichment tool STROMA, which obtains a pre-calculated schema or ontology mapping and for each correspondence determines a semantic relation type. In contrast to previous approaches, we will strongly focus on linguistic laws and linguistic insights. By and large, linguistics is the key for precise matching and for the determination of relation types. We will introduce various strategies that make use of these linguistic laws and are able to calculate the semantic type between two matching concepts. The observations and insights gained from this research go far beyond the field of mapping enrichment and can be also applied to schema and ontology matching in general. Since generic strategies have certain limits and may not be able to determine the relation type between more complex concepts, like a laptop and a personal computer, background knowledge plays an important role in this research as well. For example, a thesaurus can help to recognize that these two concepts are in an is-a relation. We will show how background knowledge can be effectively used in this instance, how it is possible to draw conclusions even if a concept is not contained in it, how the relation types in complex paths can be resolved and how time complexity can be reduced by a so-called bidirectional search. The developed techniques go far beyond the background knowledge exploitation of previous approaches, and are now part of the semantic repository SemRep, a flexible and extendable system that combines different lexicographic resources. Further on, we will show how additional lexicographic resources can be developed automatically by parsing Wikipedia articles. The proposed Wikipedia relation extraction approach yields some millions of additional relations, which constitute significant additional knowledge for mapping enrichment. The extracted relations were also added to SemRep, which thus became a comprehensive background knowledge resource. To augment the quality of the repository, different techniques were used to discover and delete irrelevant semantic relations. We could show in several experiments that STROMA obtains very good results w.r.t. relation type detection. In a comparative evaluation, it was able to achieve considerably better results than related applications. This corroborates the overall usefulness and strengths of the implemented strategies, which were developed with particular emphasis on the principles and laws of linguistics. ontology mapping ontology matching semantische erweiterung datenintegration hintergrundwissen ontology mapping ontology matching semantic enrichment data integration background knowledge ddc:500
35	A Common Programming Interface for Managed Heterogeneous Data Analysis Luong, Johannes 28 July 2021 (has links) The widespread success of data analysis in a growing number of application domains has lead to the development of a variety of purpose build data processing systems. Today, many organizations operate whole fleets of different data related systems. Although this differentiation has good reasons there is also a growing need to create holistic perspectives that cut across the borders of individual systems. Application experts that want to create such perspectives are confronted with a variety of programming interfaces, data formats, and the task to combine available systems in an efficient manner. These issues are generally unrelated to the application domain and require a specialized set of skills. As a consequence, development is slowed down and made more expensive which stifles exploration and innovation. In addition, the direct use of specialized system interfaces can couple application code to specific processing systems. In this dissertation, we propose the data processing platform DataCalc which presents users with a unified application oriented programming interface and which automatically executes this interface in an efficient manner on a variety of processing systems. DataCalc offers a managed environment for data analyses that enables domain experts to concentrate on their application logic and decouples code from specific processing technology. The basis of this managed processing environment are the high-level domain oriented program representation DCIL and a flexible and extensible cost based optimization component. In addition to traditional up-front optimization, the optimizer also supports dynamic re-optimization of partially executed DCIL programs. This enables the system to benefit from dynamic information that only becomes available during execution of queries. DataCalc assigns workloads to available processing systems using a fine grained task scheduling model to enable efficient exploitation of available resources. In the second part of the dissertation we present a prototypical implementation of the DataCalc platform which includes connectors for the relational DBMS PostgreSQL, the document store MongoDB, the graph database Neo4j, and for the custom build PyProc processing system. For the evaluation of this prototype we have implemented an extended application scenario. Our experiments demonstrate that DataCalc is able to find and execute efficient execution strategies that minimize cross system data movement. The system achieves much better results than a naive implementation and it comes close to the performance of a hand-optimized solution. Based on these findings we are confident to conclude that the DataCalc platform architecture provides an excellent environment for cross domain data analysis on a heterogeneous federated processing architecture. info:eu-repo/classification/ddc/004 ddc:004
36	Open(Geo-)Data - ein Katalysator für die Digitalisierung in der Landwirtschaft? Nölle, Olaf 15 November 2016 (has links) (Geo-)Daten integrieren, analysieren und visualisieren - Wissen erschließen und in Entscheidungsprozesse integrieren – dafür steht Disy seit knapp 20 Jahren! info:eu-repo/classification/ddc/630 ddc:630
37	Geometrische und stochastische Modelle für die integrierte Auswertung terrestrischer Laserscannerdaten und photogrammetrischer Bilddaten: Geometrische und stochastische Modelle für die integrierte Auswertung terrestrischer Laserscannerdaten und photogrammetrischer Bilddaten Schneider, Danilo 13 November 2008 (has links) Terrestrische Laserscanner finden seit einigen Jahren immer stärkere Anwendung in der Praxis und ersetzen bzw. ergänzen bisherige Messverfahren, oder es werden neue Anwendungsgebiete erschlossen. Werden die Daten eines terrestrischen Laserscanners mit photogrammetrischen Bilddaten kombiniert, ergeben sich viel versprechende Möglichkeiten, weil die Eigenschaften beider Datentypen als weitestgehend komplementär angesehen werden können: Terrestrische Laserscanner erzeugen schnell und zuverlässig dreidimensionale Repräsentationen von Objektoberflächen von einem einzigen Aufnahmestandpunkt aus, während sich zweidimensionale photogrammetrische Bilddaten durch eine sehr gute visuelle Qualität mit hohem Interpretationsgehalt und hoher lateraler Genauigkeit auszeichnen. Infolgedessen existieren bereits zahlreiche Ansätze, sowohl software- als auch hardwareseitig, in denen diese Kombination realisiert wird. Allerdings haben die Bildinformationen bisher meist nur ergänzenden Charakter, beispielsweise bei der Kolorierung von Punktwolken oder der Texturierung von aus Laserscannerdaten erzeugten Oberflächenmodellen. Die konsequente Nutzung der komplementären Eigenschaften beider Sensortypen bietet jedoch ein weitaus größeres Potenzial. Aus diesem Grund wurde im Rahmen dieser Arbeit eine Berechnungsmethode – die integrierte Bündelblockausgleichung – entwickelt, bei dem die aus terrestrischen Laserscannerdaten und photogrammetrischen Bilddaten abgeleiteten Beobachtungen diskreter Objektpunkte gleichberechtigt Verwendung finden können. Diese Vorgehensweise hat mehrere Vorteile: durch die Nutzung der individuellen Eigenschaften beider Datentypen unterstützen sie sich gegenseitig bei der Bestimmung von 3D-Objektkoordinaten, wodurch eine höhere Genauigkeit erreicht werden kann. Alle am Ausgleichungsprozess beteiligten Daten werden optimal zueinander referenziert und die verwendeten Aufnahmegeräte können simultan kalibriert werden. Wegen des (sphärischen) Gesichtsfeldes der meisten terrestrischen Laserscanner von 360° in horizontaler und bis zu 180° in vertikaler Richtung bietet sich die Kombination mit Rotationszeilen-Panoramakameras oder Kameras mit Fisheye-Objektiv an, weil diese im Vergleich zu zentralperspektiven Kameras deutlich größere Winkelbereiche in einer Aufnahme abbilden können. Grundlage für die gemeinsame Auswertung terrestrischer Laserscanner- und photogrammetrischer Bilddaten ist die strenge geometrische Modellierung der Aufnahmegeräte. Deshalb wurde für terrestrische Laserscanner und verschiedene Kameratypen ein geometrisches Modell, bestehend aus einem Grundmodell und Zusatzparametern zur Kompensation von Restsystematiken, entwickelt und verifiziert. Insbesondere bei der Entwicklung des geometrischen Modells für Laserscanner wurden verschiedene in der Literatur beschriebene Ansätze berücksichtigt. Dabei wurde auch auf von Theodoliten und Tachymetern bekannte Korrekturmodelle zurückgegriffen. Besondere Bedeutung innerhalb der gemeinsamen Auswertung hat die Festlegung des stochastischen Modells. Weil verschiedene Typen von Beobachtungen mit unterschiedlichen zugrunde liegenden geometrischen Modellen und unterschiedlichen stochastischen Eigenschaften gemeinsam ausgeglichen werden, muss den Daten ein entsprechendes Gewicht zugeordnet werden. Bei ungünstiger Gewichtung der Beobachtungen können die Ausgleichungsergebnisse negativ beeinflusst werden. Deshalb wurde die integrierte Bündelblockausgleichung um das Verfahren der Varianzkomponentenschätzung erweitert, mit dem optimale Beobachtungsgewichte automatisch bestimmt werden können. Erst dadurch wird es möglich, das Potenzial der Kombination terrestrischer Laserscanner- und photogrammetrischer Bilddaten vollständig auszuschöpfen. Zur Berechnung der integrierten Bündelblockausgleichung wurde eine Software entwickelt, mit der vielfältige Varianten der algorithmischen Kombination der Datentypen realisiert werden können. Es wurden zahlreiche Laserscannerdaten, Panoramabilddaten, Fisheye-Bilddaten und zentralperspektive Bilddaten in mehreren Testumgebungen aufgenommen und unter Anwendung der entwickelten Software prozessiert. Dabei wurden verschiedene Berechnungsvarianten detailliert analysiert und damit die Vorteile und Einschränkungen der vorgestellten Methode demonstriert. Ein Anwendungsbeispiel aus dem Bereich der Geologie veranschaulicht das Potenzial des Algorithmus in der Praxis. / The use of terrestrial laser scanning has grown in popularity in recent years, and replaces and complements previous measuring methods, as well as opening new fields of application. If data from terrestrial laser scanners are combined with photogrammetric image data, this yields promising possibilities, as the properties of both types of data can be considered mainly complementary: terrestrial laser scanners produce fast and reliable three-dimensional representations of object surfaces from only one position, while two-dimensional photogrammetric image data are characterised by a high visual quality, ease of interpretation, and high lateral accuracy. Consequently there are numerous approaches existing, both hardware- and software-based, where this combination is realised. However, in most approaches, the image data are only used to add additional characteristics, such as colouring point clouds or texturing object surfaces generated from laser scanner data. A thorough exploitation of the complementary characteristics of both types of sensors provides much more potential. For this reason a calculation method – the integrated bundle adjustment – was developed within this thesis, where the observations of discrete object points derived from terrestrial laser scanner data and photogrammetric image data are utilised equally. This approach has several advantages: using the individual characteristics of both types of data they mutually strengthen each other in terms of 3D object coordinate determination, so that a higher accuracy can be achieved; all involved data sets are optimally co-registered; and each instrument is simultaneously calibrated. Due to the (spherical) field of view of most terrestrial laser scanners of 360° in the horizontal direction and up to 180° in the vertical direction, the integration with rotating line panoramic cameras or cameras with fisheye lenses is very appropriate, as they have a wider field of view compared to central perspective cameras. The basis for the combined processing of terrestrial laser scanner and photogrammetric image data is the strict geometric modelling of the recording instruments. Therefore geometric models, consisting of a basic model and additional parameters for the compensation of systematic errors, was developed and verified for terrestrial laser scanners and different types of cameras. Regarding the geometric laser scanner model, different approaches described in the literature were considered, as well as applying correction models known from theodolites and total stations. A particular consideration within the combined processing is the definition of the stochastic model. Since different types of observations with different underlying geometric models and different stochastic properties have to be adjusted simultaneously, adequate weights have to be assigned to the measurements. An unfavourable weighting can have a negative influence on the adjustment results. Therefore a variance component estimation procedure was implemented in the integrated bundle adjustment, which allows for an automatic determination of optimal observation weights. Hence, it becomes possible to exploit the potential of the combination of terrestrial laser scanner and photogrammetric image data completely. For the calculation of the integrated bundle adjustment, software was developed allowing various algorithmic configurations of the different data types to be applied. Numerous laser scanner, panoramic image, fisheye image and central perspective image data were recorded in different test fields and processed using the developed software. Several calculation alternatives were analysed, demonstrating the advantages and limitations of the presented method. An application example from the field of geology illustrates the potential of the algorithm in practice. info:eu-repo/classification/ddc/550 ddc:550
38	Efficient use of a protein structure annotation database Rother, Kristian 14 August 2007 (has links) Im Rahmen dieser Arbeit wird eine Vielzahl von Daten zur Struktur und Funktion von Proteinen gesammelt. Anschließend wird in strukturellen Daten die atomare Packungsdichte untersucht. Untersuchungen an Strukturen benötigen oftmals maßgeschneiderte Datensätze von Proteinen. Kriterien für die Auswahl einzelner Proteine sind z.B. Eigenschaften der Sequenzen, die Faltung oder die Auflösung einer Struktur. Solche Datensätze mit den im Netz verfügbaren Mitteln herzustellen ist mühselig, da die notwendigen Daten über viele Datenbanken verteilt liegen. Um diese Aufgabe zu vereinfachen, wurde Columba, eine integrierte Datenbank zur Annotation von Proteinstrukturen, geschaffen. Columba integriert insgesamt sechzehn Datenbanken, darunter u.a. die PDB, KEGG, Swiss-Prot, CATH, SCOP, die Gene Ontology und ENZYME. Von den in Columba enthaltenen Strukturen der PDB sind zwei Drittel durch viele andere Datenbanken annotiert. Zum verbliebenen Drittel gibt es nur wenige zusätzliche Angaben, teils da die entsprechenden Strukturen erst seit kurzem in der PDB sind, teils da es gar keine richtigen Proteine sind. Die Datenbank kann über eine Web-Oberfläche unter www.columba-db.de spezifisch für einzelne Quelldatenbanken durchsucht werden. Ein Benutzer kann sich auf diese Weise schnell einen Datensatz von Strukturen aus der PDB zusammenstellen, welche den gewählten Anforderungen entsprechen. Es wurden Regeln aufgestellt, mit denen Datensätze effizient erstellt werden können. Diese Regeln wurden angewandt, um Datensätze zur Analyse der Packungsdichte von Proteinen zu erstellen. Die Packungsanalyse quantifiziert den Raum zwischen Atomen, und kann Regionen finden, in welchen eine hohe lokale Beweglichkeit vorliegt oder welche Fehler in der Struktur beinhalten. In einem Referenzdatensatz wurde so eine große Zahl von atomgroßen Höhlungen dicht unterhalb der Proteinoberfläche gefunden. In Transmembrandomänen treten diese Höhlungen besonders häufig in Kanal- und Transportproteinen auf, welche Konformationsänderungen vollführen. In proteingebundenen Liganden und Coenzymen wurde eine zu den Referenzdaten ähnliche Packungsdichte beobachtet. Mit diesen Ergebnissen konnten mehrere Widersprüche in der Fachliteratur ausgeräumt werden. / In this work, a multitude of data on structure and function of proteins is compiled and subsequently applied to the analysis of atomic packing. Structural analyses often require specific protein datasets, based on certain properties of the proteins, such as sequence features, protein folds, or resolution. Compiling such sets using current web resources is tedious because the necessary data are spread over many different databases. To facilitate this task, Columba, an integrated database containing annotation of protein structures was created. Columba integrates sixteen databases, including PDB, KEGG, Swiss-Prot, CATH, SCOP, the Gene Ontology, and ENZYME. The data in Columba revealed that two thirds of the structures in the PDB database are annotated by many other databases. The remaining third is poorly annotated, partially because the according structures have only recently been published, and partially because they are non-protein structures. The Columba database can be searched by a data source-specific web interface at www.columba-db.de. Users can thus quickly select PDB entries of proteins that match the desired criteria. Rules for creating datasets of proteins efficiently have been derived. These rules were applied to create datasets for analyzing the packing of proteins. Packing analysis measures how much space there is between atoms. This indicates regions where a high local mobility of the structure is required, and errors in the structure. In a reference dataset, a high number of atom-sized cavities was found in a region near the protein surface. In a transmembrane protein dataset, these cavities frequently locate in channels and transporters that undergo conformational changes. A dataset of ligands and coenzymes bound to proteins was packed as least as tightly as the reference data. By these results, several contradictions in the literature have been resolved. Datenqualität Proteinstruktu Datenbanken Datenintegration Annotation Packungsdichte data quality protein structure databases data integration annotation protein packing 570 Biowissenschaften, Biologie 32 Biologie WD 5100 ddc:570
39	Dependency discovery for data integration Bauckmann, Jana January 2013 (has links) Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains. / Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt. Datenabhängigkeiten-Entdeckung Datenintegration Schema-Entdeckung Link-Entdeckung Inklusionsabhängigkeit dependency discovery data integration schema discovery link discovery inclusion dependency Data processing Computer science
40	Ein Integrations- und Darstellungsmodell für verteilte und heterogene kontextbezogene Informationen / An Integration and Representation Model for Distributed and Heterogeneous Contextual Information Goslar, Kevin 07 February 2007 (has links) (PDF) Die &quot;Kontextsensitivität&quot; genannte systematische Berücksichtigung von Umweltinformationen durch Anwendungssysteme kann als Querschnittsfunktion im betrieblichen Umfeld in vielen Bereichen einen Nutzen stiften. Wirklich praxistaugliche kontextsensitive Anwendungssysteme, die sich analog zu einem mitdenkenden menschlichen Assistenten harmonisch in die ablaufenden Vorgänge in der Realwelt einbringen, haben einen enormen Bedarf nach umfassenden, d.h. diverse Aspekte der Realwelt beschreibenden Kontextinformationen, die jedoch prinzipbedingt verteilt in verschiedenen Datenquellen, etwa Kontexterfassungssystemen, Endgeräten sowie prinzipiell auch in beliebigen anderen, z.T. bereits existierenden Anwendungen entstehen. Ziel dieser Arbeit ist die Verringerung der Komplexität des Beschaffungsvorganges von verteilten und heterogenen Kontextinformationen durch Bereitstellung einer einfach verwendbaren Methode zur Darstellung eines umfassenden, aus verteilten und heterogenen Datenquellen zusammengetragenen Kontextmodells. Im Besonderen werden durch diese Arbeit zwei Probleme addressiert, zum einen daß ein Konsument von umfassenden Kontextinformationen mehrere Datenquellen sowohl kennen und zugreifen können und zum anderen über die zwischen den einzelnen Kontextinformationen in verschiedenen Datenquellen existierenden, zunächst nicht modellierten semantischen Verbindungen Bescheid wissen muß. Das dazu entwickelte Kontextinformationsintegrations- und -darstellungsverfahren kombiniert daher ein die Beschaffung und Integration von Kontextinformationen aus diversen Datenquellen modellierendes Informationsintegrationsmodell mit einem Kontextdarstellungsmodell, welches die abzubildende Realweltdomäne basierend auf ontologischen Informationen durch in problemspezifischer Weise erweiterte Verfahren des Semantic Web in einer möglichst intuitiven, wiederverwendbaren und modularen Weise modelliert. Nach einer fundierten Anforderungsanalyse des entwickelten Prinzips wird dessen Verwendung und Nutzen basierend auf der Skizzierung der wichtigsten allgemeinen Verwendungsmöglichkeiten von Kontextinformationen im betrieblichen Umfeld anhand eines komplexen betrieblichen Anwendungsszenarios demonstriert. Dieses beinhaltet ein Nutzerprofil, das von diversen Anwendungen, u.a. einem kontextsensitiven KFZ-Navigationssystem, einer Restaurantsuchanwendung sowie einem Touristenführer verwendet wird. Probleme hinsichtlich des Datenschutzes, der Integration in existierende Umgebungen und Abläufe sowie der Skalierbarkeit und Leistungsfähigkeit des Verfahrens werden ebenfalls diskutiert. / Context-awareness, which is the systematic consideration of information from the environment of applications, can provide significant benefits in the area of business and technology. To be really useful, i.e. harmonically support real-world processes as human assistants do it, practical applications need a comprehensive and detailed contextual information base that describes all relevant aspects of the real world. As a matter of principle, comprehensive contextual information arises in many places and data sources, e.g. in context-aware infrastructures as well as in &quot;normal&quot; applications, which may have knowledge about the context based on their functionality to support a certain process in the real world. This thesis facilitates the use of contextual information by reducing the complexity of the procurement process of distributed and heterogenous contextual information. Particularly, it addresses the two problems that a consumer of comprehensive contextual information needs to be aware of and able to access several different data sources and must know how to combine the contextual information taken from different and isolated data sources into a meaningful representation of the context. Especially the latter information cannot be modelled using the current state of the art. These problems are addressed by the development of an integration and representation model for contextual information that allows to compose comprehensive context models using information inside distributed and heterogeneous data sources. This model combines an information integration model for distributed and heterogenous information (which consists of an access model for heterogeneous data sources, an integration model and an information relation model) with a representation model for context that formalizes the representation of the respective real world domain, i.e. of the real world objects and their semantic relations in an intuitive, reusable and modular way based on ontologies. The resulting model consists of five layers that represent different aspects of the information integration solution. The achievement of the objectives is rated based on a requirement analysis of the problem domain. The technical feasibility and usefulness of the model is demonstrated by the implementation of an engine to support the approach as well as a complex application scenario consisting of a user profile that integrates information from several data sources and a couple of context-aware applications, e.g. a context-aware navigation system, a restaurant finder application as well as an enhanced tourist guide that use the user profile. Problems regarding security and social effects, the integration of this solution into existing environments and infrastructures as well as technical issues like the scalability and performance of this model are discussed too. Kontextsensitivität Datenintegration Kontextmodell Informationsbeschaffung Informationsintegration Ontologie Semantic Web context-awareness data integration context model information procurement information integration ontology semantic web ddc:330 rvk:ST 515 Wissensmanagement Information Engineering

Search results