Global ETD Search

181	Understanding cryptic schemata in large extract-transform-load systems Albrecht, Alexander, Naumann, Felix January 2012 (has links) Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema. / Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt. Extract-Transform-Load (ETL) Data Warehouse Datenintegration Extract-Transform-Load (ETL) Data Warehouse Data Integration Data processing Computer science
182	Environmental Health and Safety data integration using Geographical Information Systems George, David Paul January 2008 (has links) Environmental Health and Safety (EHS) departments in many organizations are faced with two interrelated problems which limit their ability to make accurate decisions based on quality data. First, many EHS departments follow a reactive business management model and need to work towards a proactive continuous improvement model to better manage EHS. The second is a lack of data integration and interoperability between the numerous different EHS data sources and systems. EHS departments are challenged with managing large quantities of data generated through tracking and monitoring programs to continuously improve EHS performance. EHS data can be in many forms paper, digital files, spreadsheets, images, relational databases and proprietary software applications. EHS data have strong spatial relationships, which makes the use of Geographical Information Systems (GIS) a very cost effective and feasible solution for integrating and managing EHS data. This thesis will outline how GIS brings to EHS the advantages of traditional IT methods with the added benefit of spatial analytical operations such as map overlay, relationships and querying, and informative visual presentation through maps, floor plans, and imagery through the implementation of a GIS database for EHS called GeoSpatial Environmental Health and Safety (GEO-EHS). Geographical Information Systems GIS Environmental Health and Safety EHS Spatial Decision Support School Board Data Integration Interoperability Geography
183	Distributed knowledge sharing and production through collaborative e-Science platforms Gaignard, Alban 15 March 2013 (has links) (PDF) This thesis addresses the issues of coherent distributed knowledge production and sharing in the Life-science area. In spite of the continuously increasing computing and storage capabilities of computing infrastructures, the management of massive scientific data through centralized approaches became inappropriate, for several reasons: (i) they do not guarantee the autonomy property of data providers, constrained, for either ethical or legal concerns, to keep the control over the data they host, (ii) they do not scale and adapt to the massive scientific data produced through e-Science platforms. In the context of the NeuroLOG and VIP Life-science collaborative platforms, we address on one hand, distribution and heterogeneity issues underlying, possibly sensitive, resource sharing ; and on the other hand, automated knowledge production through the usage of these e-Science platforms, to ease the exploitation of the massively produced scientific data. We rely on an ontological approach for knowledge modeling and propose, based on Semantic Web technologies, to (i) extend these platforms with efficient, static and dynamic, transparent federated semantic querying strategies, and (ii) to extend their data processing environment, from both provenance information captured at run-time and domain-specific inference rules, to automate the semantic annotation of ''in silico'' experiment results. The results of this thesis have been evaluated on the Grid'5000 distributed and controlled infrastructure. They contribute to addressing three of the main challenging issues faced in the area of computational science platforms through (i) a model for secured collaborations and a distributed access control strategy allowing for the setup of multi-centric studies while still considering competitive activities, (ii) semantic experiment summaries, meaningful from the end-user perspective, aimed at easing the navigation into massive scientific data resulting from large-scale experimental campaigns, and (iii) efficient distributed querying and reasoning strategies, relying on Semantic Web standards, aimed at sharing capitalized knowledge and providing connectivity towards the Web of Linked Data. [INFO:INFO_OH] Computer Science/Other Scientific workflows Semantic web services Web of linked data Federated knowledge bases Distributed data integration E-Science E-Health
184	Environmental Health and Safety data integration using Geographical Information Systems George, David Paul January 2008 (has links) Environmental Health and Safety (EHS) departments in many organizations are faced with two interrelated problems which limit their ability to make accurate decisions based on quality data. First, many EHS departments follow a reactive business management model and need to work towards a proactive continuous improvement model to better manage EHS. The second is a lack of data integration and interoperability between the numerous different EHS data sources and systems. EHS departments are challenged with managing large quantities of data generated through tracking and monitoring programs to continuously improve EHS performance. EHS data can be in many forms paper, digital files, spreadsheets, images, relational databases and proprietary software applications. EHS data have strong spatial relationships, which makes the use of Geographical Information Systems (GIS) a very cost effective and feasible solution for integrating and managing EHS data. This thesis will outline how GIS brings to EHS the advantages of traditional IT methods with the added benefit of spatial analytical operations such as map overlay, relationships and querying, and informative visual presentation through maps, floor plans, and imagery through the implementation of a GIS database for EHS called GeoSpatial Environmental Health and Safety (GEO-EHS). Geographical Information Systems GIS Environmental Health and Safety EHS Spatial Decision Support School Board Data Integration Interoperability Geography
185	Efficient and exact computation of inclusion dependencies for data integration Bauckmann, Jana, Leser, Ulf, Naumann, Felix January 2010 (has links) Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation. Metadatenentdeckung Metadatenqualität Schemaentdeckung Datenanalyse Datenintegration metadata discovery metadata quality schema discovery data profiling data integration Data processing Computer science
186	Geometrische und stochastische Modelle für die integrierte Auswertung terrestrischer Laserscannerdaten und photogrammetrischer Bilddaten Schneider, Danilo 07 September 2009 (has links) (PDF) Terrestrische Laserscanner finden seit einigen Jahren immer stärkere Anwendung in der Praxis und ersetzen bzw. ergänzen bisherige Messverfahren, oder es werden neue Anwendungsgebiete erschlossen. Werden die Daten eines terrestrischen Laserscanners mit photogrammetrischen Bilddaten kombiniert, ergeben sich viel versprechende Möglichkeiten, weil die Eigenschaften beider Datentypen als weitestgehend komplementär angesehen werden können: Terrestrische Laserscanner erzeugen schnell und zuverlässig dreidimensionale Repräsentationen von Objektoberflächen von einem einzigen Aufnahmestandpunkt aus, während sich zweidimensionale photogrammetrische Bilddaten durch eine sehr gute visuelle Qualität mit hohem Interpretationsgehalt und hoher lateraler Genauigkeit auszeichnen. Infolgedessen existieren bereits zahlreiche Ansätze, sowohl software- als auch hardwareseitig, in denen diese Kombination realisiert wird. Allerdings haben die Bildinformationen bisher meist nur ergänzenden Charakter, beispielsweise bei der Kolorierung von Punktwolken oder der Texturierung von aus Laserscannerdaten erzeugten Oberflächenmodellen. Die konsequente Nutzung der komplementären Eigenschaften beider Sensortypen bietet jedoch ein weitaus größeres Potenzial. Aus diesem Grund wurde im Rahmen dieser Arbeit eine Berechnungsmethode – die integrierte Bündelblockausgleichung – entwickelt, bei dem die aus terrestrischen Laserscannerdaten und photogrammetrischen Bilddaten abgeleiteten Beobachtungen diskreter Objektpunkte gleichberechtigt Verwendung finden können. Diese Vorgehensweise hat mehrere Vorteile: durch die Nutzung der individuellen Eigenschaften beider Datentypen unterstützen sie sich gegenseitig bei der Bestimmung von 3D-Objektkoordinaten, wodurch eine höhere Genauigkeit erreicht werden kann. Alle am Ausgleichungsprozess beteiligten Daten werden optimal zueinander referenziert und die verwendeten Aufnahmegeräte können simultan kalibriert werden. Wegen des (sphärischen) Gesichtsfeldes der meisten terrestrischen Laserscanner von 360° in horizontaler und bis zu 180° in vertikaler Richtung bietet sich die Kombination mit Rotationszeilen-Panoramakameras oder Kameras mit Fisheye-Objektiv an, weil diese im Vergleich zu zentralperspektiven Kameras deutlich größere Winkelbereiche in einer Aufnahme abbilden können. Grundlage für die gemeinsame Auswertung terrestrischer Laserscanner- und photogrammetrischer Bilddaten ist die strenge geometrische Modellierung der Aufnahmegeräte. Deshalb wurde für terrestrische Laserscanner und verschiedene Kameratypen ein geometrisches Modell, bestehend aus einem Grundmodell und Zusatzparametern zur Kompensation von Restsystematiken, entwickelt und verifiziert. Insbesondere bei der Entwicklung des geometrischen Modells für Laserscanner wurden verschiedene in der Literatur beschriebene Ansätze berücksichtigt. Dabei wurde auch auf von Theodoliten und Tachymetern bekannte Korrekturmodelle zurückgegriffen. Besondere Bedeutung innerhalb der gemeinsamen Auswertung hat die Festlegung des stochastischen Modells. Weil verschiedene Typen von Beobachtungen mit unterschiedlichen zugrunde liegenden geometrischen Modellen und unterschiedlichen stochastischen Eigenschaften gemeinsam ausgeglichen werden, muss den Daten ein entsprechendes Gewicht zugeordnet werden. Bei ungünstiger Gewichtung der Beobachtungen können die Ausgleichungsergebnisse negativ beeinflusst werden. Deshalb wurde die integrierte Bündelblockausgleichung um das Verfahren der Varianzkomponentenschätzung erweitert, mit dem optimale Beobachtungsgewichte automatisch bestimmt werden können. Erst dadurch wird es möglich, das Potenzial der Kombination terrestrischer Laserscanner- und photogrammetrischer Bilddaten vollständig auszuschöpfen. Zur Berechnung der integrierten Bündelblockausgleichung wurde eine Software entwickelt, mit der vielfältige Varianten der algorithmischen Kombination der Datentypen realisiert werden können. Es wurden zahlreiche Laserscannerdaten, Panoramabilddaten, Fisheye-Bilddaten und zentralperspektive Bilddaten in mehreren Testumgebungen aufgenommen und unter Anwendung der entwickelten Software prozessiert. Dabei wurden verschiedene Berechnungsvarianten detailliert analysiert und damit die Vorteile und Einschränkungen der vorgestellten Methode demonstriert. Ein Anwendungsbeispiel aus dem Bereich der Geologie veranschaulicht das Potenzial des Algorithmus in der Praxis. / The use of terrestrial laser scanning has grown in popularity in recent years, and replaces and complements previous measuring methods, as well as opening new fields of application. If data from terrestrial laser scanners are combined with photogrammetric image data, this yields promising possibilities, as the properties of both types of data can be considered mainly complementary: terrestrial laser scanners produce fast and reliable three-dimensional representations of object surfaces from only one position, while two-dimensional photogrammetric image data are characterised by a high visual quality, ease of interpretation, and high lateral accuracy. Consequently there are numerous approaches existing, both hardware- and software-based, where this combination is realised. However, in most approaches, the image data are only used to add additional characteristics, such as colouring point clouds or texturing object surfaces generated from laser scanner data. A thorough exploitation of the complementary characteristics of both types of sensors provides much more potential. For this reason a calculation method – the integrated bundle adjustment – was developed within this thesis, where the observations of discrete object points derived from terrestrial laser scanner data and photogrammetric image data are utilised equally. This approach has several advantages: using the individual characteristics of both types of data they mutually strengthen each other in terms of 3D object coordinate determination, so that a higher accuracy can be achieved; all involved data sets are optimally co-registered; and each instrument is simultaneously calibrated. Due to the (spherical) field of view of most terrestrial laser scanners of 360° in the horizontal direction and up to 180° in the vertical direction, the integration with rotating line panoramic cameras or cameras with fisheye lenses is very appropriate, as they have a wider field of view compared to central perspective cameras. The basis for the combined processing of terrestrial laser scanner and photogrammetric image data is the strict geometric modelling of the recording instruments. Therefore geometric models, consisting of a basic model and additional parameters for the compensation of systematic errors, was developed and verified for terrestrial laser scanners and different types of cameras. Regarding the geometric laser scanner model, different approaches described in the literature were considered, as well as applying correction models known from theodolites and total stations. A particular consideration within the combined processing is the definition of the stochastic model. Since different types of observations with different underlying geometric models and different stochastic properties have to be adjusted simultaneously, adequate weights have to be assigned to the measurements. An unfavourable weighting can have a negative influence on the adjustment results. Therefore a variance component estimation procedure was implemented in the integrated bundle adjustment, which allows for an automatic determination of optimal observation weights. Hence, it becomes possible to exploit the potential of the combination of terrestrial laser scanner and photogrammetric image data completely. For the calculation of the integrated bundle adjustment, software was developed allowing various algorithmic configurations of the different data types to be applied. Numerous laser scanner, panoramic image, fisheye image and central perspective image data were recorded in different test fields and processed using the developed software. Several calculation alternatives were analysed, demonstrating the advantages and limitations of the presented method. An application example from the field of geology illustrates the potential of the algorithm in practice. Photogrammetrie Terrestrisches Laserscanning Bündelblockausgleichung Datenintegration photogrammetry terrestrial laser scanning bundle adjustment data integration ddc:550 rvk:ZI 9510
187	Constraining 3D Petroleum Reservoir Models to Petrophysical Data, Local Temperature Observations, and Gridded Seismic Attributes with the Ensemble Kalman Filter (EnKF) Zagayevskiy, Yevgeniy Unknown Date No description available. Ensemble Kalman Filter EnKF Continuous Data Integration Petroleum Reservoir Characterization Geostatistical Modeling Petroelastic Model Temperature Data Assimilation
188	Capability-based Description and Discovery of Services Devereux, Drew Unknown Date (has links) Whenever autonomous entities work together to meet each other's needs, there arises the problem of how an entity with a need can find and use entities with the capability to meet that need. This problem is seen in Web service architectures, agent systems, and data integration systems, among others. Solutions have been proposed in each of these fields, but they are all dependent on implementation and interface. Hence all are restricted to their particular field, and all require their participants to conform to certain assumptions about implementation and interface. This failure of support for service autonomy is conceptually unattractive and impractical. In this thesis we show how to describe and matchmake service capabilities and client needs in a way that is implementation and interface independent. The result is a service discovery solution that fully supports the rights of services to choose their own implementation and interface. Our representation is capable of capturing capabilities across a range of service types, from Web services to agents to data sources, while ignoring the implementation and interface details that distinguish them. Thus, our solution unifies these fields for description and discovery purposes, allowing data sources with complex language interfaces to compete against form-based Web services and frame-and-slot agents, for example. Moreover, our solution captures all of the most important aspects of capability, such as: the conceptual meaning and limitations on what a service can achieve; what requests can be expressed through a service's interface, and limitations on what attributes of information a service can return. The provision of an interface independent capability description raises the additional question of how to enable a client to invoke the service to which it has been matched, and correctly interpret the results returned; we solve this by providing an interface description that maps from client objectives onto invocations, and from returned results onto a canonical result format. 280107 Global Information Systems semantic Web capability data integration Web services agents
189	Capability-based Description and Discovery of Services Devereux, Drew Unknown Date (has links) Whenever autonomous entities work together to meet each other's needs, there arises the problem of how an entity with a need can find and use entities with the capability to meet that need. This problem is seen in Web service architectures, agent systems, and data integration systems, among others. Solutions have been proposed in each of these fields, but they are all dependent on implementation and interface. Hence all are restricted to their particular field, and all require their participants to conform to certain assumptions about implementation and interface. This failure of support for service autonomy is conceptually unattractive and impractical. In this thesis we show how to describe and matchmake service capabilities and client needs in a way that is implementation and interface independent. The result is a service discovery solution that fully supports the rights of services to choose their own implementation and interface. Our representation is capable of capturing capabilities across a range of service types, from Web services to agents to data sources, while ignoring the implementation and interface details that distinguish them. Thus, our solution unifies these fields for description and discovery purposes, allowing data sources with complex language interfaces to compete against form-based Web services and frame-and-slot agents, for example. Moreover, our solution captures all of the most important aspects of capability, such as: the conceptual meaning and limitations on what a service can achieve; what requests can be expressed through a service's interface, and limitations on what attributes of information a service can return. The provision of an interface independent capability description raises the additional question of how to enable a client to invoke the service to which it has been matched, and correctly interpret the results returned; we solve this by providing an interface description that maps from client objectives onto invocations, and from returned results onto a canonical result format. 280107 Global Information Systems semantic Web capability data integration Web services agents
190	Bivariate relationship modelling on bounded spaces with application to the estimation of forest foliage cover by Landsat satellite ETM-plus sensor Moffiet, Trevor Noel January 2008 (has links) Research Doctorate - Doctor of Philosophy (PhD) / Due to the effects of global warming and climate change there is currently intense and growing international interest in suitable modelling methods for relating satellite remotely sensed spectral imagery of vegetated landscapes to the biophysical structural variables in those landscapes across regional, continental or global scales. Of particular interest here is the satellite optical remote sensing of forest foliage cover—measured as foliage projective cover (FPC)—by Landsat ETM+ (Enhanced Thematic Mapper plus) sensor. In the remote sensing literature, different empirical and physical modelling approaches exist for relating remotely sensed imagery to the landscape parameters of interest, each with their own advantages and disadvantages. These approaches, in the main, may be broadly categorised as belonging to one, or a combination of: spectral mixture analysis (SMA) modelling, canopy reflectance modelling, multiple regression (MR) modelling or, spectral vegetation index (SVI) modelling. This thesis uses the SVI approach, partly in comparison to the MR approach. Both the SVI and MR approaches require field-based data to establish the relationship between the biophysical parameter and the spectral index or spectral responses within defined spectral bandwidths. Surrogate measures of the biophysical parameter are sometimes used extensively to establish this relationship and therefore a separate calibration relationship is required.This has inherent problems when the output of one model is substituted into the next and the effects of carry-over of error from one model to the next are not considered. My main goal is therefore to develop a modelling approach that will allow a larger set of one or more surrogate measures to be combined with a smaller set of ‘true’ measures of the biophysical parameter into the one model for establishing the relationship with the SVI and hence the spectral imagery. Success in meeting the goal is the illustration of a working model using real data. In progression towards meeting the goal, two new modelling ideas are developed and synthesised into the creation of an overall modelling framework for estimating FPC from spectral imagery. The modelling framework, which has potential for use in other applications, allows for the incorporation of different types of data including different calibration relationships between variables while avoiding the usual, stepwise approach to the linking of separate relationship models and their variables. One contribution that is new to both remote sensing and statistical modelling practices involves a polar transformation of the principal components of a multi-spectral image of a local reference landscape to produce a set of empirically based, invariant three-dimensional spectral index transformations that have potential for application to the spectral images of different regional landscapes and possibly global landscapes. In particular, the vegetation index from the set has approximate bounded properties that we exploit for modelling of its contribution to residual variation in its relationships with the biophysical variables measured on the ground. The other contribution to statistical modelling practice that has potential for application by a wide range of disciplines is the direct modelling of interdependent relationships between pairs of bounded variates, each considered to have a measurement error structure that can be modelled as though it is similar to sampling variation. Associated with this particular contribution is the development of novel geometric methods to construct approximate prediction bounds and to assist with model interpretations. Bayesian modelling functional modelling bivariate relationships intrinsic variation measurement error modelling model integration spectral vegetation indices data integration

Search results