Spelling suggestions: "subject:"dataintegration"" "subject:"data.cointegration""
31 |
Multivariate Analysis of Diverse Data for Improved Geostatistical Reservoir ModelingHong, Sahyun 11 1900 (has links)
Improved numerical reservoir models are constructed when all available diverse data sources are accounted for to the maximum extent possible. Integrating various diverse data is not a simple problem because data show different precision and relevance to the primary variables being modeled, nonlinear relations and different qualities. Previous approaches rely on a strong Gaussian assumption or the combination of the source-specific probabilities that are individually calibrated from each data source.
This dissertation develops different approaches to integrate diverse earth science data. First approach is based on combining probability. Each of diverse data is calibrated to generate individual conditional probabilities, and they are combined by a combination model. Some existing models are reviewed and a combination model is proposed with a new weighting scheme. Weakness of the probability combination schemes (PCS) is addressed. Alternative to the PCS, this dissertation develops a multivariate analysis technique. The method models the multivariate distributions without a parametric distribution assumption and without ad-hoc probability combination procedures. The method accounts for nonlinear features and different types of the data. Once the multivariate distribution is modeled, the marginal distribution constraints are evaluated. A sequential iteration algorithm is proposed for the evaluation. The algorithm compares the extracted marginal distributions from the modeled multivariate distribution with the known marginal distributions and corrects the multivariate distribution. Ultimately, the corrected distribution satisfies all axioms of probability distribution functions as well as the complex features among the given data.
The methodology is applied to several applications including: (1) integration of continuous data for a categorical attribute modeling, (2) integration of continuous and a discrete geologic map for categorical attribute modeling, (3) integration of continuous data for a continuous attribute modeling. Results are evaluated based on the defined criteria such as the fairness of the estimated probability or probability distribution and reasonable reproduction of input statistics. / Mining Engineering
|
32 |
Tabular Representation of Schema Mappings: Semantics and AlgorithmsRahman, Md. Anisur 27 May 2011 (has links)
Our thesis investigates a mechanism for representing schema mapping by tabular forms and checking utility of the new representation.
Schema mapping is a high-level specification that describes the relationship between two database schemas. Schema mappings constitute essential building blocks of data integration, data exchange and peer-to-peer data sharing systems. Global-and-local-as-view (GLAV) is one of the approaches for specifying the schema mappings. Tableaux are used for expressing queries and functional dependencies on a single database in a tabular form. In our thesis, we first introduce a tabular representation of GLAV mappings. We find that this tabular representation helps to solve many mapping-related algorithmic and semantic problems. For example, a well-known problem is to find the minimal instance of the target schema for a given instance of the source schema and a set of mappings between the source and the target schema. Second, we show that our proposed tabular mapping can be used as an operator on an instance of the source schema to produce an instance of the target schema which is `minimal' and `most general' in nature. There exists a tableaux-based mechanism for finding equivalence of two queries. Third, we extend that mechanism for deducing equivalence between two schema mappings using their corresponding tabular representations. Sometimes, there exist redundant conjuncts in a schema mapping which causes data exchange, data integration and data sharing operations more time consuming. Fourth, we present an algorithm that utilizes the tabular representations for reducing number of constraints in the schema mappings. At present, either schema-level mappings or data-level mappings are used for data sharing purposes. Fifth, we introduce and give the semantics of bi-level mapping that combines the schema-level and data-level mappings. We also show that bi-level mappings are more effective for data sharing systems. Finally, we implemented our algorithms and developed a software prototype to evaluate our proposed strategies.
|
33 |
A Practical Approach to Merging Multidimensional Data ModelsMireku Kwakye, Michael 30 November 2011 (has links)
Schema merging is the process of incorporating data models into an integrated, consistent schema from which query solutions satisfying all incorporated models can be derived. The efficiency of such a process is reliant on the effective semantic representation of the chosen data models, as well as the mapping relationships between the elements of the source data models.
Consider a scenario where, as a result of company mergers or acquisitions, a number of related, but possible disparate data marts need to be integrated into a global data warehouse. The ability to retrieve data across these disparate, but related, data marts poses an important challenge. Intuitively, forming an all-inclusive data warehouse includes the tedious tasks of identifying related fact and dimension table attributes, as well as the design of a schema merge algorithm for the integration. Additionally, the evaluation of the combined set of correct answers to queries, likely to be independently posed to such data marts, becomes difficult to achieve.
Model management refers to a high-level, abstract programming language designed to efficiently manipulate schemas and mappings. Particularly, model management operations such as match, compose mappings, apply functions and merge, offer a way to handle the above-mentioned data integration problem within the domain of data warehousing.
In this research, we introduce a methodology for the integration of star schema source data marts into a single consolidated data warehouse based on model management. In our methodology, we discuss the development of three (3) main streamlined steps to facilitate the generation of a global data warehouse. That is, we adopt techniques for deriving attribute correspondences, and for schema mapping discovery. Finally, we formulate and design a merge algorithm, based on multidimensional star schemas; which is primarily the core contribution of this research. Our approach focuses on delivering a polynomial time solution needed for the expected volume of data and its associated large-scale query processing.
The experimental evaluation shows that an integrated schema, alongside instance data, can be derived based on the type of mappings adopted in the mapping discovery step. The adoption of Global-And-Local-As-View (GLAV) mapping models delivered a maximally-contained or exact representation of all fact and dimensional instance data tuples needed in query processing on the integrated data warehouse. Additionally, different forms of conflicts, such as semantic conflicts for related or unrelated dimension entities, and descriptive conflicts for differing attribute data types, were encountered and resolved in the developed solution. Finally, this research has highlighted some critical and inherent issues regarding functional dependencies in mapping models, integrity constraints at the source data marts, and multi-valued dimension attributes. These issues were encountered during the integration of the source data marts, as it has been the case of evaluating the queries processed on the merged data warehouse as against that on the independent data marts.
|
34 |
Integration of heterogeneous data types using self organizing mapsBourennani, Farid 01 July 2009 (has links)
With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data.
We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content.
iii
The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.
|
35 |
Tabular Representation of Schema Mappings: Semantics and AlgorithmsRahman, Md. Anisur 27 May 2011 (has links)
Our thesis investigates a mechanism for representing schema mapping by tabular forms and checking utility of the new representation.
Schema mapping is a high-level specification that describes the relationship between two database schemas. Schema mappings constitute essential building blocks of data integration, data exchange and peer-to-peer data sharing systems. Global-and-local-as-view (GLAV) is one of the approaches for specifying the schema mappings. Tableaux are used for expressing queries and functional dependencies on a single database in a tabular form. In our thesis, we first introduce a tabular representation of GLAV mappings. We find that this tabular representation helps to solve many mapping-related algorithmic and semantic problems. For example, a well-known problem is to find the minimal instance of the target schema for a given instance of the source schema and a set of mappings between the source and the target schema. Second, we show that our proposed tabular mapping can be used as an operator on an instance of the source schema to produce an instance of the target schema which is `minimal' and `most general' in nature. There exists a tableaux-based mechanism for finding equivalence of two queries. Third, we extend that mechanism for deducing equivalence between two schema mappings using their corresponding tabular representations. Sometimes, there exist redundant conjuncts in a schema mapping which causes data exchange, data integration and data sharing operations more time consuming. Fourth, we present an algorithm that utilizes the tabular representations for reducing number of constraints in the schema mappings. At present, either schema-level mappings or data-level mappings are used for data sharing purposes. Fifth, we introduce and give the semantics of bi-level mapping that combines the schema-level and data-level mappings. We also show that bi-level mappings are more effective for data sharing systems. Finally, we implemented our algorithms and developed a software prototype to evaluate our proposed strategies.
|
36 |
A Practical Approach to Merging Multidimensional Data ModelsMireku Kwakye, Michael 30 November 2011 (has links)
Schema merging is the process of incorporating data models into an integrated, consistent schema from which query solutions satisfying all incorporated models can be derived. The efficiency of such a process is reliant on the effective semantic representation of the chosen data models, as well as the mapping relationships between the elements of the source data models.
Consider a scenario where, as a result of company mergers or acquisitions, a number of related, but possible disparate data marts need to be integrated into a global data warehouse. The ability to retrieve data across these disparate, but related, data marts poses an important challenge. Intuitively, forming an all-inclusive data warehouse includes the tedious tasks of identifying related fact and dimension table attributes, as well as the design of a schema merge algorithm for the integration. Additionally, the evaluation of the combined set of correct answers to queries, likely to be independently posed to such data marts, becomes difficult to achieve.
Model management refers to a high-level, abstract programming language designed to efficiently manipulate schemas and mappings. Particularly, model management operations such as match, compose mappings, apply functions and merge, offer a way to handle the above-mentioned data integration problem within the domain of data warehousing.
In this research, we introduce a methodology for the integration of star schema source data marts into a single consolidated data warehouse based on model management. In our methodology, we discuss the development of three (3) main streamlined steps to facilitate the generation of a global data warehouse. That is, we adopt techniques for deriving attribute correspondences, and for schema mapping discovery. Finally, we formulate and design a merge algorithm, based on multidimensional star schemas; which is primarily the core contribution of this research. Our approach focuses on delivering a polynomial time solution needed for the expected volume of data and its associated large-scale query processing.
The experimental evaluation shows that an integrated schema, alongside instance data, can be derived based on the type of mappings adopted in the mapping discovery step. The adoption of Global-And-Local-As-View (GLAV) mapping models delivered a maximally-contained or exact representation of all fact and dimensional instance data tuples needed in query processing on the integrated data warehouse. Additionally, different forms of conflicts, such as semantic conflicts for related or unrelated dimension entities, and descriptive conflicts for differing attribute data types, were encountered and resolved in the developed solution. Finally, this research has highlighted some critical and inherent issues regarding functional dependencies in mapping models, integrity constraints at the source data marts, and multi-valued dimension attributes. These issues were encountered during the integration of the source data marts, as it has been the case of evaluating the queries processed on the merged data warehouse as against that on the independent data marts.
|
37 |
Manifold Integration: Data Integration on Multiple ManifoldsChoi, Hee Youl 2010 May 1900 (has links)
In data analysis, data points are usually analyzed based on their relations to
other points (e.g., distance or inner product). This kind of relation can be analyzed
on the manifold of the data set. Manifold learning is an approach to understand
such relations. Various manifold learning methods have been developed and their
effectiveness has been demonstrated in many real-world problems in pattern recognition and signal processing. However, most existing manifold learning algorithms
only consider one manifold based on one dissimilarity matrix. In practice, multiple
measurements may be available, and could be utilized. In pattern recognition systems, data integration has been an important consideration for improved accuracy
given multiple measurements. Some data integration algorithms have been proposed
to address this issue. These integration algorithms mostly use statistical information
from the data set such as uncertainty of each data source, but they do not use the
structural information (i.e., the geometric relations between data points). Such a
structure is naturally described by a manifold.
Even though manifold learning and data integration have been successfully used
for data analysis, they have not been considered in a single integrated framework.
When we have multiple measurements generated from the same data set and mapped
onto different manifolds, those measurements can be integrated using the structural
information on these multiple manifolds. Furthermore, we can better understand the
structure of the data set by combining multiple measurements in each manifold using data integration techniques.
In this dissertation, I present a new concept, manifold integration, a data integration method using the structure of data expressed in multiple manifolds. In order
to achieve manifold integration, I formulated the manifold integration concept, and
derived three manifold integration algorithms. Experimental results showed the algorithms' effectiveness in classification and dimension reduction. Moreover, for manifold
integration, I showed that there are good theoretical and neuroscientific applications.
I expect the manifold integration approach to serve as an effective framework for
analyzing multimodal data sets on multiple manifolds. Also, I expect that my research
on manifold integration will catalyze both manifold learning and data integration
research.
|
38 |
Production Data Integration into High Resolution Geologic Models with Trajectory-based Methods and A Dual Scale ApproachKim, Jong Uk 2009 August 1900 (has links)
Inverse problems associated with reservoir characterization are typically underdetermined
and often have difficulties associated with stability and convergence of the
solution. A common approach to address this issue is through the introduction of prior
constraints, regularization or reparameterization to reduce the number of estimated
parameters.
We propose a dual scale approach to production data integration that relies on a
combination of coarse-scale and fine-scale inversions while preserving the essential
features of the geologic model. To begin with, we sequentially coarsen the fine-scale
geological model by grouping layers in such a way that the heterogeneity measure of an
appropriately defined 'static' property is minimized within the layers and maximized
between the layers. Our coarsening algorithm results in a non-uniform coarsening of the
geologic model with minimal loss of heterogeneity and the ?optimal? number of layers is
determined based on a bias-variance trade-off criterion. The coarse-scale model is then
updated using production data via a generalized travel time inversion. The coarse-scale
inversion proceeds much faster compared to a direct fine-scale inversion because of the
significantly reduced parameter space. Furthermore, the iterative minimization is much
more effective because at the larger scales there are fewer local minima and those tend to
be farther apart. At the end of the coarse-scale inversion, a fine-scale inversion may be
carried out, if needed. This constitutes the outer iteration in the overall algorithm. The
fine-scale inversion is carried out only if the data misfit is deemed to be unsatisfactory. We propose a fast and robust approach to calibrating geologic models by
transient pressure data using a trajectory-based approach that based on a high frequency
asymptotic expansion of the diffusivity equation. The trajectory or ray-based methods
are routinely used in seismic tomography. In this work, we investigate seismic rays and
compare them with streamlines. We then examine the applicability of streamline-based
methods for transient pressure data inversion. Specifically, the high frequency
asymptotic approach allows us to analytically compute the sensitivity of the pressure
responses with respect to reservoir properties such as porosity and permeability. It
facilitates a very efficient methodology for the integration of pressure data into geologic
models.
|
39 |
Field scale history matching and assisted history matching using streamline simulationKharghoria, Arun 15 November 2004 (has links)
In this study, we apply the streamline-based production data integration method to condition a multimillion cell geologic model to historical production response for a giant Saudi Arabian reservoir. The field has been under peripheral water injection with 16 injectors and 70 producers. There is also a strong aquifer influx into the field. A total of 30 years of production history with detailed rate, infill well and re-perforation schedule were incorporated via multiple pressure updates during streamline simulation. Also, gravity and compressibility effects were included to account for water slumping and aquifer support. To our knowledge, this is the first and the largest such application of production data integration to geologic models accounting for realistic field conditions. We have developed novel techniques to analytically compute the sensitivities of the production response in the presence of gravity and changing field conditions. This makes our method computationally extremely efficient. The field application takes less than 6 hours to run on a PC.
The geologic model derived after conditioning to production response was validated using field surveillance data. In particular, the flood front movement, the aquifer encroachment and bypassed oil locations obtained from the geologic model was found to be consistent with field observations. Finally, an examination of the permeability changes during production data integration revealed that most of these changes were aligned along the facies distribution, particularly the 'good' facies distribution with no resulting loss in geologic realism.
We also propose a novel assisted history matching procedure for finite difference simulators using streamline derived sensitivity calculations. Unlike existing assisted history matching techniques where the user is required to manually adjust the parameters, this procedure combines the rigor of finite difference models and efficiencies of streamline simulators to perform history matching. Finite difference simulator is used to solve for pressure, flux and saturations which, in turn, are used as input for the streamline simulator for estimating the parameter sensitivities analytically. The streamline derived sensitivities are then used to update the reservoir model. The updated model is then used in the finite difference simulator in an iterative mode until a significant satisfactory history match is obtained.
The assisted history matching procedure has been tested for both synthetic and field examples. The results show a significant speed-up in history matching using conventional finite difference simulators.
|
40 |
Geostatistical data integration in complex reservoirsElahi Naraghi, Morteza 03 February 2015 (has links)
One of the most challenging issues in reservoir modeling is to integrate information coming from different sources at disparate scales and precision. The primary data are borehole measurements, but in most cases, these are too sparse to construct accurate reservoir models. Therefore, in most cases, the information from borehole measurements has to be supplemented with other secondary data. The secondary data for reservoir modeling could be static data such as seismic data or dynamic data such as production history, well test data or time-lapse seismic data. Several algorithms for integrating different types of data have been developed. A novel method for data integration based on the permanence of ratio hypothesis was proposed by Journel in 2002. The premise of the permanence of ratio hypothesis is to assess the information from each data source separately and then merge the information accounting for the redundancy between the information sources. The redundancy between the information from different sources is accounted for using parameters (tau or nu parameters, Krishnan, 2004). The primary goal of this thesis is to derive a practical expression for the tau parameters and demonstrate the procedure for calibrating these parameters using the available data. This thesis presents two new algorithms for data integration in reservoir modeling. The algorithms proposed in this thesis overcome some of the limitations of the current methods for data integration. We present an extension to the direct sampling based multiple-point statistics method. We present a methodology for integrating secondary soft data in that framwork. The algorithm is based on direct pattern search through an ensemble of realizations. We show that the proposed methodology is sutiable for modeling complex channelized reservoirs and reduces the uncertainty associated with production performance due to integration of secondary data. We subsequently present the permanence of ratio hypothesis for data integration in great detail. We present analytical equations for calculating the redundancy factor for discrete or continuous variable modeling. Then, we show how this factor can be infered using available data for different scenarios. We implement the method to model a carbonate reservoir in the Gulf of Mexico. We show that the method has a better performance than when primary hard and secondary soft data are used within the traditional geostatistical framework. / text
|
Page generated in 0.1292 seconds