• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 20
  • 4
  • 2
  • 1
  • Tagged with
  • 33
  • 33
  • 8
  • 8
  • 7
  • 7
  • 7
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Towards a Data Quality Framework for Heterogeneous Data

Micic, Natasha, Neagu, Daniel, Campean, Felician, Habib Zadeh, Esmaeil 22 April 2017 (has links)
yes / Every industry has significant data output as a product of their working process, and with the recent advent of big data mining and integrated data warehousing it is the case for a robust methodology for assessing the quality for sustainable and consistent processing. In this paper a review is conducted on Data Quality (DQ) in multiple domains in order to propose connections between their methodologies. This critical review suggests that within the process of DQ assessment of heterogeneous data sets, not often are they treated as separate types of data in need of an alternate data quality assessment framework. We discuss the need for such a directed DQ framework and the opportunities that are foreseen in this research area and propose to address it through degrees of heterogeneity.
12

Temporal Event Modeling of Social Harm with High Dimensional and Latent Covariates

Liu, Xueying 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / The counting process is the fundamental of many real-world problems with event data. Poisson process, used as the background intensity of Hawkes process, is the most commonly used point process. The Hawkes process, a self-exciting point process fits to temporal event data, spatial-temporal event data, and event data with covariates. We study the Hawkes process that fits to heterogeneous drug overdose data via a novel semi-parametric approach. The counting process is also related to survival data based on the fact that they both study the occurrences of events over time. We fit a Cox model to temporal event data with a large corpus that is processed into high dimensional covariates. We study the significant features that influence the intensity of events.
13

Exploring Methods for Comparing Similarity of Dimensionally Inconsistent Multivariate Numerical Data

Micic, Natasha, Neagu, Daniel, Torgunov, Denis, Campean, Felician 28 June 2018 (has links)
no / When developing multivariate data classification and clustering methodologies for data mining, it is clear that most literature contributions only really consider data that contain consistently the same attributes. There are however many cases in current big data analytics applications where for same topic and even same source data sets there are differing attributes being measured, for a multitude of reasons (whether the specific design of an experiment or poor data quality and consistency). We define this class of data a dimensionally inconsistent multivariate data, a topic that can be considered a subclass of the Big Data Variety research. This paper explores some classification methodologies commonly used in multivariate classification and clustering tasks and considers how these traditional methodologies could be adapted to compare dimensionally inconsistent data sets. The study focuses on adapting two similarity measures: Robinson-Foulds tree distance metrics and Variation of Information; for comparing clustering of hierarchical cluster algorithms (such clusters are derived from the raw multivariate data). The results from experiments on engineering data highlight that adapting pairwise measures to exclude non-common attributes from the traditional distance metrics may not be the best method of classification. We suggest that more specialised metrics of similarity are required to address challenges presented by dimensionally inconsistent multivariate data, with specific applications for big engineering data analytics. / Jaguar Land-Rover
14

Contributions to Engineering Big Data Transformation, Visualisation and Analytics. Adapted Knowledge Discovery Techniques for Multiple Inconsistent Heterogeneous Data in the Domain of Engine Testing

Jenkins, Natasha N. January 2022 (has links)
In the automotive sector, engine testing generates vast data volumes that are mainly beneficial to requesting engineers. However, these tests are often not revisited for further analysis due to inconsistent data quality and a lack of structured assessment methods. Moreover, the absence of a tailored knowledge discovery process hinders effective preprocessing, transformation, analytics, and visualization of data, restricting the potential for historical data insights. Another challenge arises from the heterogeneous nature of test structures, resulting in varying measurements, data types, and contextual requirements across different engine test datasets. This thesis aims to overcome these obstacles by introducing a specialized knowledge discovery approach for the distinctive Multiple Inconsistent Heterogeneous Data (MIHData) format characteristic of engine testing. The proposed methods include adapting data quality assessment and reporting, classifying engine types through compositional features, employing modified dendrogram similarity measures for classification, performing customized feature extraction, transformation, and structuring, generating and manipulating synthetic images to enhance data visualization, and applying adapted list-based indexing for multivariate engine test summary data searches. The thesis demonstrates how these techniques enable exploratory analysis, visualization, and classification, presenting a practical framework to extract meaningful insights from historical data within the engineering domain. The ultimate objective is to facilitate the reuse of past data resources, contributing to informed decision-making processes and enhancing comprehension within the automotive industry. Through its focus on data quality, heterogeneity, and knowledge discovery, this research establishes a foundation for optimized utilization of historical Engine Test Data (ETD) for improved insights. / Soroptimist International Bradford
15

Study of network-service disruptions using heterogeneous data and statistical learning

Erjongmanee, Supaporn 21 January 2011 (has links)
The study of network-service disruptions caused by large-scale disturbances has mainly focused on assessing network damage; however, network-disruption responses, i.e., how the disruptions occur depending on social organizations, weather, and power resources, have been studied little. The goal of this research is to study the responses of network-service disruptions caused by large-scale disturbances with respect to (1) temporal and logical network, and (2) external factors such as weather and power resources, using real and publicly available heterogeneous data that are composed of network measurements, user inputs, organizations, geographic locations, weather, and power outage reports. Network-service disruptions at the subnet level caused by Hurricanes Katrina in 2005 and Ike in 2008 are used as the case studies. The analysis of network-disruption responses with respect to temporal and logical network shows that subnets became unreachable dependently within organization, cross organization, and cross autonomous system. Thus, temporal dependence also illustrates the characteristics of logical dependence. In addition, subnet unreachability is analyzed with respect to the external factors. It is found that subnet unreachability and the storm are weakly correlated. The weak correlation motivates us to search for root causes and discover that the majority of subnet unreachability reportedly occurred because of power outages or lack of power generators. Using the power outage data, it is found that subnet unreachability and power outages are strongly correlated.
16

An integrated latent construct modeling framework for predicting physical activity engagement and health outcomes

Hoklas, Megan Marie 02 February 2015 (has links)
The health and well-being of individuals is related to their activity-travel patterns. Individuals who undertake physically active episodes such as walking and bicycling are likely to have improved health outcomes compared to individuals with sedentary auto-centric lifestyles. Activity-based travel demand models are able to predict activity-travel patterns of individuals at a high degree of fidelity, thus providing rich information for transportation and public health professionals to infer health outcomes that may be experienced by individuals in various geographic and demographic market segments. However, models of activity-travel demand do not account for the attitudinal factors and lifestyle preferences that affect activity-travel and mode use patterns. Such attitude and preference variables are virtually never collected explicitly in travel surveys, rendering it difficult to include them in model specifications. This paper applies Bhat’s (2014) Generalized Heterogeneous Data Model (GHDM) approach, whereby latent constructs representing the degree to which individuals are health conscious and inclined to pursue physical activities may be modeled as a function of observed socio-economic and demographic variables and then included as explanatory factors in models of activity-travel outcomes and walk and bicycle use. The model system is estimated on the 2005-2006 National Health and Nutrition Examination Survey (NHANES) sample, demonstrating the efficacy of the approach and the importance of including such latent constructs in model specifications that purport to forecast activity and time use patterns. / text
17

Heterogeneous data in information system integration : A Case study designing and implementing an integration mechanism

Brostedt, Nathan January 2017 (has links)
The data of different information systems is heterogeneous. As systems are being integrated, it’s necessary to bridge inconsistencies to reduce heterogenous data. To integrate heterogenous systems a mediator can be used. The mediator acts as a middle-layer for integrating systems, it handles transfers and translating of data between systems. A case study was conducted, developing a prototype of an integration mechanism for exchanging genealogical data, that used the mediator concept. Further, a genealogical system was developed to take use of the integration mechanism, integrating with a genealogy service. To test the reusability of the integration mechanism, a file import/export system and a system for exporting data from the genealogy service to a file was developed. The mechanism was based on the usage of a neutral entity model, that integrating systems could translate to. A neutralizing/de-neutralizing mechanism was used for the translating of the data between the neutral entity model, and a system specific entity model. The integration mechanism was added to the genealogy system as an addon. The integration mechanism was successful at integrating genealogy data. The outcomes included: The integration mechanism can be added as an addon to one or more systems being integrated. It was good to have a neutral entity model, together with a neutralizing/de-neutralizing mechanism.
18

Towards robust steganalysis : binary classifiers and large, heterogeneous data

Lubenko, Ivans January 2013 (has links)
The security of a steganography system is defined by our ability to detect it. It is of no surprise then that steganography and steganalysis both depend heavily on the accuracy and robustness of our detectors. This is especially true when real-world data is considered, due to its heterogeneity. The difficulty of such data manifests itself in a penalty that has periodically been reported to affect the performance of detectors built on binary classifiers; this is known as cover source mismatch. It remains unclear how the performance drop that is associated with cover source mismatch is mitigated or even measured. In this thesis we aim to show a robust methodology to empirically measure its effects on the detection accuracy of steganalysis classifiers. Some basic machine-learning based methods, which take their origin in domain adaptation, are proposed to counter it. Specifically, we test two hypotheses through an empirical investigation. First, that linear classifiers are more robust than non-linear classifiers to cover source mismatch in real-world data and, second, that linear classifiers are so robust that given sufficiently large mismatched training data they can equal the performance of any classifier trained on small matched data. With the help of theory we draw several nontrivial conclusions based on our results. The penalty from cover source mismatch may, in fact, be a combination of two types of error; estimation error and adaptation error. We show that relatedness between training and test data, as well as the choice of classifier, both have an impact on adaptation error, which, as we argue, ultimately defines a detector's robustness. This provides a novel framework for reasoning about what is required to improve the robustness of steganalysis detectors. Whilst our empirical results may be viewed as the first step towards this goal, we show that our approach provides clear advantages over earlier methods. To our knowledge this is the first study of this scale and structure.
19

DATAWAREHOUSE APPROACH TO DECISION SUPPORT SYSTEM FROM DISTRIBUTED, HETEROGENEOUS SOURCES

Sannellappanavar, Vijaya Laxmankumar 05 October 2006 (has links)
No description available.
20

Contributions for Handling Big Data Heterogeneity. Using Intuitionistic Fuzzy Set Theory and Similarity Measures for Classifying Heterogeneous Data

Ali, Najat January 2019 (has links)
A huge amount of data is generated daily by digital technologies such as social media, web logs, traffic sensors, on-line transactions, tracking data, videos, and so on. This has led to the archiving and storage of larger and larger datasets, many of which are multi-modal, or contain different types of data which contribute to the problem that is now known as “Big Data”. In the area of Big Data, volume, variety and velocity problems remain difficult to solve. The work presented in this thesis focuses on the variety aspect of Big Data. For example, data can come in various and mixed formats for the same feature(attribute) or different features and can be identified mainly by one of the following data types: real-valued, crisp and linguistic values. The increasing variety and ambiguity of such data are particularly challenging to process and to build accurate machine learning models. Therefore, data heterogeneity requires new methods of analysis and modelling techniques to enable useful information extraction and the modelling of achievable tasks. In this thesis, new approaches are proposed for handling heterogeneous Big Data. these include two techniques for filtering heterogeneous data objects are proposed. The two techniques called Two-Dimensional Similarity Space(2DSS) for data described by numeric and categorical features, and Three-Dimensional Similarity Space(3DSS) for real-valued, crisp and linguistic data are proposed for filtering such data. Both filtering techniques are used in this research to reduce the noise from the initial dataset and make the dataset more homogeneous. Furthermore, a new similarity measure based on intuitionistic fuzzy set theory is proposed. The proposed measure is used to handle the heterogeneity and ambiguity within crisp and linguistic data. In addition, new combine similarity models are proposed which allow for a comparison between the heterogeneous data objects represented by a combination of crisp and linguistic values. Diverse examples are used to illustrate and discuss the efficiency of the proposed similarity models. The thesis also presents modification of the k-Nearest Neighbour classifier, called k-Nearest Neighbour Weighted Average (k-NNWA), to classify the heterogeneous dataset described by real-valued, crisp and linguistic data. Finally, the thesis also introduces a novel classification model, called FCCM (Filter Combined Classification Model), for heterogeneous data classification. The proposed model combines the advantages of the 3DSS and k-NNWA classifier and outperforms the latter algorithm. All the proposed models and techniques have been applied to weather datasets and evaluated using accuracy, Fscore and ROC area measures. The experiments revealed that the proposed filtering techniques are an efficient approach for removing noise from heterogeneous data and improving the performance of classification models. Moreover, the experiments showed that the proposed similarity measure for intuitionistic fuzzy data is capable of handling the fuzziness of heterogeneous data and the intuitionistic fuzzy set theory offers some promise in solving some Big Data problems by handling the uncertainties, and the heterogeneity of the data.

Page generated in 0.1067 seconds