Spelling suggestions: "subject:"heterogeneous datasets"" "subject:"eterogeneous datasets""
1 |
Clustering Multiple Contextually Related Heterogeneous DatasetsHossain, Mahmood 09 December 2006 (has links)
Traditional clustering is typically based on a single feature set. In some domains, several feature sets may be available to represent the same objects, but it may not be easy to compute a useful and effective integrated feature set. We hypothesize that clustering individual datasets and then combining them using a suitable ensemble algorithm will yield better quality clusters compared to the individual clustering or clustering based on an integrated feature set. We present two classes of algorithms to address the problem of combining the results of clustering obtained from multiple related datasets where the datasets represent identical or overlapping sets of objects but use different feature sets. One class of algorithms was developed for combining hierarchical clustering generated from multiple datasets and another class of algorithms was developed for combining partitional clustering generated from multiple datasets. The first class of algorithms, called EPaCH, are based on graph-theoretic principles and use the association strengths of objects in the individual cluster hierarchies. The second class of algorithms, called CEMENT, use an EM (Expectation Maximization) approach to progressively refine the individual clusterings until the mutual entropy between them converges toward a maximum. We have applied our methods to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. After several natural language preprocessing steps, both syntactic and semantic feature sets were extracted. We present empirical results that include the comparison of our algorithms with several baseline clustering schemes using different cluster validation indices. We also present the results of one-tailed paired emph{T}-tests performed on cluster qualities. Our methods are shown to yield higher quality clusters than the baseline clustering schemes that include the clustering based on individual feature sets and clustering based on concatenated feature sets. When the sets of objects represented in two datasets are overlapping but not identical, our algorithms outperform all baseline methods for all indices.
|
2 |
Classification of heterogeneous data based on data type impact of similarityAli, N., Neagu, Daniel, Trundle, Paul R. 11 August 2018 (has links)
Yes / Real-world datasets are increasingly heterogeneous, showing a mixture of numerical, categorical and other feature types. The main challenge for mining heterogeneous datasets is how to deal with heterogeneity present in the dataset records. Although some existing classifiers (such as decision trees) can handle heterogeneous data in specific circumstances, the performance of such models may be still improved, because heterogeneity involves specific adjustments to similarity measurements and calculations. Moreover, heterogeneous data is still treated inconsistently and in ad-hoc manner. In this paper, we study the problem of heterogeneous data classification: our purpose is to use heterogeneity as a positive feature of the data classification effort by using consistently the similarity between data objects. We address the heterogeneity issue by studying the impact of mixing data types in the calculation of data objects’ similarity. To reach our goal, we propose an algorithm to divide the initial data records based on pairwise similarity for classification subtasks with the aim to increase the quality of the data subsets and apply specialized classifier models on them. The performance of the proposed approach is evaluated on 10 publicly available heterogeneous data sets. The results show that the models achieve better performance for heterogeneous datasets when using the proposed similarity process.
|
3 |
Efficient network based approaches for pattern recognition and knowledge discovery from large and heterogeneous datasetsZhu, Cheng 25 October 2013 (has links)
No description available.
|
4 |
On the discovery of relevant structures in dynamic and heterogeneous dataPreti, Giulia 22 October 2019 (has links)
We are witnessing an explosion of available data coming from a huge amount of sources and domains, which is leading to the creation of datasets larger and larger, as well as richer and richer.
Understanding, processing, and extracting useful information from those datasets requires specialized algorithms that take into consideration both the dynamism and the heterogeneity of the data they contain.
Although several pattern mining techniques have been proposed in the literature, most of them fall short in providing interesting structures when the data can be interpreted differently from user to user, when it can change from time to time, and when it has different representations.
In this thesis, we propose novel approaches that go beyond the traditional pattern mining algorithms, and can effectively and efficiently discover relevant structures in dynamic and heterogeneous settings.
In particular, we address the task of pattern mining in multi-weighted graphs, pattern mining in dynamic graphs, and pattern mining in heterogeneous temporal databases.
In pattern mining in multi-weighted graphs, we consider the problem of mining patterns for a new category of graphs called emph{multi-weighted graphs}. In these graphs, nodes and edges can carry multiple weights that represent, for example, the preferences of different users or applications, and that are used to assess the relevance of the patterns.
We introduce a novel family of scoring functions that assign a score to each pattern based on both the weights of its appearances and their number, and that respect the anti-monotone property, pivotal for efficient implementations.
We then propose a centralized and a distributed algorithm that solve the problem both exactly and approximately. The approximate solution has better scalability in terms of the number of edge weighting functions, while achieving good accuracy in the results found.
An extensive experimental study shows the advantages and disadvantages of our strategies, and proves their effectiveness.
Then, in pattern mining in dynamic graphs, we focus on the particular task of discovering structures that are both well-connected and correlated over time, in graphs where nodes and edges can change over time.
These structures represent edges that are topologically close and exhibit a similar behavior of appearance and disappearance in the snapshots of the graph.
To this aim, we introduce two measures for computing the density of a subgraph whose edges change in time, and a measure to compute their correlation.
The density measures are able to detect subgraphs that are silent in some periods of time but highly connected in the others, and thus they can detect events or anomalies happened in the network.
The correlation measure can identify groups of edges that tend to co-appear together, as well as edges that are characterized by similar levels of activity.
For both variants of density measure, we provide an effective solution that enumerates all the maximal subgraphs whose density and correlation exceed given minimum thresholds, but can also return a more compact subset of representative subgraphs that exhibit high levels of pairwise dissimilarity.
Furthermore, we propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy.
We evaluate our framework with an extensive set of experiments on both real and synthetic datasets, and compare its performance with the main competitor algorithm.
The results confirm the correctness of the exact solution, the high accuracy of the approximate, and the superiority of our framework over the existing solutions.
In addition, they demonstrate the scalability of the framework and its applicability to networks of different nature.
Finally, we address the problem of entity resolution in heterogeneous temporal data-ba-se-s, which are datasets that contain records that give different descriptions of the status of real-world entities at different periods of time, and thus are characterized by different sets of attributes that can change over time.
Detecting records that refer to the same entity in such scenario requires a record similarity measure that takes into account the temporal information and that is aware of the absence of a common fixed schema between the records.
However, existing record matching approaches either ignore the dynamism in the attribute values of the records, or assume that all the records share the same set of attributes throughout time.
In this thesis, we propose a novel time-aware schema-agnostic similarity measure for temporal records to find pairs of matching records, and integrate it into an exact and an approximate algorithm.
The exact algorithm can find all the maximal groups of pairwise similar records in the database.
The approximate algorithm, on the other hand, can achieve higher scalability with the size of the dataset and the number of attributes, by relying on a technique called meta-blocking. This algorithm can find a good-quality approximation of the actual groups of similar records, by adopting an effective and efficient clustering algorithm.
|
Page generated in 0.0754 seconds