41 |
An Ontology-Based Personalized Document Clustering ApproachHuang, Tse-hsiu 05 August 2004 (has links)
With the proliferation of electronic commerce and knowledge economy environments, both persons and organizations increasingly have generated and consumed large amounts of online information, typically available as textual documents. To manage this rapid growth of the number of textual documents, people often use categories or folders to organize their documents. These document grouping behaviors are intentional acts that reflect the persons¡¦ (or organizations¡¦) preferences with regard to semantic coherency, or relevant groupings between subjects. For this thesis, we design and implement an ontology-based personalized document clustering (OnPEC) technique by incorporating both an individual user¡¦s partial clustering and an ontology into the document clustering process. Our use of a target user¡¦s partial clustering supports the personalization of document categorization, whereas our use of the ontology turns document clustering from a feature-based to a concept-based approach. In addition, we combine two hierarchical agglomerative clustering (HAC) approaches (i.e., pre-cluster-based and atomic-based) in our proposed OnPEC technique. Using the clustering effectiveness achieved by a traditional content-based document clustering technique and previously proposed feature-based document clustering (PEC) techniques as performance benchmarks, we find that use of partial clusters improves document clustering effectiveness, as measured by cluster precision and cluster recall. Moreover, for both OnPEC and PEC techniques, the clustering effectiveness of pre-cluster-based HAC methods greatly outperforms that of atomic-based HAC methods.
|
42 |
Preference-Anchored Document Clustering Technique: Effects of Term Relationships and ThesaurusLin, Hao-hsiang 30 August 2006 (has links)
According to the context theory of classification, the document-clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is doing the task and in what context. Thus, effective document-clustering techniques need to be able to take into account users¡¦ categorization preferences and thus can generate document clusters from different preferential perspectives. The Preference-Anchored Document Clustering (PAC) technique was proposed for supporting preference-based document-clustering. Specifically, PAC takes a user¡¦s categorization preference into consideration and subsequently generates a set of document clusters from this specific preferential perspective. In this study, we attempt to investigate two research questions concerning the PAC technique. The first research question investigates ¡§whether the incorporation of the broader-term expansion (i.e., the proposed PAC2 technique in this study) will improve the effectiveness of preference-based document-clustering, whereas the second research question is ¡§whether the use of a statistical-based thesaurus constructed from a larger document corpus will improve the effectiveness of preference-based document-clustering.¡¨ Compared with the effectiveness achieved by PAC, our empirical results show that the proposed PAC2 technique neither improves nor deteriorates the effectiveness of preference-based document-clustering when the complete set of anchoring terms is used. However, when only a partial set of anchoring terms is provided, PAC2 cannot improve and even deteriorate the effectiveness of preference-based document-clustering. As to the second research question, our empirical results suggest the use of a statistical-based thesaurus constructed from a larger document corpus (i.e., the ACM corpus consisting of 14,729 documents) does not improve the effectiveness of PAC and PAC2 for preference-based document-clustering.
|
43 |
Personalized Document Clustering: Technique Development and Empirical EvaluationWu, Chia-Chen 14 August 2003 (has links)
With the proliferation of an electronic commerce and knowledge economy environment, both organizations and individuals generate and consume a large amount of online information, typically available as textual documents. To manage the ever-increasing volume of documents, organizations and individuals typically organize their documents into categories to facilitate document management and subsequent information access and browsing. However, document grouping behaviors are intentional acts, reflecting individuals¡¦ (or organizations¡¦) preferential perspective on semantic coherency or relevant groupings between subjects. Thus, an effective document clustering needs to address the described preferential perspective on document grouping and support personalized document clustering. In this thesis, we designed and implemented a personalized document clustering approach by incorporating individual¡¦s partial clustering into the document clustering process. Combining two document representation methods (i.e., feature refinement and feature weighting) with two clustering processes (i.e., pre-cluster-based and atomic-based), four personalized document clustering techniques are proposed. Using the clustering effectiveness achieved by a traditional content-based document clustering technique as performance benchmarks, our evaluation results suggest that use of partial clusters would improve the document clustering effectiveness. Moreover, the pre-cluster-based technique outperforms the atomic-based one, and the feature weighting method for document representation achieves a higher clustering effectiveness than the feature refinement method does.
|
44 |
Spatial association between the locations of roots and water flow paths in highly structured soilGardiner, Nathan Thomas 17 February 2005 (has links)
Considerable evidence exists that the majority of low tension water flow through highly structured clayey soil occurs in a small fraction of total pore space and that the flow paths converge as depth increases. In structured clayey soils, water tends to flow in locations where macroporosity is high and roots tend to enjoy this condition as well. Water reduces the strength and mechanical impedance of the soil. Mechanical impedance of clayey soils tends to be extremely high when the soils are dry so one might expect that there would be a positive spatial correlation between the location of roots and the location of water flow paths in highly structured clayey soils. Understanding the relationship between the location of roots in soil relative to the location of water flow paths is important in understanding how plants obtain nutrients and water for growth, and it would also be of considerable importance in phytoremediation research and research into the prevention of groundwater contamination. This experiment was designed to map the locations of flow paths and roots and then measure the spatial association of the two.
A pasture on Ships clay along the Brazos River was chosen as the research site. Three plots were irrigated with an Erioglaucine dye solution used to stain flow paths.
After irrigation the soil was excavated to a depth of 25 cm. On the resulting horizontal plane the dye stain pattern was mapped using photography. The locations of roots were mapped on clear plastic sheets. During mapping the roots were categorized by size. The mapping procedure was repeated at depth of 45 cm and 75 cm for all plots. The root maps were overlaid on the photographic images and analyzed for a spatial association. There was no evidence the smallest (> 1 mm diameter) roots were not randomly distributed. The results did show that the larger roots were not randomly distributed, and evidence pointed to a clustering of roots in and around the dye stained flow paths. However, the data fell short of establishing a spatial association. The lack of more conclusive data was likely the result of inaccuracies in the mapping.
|
45 |
Fuzzy Clustering AnalysisKarim, Ehsanul, Madani, Sri Phani Venkata Siva Krishna, Yun, Feng January 2010 (has links)
The Objective of this thesis is to talk about the usage of Fuzzy Logic in pattern recognition. There are different fuzzy approaches to recognize the pattern and the structure in data. The fuzzy approach that we choose to process the data is completely depends on the type of data. Pattern reorganization as we know involves various mathematical transforms so as to render the pattern or structure with the desired properties such as the identification of a probabilistic model which provides the explaination of the process generating the data clarity seen and so on and so forth. With this basic school of thought we plunge into the world of Fuzzy Logic for the process of pattern recognition. Fuzzy Logic like any other mathematical field has its own set of principles, types, representations, usage so on and so forth. Hence our job primarily would focus to venture the ways in which Fuzzy Logic is applied to pattern recognition and knowledge of the results. That is what will be said in topics to follow. Pattern recognition is the collection of all approaches that understand, represent and process the data as segments and features by using fuzzy sets. The representation and processing depend on the selected fuzzy technique and on the problem to be solved. In the broadest sense, pattern recognition is any form of information processing for which both the input and output are different kind of data, medical records, aerial photos, market trends, library catalogs, galactic positions, fingerprints, psychological profiles, cash flows, chemical constituents, demographic features, stock options, military decisions.. Most pattern recognition techniques involve treating the data as a variable and applying standard processing techniques to it.
|
46 |
Bayesian inference for random partitionsSundar, Radhika 05 December 2013 (has links)
I consider statistical inference for clustering, that is the arrangement of experimental units in homogeneous groups. In particular, I discuss clustering for multivariate binary outcomes. Binary data is not very informative, making it less meaningful to proceed with traditional (deterministic) clustering methods. Meaningful inference needs to account for and report the considerable uncertainty related with any reported cluster arrangement. I review and implement an approach that was proposed in the recent literature. / text
|
47 |
Functional Analysis of Real World Truck Fuel Consumption DataVogetseder, Georg January 2008 (has links)
This thesis covers the analysis of sparse and irregular fuel consumption data of long distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results. The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers. The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data.
|
48 |
Clustering in the Presence of NoiseHaghtalab, Nika 08 August 2013 (has links)
Clustering, which is partitioning data into groups of similar objects, has a wide range of applications. In many cases unstructured data makes up a significant part of the input. Attempting to cluster such part of the data, which can be referred to as noise, can disturb the clustering on the remaining domain points. Despite the practical need for a framework of clustering that allows a portion of the data to remain unclustered, little research has been done so far in that direction. In this thesis, we take a step towards addressing the issue of clustering in the presence of noise in two parts. First, we develop a platform for clustering that has a cluster devoted to the "noise" points. Second, we examine the problem of "robustness" of clustering algorithms to the addition of noise.
In the first part, we develop a formal framework for clustering that has a designated noise cluster. We formalize intuitively desirable input-output properties of clustering algorithms that have a noise cluster. We review some previously known algorithms, introduce new algorithms for this setting, and examine them with respect to the introduced properties.
In the second part, we address the problem of robustness of clustering algorithms to the addition of unstructured data. We propose a simple and efficient method to turn any centroid-based clustering algorithm into a noise robust one that has a noise cluster. We discuss several rigorous measures of robustness and prove performance guarantees for our method with respect to these measures under the assumption that the noise-free data satisfies some niceness properties and the noise satisfies some mildness properties. We also prove that more straightforward ways of adding robustness to clustering algorithms fail to achieve the above mentioned guarantees.
|
49 |
Clustering Microarray Data Via a Bayesian Infinite Mixture ModelGivari, Dena 04 January 2013 (has links)
Clustering microarray data is a helpful way of identifying genes which are biologically related. Unfortunately, when attempting to cluster microarray data, certain issues must be considered including: the uncertainty in the number of true clusters; the expression of a given gene is often a ected by the expression of other genes; and microarray data is usually high dimensional. This thesis outlines a Bayesian in nite
Gaussian mixture model which addresses the issues outlined above by: not requiring the researcher to specify the number of clusters expected, applying a non-diagonal covariance structure, and using mixtures of factor analyzers and extensions thereof to structure the covariance matrix such that it is based on a few latent variables. This
approach will be illustrated on real and simulated data.
|
50 |
A Data Cleaning Framework for Trajectory ClusteringIdrissov, Agzam Y. Unknown Date
No description available.
|
Page generated in 0.0998 seconds