Spelling suggestions: "subject:"4cluster 2analysis."" "subject:"4cluster 3analysis.""
61 |
Robust methods for locating multiple dense regions in complex datasetsGupta, Gunjan Kumar. January 1900 (has links) (PDF)
Thesis (Ph. D.)--University of Texas at Austin, 2006. / Vita. Includes bibliographical references.
|
62 |
Kullback-Leibler estimation of probability measures with an application to clustering /Sheehy, Anne. January 1987 (has links)
Thesis (Ph. D.)--University of Washington, 1987. / Vita. Bibliography: leaves [192]-194.
|
63 |
GEE with large cluster sizes : high-dimensional working correlation models /Chung, Hyoju, January 2006 (has links)
Thesis (Ph. D.)--University of Washington, 2006. / Vita. Includes bibliographical references (p. 79-82).
|
64 |
Optimal clustering : genetic constrained K-means and linear programming algorithms /Zhao, Jianmin, January 2006 (has links)
Thesis (Ph. D.)--Virginia Commonwealth University, 2006. / Prepared for: Dept. of Biostatistics. Bibliography: leaves 99-104. Also available online via the Internet.
|
65 |
Adaptive Double Self-Organizing Map for Clustering Gene Expression DataWang, Dali January 2003 (has links) (PDF)
No description available.
|
66 |
Cluster detection and analysis with geo-spatial datasets using a hybrid statistical and neural networks hierarchical approachMajeed, Salar Mustafa January 2010 (has links)
Spatial datasets contain information relating to the locations of incidents of phenomena for example, crime and disease. Areas that contain a higher than expected incidence of the phenomena, given background population and census datasets, are of particular interest. By analysing the locations of potential influence, it may be possible to establish where a cause and effect relationship is present in the observed process. Cluster detection techniques can be applied to such datasets in order to reveal information relating to the spatial distribution of the cases. Research in these areas has mainly concentrated on either computational or statistical aspects of cluster detection. Each clustering algorithm has its own strengths and weakness. Their main weaknesses causing their unreliability can be estimating the number of clusters, testing the number of components, selecting initial seeds (centroids), running time and memory requirements. Consequently, a new cluster detection methodology has been developed in this thesis based on knowledge drawn from both statistical and computing domains. This methodology is based on a hybrid of statistical methods using properties of probability rather than distance to associate data with clusters. No previous knowledge of the dataset is required and the number of clusters is not predetermined. It performs efficiently in terms of memory requirements, running time and cluster quality. The algorithm for determining both the centre of clusters and the existence of the clusters themselves was applied and tested on simulated and real datasets. The results which were obtained from identification of hotspots were compared with results of other available algorithms such as CLAP (Cluster Location Analysis Procedure), Satscan and GAM (Geographical Analysis Machine). The outputs are very similar. XVI GIS presented in this thesis encompasses the SCS algorithm, statistics and neural networks for developing a hybrid predictive crime model, mapping, visualizing crime data and the corresponding population in the study region, visualizing the location of obtained clusters and burglary incidence concentration ‘hotspots’ which was specified by clustering algorithm SCS. Naturally the quality of results is subject to the accuracy of the used data. GIS is used in this thesis for developing a methodology for modelling data containing multiple functions. The census data used throughout this construction provided a useful source of geo-demographic information. The obtained datasets were used for predictive crime modelling. This thesis has benefited from several existing methodologies to develop a hybrid modelling approach. The methodology was applied to real data on burglary incidence distribution in the study region. Relevant principles of statistics, Geographical Information System, Neural Networks and SCS algorithm were utilized for the analysis of observed data. Regression analysis was used for building a predictive crime model and combined with Neural Networks with the aim of developing a new hierarchical neural Network approaches to generate a more reliable prediction. The promising results were compared with the non-hierarchical neural Network back-propagation network and multiple regression analysis. The average percentage accuracy achieved by the new methodology at testing stage increase 13% compared with the non-hierarchical BP performance. In general the analysis reveals a number of predictors that increase the risk of burglary in the study region. Specifically living in a household in which there is ‘one person’, ‘lone parent’, household where occupations are in elementary or intermediate and unemployed. For the influence of Household space, the results indicate that the risk of burglary rate increases within the household living in shared houses.
|
67 |
Determining geographical causal relationships through the development of spatial cluster detection and feature selection techniquesJarvis, Paul S. January 2006 (has links)
Spatial datasets contain information relating to the locations of incidents of a disease or other phenomena. Appropriate analysis of such datasets can reveal information about the distribution of cases of the phenomena. Areas that contain higher than expected incidence of the phenomena, given the background population, are of particular interest. Such clusters of cases may be affected by external factors. By analysing the locations of potential influences, it may be possible to establish whether a cause and effect relationship is present within the dataset. This thesis describes research that has led to the development and application of cluster detection and feature selection techniques in order to determine whether causal relationships are present within generic spatial datasets. The techniques are described and demonstrated, and their effectiveness established by testing them using synthetic datasets. The techniques are then applied to a dataset supplied by the Welsh Leukaemia Registry that details all cases of leukaemia diagnosed in Wales between 1990 and 2000. Cluster detection techniques can be used to provide information about case distribution. A novel technique, CLAP, has been developed that scans the study region and identifies the statistical significance of the levels of incidence in specific areas. Feature selection techniques can be used to identify the extent to which a selection of inputs impact upon a given output. Results from CLAP are combined with details of the locations of potential causal factors, in the form of a numerical dataset that can be analysed using feature selection techniques. Established techniques and a newly developed technique are used for the analysis. Results from such analysis allow conclusions to be drawn as to whether geographical causal relationships are apparent.
|
68 |
Mass-resolved resonant two-photon ionisation spectroscopy of jet-cooled Cu2 and Ag2Butler, Andrew Michael January 1990 (has links)
Clusters of the transition metals were generated by laser vaporisation of a sample of the metal into the throat of a pulsed supersonic expansion. This allowed clusters with internal temperatures as low as 5 K to be routinely prepared. Mass-selective detection was accomplished by multi-photon ionisation of the clusters within the ion source of a time - of - flight mass spectrometer. Use of a tunable laser to carry out electronic excitation, prior to ionisation, allowed mass - resolved resonant two - photon ionisation spectra of the clusters to be recorded. Real time control of the experiment and automated data logging was achieved using software developed to run on an IBM PC - AT microcomputer. This allowed multiple ion signals to be recorded simultaneously whilst carrying out R2PI or time-resolved studies on the metal cluster species in the beam. Resonant two - photon ionisation spectroscopic studies were carried out on the ( 0 - 0 ) and ( 1 - 0 ) bands of the J X system of Cu9 and the A X system of Ag->. The 0.04 cm-1 bandwidth of the tunable dye laser used allowed rotationally resolved spectra to be recorded. The spectra recorded for these systems showed them both to be AA = 0 ( or AS2 = 0 ) transitions. The J state of CU2 was assigned to the 1 Zj state derived from the ?P + atomic limit at Dg(X) + 45821 cm-1. Rotational analysis of the spectra yieldedl | lthe following constants for the Cu2 isotopomer: Bg = 0.1166(1) cm , ae = 0.0021(1) cm-1. This gave Rg = 2.138(1) A for the J state, shorter than the ground state bond length. Accordingly the transition was assigned to 3ditg -*?4piru, to give the above assignment. The rotational constants obtained, for the *?7Ag-, isotopomer, from analysisI _ | *of the spectra of the A X system of Ag-, were: Bg = 0.0447(3) cm , ae= 0.0004(2) cm'*, and Bq = 0.0490(18) cm"1. These gave bond lengths of Rg = 2.649(9) A and Rq = 2.530(46) A. The observed Ail = 0 transition agreed with the previous assignment of the A state as 0* arising from the 5sag -+ 5sau promotion.
|
69 |
Extending linear grouping analysis and robust estimators for very large data setsHarrington, Justin 11 1900 (has links)
Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest.
Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and
2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”.
LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data.
In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient.
The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution. / Science, Faculty of / Statistics, Department of / Graduate
|
70 |
Advances in categorical data clusteringZhang, Yiqun 29 August 2019 (has links)
Categorical data are common in various research areas, and clustering is a prevalent technique used for analyse them. However, two challenging problems are encountered in categorical data clustering analysis. The first is that most categorical data distance metrics were actually proposed for nominal data (i.e., a categorical data set that comprises only nominal attributes), ignoring the fact that ordinal attributes are also common in various categorical data sets. As a result, these nominal data distance metrics cannot account for the order information of ordinal attributes and may thus inappropriately measure the distances for ordinal data (i.e., a categorical data set that comprises only ordinal attributes) and mixed categorical data (i.e., a categorical data set that comprises both ordinal and nominal attributes). The second problem is that most hierarchical clustering approaches were actually designed for numerical data and have very high computation costs; that is, with time complexity O(N2) for a data set with N data objects. These issues have presented huge obstacles to the clustering analysis of categorical data. To address the ordinal data distance measurement problem, we studied the characteristics of ordered possible values (also called 'categories' interchangeably in this thesis) of ordinal attributes and propose a novel ordinal data distance metric, which we call the Entropy-Based Distance Metric (EBDM), to quantify the distances between ordinal categories. The EBDM adopts cumulative entropy as a measure to indicate the amount of information in the ordinal categories and simulates the thinking process of changing one's mind between two ordered choices to quantify the distances according to the amount of information in the ordinal categories. The order relationship and the statistical information of the ordinal categories are both considered by the EBDM for more appropriate distance measurement. Experimental results illustrate the superiority of the proposed EBDM in ordinal data clustering. In addition to designing an ordinal data distance metric, we further propose a unified categorical data distance metric that is suitable for distance measurement of all three types of categorical data (i.e., ordinal data, nominal data, and mixed categorical data). The extended version uniformly defines distances and attribute weights for both ordinal and nominal attributes, by which the distances measured for the two types of attributes of a mixed categorical data can be directly combined to obtain the overall distances between data objects with no information loss. Extensive experiments on all three types of categorical data sets demonstrate the effectiveness of the unified distance metric in clustering analysis of categorical data. To address the hierarchical clustering problem of large-scale categorical data, we propose a fast hierarchical clustering framework called the Growing Multi-layer Topology Training (GMTT). The most significant merit of this framework is its ability to reduce the time complexity of most existing hierarchical clustering frameworks (i.e., O(N2)) to O(N1.5) without sacrificing the quality (i.e., clustering accuracy and hierarchical details) of the constructed hierarchy. According to our design, the GMTT framework is applicable to categorical data clustering simply by adopting a categorical data distance metric. To make the GMTT framework suitable for the processing of streaming categorical data, we also provide an incremental version of GMTT that can dynamically adopt new inputs into the hierarchy via local updating. Theoretical analysis proves that the GMTT frameworks have time complexity O(N1.5). Extensive experiments show the efficacy of the GMTT frameworks and demonstrate that they achieve more competitive categorical data clustering performance by adopting the proposed unified distance metric.
|
Page generated in 0.078 seconds