Global ETD Search

201	Mining-Based Category Evolution for Text Databases Dong, Yuan-Xin 18 July 2000 (has links) As text repositories grow in number and size and global connectivity improves, the amount of online information in the form of free-format text is growing extremely rapidly. In many large organizations, huge volumes of textual information are created and maintained, and there is a pressing need to support efficient and effective information retrieval, filtering, and management. Text categorization is essential to the efficient management and retrieval of documents. Past research on text categorization mainly focused on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns from a training set of manually categorized documents. However, as documents accumulate, the pre-defined categories may not capture the characteristics of the documents. In this study, we proposed a mining-based category evolution (MiCE) technique to adjust the categories based on the existing categories and their associated documents. According to the empirical evaluation results, the proposed technique, MiCE, was more effective than the discovery-based category management approach, insensitive to the quality of original categories, and capable of improving classification accuracy. Text Categorization Clustering Category management Category Evolution
202	GA-Based fuzzy clustering applied to irregular Lai, Fun-Zhu 10 February 2003 (has links) Building a rule-based classification system for a training data set is an important research topic in the area of data mining, knowledge discovery and expert systems. Recently, the GA-based fuzzy approach is shown to be an effective way to design an efficient evolutionary fuzzy system. In this thesis a three layers genetic algorithm with Simulated Annealing for selecting a small number of fuzzy if-then rules to building a compact fuzzy classification system will be proposed. The rule selection problem with three objectives: (1) maximize the number of correctly classified patterns, (2) minimize the number of fuzzy if-then rules, and (3) minimize the number of required features. Genetic algorithms are applied to solve this problem. A set of fuzzy if-then rules is coded into a binary string and treated as an in-dividual in genetic algorithms. The fitness of each individual is specified by three ob-jectives in the combinatorial optimization problem. Simulated annealing (SA) is op-tionally cooperated with three layers genetic algorithm to effectively select some layer control genes. The performance of the proposed method for training data and test data is ex-amined by computer simulations on the iris data set and spiral data set, and comparing the performance with the existing approaches. It is shown empirically that the pro-posed method outperforms the existing methods in the design of optimal fuzzy sys-tems. Fuzzy Clustering Genetic Algorithm Simulated Annealing
203	Bootstrapping in a high dimensional but very low sample size problem Song, Juhee 16 August 2006 (has links) High Dimension, Low Sample Size (HDLSS) problems have received much attention recently in many areas of science. Analysis of microarray experiments is one such area. Numerous studies are on-going to investigate the behavior of genes by measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression. HDLSS data investigated in this dissertation consist of a large number of data sets each of which has only a few observations. We assume a statistical model in which measurements from the same subject have the same expected value and variance. All subjects have the same distribution up to location and scale. Information from all subjects is shared in estimating this common distribution. Our interest is in testing the hypothesis that the mean of measurements from a given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and traditional bootstrapping, do not necessarily provide reliable results since there are only a few observations for each data set. We motivate a mixture model having C clusters and 3C parameters to overcome the small sample size problem. Standardized data are pooled after assigning each data set to one of the mixture components. To get reasonable initial parameter estimates when density estimation methods are applied, we apply clustering methods including agglomerative and K-means. Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean of within Cluster Variance estimates), are used to choose an optimal number of clusters. Density estimation methods including a maximum likelihood unimodal density estimator and kernel density estimation are used to estimate the unknown density. Once the density is estimated, a bootstrapping algorithm that selects samples from the estimated density is used to approximate the distribution of test statistics. The t-statistic and an empirical likelihood ratio statistic are used, since their distributions are completely determined by the distribution common to all subject. A method to control the false discovery rate is used to perform simultaneous tests on all small data sets. Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid) microarray experiment data are analyzed by the proposed methods. Bootstrap Density Estimation Clustering High dimensional Data
204	The GDense Algorithm for Clustering Data Streams with High Quality Lin, Shu-Yi 25 June 2009 (has links) In recent years, mining data streams has been widely studied. A data streams is a sequence of dynamic, continuous, unbounded and real time data items with a very high data rate that can only be read once. In data mining, clustering is one of use- ful techniques for discovering interesting data in the underlying data objects. The problem of clustering can be defined formally as follows: given n data points in the d- dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters. In the data streams environment, the difficulties of data streams clustering contain storage overhead, low clustering quality and a low updating efficiency. Cur- rent clustering algorithms can be broadly classified into four categories: partition, hierarchical, density-based and grid-based approaches. The advantage of the grid- based algorithm is that it can handle large databases. Based on the density-based approach, the insertion or deletion of data affects the current clustering only in the neighborhood of this data. Combining the advantages of the grid-based approach and density-based approach, the CDS-Tree algorithm was proposed. Although it can handle large databases, its clustering quality is restricted to the grid partition and the threshold of a dense cell. Therefore, in this thesis, we present a new clustering algo- rithm with high quality, GDense, for data streams. The GDense algorithm has high quality due to two kinds of partition: cells and quadcells, and two kinds of threshold: £_ and (1/4) . Moreover, in our GDense algorithm, in the data insertion part, the 7 cases takes 3 factors about the cell and the quadcell into consideration. In the deletion part, the 10 cases take 5 factors about the cell into consideration. From our simulation results, no matter what condition (including the number of data points, the number of cells, the size of the sliding window, and the threshold of dense cell) is, the clustering purity of our GDense algorithm is always higher than that of the CDS-Tree algorithm. Moreover, we make a comparison of the purity between the our GDense algorithm and the CDS-Tree algorithm with outliers. No matter whether the number of outliers is large or small, the clustering purity of our GDense algorithm is still higher than that of the CDS-Tree and we can improve about 20% the clustering purity as compared to the CDS-Tree algorithm. density-based grid-based clustering data streams
205	CUDIA : a probabilistic cross-level imputation framework using individual auxiliary information / Probabilistic cross-level imputation framework using individual auxiliary information Park, Yubin 17 February 2012 (has links) In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones such as Hospital Referral Regions (HRR) or Hospital Service Areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this thesis, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel "cross-level" imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework. / text Clustering Privacy preserving data mining BRFSS
206	Minimally supervised induction of morphology through bitexts Moon, Taesun, Ph. D. 17 January 2013 (has links) A knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems. Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis. While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics. / text Morphological clustering Inflectional morphology Natural language processing
207	Οριακή ροή κοκκώδους υλικού σε διάδρομο μεταφοράς / Critical flow and pattern formation of granular matter on a conveyor belt Κανελλόπουλος, Γεώργιος 09 February 2009 (has links) Εισάγουμε σταθερή εισροή υλικού στο πρώτο δοχείο με σκοπό να περιγράψουμε τις συνθήκες κάτω από τις οποίες η ροή θα είναι ομαλή και συνεχής μέχρι και το τελευταίο δοχείο. Σε αντίθεση με τα κανονικά ρευστά, τα κοκκώδη υλικά έχουν την τάση να δημιουργούν συσσωματώματα (λόγω της μη-ελαστικής σύγκρουσης των σωματιδίων τους [Goldhirsch and Zanetti, 1993]). Όταν συμβαίνει αυτό η ροή σταματά και η εκροή από το τελευταίο δοχείο μηδενίζεται. Δεδομένης της δύναμης ανατάραξης και των διαστάσεων του διαδρόμου, καθορίζουμε την οριακή τιμή της εισροής πέρα από την οποία η δημιουργία συσσωματωμάτων είναι αναπόφευκτη. Δείχνουμε ότι η κρίσιμη αυτή κατάσταση αναγγέλλεται εκ των προτέρων (ήδη πριν από την οριακή τιμή της εισροής) από την εμφάνιση ενός κυματιστού προφίλ πυκνότητας υλικού κατά μήκος του διαδρόμου. Η οριακή ροή καθώς και το κυματιστό προφίλ εξηγούνται σ΄αυτή την εργασία, τόσο ποιοτικά όσο και ποσοτικά, μέσω ενός μαθηματικού μοντέλου ροής [Eggers 1999, Van der Weele et al., 2001]. Τέλος, βασιζόμενοι σε αυτό το μοντέλο προτείνουμε πρακτικές λύσεις ώστε να βελτιωθεί σημαντικά η παροχή του διαδρόμου. / We study the flow of granular material on a conveyor belt consisting of K connected, vertically vibrated compartments. A steady inflow is applied to the top compartment and our goal is to describe the conditions that ensure a continuous flow all the way down to the Kth compartment. In contrast to normal fluids, flowing granular matter has a tendency to form clusters (due to the inelasticity of the particle collisions [Goldhirsch and Zanetti, 1993]); when this happens the flow stops and the outflow from the Kth compartment vanishes. Given the dimensions of the conveyor belt and the vibration strength, we determine the critical value of the inflow beyond which cluster formation is inevitable. Fortunately, the clusters are announced in advance (already below the critical value of the inflow) by the appearance of a wavy density profile along the K compartments. The critical flow and the associated wavy profile are explained quantitatively in terms of a dynamical flux model [Eggers, 1999; Van der Weele et al., 2001]. This same model enables us to formulate a method to greatly increase the critical value of the inflow, improving the capacity of the conveyor belt by a factor two or even more. Κοκκώδης ροή Συσσωματώματα 532.051 Granular flow Clustering
208	Η σχέση της ανάλυσης χωροθέτησης με τους αλγορίθμους ομαδοποίησης Χατζηθωμά, Ανδρούλα 02 May 2008 (has links) Γίνετε ανασκόπηση των πιο σημαντικών προβλημάτων της Ανάλυσης Χωροθέτησης. Παρατίθονται συγκρίσεις των προβλημάτων της Ανάλυσης Χωροθέτησης με τους Αλγορίθμους Ομαδοποίησης. Ακολούθως αναγράφεται μια αριθμητική εφαρμογή μιας σύγκρισης. / This project is a review of the more important algorithms of the Locational Analysis. The main theme is the comparison of the algorithms of Location Analysis against the algorithms of Clustering. Ομαδοποίηση Ανάλυση χωροθέτησης 511.8 Clustering Location analysis
209	Υπολογιστική νοημοσύνη και ομαδοποίηση Κανδηλιώτης, Στέφανος 17 September 2008 (has links) Η εργασία ασχολείται με την ομαδοποίηση δεδομένων ανθρώπινου γονιδιόματος με την χρήση αλγόριθμων ομαδοποίοησης και νευρωνικών δικτύων για τον διαχωρισμό του δείγματος σε ομάδες με βάση το αν έχουν κάποιο είδος ασθένειας ή όχι ή για τον καθορισμό του τύπου της ασθένειας. Παρουσιάζονται κάποια πειράματα που έγιναν με την χρήση και των δύο μεθόδων. / This master thesis is an application of clustering algorithms and artificial neural networks on human dna data in order to cluster the data in groups depending on wether a person has or hasn't an illness or what type of ilness one has. The thesis shows the results of some experiments conducted using either technique (clustering, ANNs) and a combination of both. Ομαδοποίηση Νευρωνικά δύκτια 006.3 Clustering Neural networks
210	Identifying Deviating Systems with Unsupervised Learning Panholzer, Georg January 2008 (has links) We present a technique to identify deviating systems among a group of systems in a self-organized way. A compressed representation of each system is used to compute similarity measures, which are combined in an affinity matrix of all systems. Deviation detection and clustering is then used to identify deviating systems based on this affinity matrix. The compressed representation is computed with Principal Component Analysis and Kernel Principal Component Analysis. The similarity measure between two compressed representations is based on the angle between the spaces spanned by the principal components, but other methods of calculating a similarity measure are suggested as well. The subsequent deviation detection is carried out by computing the probability of each system to be observed given all the other systems. Clustering of the systems is done with hierarchical clustering and spectral clustering. The whole technique is demonstrated on four data sets of mechanical systems, two of a simulated cooling system and two of human gait. The results show its applicability on these mechanical systems. Deviation Detection Clustering Eigen-Subspace Machine Learning

Search results