Global ETD Search

201	Clustering and Inconsistent Information: A Kernelization Approach Cao, Yixin 2012 May 1900 (has links) Clustering is the unsupervised classification of patterns into groups, which is easy provided the data of patterns are consistent. However, real data are almost always tempered with inconsistencies, which make it a hard problem, and actually, the most widely studied formulations, correlation clustering and hierarchical clustering, are both NP-hard. In the graph representation of data, inconsistencies also frequently present themselves as cycles, also called deadlocks, and to break cycles by removing vertices is the objective of the classical feedback vertex set (FVS) problem. This dissertation studies the three problems, correlation clustering, hierarchical clustering, and disjoint-FVS (a variation of FVS), from a kernelization approach. A kernelization algorithm in polynomial time reduces a problem instance provably to speed up the further processing with other approaches. For each of the problems studied, an efficient kernelization algorithm of linear or sub-quadratic running time is presented. All the kernels obtained in this dissertation have linear size with very small constants. Better parameterized algorithms are also designed based on the kernels for the last two problems. Finally, some concluding remarks on possible directions for future research are briefly mentioned. kernelization clustering feedback vertex set fvs
202	Parity Conditions and the Efficiency of the NTD /USD 30 and 90 Day Forward Markets Hsing, Kuo 24 December 2004 (has links) Efficient market exist such that financial market make the absence of arbitrage opportunity on intertemporal asset price, There are special existence due to volatility clustering effect provides that the conditional volatility predictor could control, applying on derivative such as option¡Bcurrency exchange¡Bswap¡Bexist possible arbitrage profits ,in this paper involve that forward market efficiency and how to prototype concrete, now we apply parity theory including covered interest parity and uncovered interest parity, then the study of both covered (CIP)and uncovered interest parity (UIP) plus FME are tested in the 30 and 90 forward markets for the NTD/USD exchange rate to examine market efficiency on using GARCH-M,EGARCH models , In the empirical tests, we find the NTS/USD dollar interest rate spread have I(o) property ,Results are provided for interest rate on stationarity indicating that interest differential is stationary ,the result also imply stationary relationship between Taiwan and USA on money policy, Using Taylor(1989) ¡As covered interest arbitrage models, The empirical results show lower positive profit opportunities on NTD or US returns, covered interest parity may hold because NTS/US exchange market after reopening becomes more efficient than market after reopening, the central bank money policy intervention is influential but we test market efficiency hypotheses on basis of Domowitz and Hakkio¡]1985¡^¡As ARCH-M model deeply employing GARCH-M¡BEGARCH models to estimate Risk Premium¡Athen employ Felmingham (2003.2) ¡As regression equation to test forward market efficiency , the empirical results shows that not only CIP¡BUIP theory fail but also Forward Market Efficiency hypotheses cannot hold ,then future spot rates could be predicted by forward rates are worthy of investigate., It may indicate that foreign securities are imperfect substitutes for domestic ones of equivalent maturity and that market participants, implying that there is arbitrage profit opportunity between Taiwan and the USA, there are many arguments to discuss whether forward rates as an unbiased predictor of future spot rate ,Forward Market efficiency give the presence of the time varying premium on different place, Ultimately, therefore, the unbiased nature of forward rates is an empirical, and not a theoretical, issue¡C covered interest parity volatility clustering EGARCH
203	Video Scene Change Detection Using Support Vector Clustering Kao, Chih-pang 13 October 2005 (has links) As digitisation era will come, a large number of multimedia datas (image, video, etc.) are stored in the database by digitisation, and its retrieval system is more and more important. Video is huge in frames amount, in order to search effectively and fast, the first step will detect and examine the place where the scene changes in the video, cut apart the scene, find out the key frame from each scene, regard as analysis that the index file searches with the key frame. The scene changes the way and divides into the abrupt and the gradual transition. But in the video, even if in the same scene, incident of often violent movements or the camera are moving etc. happens, and obscure with the gradual transition to some extent. Thus this papper gets the main component from every frame in the video using principal component analysis (PCA), reduce the noise to interfere, and classify these feauture points with support vector clustering, it is the same class that the close feature points is belonged to. If the feature points are located between two groups of different datas, represent the scene is changing slowly in the video, detect scene change by this. video scene change support vector clustering
204	Mining-Based Category Evolution for Text Databases Dong, Yuan-Xin 18 July 2000 (has links) As text repositories grow in number and size and global connectivity improves, the amount of online information in the form of free-format text is growing extremely rapidly. In many large organizations, huge volumes of textual information are created and maintained, and there is a pressing need to support efficient and effective information retrieval, filtering, and management. Text categorization is essential to the efficient management and retrieval of documents. Past research on text categorization mainly focused on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns from a training set of manually categorized documents. However, as documents accumulate, the pre-defined categories may not capture the characteristics of the documents. In this study, we proposed a mining-based category evolution (MiCE) technique to adjust the categories based on the existing categories and their associated documents. According to the empirical evaluation results, the proposed technique, MiCE, was more effective than the discovery-based category management approach, insensitive to the quality of original categories, and capable of improving classification accuracy. Text Categorization Clustering Category management Category Evolution
205	GA-Based fuzzy clustering applied to irregular Lai, Fun-Zhu 10 February 2003 (has links) Building a rule-based classification system for a training data set is an important research topic in the area of data mining, knowledge discovery and expert systems. Recently, the GA-based fuzzy approach is shown to be an effective way to design an efficient evolutionary fuzzy system. In this thesis a three layers genetic algorithm with Simulated Annealing for selecting a small number of fuzzy if-then rules to building a compact fuzzy classification system will be proposed. The rule selection problem with three objectives: (1) maximize the number of correctly classified patterns, (2) minimize the number of fuzzy if-then rules, and (3) minimize the number of required features. Genetic algorithms are applied to solve this problem. A set of fuzzy if-then rules is coded into a binary string and treated as an in-dividual in genetic algorithms. The fitness of each individual is specified by three ob-jectives in the combinatorial optimization problem. Simulated annealing (SA) is op-tionally cooperated with three layers genetic algorithm to effectively select some layer control genes. The performance of the proposed method for training data and test data is ex-amined by computer simulations on the iris data set and spiral data set, and comparing the performance with the existing approaches. It is shown empirically that the pro-posed method outperforms the existing methods in the design of optimal fuzzy sys-tems. Fuzzy Clustering Genetic Algorithm Simulated Annealing
206	Bootstrapping in a high dimensional but very low sample size problem Song, Juhee 16 August 2006 (has links) High Dimension, Low Sample Size (HDLSS) problems have received much attention recently in many areas of science. Analysis of microarray experiments is one such area. Numerous studies are on-going to investigate the behavior of genes by measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression. HDLSS data investigated in this dissertation consist of a large number of data sets each of which has only a few observations. We assume a statistical model in which measurements from the same subject have the same expected value and variance. All subjects have the same distribution up to location and scale. Information from all subjects is shared in estimating this common distribution. Our interest is in testing the hypothesis that the mean of measurements from a given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and traditional bootstrapping, do not necessarily provide reliable results since there are only a few observations for each data set. We motivate a mixture model having C clusters and 3C parameters to overcome the small sample size problem. Standardized data are pooled after assigning each data set to one of the mixture components. To get reasonable initial parameter estimates when density estimation methods are applied, we apply clustering methods including agglomerative and K-means. Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean of within Cluster Variance estimates), are used to choose an optimal number of clusters. Density estimation methods including a maximum likelihood unimodal density estimator and kernel density estimation are used to estimate the unknown density. Once the density is estimated, a bootstrapping algorithm that selects samples from the estimated density is used to approximate the distribution of test statistics. The t-statistic and an empirical likelihood ratio statistic are used, since their distributions are completely determined by the distribution common to all subject. A method to control the false discovery rate is used to perform simultaneous tests on all small data sets. Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid) microarray experiment data are analyzed by the proposed methods. Bootstrap Density Estimation Clustering High dimensional Data
207	The GDense Algorithm for Clustering Data Streams with High Quality Lin, Shu-Yi 25 June 2009 (has links) In recent years, mining data streams has been widely studied. A data streams is a sequence of dynamic, continuous, unbounded and real time data items with a very high data rate that can only be read once. In data mining, clustering is one of use- ful techniques for discovering interesting data in the underlying data objects. The problem of clustering can be defined formally as follows: given n data points in the d- dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters. In the data streams environment, the difficulties of data streams clustering contain storage overhead, low clustering quality and a low updating efficiency. Cur- rent clustering algorithms can be broadly classified into four categories: partition, hierarchical, density-based and grid-based approaches. The advantage of the grid- based algorithm is that it can handle large databases. Based on the density-based approach, the insertion or deletion of data affects the current clustering only in the neighborhood of this data. Combining the advantages of the grid-based approach and density-based approach, the CDS-Tree algorithm was proposed. Although it can handle large databases, its clustering quality is restricted to the grid partition and the threshold of a dense cell. Therefore, in this thesis, we present a new clustering algo- rithm with high quality, GDense, for data streams. The GDense algorithm has high quality due to two kinds of partition: cells and quadcells, and two kinds of threshold: £_ and (1/4) . Moreover, in our GDense algorithm, in the data insertion part, the 7 cases takes 3 factors about the cell and the quadcell into consideration. In the deletion part, the 10 cases take 5 factors about the cell into consideration. From our simulation results, no matter what condition (including the number of data points, the number of cells, the size of the sliding window, and the threshold of dense cell) is, the clustering purity of our GDense algorithm is always higher than that of the CDS-Tree algorithm. Moreover, we make a comparison of the purity between the our GDense algorithm and the CDS-Tree algorithm with outliers. No matter whether the number of outliers is large or small, the clustering purity of our GDense algorithm is still higher than that of the CDS-Tree and we can improve about 20% the clustering purity as compared to the CDS-Tree algorithm. density-based grid-based clustering data streams
208	CUDIA : a probabilistic cross-level imputation framework using individual auxiliary information / Probabilistic cross-level imputation framework using individual auxiliary information Park, Yubin 17 February 2012 (has links) In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones such as Hospital Referral Regions (HRR) or Hospital Service Areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this thesis, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel "cross-level" imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework. / text Clustering Privacy preserving data mining BRFSS
209	Minimally supervised induction of morphology through bitexts Moon, Taesun, Ph. D. 17 January 2013 (has links) A knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems. Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis. While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics. / text Morphological clustering Inflectional morphology Natural language processing
210	Οριακή ροή κοκκώδους υλικού σε διάδρομο μεταφοράς / Critical flow and pattern formation of granular matter on a conveyor belt Κανελλόπουλος, Γεώργιος 09 February 2009 (has links) Εισάγουμε σταθερή εισροή υλικού στο πρώτο δοχείο με σκοπό να περιγράψουμε τις συνθήκες κάτω από τις οποίες η ροή θα είναι ομαλή και συνεχής μέχρι και το τελευταίο δοχείο. Σε αντίθεση με τα κανονικά ρευστά, τα κοκκώδη υλικά έχουν την τάση να δημιουργούν συσσωματώματα (λόγω της μη-ελαστικής σύγκρουσης των σωματιδίων τους [Goldhirsch and Zanetti, 1993]). Όταν συμβαίνει αυτό η ροή σταματά και η εκροή από το τελευταίο δοχείο μηδενίζεται. Δεδομένης της δύναμης ανατάραξης και των διαστάσεων του διαδρόμου, καθορίζουμε την οριακή τιμή της εισροής πέρα από την οποία η δημιουργία συσσωματωμάτων είναι αναπόφευκτη. Δείχνουμε ότι η κρίσιμη αυτή κατάσταση αναγγέλλεται εκ των προτέρων (ήδη πριν από την οριακή τιμή της εισροής) από την εμφάνιση ενός κυματιστού προφίλ πυκνότητας υλικού κατά μήκος του διαδρόμου. Η οριακή ροή καθώς και το κυματιστό προφίλ εξηγούνται σ΄αυτή την εργασία, τόσο ποιοτικά όσο και ποσοτικά, μέσω ενός μαθηματικού μοντέλου ροής [Eggers 1999, Van der Weele et al., 2001]. Τέλος, βασιζόμενοι σε αυτό το μοντέλο προτείνουμε πρακτικές λύσεις ώστε να βελτιωθεί σημαντικά η παροχή του διαδρόμου. / We study the flow of granular material on a conveyor belt consisting of K connected, vertically vibrated compartments. A steady inflow is applied to the top compartment and our goal is to describe the conditions that ensure a continuous flow all the way down to the Kth compartment. In contrast to normal fluids, flowing granular matter has a tendency to form clusters (due to the inelasticity of the particle collisions [Goldhirsch and Zanetti, 1993]); when this happens the flow stops and the outflow from the Kth compartment vanishes. Given the dimensions of the conveyor belt and the vibration strength, we determine the critical value of the inflow beyond which cluster formation is inevitable. Fortunately, the clusters are announced in advance (already below the critical value of the inflow) by the appearance of a wavy density profile along the K compartments. The critical flow and the associated wavy profile are explained quantitatively in terms of a dynamical flux model [Eggers, 1999; Van der Weele et al., 2001]. This same model enables us to formulate a method to greatly increase the critical value of the inflow, improving the capacity of the conveyor belt by a factor two or even more. Κοκκώδης ροή Συσσωματώματα 532.051 Granular flow Clustering

Search results