Global ETD Search

11	Text-based language identification for the South African languages Botha, Gerrit Reinier 04 September 2008 (has links) We investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n, for different amounts of training data. For a fixed value of n the support vector machines generally outperforms the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available. We find that it is more difficult to discriminate languages within language families then those across families. The accuracy on small input strings is low due to this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy. The relationship between the amount of training data and the accuracy achieved is found to depend on the window size – for the largest window (300 characters) about 400 000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data. Finally, we show that the confusions between the different languages in our set can be used to derive informative graphical representations of the relationships between the languages. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted Naïve bayesian classification Support vector machine N-gram statistics Text-based language identification Difference-in-frequency classification UCTD
12	Praktické uplatnění technologií data mining ve zdravotních pojišťovnách / Practical applications of data mining technologies in health insurance companies Kulhavý, Lukáš January 2010 (has links) This thesis focuses on data mining technology and its possible practical use in the field of health insurance companies. Thesis defines the term data mining and its relation to the term knowledge discovery in databases. The term data mining is explained, inter alia, with methods describing the individual phases of the process of knowledge discovery in databases (CRISP-DM, SEMMA). There is also information about possible practical applications, technologies and products available in the market (both products available free and commercial products). Introduction of the main data mining methods and specific algorithms (decision trees, association rules, neural networks and other methods) serves as a theoretical introduction, on which are the practical applications of real data in real health insurance companies build. These are applications seeking the causes of increased remittances and churn prediction. I have solved these applications in freely-available systems Weka and LISP-Miner. The objective is to introduce and to prove data mining capabilities over this type of data and to prove capabilities of Weka and LISP-Miner systems in solving tasks due to the methodology CRISP-DM. The last part of thesis is devoted the fields of cloud and grid computing in conjunction with data mining. It offers an insight into possibilities of these technologies and their benefits to the technology of data mining. Possibilities of cloud computing are presented on the Amazon EC2 system, grid computing can be used in Weka Experimenter interface.
13	Financial Time Series Analysis using Pattern Recognition Methods Zeng, Zhanggui January 2008 (has links) Doctor of Philosophy / This thesis is based on research on financial time series analysis using pattern recognition methods. The first part of this research focuses on univariate time series analysis using different pattern recognition methods. First, probabilities of basic patterns are used to represent the features of a section of time series. This feature can remove noise from the time series by statistical probability. It is experimentally proven that this feature is successful for pattern repeated time series. Second, a multiscale Gaussian gravity as a pattern relationship measurement which can describe the direction of the pattern relationship is introduced to pattern clustering. By searching for the Gaussian-gravity-guided nearest neighbour of each pattern, this clustering method can easily determine the boundaries of the clusters. Third, a method that unsupervised pattern classification can be transformed into multiscale supervised pattern classification by multiscale supervisory time series or multiscale filtered time series is presented. The second part of this research focuses on multivariate time series analysis using pattern recognition. A systematic method is proposed to find the independent variables of a group of share prices by time series clustering, principal component analysis, independent component analysis, and object recognition. The number of dependent variables is reduced and the multivariate time series analysis is simplified by time series clustering and principal component analysis. Independent component analysis aims to find the ideal independent variables of the group of shares. Object recognition is expected to recognize those independent variables which are similar to the independent components. This method provides a new clue to understanding the stock market and to modelling a large time series database. Financial time series Pattern recognition Probabilistic relaxation clustering Multi-scale pattern clustering Principal component analysis Bayesian classification Principal component analysis Independent component analysis Object matching
14	Financial Time Series Analysis using Pattern Recognition Methods Zeng, Zhanggui January 2008 (has links) Doctor of Philosophy / This thesis is based on research on financial time series analysis using pattern recognition methods. The first part of this research focuses on univariate time series analysis using different pattern recognition methods. First, probabilities of basic patterns are used to represent the features of a section of time series. This feature can remove noise from the time series by statistical probability. It is experimentally proven that this feature is successful for pattern repeated time series. Second, a multiscale Gaussian gravity as a pattern relationship measurement which can describe the direction of the pattern relationship is introduced to pattern clustering. By searching for the Gaussian-gravity-guided nearest neighbour of each pattern, this clustering method can easily determine the boundaries of the clusters. Third, a method that unsupervised pattern classification can be transformed into multiscale supervised pattern classification by multiscale supervisory time series or multiscale filtered time series is presented. The second part of this research focuses on multivariate time series analysis using pattern recognition. A systematic method is proposed to find the independent variables of a group of share prices by time series clustering, principal component analysis, independent component analysis, and object recognition. The number of dependent variables is reduced and the multivariate time series analysis is simplified by time series clustering and principal component analysis. Independent component analysis aims to find the ideal independent variables of the group of shares. Object recognition is expected to recognize those independent variables which are similar to the independent components. This method provides a new clue to understanding the stock market and to modelling a large time series database. Financial time series Pattern recognition Probabilistic relaxation clustering Multi-scale pattern clustering Principal component analysis Bayesian classification Principal component analysis Independent component analysis Object matching
15	Metody klasifikace www stránek / Methods for Classification of WWW Pages Svoboda, Pavel January 2009 (has links) The main goal of this master's thesis was to study the main principles of classification methods. Basic principles of knowledge discovery process, data mining and using an external class CSSBox are described. Special attantion was paid to implementation of a ,,k-nearest neighbors`` classification method. The first objective of this work was to create training and testing data described by 'n' attributes. The second objective was to perform experimental analysis to determine a good value for 'k', the number of neighbors.

Page generated in 0.1023 seconds