Global ETD Search

121	Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos / Evaluation of unsupervised feature selection methods for Text Mining Nogueira, Bruno Magalhães 27 March 2009 (has links) Selecionar atributos é, por vezes, uma atividade necessária para o correto desenvolvimento de tarefas de aprendizado de máquina. Em Mineração de Textos, reduzir o número de atributos em uma base de textos é essencial para a eficácia do processo e a compreensibilidade do conhecimento extraído, uma vez que se lida com espaços de alta dimensionalidade e esparsos. Quando se lida com contextos nos quais a coleção de textos é não-rotulada, métodos não-supervisionados de redução de atributos são utilizados. No entanto, não existe forma geral predefinida para a obtenção de medidas de utilidade de atributos em métodos não-supervisionados, demandando um esforço maior em sua realização. Assim, este trabalho aborda a seleção não-supervisionada de atributos por meio de um estudo exploratório de métodos dessa natureza, comparando a eficácia de cada um deles na redução do número de atributos em aplicações de Mineração de Textos. Dez métodos são comparados - Ranking porTerm Frequency, Ranking por Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Método de Luhn, Método LuhnDF, Método de Salton e Zone-Scored Term Frequency - sendo dois deles aqui propostos - Método LuhnDF e Zone-Scored Term Frequency. A avaliação se dá em dois focos, supervisionado, pelo medida de acurácia de quatro classificadores (C4.5, SVM, KNN e Naïve Bayes), e não-supervisionado, por meio da medida estatística de Expected Mutual Information Measure. Aos resultados de avaliação, aplica-se o teste estatístico de Kruskal-Wallis para determinação de significância estatística na diferença de desempenho dos diferentes métodos de seleção de atributos comparados. Seis bases de textos são utilizadas nas avaliações experimentais, cada uma relativa a um grande domínio e contendo subdomínios, os quais correspondiam às classes usadas para avaliação supervisionada. Com esse estudo, este trabalho visa contribuir com uma aplicação de Mineração de Textos que visa extrair taxonomias de tópicos a partir de bases textuais não-rotuladas, selecionando os atributos mais representativos em uma coleção de textos. Os resultados das avaliações mostram que não há diferença estatística significativa entre os métodos não-supervisionados de seleção de atributos comparados. Além disso, comparações desses métodos não-supervisionados com outros supervisionados (Razão de Ganho e Ganho de Informação) apontam que é possível utilizar os métodos não-supervisionados em atividades supervisionadas de Mineração de Textos, obtendo eficiência compatível com os métodos supervisionados, dado que não detectou-se diferença estatística nessas comparações, e com um custo computacional menor / Feature selection is an activity sometimes necessary to obtain good results in machine learning tasks. In Text Mining, reducing the number of features in a text base is essential for the effectiveness of the process and the comprehensibility of the extracted knowledge, since it deals with high dimensionalities and sparse contexts. When dealing with contexts in which the text collection is not labeled, unsupervised methods for feature reduction have to be used. However, there aren\'t any general predefined feature quality measures for unsupervised methods, therefore demanding a higher effort for its execution. So, this work broaches the unsupervised feature selection through an exploratory study of methods of this kind, comparing their efficacies in the reduction of the number of features in the Text Mining process. Ten methods are compared - Ranking by Term Frequency, Ranking by Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Luhn\'s Method, LuhnDF Method, Salton\'s Method and Zone-Scored Term Frequency - and two of them are proposed in this work - LuhnDF Method and Zone-Scored Term Frequency. The evaluation process is done in two ways, supervised, through the accuracy measure of four classifiers (C4.5, SVM, KNN and Naïve Bayes), and unsupervised, using the Expected Mutual Information Measure. The evaluation results are submitted to the statistical test of Kruskal-Wallis in order to determine the statistical significance of the performance difference of the different feature selection methods. Six text bases are used in the experimental evaluation, each one related to one domain and containing sub domains, which correspond to the classes used for supervised evaluation. Through this study, this work aims to contribute with a Text Mining application that extracts topic taxonomies from unlabeled text collections, through the selection of the most representative features in a text collection. The evaluation results show that there is no statistical difference between the unsupervised feature selection methods compared. Moreover, comparisons of these unsupervised methods with other supervised ones (Gain Ratio and Information Gain) show that it is possible to use unsupervised methods in supervised Text Mining activities, obtaining an efficiency compatible with supervised methods, since there isn\'t any statistical difference the statistical test detected in these comparisons, and with a lower computational effort Aprendizado de máquina Aprendizado não-supervisionado Feature selection Machine learning Mineração de textos Seleção de atributos Text mining Unsupervised learning
122	Image-based Process Monitoring via Generative Adversarial Autoencoder with Applications to Rolling Defect Detection January 2019 (has links) abstract: Image-based process monitoring has recently attracted increasing attention due to the advancement of the sensing technologies. However, existing process monitoring methods fail to fully utilize the spatial information of images due to their complex characteristics including the high dimensionality and complex spatial structures. Recent advancement of the unsupervised deep models such as a generative adversarial network (GAN) and generative adversarial autoencoder (AAE) has enabled to learn the complex spatial structures automatically. Inspired by this advancement, we propose an anomaly detection framework based on the AAE for unsupervised anomaly detection for images. AAE combines the power of GAN with the variational autoencoder, which serves as a nonlinear dimension reduction technique with regularization from the discriminator. Based on this, we propose a monitoring statistic efficiently capturing the change of the image data. The performance of the proposed AAE-based anomaly detection algorithm is validated through a simulation study and real case study for rolling defect detection. / Dissertation/Thesis / Masters Thesis Industrial Engineering 2019 Industrial engineering Information technology Computer science adversarial autoencoder anomaly detection generative adversarial networks machine learning statistic unsupervised learning
123	Novelty Detection Of Machinery Using A Non-Parametric Machine Learning Approach Angola, Enrique 01 January 2018 (has links) A novelty detection algorithm inspired by human audio pattern recognition is conceptualized and experimentally tested. This anomaly detection technique can be used to monitor the health of a machine or could also be coupled with a current state of the art system to enhance its fault detection capabilities. Time-domain data obtained from a microphone is processed by applying a short-time FFT, which returns time-frequency patterns. Such patterns are fed to a machine learning algorithm, which is designed to detect novel signals and identify windows in the frequency domain where such novelties occur. The algorithm presented in this paper uses one-dimensional kernel density estimation for different frequency bins. This process eliminates the need for data dimension reduction algorithms. The method of "pseudo-likelihood cross validation" is used to find an independent optimal kernel bandwidth for each frequency bin. Metrics such as the "Individual Node Relative Difference" and "Total Novelty Score" are presented in this work, and used to assess the degree of novelty of a new signal. Experimental datasets containing synthetic and real novelties are used to illustrate and test the novelty detection algorithm. Novelties are successfully detected in all experiments. The presented novelty detection technique could greatly enhance the performance of current state-of-the art condition monitoring systems, or could also be used as a stand-alone system. Audio Processing Condition Monitoring Machine Learning Novelty Detection Statistical Learning Unsupervised Learning Computer Sciences Mathematics Mechanical Engineering
124	Bayesian non-parametric parsimonious mixtures for model-based clustering / Modèles de mélanges Bayésiens non-paramétriques parcimonieux pour la classification automatique Bartcus, Marius 26 October 2015 (has links) Cette thèse porte sur l’apprentissage statistique et l’analyse de données multi-dimensionnelles. Elle se focalise particulièrement sur l’apprentissage non supervisé de modèles génératifs pour la classiﬁcation automatique. Nous étudions les modèles de mélanges Gaussians, aussi bien dans le contexte d’estimation par maximum de vraisemblance via l’algorithme EM, que dans le contexte Bayésien d’estimation par Maximum A Posteriori via des techniques d’échantillonnage par Monte Carlo. Nous considérons principalement les modèles de mélange parcimonieux qui reposent sur une décomposition spectrale de la matrice de covariance et qui oﬀre un cadre ﬂexible notamment pour les problèmes de classiﬁcation en grande dimension. Ensuite, nous investiguons les mélanges Bayésiens non-paramétriques qui se basent sur des processus généraux ﬂexibles comme le processus de Dirichlet et le Processus du Restaurant Chinois. Cette formulation non-paramétrique des modèles est pertinente aussi bien pour l’apprentissage du modèle, que pour la question diﬃcile du choix de modèle. Nous proposons de nouveaux modèles de mélanges Bayésiens non-paramétriques parcimonieux et dérivons une technique d’échantillonnage par Monte Carlo dans laquelle le modèle de mélange et son nombre de composantes sont appris simultanément à partir des données. La sélection de la structure du modèle est eﬀectuée en utilisant le facteur de Bayes. Ces modèles, par leur formulation non-paramétrique et parcimonieuse, sont utiles pour les problèmes d’analyse de masses de données lorsque le nombre de classe est indéterminé et augmente avec les données, et lorsque la dimension est grande. Les modèles proposés validés sur des données simulées et des jeux de données réelles standard. Ensuite, ils sont appliqués sur un problème réel diﬃcile de structuration automatique de données bioacoustiques complexes issues de signaux de chant de baleine. Enﬁn, nous ouvrons des perspectives Markoviennes via les processus de Dirichlet hiérarchiques pour les modèles Markov cachés. / This thesis focuses on statistical learning and multi-dimensional data analysis. It particularly focuses on unsupervised learning of generative models for model-based clustering. We study the Gaussians mixture models, in the context of maximum likelihood estimation via the EM algorithm, as well as in the Bayesian estimation context by maximum a posteriori via Markov Chain Monte Carlo (MCMC) sampling techniques. We mainly consider the parsimonious mixture models which are based on a spectral decomposition of the covariance matrix and provide a ﬂexible framework particularly for the analysis of high-dimensional data. Then, we investigate non-parametric Bayesian mixtures which are based on general ﬂexible processes such as the Dirichlet process and the Chinese Restaurant Process. This non-parametric model formulation is relevant for both learning the model, as well for dealing with the issue of model selection. We propose new Bayesian non-parametric parsimonious mixtures and derive a MCMC sampling technique where the mixture model and the number of mixture components are simultaneously learned from the data. The selection of the model structure is performed by using Bayes Factors. These models, by their non-parametric and sparse formulation, are useful for the analysis of large data sets when the number of classes is undetermined and increases with the data, and when the dimension is high. The models are validated on simulated data and standard real data sets. Then, they are applied to a real diﬃcult problem of automatic structuring of complex bioacoustic data issued from whale song signals. Finally, we open Markovian perspectives via hierarchical Dirichlet processes hidden Markov models. Apprentissage non-supervisé Modèles de mélange Mélanges parcimonieux Unsupervised learning Mixture models Parsimonious mixtures Bayesian non-parametric learning
125	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data
126	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data
127	Separation and Analysis of Multichannel Signals Parry, Robert Mitchell 09 October 2007 (has links) Music recordings contain the mixed contribution of multiple overlapping instruments. In order to better understand the music, it would be beneficial to understand each instrument independently. This thesis focuses on separating the individual instrument recordings within a song. In particular, we propose novel algorithms for separating instrument recordings given only their mixture. When the number of source signals does not exceed the number of mixture signals, we focus on a subclass of source separation algorithms based on joint diagonalization. Each approach leverages a different form of source structure. We introduce repetitive structure as an alternative that leverages unique repetition patterns in music and compare its performance against the other techniques. When the number of source signals exceeds the number of mixtures (i.e. the underdetermined problem), we focus on spectrogram factorization techniques for source separation. We extend single-channel techniques to utilize the additional spatial information in multichannel recordings, and use phase information to improve the estimation of the underlying components. Source separation Independent component analysis Audio processing Unsupervised learning Time-frequency rerpesentations Music Sound Recording and reproducing Source separation (Signal processing)
128	GPR Method for the Detection and Characterization of Fractures and Karst Features: Polarimetry, Attribute Extraction, Inverse Modeling and Data Mining Techniques Sassen, Douglas Spencer 2009 December 1900 (has links) The presence of fractures, joints and karst features within rock strongly influence the hydraulic and mechanical behavior of a rock mass, and there is a strong desire to characterize these features in a noninvasive manner, such as by using ground penetrating radar (GPR). These features can alter the incident waveform and polarization of the GPR signal depending on the aperture, fill and orientation of the features. The GPR methods developed here focus on changes in waveform, polarization or texture that can improve the detection and discrimination of these features within rock bodies. These new methods are utilized to better understand the interaction of an invasive shrub, Juniperus ashei, with subsurface flow conduits at an ecohydrologic experimentation plot situated on the limestone of the Edwards Aquifer, central Texas. First, a coherency algorithm is developed for polarimetric GPR that uses the largest eigenvalue of a scattering matrix in the calculation of coherence. This coherency is sensitive to waveshape and unbiased by the polarization of the GPR antennas, and it shows improvement over scalar coherency in detection of possible conduits in the plot data. Second, a method is described for full-waveform inversion of transmission data to quantitatively determine fracture aperture and electromagnetic properties of the fill, based on a thin-layer model. This inversion method is validated on synthetic data, and the results from field data at the experimentation plot show consistency with the reflection data. Finally, growing hierarchical self-organizing maps (GHSOM) are applied to the GPR data to discover new patterns indicative of subsurface features, without representative examples. The GHSOMs are able to distinguish patterns indicating soil filled cavities within the limestone. Using these methods, locations of soil filled cavities and the dominant flow conduits were indentified. This information helps to reconcile previous hydrologic experiments conducted at the site. Additionally, the GPR and hydrologic experiments suggests that Juniperus ashei significantly impacts infiltration by redirecting flow towards its roots occupying conduits and soil bodies within the rock. This research demonstrates that GPR provides a noninvasive tool that can improve future subsurface experimentation. Ground Penetrating Radar GPR Hydrogeology ecohydrology unsupervised learning inverse modeling image enhancement data mining coherency fractures karst
129	Higher-Ordered Feedback Architectures : a Comparison Jason, Henrik January 2002 (has links) <p>This dissertation aim is to investigate the application of higher-ordered feedback architectures, as a control system for an autonomous robot, on delayed response task problems in the area of evolutionary robotics. For the two architectures of interest a theoretical and practical experiment study is conducted to elaborate how these architectures cope with the road-sign problem, and extended versions of the same. In the theoretical study conducted in this dissertation focus is on the features of the architectures, how they behave and act in different kinds of road-sign problem environments in earlier work. Based on this study two problem environments are chosen for practical experiments. The two experiments that are tested are the three-way and multiple stimuli road-sign problems. Both architectures seams to be cope with the three-way road-sign problem. Although, both architectures are shown to have difficulties solving the multiple stimuli road-sign problem with the current experimental setting used.</p><p>This work leads to two insights in the way these architectures cope with and behave in the three-way road-sign problem environment and delayed response tasks. The robot seams to learn to explicitly relate its actions to the different stimuli settings that it is exposed to. Firstly, both architectures forms higher abstracted representations of the inputs from the environment. These representations are used to guide the robots actions in the environment in those situations were the raw input not was enough to do the correct actions. Secondly, it seams to be enough to have two internal representations of stimuli setting and offloading some stimuli settings, relying on the raw input from the environment, to solve the three-way road-sign problem.</p><p>The dissertation works as an overview for new researchers on the area and also as take-off for the direction to which further investigations should be conducted of using higher-ordered feedback architectures.</p> Evolutionary Robotics Higher-ordered feedback architectures Virtual modulation Self-organization Unsupervised Learning Computer and systems science Data- och systemvetenskap
130	Unsupervised discovery of activity primitives from multivariate sensor data Minnen, David 08 July 2008 (has links) This research addresses the problem of temporal pattern discovery in real-valued, multivariate sensor data. Several algorithms were developed, and subsequent evaluation demonstrates that they can efficiently and accurately discover unknown recurring patterns in time series data taken from many different domains. Different data representations and motif models were investigated in order to design an algorithm with an improved balance between run-time and detection accuracy. The different data representations are used to quickly filter large data sets in order to detect potential patterns that form the basis of a more detailed analysis. The representations include global discretization, which can be efficiently analyzed using a suffix tree, local discretization with a corresponding random projection algorithm for locating similar pairs of subsequences, and a density-based detection method that operates on the original, real-valued data. In addition, a new variation of the multivariate motif discovery problem is proposed in which each pattern may span only a subset of the input features. An algorithm that can efficiently discover such "subdimensional" patterns was developed and evaluated. The discovery algorithms are evaluated by measuring the detection accuracy of discovered patterns relative to a set of expected patterns for each data set. The data sets used for evaluation are drawn from a variety of domains including speech, on-body inertial sensors, music, American Sign Language video, and GPS tracks. Activity discovery Temporal pattern discovery Data mining Unsupervised learning Sensor analysis Detectors Algorithms Pattern recognition systems Time-series analysis Probabilities

Search results