Global ETD Search

161	Hypothesis testing and feature selection in semi-supervised data Sechidis, Konstantinos January 2015 (has links) A characteristic of most real world problems is that collecting unlabelled examples is easier and cheaper than collecting labelled ones. As a result, learning from partially labelled data is a crucial and demanding area of machine learning, and extending techniques from fully to partially supervised scenarios is a challenging problem. Our work focuses on two types of partially labelled data that can occur in binary problems: semi-supervised data, where the labelled set contains both positive and negative examples, and positive-unlabelled data, a more restricted version of partial supervision where the labelled set consists of only positive examples. In both settings, it is very important to explore a large number of features in order to derive useful and interpretable information about our classification task, and select a subset of features that contains most of the useful information. In this thesis, we address three fundamental and tightly coupled questions concerning feature selection in partially labelled data; all three relate to the highly controversial issue of when does additional unlabelled data improve performance in partially labelled learning environments and when does not. The first question is what are the properties of statistical hypothesis testing in such data? Second, given the widespread criticism of significance testing, what can we do in terms of effect size estimation, that is, quantification of how strong the dependency between feature X and the partially observed label Y? Finally, in the context of feature selection, how well can features be ranked by estimated measures, when the population values are unknown? The answers to these questions provide a comprehensive picture of feature selection in partially labelled data. Interesting applications include for estimation of mutual information quantities, structure learning in Bayesian networks, and investigation of how human-provided prior knowledge can overcome the restrictions of partial labelling. One direct contribution of our work is to enable valid statistical hypothesis testing and estimation in positive-unlabelled data. Focusing on a generalised likelihood ratio test and on estimating mutual information, we provide five key contributions. (1) We prove that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities. (2) We suggest a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power by incorporating user’s prior knowledge over the prevalence of positive examples. (3) We show a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. (4) We derive an estimator of the mutual information in positive-unlabelled data, and its asymptotic distribution. (5) Finally, we show how to rank features with and without prior knowledge. Also we derive extensions of these results to semi-supervised data. In another extension, we investigate how we can use our results for Markov blanket discovery in partially labelled data. While there are many different algorithms for deriving the Markov blanket of fully supervised nodes, the partially labelled problem is far more challenging, and there is a lack of principled approaches in the literature. Our work constitutes a generalization of the conditional tests of independence for partially labelled binary target variables, which can handle the two main partially labelled scenarios: positive-unlabelled and semi-supervised. The result is a significantly deeper understanding of how to control false negative errors in Markov Blanket discovery procedures and how unlabelled data can help. Finally, we present how our results can be used for information theoretic feature selection in partially labelled data. Our work extends naturally feature selection criteria suggested for fully-supervised data, to partially labelled scenarios. These criteria can capture both the relevancy and redundancy of the features and can be used for semi-supervised and positive-unlabelled data. 519.5
162	Minimização de funções decomponíveis em curvas em U definidas sobre cadeias de posets -- algoritmos e aplicações / Minimization of decomposable in U-shaped curves functions defined on poset chains -- algorithms and applications Marcelo da Silva Reis 28 November 2012 (has links) O problema de seleção de características, no contexto de Reconhecimento de Padrões, consiste na escolha de um subconjunto X de um conjunto S de características, de tal forma que X seja \"ótimo\" dentro de algum critério. Supondo a escolha de uma função custo c apropriada, o problema de seleção de características é reduzido a um problema de busca que utiliza c para avaliar os subconjuntos de S e assim detectar um subconjunto de características ótimo. Todavia, o problema de seleção de características é NP-difícil. Na literatura existem diversos algoritmos e heurísticas propostos para abordar este problema; porém, quase nenhuma dessas técnicas explora o fato que existem funções custo cujos valores são estimados a partir de uma amostra e que descrevem uma \"curva em U\" nas cadeias do reticulado Booleano (P(S),<=), um fenômeno bem conhecido em Reconhecimento de Padrões: conforme aumenta-se o número de características consideradas, há uma queda no custo do subconjunto avaliado, até o ponto em que a limitação no número de amostras faz com que seguir adicionando características passe a aumentar o custo, devido ao aumento no erro de estimação. Em 2010, Ris e colegas propuseram um novo algoritmo para resolver esse caso particular do problema de seleção de características, que aproveita o fato de que o espaço de busca pode ser organizado como um reticulado Booleano, assim como a estrutura de curvas em U das cadeias do reticulado, para encontrar um subconjunto ótimo. Neste trabalho estudamos a estrutura do problema de minimização de funções custo cujas cadeias são decomponíveis em curvas em U (problema U-curve), provando que o mesmo é NP-difícil. Mostramos que o algoritmo de Ris e colegas possui um erro que o torna de fato sub-ótimo, e propusemos uma versão corrigida e melhorada do mesmo, o algoritmo U-Curve-Search (UCS). Apresentamos também duas variações do algoritmo UCS que controlam o espaço de busca de forma mais sistemática. Introduzimos dois novos algoritmos branch-and-bound para abordar o problema, chamados U-Curve-Branch-and-Bound (UBB) e Poset-Forest-Search (PFS). Para todos os algoritmos apresentados nesta tese, fornecemos análise de complexidade de tempo e, para alguns deles, também prova de corretude. Implementamos todos os algoritmos apresentados utilizando o arcabouço featsel, também desenvolvido neste trabalho; realizamos experimentos ótimos e sub-ótimos com instâncias de dados reais e simulados e analisamos os resultados obtidos. Por fim, propusemos um relaxamento do problema U-curve que modela alguns tipos de projeto de classificadores; também provamos que os algoritmos UCS, UBB e PFS resolvem esta versão generalizada do problema. / The feature selection problem, in the context of Pattern Recognition, consists in the choice of a subset X of a set S of features, such that X is \"optimal\" under some criterion. If we assume the choice of a proper cost function c, then the feature selection problem is reduced to a search problem, which uses c to evaluate the subsets of S, therefore finding an optimal feature subset. However, the feature selection problem is NP-hard. Although there are a myriad of algorithms and heuristics to tackle this problem in the literature, almost none of those techniques explores the fact that there are cost functions whose values are estimated from a sample and describe a \"U-shaped curve\" in the chains of the Boolean lattice o (P(S),<=), a well-known phenomenon in Pattern Recognition: for a fixed number of samples, the increase in the number of considered features may have two consequences: if the available sample is enough to a good estimation, then it should occur a reduction of the estimation error, otherwise, the lack of data induces an increase of the estimation error. In 2010, Ris et al. proposed a new algorithm to solve this particular case of the feature selection problem: their algorithm takes into account the fact that the search space may be organized as a Boolean lattice, as well as that the chains of this lattice describe a U-shaped curve, to find an optimal feature subset. In this work, we studied the structure of the minimization problem of cost functions whose chains are decomposable in U-shaped curves (the U-curve problem), and proved that this problem is actually NP-hard. We showed that the algorithm introduced by Ris et al. has an error that leads to suboptimal solutions, and proposed a corrected and improved version, the U-Curve-Search (UCS) algorithm. Moreover, to manage the search space in a more systematic way, we also presented two modifications of the UCS algorithm. We introduced two new branch-and-bound algorithms to tackle the U-curve problem, namely U-Curve-Branch-and-Bound (UBB) and Poset-Forest-Search (PFS). For each algorithm presented in this thesis, we provided time complexity analysis and, for some of them, also proof of correctness. We implemented each algorithm through the featsel framework, which was also developed in this work; we performed optimal and suboptimal experiments with instances from real and simulated data, and analyzed the results. Finally, we proposed a generalization of the U-curve problem that models some kinds of classifier design; we proved the correctness of the UCS, UBB, and PFS algorithms for this generalized version of the U-curve problem. branch-and-bound busca ótima seleção de características U-curve branch-and-bound feature selection optimal search U-curve
163	Algorithms for Accelerating Machine Learning with Wide and Deep Models / Wide・Deepモデルを用いた機械学習を高速化するためのアルゴリズム Ida, Yasutoshi 23 March 2021 (has links) 京都大学 / 新制・課程博士 / 博士(情報学) / 甲第23310号 / 情博第746号 / 新制\|\|情\|\|127(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授鹿島久嗣, 教授田中利幸, 教授山下信雄 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Machine Learning Sparsity-Inducing Norms Deep Learning Feature Selection Efficient Algorithm 007
164	Optimization of Insert-Tray Matching using Machine Learning Hedberg, Karolina January 2021 (has links) The manufacturing process of carbide inserts at Sandvik Coromant consists of several operations. During some of these, the inserts are positioned on trays. For some inserts the trays are pre-defined but for others the insert-tray matching is partly improvised. The goal of this thesis project is to examine whether machine learning can be used to predict which tray to use for a given insert. It is also investigated which insert features are determining for the choice of tray. The study is done with insert and tray data from four blasting operations and considers a set of standardized inserts since it is assumed that the tray matching for these is well tuned. The algorithm that is used for the predictions is the supervised learning algorithm k-nearest neighbors. The problem of identifying the determining features is regarded as a feature selection problem and is done with the ReliefF algorithm. From the classification results it is seen that the classifiers are overfitting. The main reason for this is probably that the datasets contain features that together are uniquely defining for which tray is used. This was not detected during the feature selection since ReliefF identifies features that are individually relevant to the output. An idea to avoid overfitting the classifiers is to exclude these defining features from the dataset. Further work is thus recommended. Machine learning Supervised learning Feature selection Computer and Information Sciences Data- och informationsvetenskap
165	Predikce povahy spamových krátkých textů textovým klasifikátorem / Machine Learning Text Classifier for Short Texts Category Prediction Drápela, Karel January 2018 (has links) This thesis deals with categorization of short spam texts from SMS messages. First part summarizes current methods for text classification and~it's followed by description of several commonly used classifiers. In following chapters test data analysis, program implementation and results are described. The program is able to predict text categories based on predefined set of classes and also estimate classification accuracy on training data. For the two category types, that I designed, classifier reached accuracy of 82% and 92% . Both preprocessing and feature selection had a positive impact on resulting accuracy. It is possible to improve this accuracy further by removing portion of samples, which are difficult to classify. With 80\% recall it is possible to increase accuracy by 8-10%.
166	Klasifikace stupně gliomů v MR datech mozku / Classification of glioma grading in brain MRI Olešová, Kristína January 2020 (has links) This thesis deals with a classification of glioma grade in high and low aggressive tumours and overall survival prediction based on magnetic resonance imaging. Data used in this work is from BRATS challenge 2019 and each set contains information from 4 weighting sequences of MRI. Thesis is implemented in PYTHON programming language and Jupyter Notebooks environment. Software PyRadiomics is used for calculation of image features. Goal of this work is to determine best tumour region and weighting sequence for calculation of image features and consequently select set of features that are the best ones for classification of tumour grade and survival prediction. Part of thesis is dedicated to survival prediction using set of statistical tests, specifically Cox regression
167	Novel Data Mining Methods for Virtual Screening of Biological Active Chemical Compounds Soufan, Othman 23 November 2016 (has links) Drug discovery is a process that takes many years and hundreds of millions of dollars to reveal a confident conclusion about a specific treatment. Part of this sophisticated process is based on preliminary investigations to suggest a set of chemical compounds as candidate drugs for the treatment. Computational resources have been playing a significant role in this part through a step known as virtual screening. From a data mining perspective, availability of rich data resources is key in training prediction models. Yet, the difficulties imposed by big expansion in data and its dimensionality are inevitable. In this thesis, I address the main challenges that come when data mining techniques are used for virtual screening. In order to achieve an efficient virtual screening using data mining, I start by addressing the problem of feature selection and provide analysis of best ways to describe a chemical compound for an enhanced screening performance. High-throughput screening (HTS) assays data used for virtual screening are characterized by a great class imbalance. To handle this problem of class imbalance, I suggest using a novel algorithm called DRAMOTE to narrow down promising candidate chemicals aimed at interaction with specific molecular targets before they are experimentally evaluated. Existing works are mostly proposed for small-scale virtual screening based on making use of few thousands of interactions. Thus, I propose enabling large-scale (or big) virtual screening through learning millions of interaction while exploiting any relevant dependency for a better accuracy. A novel solution called DRABAL that incorporates structure learning of a Bayesian Network as a step to model dependency between the HTS assays, is showed to achieve significant improvements over existing state-of-the-art approaches. high-throughput screening Data Mining virtual screening Feature Selection multilabel learning
168	Výběr příznaků metodou Dynamická vzájemná informace / Feature Selection Based on Dynamic Mutual Information Manga, Marek January 2014 (has links) This work analyzes and discuss a issue of implementation feature selection method called Dynamic mutual information (DMIFS). Original description of the DMIFS contains several irregularities, therefore DMIFS can not be implemented exactly as original method. Results of implemented DMIFS is compared with results of original DMIFS. This results shows that implemented DMIFS is similar to the DMIFS. Next part of the work describes design of two new methods based on the DMIFS. The first method called DmRMR merges mRMR and DMIFS. Better performance but worse stability of DmRMR was proved by several tests. The second method called WDMIFS is weighted version of the DMIFS based on AdaBoost algorithm. The WDMIFS has worse performance than DMIFS. Finnaly, manual for implementing DMIFS to RapidMiner and Weka is provided.
169	Vliv selekce příznaků metodou HFS na shlukovou analýzu / Effect of HFS Based Feature Selection on Cluster Analysis Malásek, Jan January 2015 (has links) Master´s thesis is focused on cluster analysis. Clustering has its roots in many areas, including data mining, statistics, biology and machine learning. The aim of this thesis is to elaborate a recherche of cluster analysis methods, methods for determining number of clusters and a short survey of feature selection methods for unsupervised learning. The very important part of this thesis is software realization for comparing different cluster analysis methods focused on finding optimal number of clusters and sorting data points into correct classes. The program also consists of feature selection HFS method implementation. Experimental methods validation was processed in Matlab environment. The end of master´s thesis compares success of clustering methods using data with known output classes and assesses contribution of feature selection HFS method for unsupervised learning for quality of cluster analysis.
170	Sentiment analysis of movie reviews in Chinese Zhang, Jun January 2020 (has links) Sentiment analysis aims at figuring out the opinions of the users towards a certain service or product. In this research, the aim is at classifying the sentiments of users based on the comments they have posed on Douban movie website. In this thesis, I try two different ways to classify the sentiments: with the first one classifying comments into five classes of ratings from 1 to 5, and with the second one classifying comments into three classes of ratings: negative, neutral and positive. For the latter, the ratings of 1 and 2 are grouped as negative, the ratings of 3 neutral and the ratings of 4 and 5 positive. First, Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for machine learning algorithms. Chi Square and Mutual Information are used for feature selection. The selected features are fed into different machine learning methods: Logistic Regression, Linear SVC, SGD classifier and Multinomial Naive Bayes. The performance of models with feature selection will be compared with the performance of models without feature selection for 5-class classification as well as 3-class classification. Also, fastText and Skip-Gram are used as embedding methods for deep learning algorithms LSTM and BILSTM. FastText will also be used for both embedding as well as being a classifier. The aim is to compare different machine learning and deep learning algorithms using different vectorization methods to see which model performs the best regarding both 5-class and 3-class classification. The two classification strategies will be compared with each other in terms of error analysis. The aim is to figure out the similarities and differences of misclassifications made by two different classification strategies. sentiment analysis classification strategies feature selection machine learning embedding deep learning Humanities and the Arts Humaniora och konst

Search results