Global ETD Search

171	An Iterative Feature Perturbation Method for Gene Selection from Microarray Data Canul Reich, Juana 11 June 2010 (has links) Gene expression microarray datasets often consist of a limited number of samples relative to a large number of expression measurements, usually on the order of thousands of genes. These characteristics pose a challenge to any classification model as they might negatively impact its prediction accuracy. Therefore, dimensionality reduction is a core process prior to any classification task. This dissertation introduces the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. IFP considers relevant features as those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do not cause any change in the predictive accuracy in such a situation. We apply IFP to 4 cancer microarray datasets: colon cancer (cancer vs. normal), leukemia (subtype classification), Moffitt colon cancer (prognosis predictor) and lung cancer (prognosis predictor). We compare results obtained by IFP to those of SVM-RFE and the t-test using a linear support vector machine as the classifier in all cases. We do so using the original entire set of features in the datasets, and using a preselected set of 200 features (based on p values) from each dataset. When using the entire set of features, the IFP approach results in comparable accuracy (and higher at some points) with respect to SVM-RFE on 3 of the 4 datasets. The simple t-test feature ranking typically produces classifiers with the highest accuracy across the 4 datasets. When using 200 features chosen by the t-test, the accuracy results show up to 3% performance improvement for both IFP and SVM-RFE across the 4 datasets. We corroborate these results with an AUC analysis and a statistical analysis using the Friedman/Holm test. Similar to the application of the t-test, we used the methodsinformation gain and reliefF as filters and compared all three. Results of the AUC analysis show that IFP and SVM-RFE obtain the highest AUC value when applied on the t-test-filtered datasets. This result is additionally corroborated with statistical analysis. The percentage of overlap between the gene sets selected by any two methods across the four datasets indicates that different sets of genes can and do result in similar accuracies. We created ensembles of classifiers using the bagging technique with IFP, SVM-RFE and the t-test, and showed that their performance can be at least equivalent to those of the non-bagging cases, as well as better in some cases. IFP feature selection classification t-test data mining SVM-RFE SVM American Studies Arts and Humanities
172	HYBRID FEATURE SELECTION IN NETWORK INTRUSION DETECTION USING DECISION TREE Chenxi Xiong (9028061) 27 June 2020 (has links) The intrusion detection system has been widely studied and deployed by researchers for providing better security to computer networks. The increasing of the attack volume and the dramatic advancement of the machine learning make the cooperation between the intrusion detection system and machine learning a hot topic and a promising solution for the cybersecurity. Machine learning usually involves the training process using huge amount of sample data. Since the huge input data may cause a negative effect on the training and detection performance of the machine learning model. Feature selection becomes a crucial technique to rule out the irrelevant and redundant features from the dataset. This study applied a feature selection approach that combines the advanced feature selection algorithms and attacks characteristic features to produce the optimal feature subset for the machine learning model in network intrusion detection. The optimal feature subset was created using the CSE-CIC-IDS2018 dataset, which is the most up-to-date benchmark dataset with comprehensive attack diversity and features. The result of the experiment was produced using machine learning models with decision tree classifier and analyzed with respect to the accuracy, precision, recall, and f1 score. Pattern Recognition and Data Mining Networking and Communications Network intrusion detection machine learning Feature selection
173	Strojové učení na malých datových množinách s velkým počtem atributů / Machine learning on small datasets with large number of features Beran, Jakub January 2020 (has links) Machine learning models are difficult to employ in biology-related research. On the one hand, the availability of features increases as we can obtain gene expressions and other omics information. On the other hand, the number of available observations is still low due to the high costs associated with obtaining the data for a single subject. In this work we, therefore, focus on the set of problems where the number of observations is smaller than the number of features. We analyse different combinations of feature selection and classification models and we study which combinations work the best. To assess these model combinations, we introduce two simulation studies and several real-world datasets. We conclude that most classification models benefit from feature pre-selection using feature selection models. Also, we define model-based thresholds for the number of observations above which we observe increased feature selection stability and quality. Finally, we identify a relation between feature selection False Discovery Rate and stability expressed in terms of the Jaccard index. 1
174	Imbalanced Data Classiﬁcation with the K-Closest Resemblance Classiﬁer for Remote Sensing and Social Media Texts Duan, Cheng 10 November 2020 (has links) Data imbalance has been a challenge in many areas of automatic classiﬁcation. Many popular approaches including over-sampling, under-sampling, and Synthetic Minority Oversampling Technique (SMOTE) have been developed and tested in previous research. A big problem with these techniques is that they try to solve the problem by modifying the original data rather than truly overcome the imbalance and let the classiﬁers learn. For tasks in areas like remote sensing and depression detection, the imbalanced data challenge also exists. Researchers have made eﬀorts to overcome the challenge by adopting methods at the data pre-processing step. However, in remote sensing and depression detection tasks, the main interest is still on applying diﬀerent new classiﬁers such as deep learning which has powerful classiﬁcation ability but still do not consider data imbalance as prime factor of lower classiﬁcation performance. In this thesis, we demonstrate the performance of K-CR in our evaluation experiments on a urban land cover classiﬁcation dataset and on two depression detection datasets. The latter two datasets consist in social media texts (tweets), therefore we propose to adopt a feature selection technique Term Frequency - Category-Based Term Weights (TF-CBTW) and various word embedding techniques (Word2Vec, FastText, GloVe, and language model BERT). This feature selection method was not applied before in similar settings and we show that it helps to improve the eﬃciency and the results of the K-CR classiﬁer. Our three experiments show that K-CR can achieve comparable performance on the majority classes and better performance on minority classes when compared to other classiﬁers such as Random Forest, K-Nearest Neighbour, Support Vector Machines, Multi-layer Perception, Convolutional Neural Networks, and Long Short-Term Memory. K-CR Prototype-based Classifier Data Imbalance Feature Selection Remote Sensing Depression Detection
175	Class discovery via feature selection in unsupervised settings Curtis, Jessica 13 February 2016 (has links) Identifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome. We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes. In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics. Statistics Class discovery Clustering Feature selection Gene expression Higher criticism Unsupervised
176	Learning with Attributed Networks: Algorithms and Applications January 2019 (has links) abstract: Attributes - that delineating the properties of data, and connections - that describing the dependencies of data, are two essential components to characterize most real-world phenomena. The synergy between these two principal elements renders a unique data representation - the attributed networks. In many cases, people are inundated with vast amounts of data that can be structured into attributed networks, and their use has been attractive to researchers and practitioners in different disciplines. For example, in social media, users interact with each other and also post personalized content; in scientific collaboration, researchers cooperate and are distinct from peers by their unique research interests; in complex diseases studies, rich gene expression complements to the gene-regulatory networks. Clearly, attributed networks are ubiquitous and form a critical component of modern information infrastructure. To gain deep insights from such networks, it requires a fundamental understanding of their unique characteristics and be aware of the related computational challenges. My dissertation research aims to develop a suite of novel learning algorithms to understand, characterize, and gain actionable insights from attributed networks, to benefit high-impact real-world applications. In the first part of this dissertation, I mainly focus on developing learning algorithms for attributed networks in a static environment at two different levels: (i) attribute level - by designing feature selection algorithms to find high-quality features that are tightly correlated with the network topology; and (ii) node level - by presenting network embedding algorithms to learn discriminative node embeddings by preserving node proximity w.r.t. network topology structure and node attribute similarity. As changes are essential components of attributed networks and the results of learning algorithms will become stale over time, in the second part of this dissertation, I propose a family of online algorithms for attributed networks in a dynamic environment to continuously update the learning results on the fly. In fact, developing application-aware learning algorithms is more desired with a clear understanding of the application domains and their unique intents. As such, in the third part of this dissertation, I am also committed to advancing real-world applications on attributed networks by incorporating the objectives of external tasks into the learning process. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019 Computer science Applications Attributed Networks Feature Selection Network Embedding Online Algorithms
177	Feature selection through visualisation for the classification of online reviews Koka, Keerthika 17 April 2017 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The purpose of this work is to prove that the visualization is at least as powerful as the best automatic feature selection algorithms. This is achieved by applying our visualization technique to the online review classification into fake and genuine reviews. Our technique uses radial chart and color overlaps to explore the best feature selection through visualization for classification. Every review is treated as a radial translucent red or blue membrane with its dimensions determining the shape of the membrane. This work also shows how the dimension ordering and combination is relevant in the feature selection process. In brief, the whole idea is about giving a structure to each text review based on certain attributes, comparing how different or how similar the structure of the different or same categories are and highlighting the key features that contribute to the classification the most. Colors and saturations aid in the feature selection process. Our visualization technique helps the user get insights into the high dimensional data by providing means to eliminate the worst features right away, pick some best features without statistical aids, understand the behavior of the dimensions in different combinations. Text Visual analytics Data visualisation Online reviews classification Multi-dimensional data visualisation Visual feature selection
178	Computational Methods for Analyzing Protein Complexes and Protein-Protein Interactions / タンパク質複合体および相互作用の情報解析手法 Ruan, Peiying 23 March 2015 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19109号 / 情博第555号 / 新制\|\|情\|\|98(附属図書館) / 32060 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授阿久津達也, 教授山本章博, 教授鹿島久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Protein-Protein Interaction Feature Selection Support Vector Machine Proteome Compression Protein Complexes Prediction 007
179	Identification of Discriminating Motifs in Heart Rate Time Series Data of Soccer Players Ravindranathan, Sampurna January 2018 (has links) No description available. Computer Science Wearable Devices Soccer Analytics Heart Rate Motifs Apriori Algorithm Feature Selection
180	Innovations of random forests for longitudinal data Wonkye, Yaa Tawiah 07 August 2019 (has links) No description available. Statistics Random forest ensemble methods feature selection correlation coefficient longitudinal data

Search results