51 |
Novel Methods of Biomarker Discovery and Predictive Modeling using Random ForestJanuary 2017 (has links)
abstract: Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.
Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.
Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.
Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets. / Dissertation/Thesis / Doctoral Dissertation Biomedical Informatics 2017
|
52 |
On Feature Selection Stability: A Data PerspectiveJanuary 2013 (has links)
abstract: The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation. / Dissertation/Thesis / Ph.D. Computer Science 2013
|
53 |
Enhanced Contour Description for People Detection in ImagesDu, Xiaoyun January 2014 (has links)
People detection has been an attractive technology in computer vision. There are many useful applications in our daily life, for instance, intelligent surveillance and driver assistance system. People detection is a challenging matter as people adopt a wide range of poses, wear diverse clothes, and are visible in different kind of backgrounds with significant changes in illumination. In this thesis, some advanced techniques and powerful tools are presented in order to design a robust people detection system. First a baseline model is implemented by combining the Histogram of Oriented Gradients descriptor and linear Support Vector Machines. This baseline model obtains a good performance on the well-known INRIA dataset. Second an advanced model is proposed which has a two-layer cascade framework that achieves both accurate detection and lower computational complexity. For the first layer, the baseline model is used as a filter to generate several candidates. In this procedure, most positive samples survived and the majority of negative samples are rejected according to a preset threshold. The second layer uses a more discriminative model. We combine the Variational Local Binary Patterns descriptor, and the Histogram of Oriented Gradients descriptor as a new discriminative feature. Furthermore multi-scale feature descriptors are used to improve the discriminative power of the Variational Local Binary Patterns feature. Then we perform Feature Selection using the Feature Generating Machine in order to generate a concise descriptor based on this concatenated feature. Moreover Histogram Intersection Kernel Support Vector Machines is employed as an efficient tool of classification. The bootstrapping algorithm is used in the training procedure to exploit the information of the dataset. Finally our approach has a good performance on the INRIA dataset, with results superior to the baseline model.
|
54 |
Embedded Feature Selection for Model-based ClusteringJanuary 2020 (has links)
abstract: Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.
The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.
The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.
The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison. / Dissertation/Thesis / Doctoral Dissertation Industrial Engineering 2020
|
55 |
A Moving-window penalization method and its applicationsBao, Minli 01 August 2017 (has links)
Genome-wide association studies (GWAS) has played an import role in identifying genetic variants underlying human complex traits. However, its success is hindered by weak effect at causal variants and noise at non-causal variants. Penalized regression can be applied to handle GWAS problems. GWAS data has some specificities. Consecutive genetic markers are usually highly correlated due to linkage disequilibrium.
This thesis introduces a moving-window penalized method for GWAS which smooths the effects of consecutive SNPs. Simulation studies indicate that this penalized moving window method provides improved true positive findings. The practical utility of the proposed method is demonstrated by applying it to Genetic Analysis Workshop 16 Rheumatoid Arthritis data.
Next, the moving-window penalty is applied on generalized linear model. We call such an approach as smoothed lasso (SLasso). Coordinate descent computing algorithms are proposed in details, for both quadratic and logistic loss. Asymptotic properties are discussed. Then based on SLasso, we discuss a two-stage method called MW-Ridge. Simulation results show that while SLasso can provide more true positive findings than Lasso, it has a side-effect that it includes more unrelated random noises. MW-Ridge can eliminate such a side-effect and result in high true positive rates and low false detective rates. The applicability to real data is illustrated by using GAW 16 Rheumatoid Arthritis data.
The SLasso and MW-Ridge approaches are then generalized to multivariate response data. The multivariate response data can be transformed into univariate response data. The causal variants are not required to be the same for different response variables. We found that no matter how the causal variants are matched, being fully matched or 60% matched, MW-Ridge can always over perform Lasso by detecting all true positives with lower false detective rates.
|
56 |
A Comparative Study of Feature Selection and Classification Methods for Gene Expression DataAbusamra, Heba 05 1900 (has links)
Microarray technology has enriched the study of gene expression in such a way that scientists are now able to measure the expression levels of thousands of genes in a single experiment. Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification, interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies.
This thesis aims on a comparative study of state-of-the-art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k- nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t- statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used for this study.
Different experiments have been applied to compare the performance of the classification methods with and without performing feature selection. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated.
|
57 |
Self-Learning Prediciton System for Optimisation of Workload Managememt in a Mainframe Operating SystemBensch, Michael, Brugger, Dominik, Rosenstiel, Wolfgang, Bogdan, Martin, Spruth, Wilhelm 06 November 2018 (has links)
We present a framework for extraction and prediction of online workload data from a workload manager of a mainframe operating system. To boost overall system performance, the prediction will be corporated
into the workload manager to take preventive action before a bottleneck develops. Model and feature selection automatically create a prediction model based on given training data, thereby keeping the system
flexible. We tailor data extraction, preprocessing and training to this specific task, keeping in mind the nonstationarity of business processes. Using error measures suited to our task, we show that our approach is promising. To conclude, we discuss our first results and give an outlook on future work.
|
58 |
Feature Selection and Analysis for Standard Machine Learning Classification of Audio Beehive SamplesGupta, Chelsi 01 August 2019 (has links)
The beekeepers need to inspect their hives regularly in order to protect them from various stressors. Manual inspection of hives require a lot of time and effort. Hence, many researchers have started using electronic beehive monitoring (EBM) systems to collect critical information from beehives, so as to alert the beekeepers of possible threats to the hive. EBM collects information by applying multiple sensors into the hive. The sensors collect information in the form of video, audio or temperature data from the hives.
This thesis involves the automatic classification of audio samples from a beehive into bee buzzing, cricket chirping and ambient noise, using machine learning models. The classification of samples in these three categories will help the beekeepers to determine the health of beehives by analyzing the sound patterns in a typical audio sample from beehive. Abnormalities in the classification pattern over a period of time can notify the beekeepers about potential risk to the hives such as attack by foreign bodies (Varroa mites or wing virus), climate changes and other stressors.
|
59 |
The Impact of Cost on Feature Selection for ClassifiersMcCrae, Richard Clyde 01 January 2018 (has links)
Supervised machine learning models are increasingly being used for medical diagnosis. The diagnostic problem is formulated as a binary classification task in which trained classifiers make predictions based on a set of input features. In diagnosis, these features are typically procedures or tests with associated costs. The cost of applying a trained classifier for diagnosis may be estimated as the total cost of obtaining values for the features that serve as inputs for the classifier. Obtaining classifiers based on a low cost set of input features with acceptable classification accuracy is of interest to practitioners and researchers. What makes this problem even more challenging is that costs associated with features vary with patients and service providers and change over time.
This dissertation aims to address this problem by proposing a method for obtaining low cost classifiers that meet specified accuracy requirements under dynamically changing costs. Given a set of relevant input features and accuracy requirements, the goal is to identify all qualifying classifiers based on subsets of the feature set. Then, for any arbitrary costs associated with the features, the cost of the classifiers may be computed and candidate classifiers selected based on cost-accuracy tradeoff. Since the number of relevant input features k tends to be large for typical diagnosis problems, training and testing classifiers based on all 2^k-1 possible non-empty subsets of features is computationally prohibitive. Under the reasonable assumption that the accuracy of a classifier is no lower than that of any classifier based on a subset of its input features, this dissertation aims to develop an efficient method to identify all qualifying classifiers.
This study used two types of classifiers – artificial neural networks and classification trees – that have proved promising for numerous problems as documented in the literature. The approach was to measure the accuracy obtained with the classifiers when all features were used. Then, reduced thresholds of accuracy were arbitrarily established which were satisfied with subsets of the complete feature set. Threshold values for three measures –true positive rates, true negative rates, and overall classification accuracy were considered for the classifiers. Two cost functions were used for the features; one used unit costs and the other random costs. Additional manipulation of costs was also performed.
The order in which features were removed was found to have a material impact on the effort required (removing the most important features first was most efficient, removing the least important features first was least efficient). The accuracy and cost measures were combined to produce a Pareto-Optimal Frontier. There were consistently few elements on this Frontier. At most 15 subsets were on the Frontier even when there were hundreds of thousands of acceptable feature sets. Most of the computational time is taken for training and testing the models. Given costs, models in the Pareto-Optimal Frontier can be efficiently identified and the models may be presented to decision makers. Both the Neural Networks and the Decision Trees performed in a comparable fashion suggesting that any classifier could be employed.
|
60 |
Binary Classification With First Phase Feature Selection forGene Expression Survival DataLoveless, Ian 28 August 2019 (has links)
No description available.
|
Page generated in 0.1447 seconds