Global ETD Search

21	Application of Hyper-geometric Hypothesis-based Quantication and Markov Blanket Feature Selection Methods to Generate Signals for Adverse Drug Reaction Detection Zhang, Yi January 2012 (has links) No description available. Mechanical Engineering Pharmacovigilance Data Mining Feature Selection
22	Effective Linear-Time Feature Selection Pradhananga, Nripendra January 2007 (has links) The classification learning task requires selection of a subset of features to represent patterns to be classified. This is because the performance of the classifier and the cost of classification are sensitive to the choice of the features used to construct the classifier. Exhaustive search is impractical since it searches every possible combination of features. The runtime of heuristic and random searches are better but the problem still persists when dealing with high-dimensional datasets. We investigate a heuristic, forward, wrapper-based approach, called Linear Sequential Selection, which limits the search space at each iteration of the feature selection process. We introduce randomization in the search space. The algorithm is called Randomized Linear Sequential Selection. Our experiments demonstrate that both methods are faster, find smaller subsets and can even increase the classification accuracy. We also explore the idea of ensemble learning. We have proposed two ensemble creation methods, Feature Selection Ensemble and Random Feature Ensemble. Both methods apply a feature selection algorithm to create individual classifiers of the ensemble. Our experiments have shown that both methods work well with high-dimensional data. filter wrapper feature selection attribute selection ensemble learning machine learning Linear Feature Selection
23	Developing integrated data fusion algorithms for a portable cargo screening detection system Ayodeji, Akiwowo January 2012 (has links) Towards having a one size fits all solution to cocaine detection at borders; this thesis proposes a systematic cocaine detection methodology that can use raw data output from a fibre optic sensor to produce a set of unique features whose decisions can be combined to lead to reliable output. This multidisciplinary research makes use of real data sourced from cocaine analyte detecting fibre optic sensor developed by one of the collaborators - City University, London. This research advocates a two-step approach: For the first step, the raw sensor data are collected and stored. Level one fusion i.e. analyses, pre-processing and feature extraction is performed at this stage. In step two, using experimentally pre-determined thresholds, each feature decides on detection of cocaine or otherwise with a corresponding posterior probability. High level sensor fusion is then performed on this output locally to combine these decisions and their probabilities at time intervals. Output from every time interval is stored in the database and used as prior data for the next time interval. The final output is a decision on detection of cocaine. The key contributions of this thesis includes investigating the use of data fusion techniques as a solution for overcoming challenges in the real time detection of cocaine using fibre optic sensor technology together with an innovative user interface design. A generalizable sensor fusion architecture is suggested and implemented using the Bayesian and Dempster-Shafer techniques. The results from implemented experiments show great promise with this architecture especially in overcoming sensor limitations. A 5-fold cross validation system using a 12 13 - 1 Neural Network was used in validating the feature selection process. This validation step yielded 89.5% and 10.5% true positive and false alarm rates with 0.8 correlation coefficient. Using the Bayesian Technique, it is possible to achieve 100% detection whilst the Dempster Shafer technique achieves a 95% detection using the same features as inputs to the DF system.
24	The influence of human factors on user's preferences of web-based applications : a data mining approach Clewley, Natalie Christine January 2010 (has links) As the Web is fast becoming an integral feature in many of our daily lives, designers are faced with the challenge of designing Web-based applications for an increasingly diverse user group. In order to develop applications that successfully meet the needs of this user group, designers have to understand the influence of human factors upon users‘ needs and preferences. To address this issue, this thesis presents an investigation that analyses the influence of three human factors, including cognitive style, prior knowledge and gender differences, on users‘ preferences for Web-based applications. In particular, two applications are studied: Web search tools and Web-based instruction tools. Previous research has suggested a number of relationships between these three human factors, so this thesis was driven by three research questions. Firstly, to what extent is the similarity between the two cognitive style dimensions of Witkin‘s Field Dependence/Independence and Pask‘s Holism/Serialism? Secondly, to what extent do computer experts have the same preferences as Internet experts and computer novices have the same preferences as Internet novices? Finally, to what extent are Field Independent users, experts and males alike, and Field Dependent users, novices and females alike? As traditional statistical analysis methods would struggle to effectively capture such relationships, this thesis proposes an integrated data mining approach that combines feature selection and decision trees to effectively capture users‘ preferences. From this, a framework is developed that integrates the combined effect of the three human factors and can be used to inform system designers. The findings suggest that firstly, there are links between these three human factors. In terms of cognitive style, the relationship between Field Dependent users and Holists can be seen more clearly than the relationship between Field Independent users and Serialists. In terms of prior knowledge, although it is shown that there is a link between computer experience and Internet experience, computer experts are shown to have similar preferences to Internet novices. In terms of the relationship between all three human factors, the results of this study highlighted that the links between cognitive style and gender and between cognitive style and system experience were found to be stronger than the relationship between system experience and gender. This work contributes both theory and methodology to multiple academic communities, including human-computer interaction, information retrieval and data mining. In terms of theory, it has helped to deepen the understanding of the effects of single and multiple human factors on users‘ preferences for Web-based applications. In terms of methodology, an integrated data mining analysis approach was proposed and was shown that is able to capture users‘ preferences. 020
25	Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes Esterhuysen, Fanechka Naomi January 2018 (has links) >Magister Scientiae - MSc / INTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests. Microarray RNA-Seq Gene expression barcode Feature selection Machine learning
26	Predictor Selection in Linear Regression: L1 regularization of a subset of parameters and Comparison of L1 regularization and stepwise selection Hu, Qing 11 May 2007 (has links) Background: Feature selection, also known as variable selection, is a technique that selects a subset from a large collection of possible predictors to improve the prediction accuracy in regression model. First objective of this project is to investigate in what data structure LASSO outperforms forward stepwise method. The second objective is to develop a feature selection method, Feature Selection by L1 Regularization of Subset of Parameters (LRSP), which selects the model by combining prior knowledge of inclusion of some covariates, if any, and the information collected from the data. Mathematically, LRSP minimizes the residual sum of squares subject to the sum of the absolute value of a subset of the coefficients being less than a constant. In this project, LRSP is compared with LASSO, Forward Selection, and Ordinary Least Squares to investigate their relative performance for different data structures. Results: simulation results indicate that for moderate number of small sized effects, forward selection outperforms LASSO in both prediction accuracy and the performance of variable selection when the variance of model error term is smaller, regardless of the correlations among the covariates; forward selection also works better in the performance of variable selection when the variance of error term is larger, but the correlations among the covariates are smaller. LRSP was shown to be an efficient method to deal with the problems when prior knowledge of inclusion of covariates is available, and it can also be applied to problems with nuisance parameters, such as linear discriminant analysis. L1 regularization Lasso Feature selection Covariate selection Regression analysis
27	Data Mining Techniques for Prognosis in Pancreatic Cancer floyd, stuart 03 May 2007 (has links) This thesis focuses on the use of data mining techniques to investigate the expected survival time of patients with pancreatic cancer. Clinical patient data have been useful in showing overall population trends in patient treatment and outcomes. Models built on patient level data also have the potential to yield insights into the best course of treatment and the long-term outlook for individual patients. Within the medical community, logistic regression has traditionally been chosen for building predictive models in terms of explanatory variables or features. Our research demonstrates that the use of machine learning algorithms for both feature selection and prediction can significantly increase the accuracy of models of patient survival. We have evaluated the use of Artificial Neural Networks, Bayesian Networks, and Support Vector Machines. We have demonstrated (p<0.05) that data mining techniques are capable of improved prognostic predictions of pancreatic cancer patient survival as compared with logistic regression alone. Data Mining Machine Learning Feature Selection Pancreatic Cancer
28	OPTIMAL PARAMETER SETTING OF SINGLE AND MULTI-TASK LASSO Huiting Su (5930882) 04 January 2019 (has links) This thesis considers the problem of feature selection when the number of predictors is larger than the number of samples. The performance of supersaturated design (SSD) working with least absolute shrinkage and selection operator (LASSO) is studied in this setting. In order to achieve higher feature selection correctness, self-voting LASSO is implemented to select the tuning parameter while approximately optimize the probability of achieving Sign Correctness. Furthermore, we derive the probability of achieving Direction Correctness, and extend the self-voting LASSO to multi-task self-voting LASSO, which has a group screening effect for multiple tasks. Operations Research feature selection statistical learning regression parameter tuning
29	Segmentation and lesion detection in dermoscopic images Eltayef, Khalid Ahmad A. January 2017 (has links) Malignant melanoma is one of the most fatal forms of skin cancer. It has also become increasingly common, especially among white-skinned people exposed to the sun. Early detection of melanoma is essential to raise survival rates, since its detection at an early stage can be helpful and curable. Working out the dermoscopic clinical features (pigment network and lesion borders) of melanoma is a vital step for dermatologists, who require an accurate method of reaching the correct clinical diagnosis, and ensure the right area receives the correct treatment. These structures are considered one of the main keys that refer to melanoma or non-melanoma disease. However, determining these clinical features can be a time-consuming, subjective (even for trained clinicians) and challenging task for several reasons: lesions vary considerably in size and colour, low contrast between an affected area and the surrounding healthy skin, especially in early stages, and the presence of several elements such as hair, reflections, oils and air bubbles on almost all images. This thesis aims to provide an accurate, robust and reliable automated dermoscopy image analysis technique, to facilitate the early detection of malignant melanoma disease. In particular, four innovative methods are proposed for region segmentation and classification, including two for pigmented region segmentation, one for pigment network detection, and one for lesion classification. In terms of boundary delineation, four pre-processing operations, including Gabor filter, image sharpening, Sobel filter and image inpainting methods are integrated in the segmentation approach to delete unwanted objects (noise), and enhance the appearance of the lesion boundaries in the image. The lesion border segmentation is performed using two alternative approaches. The Fuzzy C-means and the Markov Random Field approaches detect the lesion boundary by repeating the labeling of pixels in all clusters, as a first method. Whereas, the Particle Swarm Optimization with the Markov Random Field method achieves greater accuracy for the same aim by combining them in the second method to perform a local search and reassign all image pixels to its cluster properly. With respect to the pigment network detection, the aforementioned pre-processing method is applied, in order to remove most of the hair while keeping the image information and increase the visibility of the pigment network structures. Therefore, a Gabor filter with connected component analysis are used to detect the pigment network lines, before several features are extracted and fed to the Artificial Neural Network as a classifier algorithm. In the lesion classification approach, the K-means is applied to the segmented lesion to separate it into homogeneous clusters, where important features are extracted; then, an Artificial Neural Network with Radial Basis Functions is trained by representative features to classify the given lesion as melanoma or not. The strong experimental results of the lesion border segmentation methods including Fuzzy C-means with Markov Random Field and the combination between the Particle Swarm Optimization and Markov Random Field, achieved an average accuracy of 94.00% , 94.74% respectively. Whereas, the lesion classification stage by using extracted features form pigment network structures and segmented lesions achieved an average accuracy of 90.1% , 95.97% respectively. The results for the entire experiment were obtained using a public database PH2 comprising 200 images. The results were then compared with existing methods in the literature, which have demonstrated that our proposed approach is accurate, robust, and efficient in the segmentation of the lesion boundary, in addition to its classification.
30	Using functional annotation to characterize genome-wide association results Fisher, Virginia Applegate 11 December 2018 (has links) Genome-wide association studies (GWAS) have successfully identified thousands of variants robustly associated with hundreds of complex traits, but the biological mechanisms driving these results remain elusive. Functional annotation, describing the roles of known genes and regulatory elements, provides additional information about associated variants. This dissertation explores the potential of these annotations to explain the biology behind observed GWAS results. The first project develops a random-effects approach to genetic fine mapping of trait-associated loci. Functional annotation and estimates of the enrichment of genetic effects in each annotation category are integrated with linkage disequilibrium (LD) within each locus and GWAS summary statistics to prioritize variants with plausible functionality. Applications of this method to simulated and real data show good performance in a wider range of scenarios relative to previous approaches. The second project focuses on the estimation of enrichment by annotation categories. I derive the distribution of GWAS summary statistics as a function of annotations and LD structure and perform maximum likelihood estimation of enrichment coefficients in two simulated scenarios. The resulting estimates are less variable than previous methods, but the asymptotic theory of standard errors is often not applicable due to non-convexity of the likelihood function. In the third project, I investigate the problem of selecting an optimal set of tissue-specific annotations with greatest relevance to a trait of interest. I consider three selection criteria defined in terms of the mutual information between functional annotations and GWAS summary statistics. These algorithms correctly identify enriched categories in simulated data, but in the application to a GWAS of BMI the penalty for redundant features outweighs the modest relationships with the outcome yielding null selected feature sets, due to the weaker overall association and high similarity between tissue-specific regulatory features. All three projects require little in the way of prior hypotheses regarding the mechanism of genetic effects. These data-driven approaches have the potential to illuminate unanticipated biological relationships, but are also limited by the high dimensionality of the data relative to the moderate strength of the signals under investigation. These approaches advance the set of tools available to researchers to draw biological insights from GWAS results. Biostatistics GWAS Feature selection Fine mapping Functional annotation Random effects

Search results