Global ETD Search

21	The influence of human factors on user's preferences of web-based applications : a data mining approach Clewley, Natalie Christine January 2010 (has links) As the Web is fast becoming an integral feature in many of our daily lives, designers are faced with the challenge of designing Web-based applications for an increasingly diverse user group. In order to develop applications that successfully meet the needs of this user group, designers have to understand the influence of human factors upon users‘ needs and preferences. To address this issue, this thesis presents an investigation that analyses the influence of three human factors, including cognitive style, prior knowledge and gender differences, on users‘ preferences for Web-based applications. In particular, two applications are studied: Web search tools and Web-based instruction tools. Previous research has suggested a number of relationships between these three human factors, so this thesis was driven by three research questions. Firstly, to what extent is the similarity between the two cognitive style dimensions of Witkin‘s Field Dependence/Independence and Pask‘s Holism/Serialism? Secondly, to what extent do computer experts have the same preferences as Internet experts and computer novices have the same preferences as Internet novices? Finally, to what extent are Field Independent users, experts and males alike, and Field Dependent users, novices and females alike? As traditional statistical analysis methods would struggle to effectively capture such relationships, this thesis proposes an integrated data mining approach that combines feature selection and decision trees to effectively capture users‘ preferences. From this, a framework is developed that integrates the combined effect of the three human factors and can be used to inform system designers. The findings suggest that firstly, there are links between these three human factors. In terms of cognitive style, the relationship between Field Dependent users and Holists can be seen more clearly than the relationship between Field Independent users and Serialists. In terms of prior knowledge, although it is shown that there is a link between computer experience and Internet experience, computer experts are shown to have similar preferences to Internet novices. In terms of the relationship between all three human factors, the results of this study highlighted that the links between cognitive style and gender and between cognitive style and system experience were found to be stronger than the relationship between system experience and gender. This work contributes both theory and methodology to multiple academic communities, including human-computer interaction, information retrieval and data mining. In terms of theory, it has helped to deepen the understanding of the effects of single and multiple human factors on users‘ preferences for Web-based applications. In terms of methodology, an integrated data mining analysis approach was proposed and was shown that is able to capture users‘ preferences. 020
22	Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes Esterhuysen, Fanechka Naomi January 2018 (has links) >Magister Scientiae - MSc / INTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests. Microarray RNA-Seq Gene expression barcode Feature selection Machine learning
23	Predictor Selection in Linear Regression: L1 regularization of a subset of parameters and Comparison of L1 regularization and stepwise selection Hu, Qing 11 May 2007 (has links) Background: Feature selection, also known as variable selection, is a technique that selects a subset from a large collection of possible predictors to improve the prediction accuracy in regression model. First objective of this project is to investigate in what data structure LASSO outperforms forward stepwise method. The second objective is to develop a feature selection method, Feature Selection by L1 Regularization of Subset of Parameters (LRSP), which selects the model by combining prior knowledge of inclusion of some covariates, if any, and the information collected from the data. Mathematically, LRSP minimizes the residual sum of squares subject to the sum of the absolute value of a subset of the coefficients being less than a constant. In this project, LRSP is compared with LASSO, Forward Selection, and Ordinary Least Squares to investigate their relative performance for different data structures. Results: simulation results indicate that for moderate number of small sized effects, forward selection outperforms LASSO in both prediction accuracy and the performance of variable selection when the variance of model error term is smaller, regardless of the correlations among the covariates; forward selection also works better in the performance of variable selection when the variance of error term is larger, but the correlations among the covariates are smaller. LRSP was shown to be an efficient method to deal with the problems when prior knowledge of inclusion of covariates is available, and it can also be applied to problems with nuisance parameters, such as linear discriminant analysis. L1 regularization Lasso Feature selection Covariate selection Regression analysis
24	Data Mining Techniques for Prognosis in Pancreatic Cancer floyd, stuart 03 May 2007 (has links) This thesis focuses on the use of data mining techniques to investigate the expected survival time of patients with pancreatic cancer. Clinical patient data have been useful in showing overall population trends in patient treatment and outcomes. Models built on patient level data also have the potential to yield insights into the best course of treatment and the long-term outlook for individual patients. Within the medical community, logistic regression has traditionally been chosen for building predictive models in terms of explanatory variables or features. Our research demonstrates that the use of machine learning algorithms for both feature selection and prediction can significantly increase the accuracy of models of patient survival. We have evaluated the use of Artificial Neural Networks, Bayesian Networks, and Support Vector Machines. We have demonstrated (p<0.05) that data mining techniques are capable of improved prognostic predictions of pancreatic cancer patient survival as compared with logistic regression alone. Data Mining Machine Learning Feature Selection Pancreatic Cancer
25	OPTIMAL PARAMETER SETTING OF SINGLE AND MULTI-TASK LASSO Huiting Su (5930882) 04 January 2019 (has links) This thesis considers the problem of feature selection when the number of predictors is larger than the number of samples. The performance of supersaturated design (SSD) working with least absolute shrinkage and selection operator (LASSO) is studied in this setting. In order to achieve higher feature selection correctness, self-voting LASSO is implemented to select the tuning parameter while approximately optimize the probability of achieving Sign Correctness. Furthermore, we derive the probability of achieving Direction Correctness, and extend the self-voting LASSO to multi-task self-voting LASSO, which has a group screening effect for multiple tasks. Operations Research feature selection statistical learning regression parameter tuning
26	Segmentation and lesion detection in dermoscopic images Eltayef, Khalid Ahmad A. January 2017 (has links) Malignant melanoma is one of the most fatal forms of skin cancer. It has also become increasingly common, especially among white-skinned people exposed to the sun. Early detection of melanoma is essential to raise survival rates, since its detection at an early stage can be helpful and curable. Working out the dermoscopic clinical features (pigment network and lesion borders) of melanoma is a vital step for dermatologists, who require an accurate method of reaching the correct clinical diagnosis, and ensure the right area receives the correct treatment. These structures are considered one of the main keys that refer to melanoma or non-melanoma disease. However, determining these clinical features can be a time-consuming, subjective (even for trained clinicians) and challenging task for several reasons: lesions vary considerably in size and colour, low contrast between an affected area and the surrounding healthy skin, especially in early stages, and the presence of several elements such as hair, reflections, oils and air bubbles on almost all images. This thesis aims to provide an accurate, robust and reliable automated dermoscopy image analysis technique, to facilitate the early detection of malignant melanoma disease. In particular, four innovative methods are proposed for region segmentation and classification, including two for pigmented region segmentation, one for pigment network detection, and one for lesion classification. In terms of boundary delineation, four pre-processing operations, including Gabor filter, image sharpening, Sobel filter and image inpainting methods are integrated in the segmentation approach to delete unwanted objects (noise), and enhance the appearance of the lesion boundaries in the image. The lesion border segmentation is performed using two alternative approaches. The Fuzzy C-means and the Markov Random Field approaches detect the lesion boundary by repeating the labeling of pixels in all clusters, as a first method. Whereas, the Particle Swarm Optimization with the Markov Random Field method achieves greater accuracy for the same aim by combining them in the second method to perform a local search and reassign all image pixels to its cluster properly. With respect to the pigment network detection, the aforementioned pre-processing method is applied, in order to remove most of the hair while keeping the image information and increase the visibility of the pigment network structures. Therefore, a Gabor filter with connected component analysis are used to detect the pigment network lines, before several features are extracted and fed to the Artificial Neural Network as a classifier algorithm. In the lesion classification approach, the K-means is applied to the segmented lesion to separate it into homogeneous clusters, where important features are extracted; then, an Artificial Neural Network with Radial Basis Functions is trained by representative features to classify the given lesion as melanoma or not. The strong experimental results of the lesion border segmentation methods including Fuzzy C-means with Markov Random Field and the combination between the Particle Swarm Optimization and Markov Random Field, achieved an average accuracy of 94.00% , 94.74% respectively. Whereas, the lesion classification stage by using extracted features form pigment network structures and segmented lesions achieved an average accuracy of 90.1% , 95.97% respectively. The results for the entire experiment were obtained using a public database PH2 comprising 200 images. The results were then compared with existing methods in the literature, which have demonstrated that our proposed approach is accurate, robust, and efficient in the segmentation of the lesion boundary, in addition to its classification.
27	Using functional annotation to characterize genome-wide association results Fisher, Virginia Applegate 11 December 2018 (has links) Genome-wide association studies (GWAS) have successfully identified thousands of variants robustly associated with hundreds of complex traits, but the biological mechanisms driving these results remain elusive. Functional annotation, describing the roles of known genes and regulatory elements, provides additional information about associated variants. This dissertation explores the potential of these annotations to explain the biology behind observed GWAS results. The first project develops a random-effects approach to genetic fine mapping of trait-associated loci. Functional annotation and estimates of the enrichment of genetic effects in each annotation category are integrated with linkage disequilibrium (LD) within each locus and GWAS summary statistics to prioritize variants with plausible functionality. Applications of this method to simulated and real data show good performance in a wider range of scenarios relative to previous approaches. The second project focuses on the estimation of enrichment by annotation categories. I derive the distribution of GWAS summary statistics as a function of annotations and LD structure and perform maximum likelihood estimation of enrichment coefficients in two simulated scenarios. The resulting estimates are less variable than previous methods, but the asymptotic theory of standard errors is often not applicable due to non-convexity of the likelihood function. In the third project, I investigate the problem of selecting an optimal set of tissue-specific annotations with greatest relevance to a trait of interest. I consider three selection criteria defined in terms of the mutual information between functional annotations and GWAS summary statistics. These algorithms correctly identify enriched categories in simulated data, but in the application to a GWAS of BMI the penalty for redundant features outweighs the modest relationships with the outcome yielding null selected feature sets, due to the weaker overall association and high similarity between tissue-specific regulatory features. All three projects require little in the way of prior hypotheses regarding the mechanism of genetic effects. These data-driven approaches have the potential to illuminate unanticipated biological relationships, but are also limited by the high dimensionality of the data relative to the moderate strength of the signals under investigation. These approaches advance the set of tools available to researchers to draw biological insights from GWAS results. Biostatistics GWAS Feature selection Fine mapping Functional annotation Random effects
28	Data Science techniques for predicting plant genes involved in secondary metabolites production Muteba, Ben Ilunga January 2018 (has links) Masters of Science / Plant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes. Medicinal plants Polyphenols Feature selection Data visualisation Feature engineering
29	Integrated feature, neighbourhood, and model optimization for personalised modelling and knowledge discovery Liang, Wen January 2009 (has links) “Machine learning is the process of discovering and interpreting meaningful information, such as new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Larose, 2005). From my understanding, machine learning is a process of using different analysis techniques to observe previously unknown, potentially meaningful information, and discover strong patterns and relationships from a large dataset. Professor Kasabov (2007b) classified computational models into three categories (e.g. global, local, and personalised) which have been widespread and used in the areas of data analysis and decision support in general, and in the areas of medicine and bioinformatics in particular. Most recently, the concept of personalised modelling has been widely applied to various disciplines such as personalised medicine, personalised drug design for known diseases (e.g. cancer, diabetes, brain disease, etc.) as well as for other modelling problems in ecology, business, finance, crime prevention, and so on. The philosophy behind the personalised modelling approach is that every person is different from others, thus he/she will benefit from having a personalised model and treatment. However, personalised modelling is not without issues, such as defining the correct number of neighbours or defining an appropriate number of features. As a result, the principal goal of this research is to study and address these issues and to create a novel framework and system for personalised modelling. The framework would allow users to select and optimise the most important features and nearest neighbours for a new input sample in relation to a certain problem based on a weighted variable distance measure in order to obtain more precise prognostic accuracy and personalised knowledge, when compared with global modelling and local modelling approaches. Personalised modelling Feature selection Nearest neighbour Model optimization Optimisation
30	Identification of Driving Styles in Buses Karginova, Nadezda January 2010 (has links) <p>It is important to detect faults in bus details at an early stage. Because the driving style affects the breakdown of different details in the bus, identification of the driving style is important to minimize the number of failures in buses.</p><p>The identification of the driving style of the driver was based on the input data which contained examples of the driving runs of each class. K-nearest neighbor and neural networks algorithms were used. Different models were tested.</p><p>It was shown that the results depend on the selected driving runs. A hypothesis was suggested that the examples from different driving runs have different parameters which affect the results of the classification.</p><p>The best results were achieved by using a subset of variables chosen with help of the forward feature selection procedure. The percent of correct classifications is about 89-90 % for the k-nearest neighbor algorithm and 88-93 % for the neural networks.</p><p>Feature selection allowed a significant improvement in the results of the k-nearest neighbor algorithm and in the results of the neural networks algorithm received for the case when the training and testing data sets were selected from the different driving runs. On the other hand, feature selection did not affect the results received with the neural networks for the case when the training and testing data sets were selected from the same driving runs.</p><p>Another way to improve the results is to use smoothing. Computing the average class among a number of consequent examples allowed achieving a decrease in the error.</p> Driving style k-nearest neighbor algorithm neural networks feature selection

Search results