Global ETD Search

131	Modern variable selection techniques in the generalised linear model with application in Biostatistics Millard, Salomi 10 1900 (has links) In a Biostatistics environment, the datasets to be analysed are frequently high-dimensional and multicollinearity is expected due to the nature of the features. However, many traditional approaches to statistical analysis and feature selection cease to be useful in the presence of high-dimensionality and multicollinearity. Penalised regression methods have proved to be practical and attractive for dealing with these problems. In this dissertation, we propose a new penalised approach, the modified elastic-net (MEnet), for statistical analysis and feature selection using a combination of the ridge and bridge penalties. This method is designed to deal with high-dimensional problems with highly correlated predictor variables. Furthermore, it has a closed-form solution, unlike the most frequently used penalised techniques, which makes it simple to implement on high-dimensional data. We show how this approach can be used to analyse high-dimensional data with binary responses, e.g., microarray data, and simultaneously select significant features. An extensive simulation study and analysis of a colon cancer dataset demonstrate the properties and practical aspects of the proposed method. / Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2020. / DSI-CSIR Interbursary Support (IBS) Programme / Statistics Industry HUB, Department of Statistics, University of Pretoria / Statistics / MSc / Restricted Mathematical statistics Penalised regression Feature selection UCTD
132	Machine Learning Identification of Protein Properties Useful for Specific Applications Khamis, Abdullah M. 31 March 2016 (has links) Proteins play critical roles in cellular processes of living organisms. It is therefore important to identify and characterize their key properties associated with their functions. Correlating protein’s structural, sequence and physicochemical properties of its amino acids (aa) with protein functions could identify some of the critical factors governing the specific functionality. We point out that not all functions of even well studied proteins are known. This, complemented by the huge increase in the number of newly discovered and predicted proteins, makes challenging the experimental characterization of the whole spectrum of possible protein functions for all proteins of interest. Consequently, the use of computational methods has become more attractive. Here we address two questions. The first one is how to use protein aa sequence and physicochemical properties to characterize a family of proteins. The second one focuses on how to use transcription factor (TF) protein’s domains to enhance accuracy of predicting TF DNA binding sites (TFBSs). To address the first question, we developed a novel method using computational representation of proteins based on characteristics of different protein regions (N-terminal, M-region and C-terminal) and combined these with the properties of protein aa sequences. We show that this description provides important biological insight about characterization of the protein functional groups. Using feature selection techniques, we identified key properties of proteins that allow for very accurate characterization of different protein families. We demonstrated efficiency of our method in application to a number of antimicrobial peptide families. To address the second question we developed another novel method that uses a combination of aa properties of DNA binding domains of TFs and their TFBS properties to develop machine learning models for predicting TFBSs. Feature selection is used to identify the most relevant characteristics of the aa for such modeling. In addition to reducing the number of required models to only 14 for several hundred TFs, the final prediction accuracy of our models appears dramatically better than with other methods. Overall, we show how to efficiently utilize properties of proteins in deriving more accurate solutions for two important problems of computational biology and bioinformatics. Machine Learning feature selection protein properties Bioinformatics
133	Evaluating and enhancing the security of cyber physical systems using machine learning approaches Sharma, Mridula 08 April 2020 (has links) The main aim of this dissertation is to address the security issues of the physical layer of Cyber Physical Systems. The network security is first assessed using a 5-level Network Security Evaluation Scheme (NSES). The network security is then enhanced using a novel Intrusion Detection System that is designed using Supervised Machine Learning. Defined as a complete architecture, this framework includes a complete packet analysis of radio traffic of Routing Protocol for Low-Power and Lossy Networks (RPL). A dataset of 300 different simulations of RPL network is defined for normal traffic, hello flood attack, DIS attack, increased version attack and decreased rank attack. The IDS is a multi-model detection model that provides an efficient detection against the known as well as new attacks. The model analysis is done with the cross-validation method as well as using the new data from a similar network. To detect the known attacks, the model performed at 99% accuracy rate and for the new attack, 85% accuracy is achieved. / Graduate CPS Supervised Machine Learning RPL Feature Selection
134	Towards an Efficient Artificial Neural Network Pruning and Feature Ranking Tool AlShahrani, Mona 24 May 2015 (has links) Artificial Neural Networks (ANNs) are known to be among the most effective and expressive machine learning models. Their impressive abilities to learn have been reflected in many broad application domains such as image recognition, medical diagnosis, online banking, robotics, dynamic systems, and many others. ANNs with multiple layers of complex non-linear transformations (a.k.a Deep ANNs) have shown recently successful results in the area of computer vision and speech recognition. ANNs are parametric models that approximate unknown functions in which parameter values (weights) are adapted during training. ANN’s weights can be large in number and thus render the trained model more complex with chances for “overfitting” training data. In this study, we explore the effects of network pruning on performance of ANNs and ranking of features that describe the data. Simplified ANN model results in fewer parameters, less computation and faster training. We investigate the use of Hessian-based pruning algorithms as well as simpler ones (i.e. non Hessian-based) on nine datasets with varying number of input features and ANN parameters. The Hessian-based Optimal Brain Surgeon algorithm (OBS) is robust but slow. Therefore a faster parallel Hessian- approximation is provided. An additional speedup is provided using a variant we name ‘Simple n Optimal Brain Surgeon’ (SNOBS), which represents a good compromise between robustness and time efficiency. For some of the datasets, the ANN pruning experiments show on average 91% reduction in the number of ANN parameters and about 60% - 90% in the number of ANN input features, while maintaining comparable or better accuracy to the case when no pruning is applied. Finally, we show through a comprehensive comparison with seven state-of-the art feature filtering methods that the feature selection and ranking obtained as a byproduct of the ANN pruning is comparable in accuracy to these methods. Artificial Prunning Neural Network Feature Ranking
135	Feature Detection from Mobile LiDAR Using Deep Learning Liu, Xian 12 March 2019 (has links) No description available. Computer Science Deep learning, LiDAR, Feature Detection
136	A Pattern Recognition Approach to Electromyography Data Mitzev, Ivan Stefanov 07 August 2010 (has links) EMG classification is widely used in electric control of mechanically developed prosthesis, robots development, clinical application etc. It has been evaluated for years, but the main goal of this research is to develop an easy to implement and fast to execute pattern recognition method for classifying signals used for human gait analysis. This method is based on adding two new temporal features (form factor and standard deviation) for EMG signal recognition and using them along with several popular features (area under the curve, wavelength function-pathway and zero crossing rate) to come up with a low complexity suitable feature extraction. Results are presented for EMG data and a comparison with existing methods is made to validate the applicability of the foregoing method. It is shown that the best combination in terms of accuracy and time performance is given by spectral and temporal extraction features along with neural network recognition (NN) algorithm. feature extraction Mahalanobis distance electromyography pattern recognition
137	A comparison of Data Stores for the Online Feature Store Component : A comparison between NDB and Aerospike / En jämförelse av datalagringssystem för andvänding som Online Feature Store : En jämförelse mellan NDB och Aerospike Volminger, Alexander January 2021 (has links) This thesis aimed to investigate what Data Stores would fit to be implemented as an Online Feature Store. This is a component in the Machine Learning infrastructure that needs to be able to handle low latency Reads at high throughput with high availability. The thesis evaluated the Data Stores with real feature workloads from Spotify’s Search system. First an investigation was made to find suitable storage systems. NDB and Aerospike were selected because of their state-of-the-art performance together with their suitable functionality. These were then implemented as the Online Feature Store by batch Reading the feature data through a Java program and by using Google Dataflow to input data to the Data Stores. For 1 client NDB achieved about 35% higher batch Read throughput with around 30% lower P99 latency than Aerospike. For 8 clients NDB got 20% higher batch Read throughput, with a varying P99 latency different compared to Aerospike. But in a 8 node setup NDB achieved on average 35% lower latency. Aerospike achieved 50% fasterWrite speeds when writing feature data to the Data Stores. Both Data Stores’ Read performance was found to suffer upon Writing to the data store at the same time as Reading, with the P99 Read latency increasing around 30% for both Data Stores. It was concluded that both Data Stores would work as an Online Feature Store. But NDB achieved better Read performance, which is one of the most important factors for this type of Feature Store. / Den här uppsatsen undersökte vilka datalagringssystem som passar för att implementeras som en Online Feature Store. Detta är en komponent i maskininlärningsinfrastrukturen som måste hantera snabba läsningar med hög genomströmning och hög tillgänglighet. Uppsatsen studerade detta genom att evaluera datalagringssystem med riktig feature data från Spotifys söksystem. En utredning gjordes först för att hitta lovande datalagringssystem för denna uppgift. NDB och Aerospike blev valda på grund av deras topp prestanda och passande funktionalitet. Dessa implementerades sedan som en Online Feature Store genom att batch-läsa feature datan med hjälp av ett Java program samt genom att använda Google Dataflow för att lägga in feature datan i datalagringssystemen. För 1 klient fick NDB runt 35% bättre genomströmning av feature data jämfört med Aerospike för batch läsningar, med ungefär 30% lägre P99 latens. För 8 klienter fick runt 20% högre genomströmning av feature data med en P99 latens som var mer varierande. Men klustren med 8 noder fick NDB i genomsnitt 35% lägre latens. Aerospike var 50% snabbare på att skriva feature datan till datalagringssystemet. Båda systemen led dock av sämre läsprestanda när skrivningar skedde till dem samtidigt. P99 läs-latensen gick då upp runt 30% för båda datalagringssystemen. Sammanfattningsvis funkade båda av de undersökta datalagringssystem som en Online Feature Store. Men NDB hade bättre läsprestanda, vilket är en av de mest viktigaste faktorerna för den här typen av Feature Store. Feature Stores Data Stores NDB Aerospike NoSQL Online Feature Stores Feature Stores Datalagringsystem NDB Aerospike NoSQL Online Feature Stores Computer and Information Sciences Data- och informationsvetenskap
138	Application of Hyper-geometric Hypothesis-based Quantication and Markov Blanket Feature Selection Methods to Generate Signals for Adverse Drug Reaction Detection Zhang, Yi January 2012 (has links) No description available. Mechanical Engineering Pharmacovigilance Data Mining Feature Selection
139	A Comparison of Unsupervised Methods for DNA Microarray Leukemia Data Harness, Denise 05 April 2018 (has links) (PDF) Advancements in DNA microarray data sequencing have created the need for sophisticated machine learning algorithms and feature selection methods. Probabilistic graphical models, in particular, have been used to identify whether microarrays or genes cluster together in groups of individuals having a similar diagnosis. These clusters of genes are informative, but can be misleading when every gene is used in the calculation. First feature reduction techniques are explored, however the size and nature of the data prevents traditional techniques from working efficiently. Our method is to use the partial correlations between the features to create a precision matrix and predict which associations between genes are most important to predicting Leukemia diagnosis. This technique reduces the number of genes to a fraction of the original. In this approach, partial correlations are then extended into a spectral clustering approach. In particular, a variety of different Laplacian matrices are generated from the network of connections between features, and each implies a graphical network model of gene interconnectivity. Various edge and vertex weighted Laplacians are considered and compared against each other in a probabilistic graphical modeling approach. The resulting multivariate Gaussian distributed clusters are subsequently analyzed to determine which genes are activated in a patient with Leukemia. Finally, the results of this are compared against other feature engineering approaches to assess its accuracy on the Leukemia data set. The initial results show the partial correlation approach of feature selection predicts the diagnosis of a Leukemia patient with almost the same accuracy as using a machine learning algorithm on the full set of genes. More calculations of the precision matrix are needed to ensure the set of most important genes is correct. Additionally more machine learning algorithms will be implemented using the full and reduced data sets to further validate the current prediction accuracy of the partial correlation method. Microarray Data Feature Reduction Applied Statistics Microarrays
140	CNN MODEL FOR RECOGNITION OF TEXT-BASED CAPTCHAS AND ANALYSIS OF LEARNING BASED ALGORITHMS’ VULNERABILITIES TO VISUAL DISTORTION Amiri Golilarz, Noorbakhsh 01 May 2023 (has links) (PDF) Due to the rapid progress and advancements in deep learning and neural networks, manyapproaches and state-of-the-art researches have been conducted in these fields which cause developing various learning-based attacks leading to vulnerability of websites and portals. This kind of attacks decrease the security of the websites which results in releasing the sensitive and important personal information. These days, preserving the security of the websites is one of the most challenging tasks. CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) is kind of test which are developed by designers and are available in various websites to distinguish and differentiate humans from robots in order to protect the websites from possible attacks. In this dissertation, we proposed a CNN based approach to attack and break text-based CAPTCHAs. The proposed method has been compared with several state-of-the-art approaches in terms of recognition accuracy (RA). Based on the results, the developed method can break and recognize CAPTCHAs at high accuracy. Additionally, we wanted to check how to make these CAPTCHAs hard to be broken, so we employed five types of distortions in these CAPTCHAs. The recognition accuracy in presence of these noises has been calculated. The results indicate that adversarial noise can make CAPTCHAs much difficult to be broken. The results have been compared with some state-of-the-art approaches. This analysis can be helpful for CAPTCHA developers to consider these noises in their developed CAPTCHAs. This dissertation also presents a hybrid model based on CNN-SVM to solve text-based CAPTCHAs. The developed method contains four main steps, namely: segmentation, feature extraction, feature selection, and recognition. For segmentation, we suggested using histogram and k-means clustering. For feature extraction, we developed a new CNN structure. The extracted features are passed through the mRMR algorithm to select the most efficient features. These selected features are fed into SVM for further classification and recognition. The results have been compared with several state-of-the-art methods to show the superiority of the developed approach. In general, this dissertation presented deep learning-based methods to solve text-based CAPTCHAs. The efficiency and effectiveness of the developed methods have been compared with various state-of-the-art methods. The developed techniques can break CAPTCHAs at high accuracy and also in a short time. We utilized Peak Signal to Noise Ratio (PSNR), ROC, accuracy, sensitivity, specificity, and precision to evaluate and measure the performance analysis of different methods. The results indicate the superiority of the developed methods. CAPTCHA CNN feature extraction recognition segmentation SVM

Search results