Global ETD Search

151	Detecting Bots using Stream-based System with Data Synthesis Hu, Tianrui 28 May 2020 (has links) Machine learning has shown great success in building security applications including bot detection. However, many machine learning models are difficult to deploy since model training requires the continuous supply of representative labeled data, which are expensive and time-consuming to obtain in practice. In this thesis, we build a bot detection system with a data synthesis method to explore detecting bots with limited data to address this problem. We collected the network traffic from 3 online services in three different months within a year (23 million network requests). We develop a novel stream-based feature encoding scheme to support our model to perform real-time bot detection on anonymized network data. We propose a data synthesis method to synthesize unseen (or future) bot behavior distributions to enable our system to detect bots with extremely limited labeled data. The synthesis method is distribution-aware, using two different generators in a Generative Adversarial Network to synthesize data for the clustered regions and the outlier regions in the feature space. We evaluate this idea and show our method can train a model that outperforms existing methods with only 1% of the labeled data. We show that data synthesis also improves the model's sustainability over time and speeds up the retraining. Finally, we compare data synthesis and adversarial retraining and show they can work complementary with each other to improve the model generalizability. / Master of Science / An internet bot is a computer-controlled software performing simple and automated tasks over the internet. Although some bots are legitimate, many bots are operated to perform malicious behaviors causing severe security and privacy issues. To address this problem, machine learning (ML) models that have shown great success in building security applications are widely used in detecting bots since they can identify hidden patterns learning from data. However, many ML-based approaches are difficult to deploy since model training requires labeled data, which are expensive and time-consuming to obtain in practice, especially for security tasks. Meanwhile, the dynamic-changing nature of malicious bots means bot detection models need the continuous supply of representative labeled data to keep the models up-to-date, which makes bot detection more challenging. In this thesis, we build an ML-based bot detection system to detect advanced malicious bots in real-time by processing network traffic data. We explore using a data synthesis method to detect bots with limited training data to address the limited and unrepresentative labeled data problem. Our proposed data synthesis method synthesizes unseen (or future) bot behavior distributions to enable our system to detect bots with extremely limited labeled data. We evaluate our approach using real-world datasets we collected and show that our model outperforms existing methods using only 1% of the labeled data. We show that data synthesis also improves the model's sustainability over time and helps to keep it up-to-date easier. Finally, we show that our method can work complementary with adversarial retraining to improve the model generalizability. Bot Detection Security Machine learning
152	Methodology Development for Improving the Performance of Critical Classification Applications Afrose, Sharmin 17 January 2023 (has links) People interact with different critical applications in day-to-day life. Some examples of critical applications include computer programs, anonymous vehicles, digital healthcare, smart homes, etc. There are inherent risks in these critical applications if they fail to perform properly. In my dissertation, we mainly focus on developing methodologies for performance improvement for software security and healthcare prognosis. Cryptographic vulnerability tools are used to detect misuses of Java cryptographic APIs and thus classify secure and insecure parts of code. These detection tools are critical applications as misuse of cryptographic libraries and APIs causes devastating security and privacy implications. We develop two benchmarks that help developers to identify secure and insecure code usage as well as improve their tools. We also perform a comparative analysis of four static analysis tools. The developed benchmarks enable the first scientific comparison of the accuracy and scalability of cryptographic API misuse detection. Many published detection tools (CryptoGuard, CrySL, Oracle Parfait) have used our benchmarks to improve their performance in terms of the detection capability of insecure cases. We also examine the need for performance improvement for healthcare applications. Numerous prediction applications are developed to predict patients' health conditions. These are critical applications where misdiagnosis can cause serious harm to patients, even death. Due to the imbalanced nature of many clinical datasets, our work provides empirical evidence showing various prediction deficiencies in a typical machine learning model. We observe that missed death cases are 3.14 times higher than missed survival cases for mortality prediction. Also, existing sampling methods and other techniques are not well-equipped to achieve good performance. We design a double prioritized (DP) technique to mitigate representational bias or disparities across race and age groups. we show DP consistently boosts the minority class recall for underrepresented groups, by up to 38.0%. Our DP method also shows better performance than the existing methods in terms of reducing relative disparity by up to 88% in terms of minority class recall. Incorrect classification in these critical applications can have significant ramifications. Therefore, it is imperative to improve the performance of critical applications to alleviate risk and harm to people. / Doctor of Philosophy / We interact with many software using our devices in our everyday life. Examples of software usage include calling transport using Lyft or Uber, doing online shopping using eBay, using social media via Twitter, check payment status from credit card accounts or bank accounts. Many of these software use cryptography to secure our personal and financial information. However, the inappropriate or improper use of cryptography can let the malicious party gain sensitive information. To capture the inappropriate usage of cryptographic functions, there are several detection tools are developed. However, to compare the coverage of the tools, and the depth of detection of these tools, suitable benchmarks are needed. To bridge this gap, we aim to build two cryptographic benchmarks that are currently used by many tool developers to improve their performance and compare their tools with the existing tools. In another aspect, people see physicians and are admitted to hospitals if needed. Physicians also use different software that assists them in caring the patients. Among this software, many of them are built using machine learning algorithms to predict patients' conditions. The historical medical information or clinical dataset is taken as input to the prediction models. Clinical datasets contain information about patients of different races and ages. The number of samples in some groups of patients may be larger than in other groups. For example, many clinical datasets contain more white patients (i.e., majority group) than Black patients (i.e., minority group). Prediction models built on these imbalanced clinical data may provide inaccurate predictions for minority patients. Our work aims to improve the prediction accuracy for minority patients in important medical applications, such as estimating the likelihood of a patient dying in an emergency room visit or surviving cancer. We design a new technique that builds customized prediction models for different demographic groups. Our results reveal that subpopulation-specific models show better performance for minority groups. Our work contributes to improving the medical care of minority patients in the age of digital health. Overall, our aim is to improve the performance of critical applications to help people by decreasing risk. Our developed methods can be applicable to other critical application domains. Software Security Machine Learning Bias
153	Using Artificial Life to Design Machine Learning Algorithms for Decoding Gene Expression Patterns from Images Zaghlool, Shaza Basyouni 26 May 2008 (has links) Understanding the relationship between gene expression and phenotype is important in many areas of biology and medicine. Current methods for measuring gene expression such as microarrays however are invasive, require biopsy, and expensive. These factors limit experiments to low rate temporal sampling of gene expression and prevent longitudinal studies within a single subject, reducing their statistical power. Thus methods for non-invasive measurements of gene expression are an important and current topic of research. An interesting approach (Segal et al, Nature Biotechnology 25 (6) 2007) to indirect measurements of gene expression has recently been reported that uses existing imaging techniques and machine learning to estimate a function mapping image features to gene expression patterns, providing an image-derived surrogate for gene expression. However, the design of machine learning methods for this purpose is hampered by the cost of training and validation. My thesis shows that populations of artificial organisms simulating genetic variation can be used for designing machine learning approaches to decoding gene expression patterns from images. If analysis of these images proves successful, then this can be applied to real biomedical images reducing the limitations of invasive imaging. The results showed that the box counting dimension was a suitable feature extraction method yielding a classification rate of at least 90% for mutation rates up to 40%. Also, the box-counting dimension was robust in dealing with distorted images. The performance of the classifiers using the fractal dimension as features, actually, seemed more vulnerable to the mutation rate as opposed to the applied distortion level. / Master of Science phenotype Machine learning biomorph genotype
154	Machine Learning Classification of Gas Chromatography Data Clark, Evan Peter 28 August 2023 (has links) Gas Chromatography (GC) is a technique for separating volatile compounds by relying on adherence differences in the chemical components of the compound. As conditions within the GC are changed, components of the mixture elute at different times. Sensors measure the elution and produce data which becomes chromatograms. By analyzing the chromatogram, the presence and quantity of the mixture's constituent components can be determined. Machine Learning (ML) is a field consisting of techniques by which machines can independently analyze data to derive their own procedures for processing it. Additionally, there are techniques for enhancing the performance of ML algorithms. Feature Selection is a technique for improving performance by using a specific subset of the data. Feature Engineering is a technique to transform the data to make processing more effective. Data Fusion is a technique which combines multiple sources of data so as to produce more useful data. This thesis applies machine learning algorithms to chromatograms. Five common machine learning algorithms are analyzed and compared, including K-Nearest Neighbour (KNN), Support Vector Machines (SVM), Convolutional Neural Network (CNN), Decision Tree, and Random Forest (RF). Feature Selection is tested by applying window sweeps with the KNN algorithm. Feature Engineering is applied via the Principal Component Analysis (PCA) algorithm. Data Fusion is also tested. It was found that KNN and RF performed best overall. Feature Selection was very beneficial overall. PCA was helpful for some algorithms, but less so for others. Data Fusion was moderately beneficial. / Master of Science / Gas Chromatography is a method for separating a mixture into its constituent components. A chromatogram is a time series showing the detection of gas in the gas chromatography machine over time. With a properly set up gas chromatographer, different mixtures will produce different chromatograms. These differences allow researchers to determine the components or differentiate compounds from each other. Machine Learning (ML) is a field encompassing a set of methods by which machines can independently analyze data to derive the exact algorithms for processing it. There are many different machine learning algorithms which can accomplish this. There are also techniques which can process the data to make it more effective for use with machine learning. Feature Engineering is one such technique which transforms the data. Feature Selection is another technique which reduces the data to a subset. Data Fusion is a technique which combines different sources of data. Each of these processing techniques have many different implementations. This thesis applies machine learning to gas chromatography. ML systems are developed to classify mixtures based on their chromatograms. Five common machine learning algorithms are developed and compared. Some common Feature Engineering, Feature Selection, and Data Fusion techniques are also evaluated. Two of the algorithms were found to be more effective overall than the other algorithms. Feature Selection was found to be very beneficial. Feature Engineering was beneficial for some algorithms but less so for others. Data Fusion was moderately beneficial. Gas Chromatography Machine Learning Classification
155	Spiking Neural Networks for Low-Power Medical Applications Smith IV, Lyle Clifford 27 August 2024 (has links) Artificial intelligence is a swiftly growing field, and many are researching whether AI can serve as a diagnostic aid in the medical domain. However, the primary weakness of traditional machine learning for many applications is energy efficiency, and this may hamper its ability to be effectively utilized in medicine for portable or edge systems. In order to be more effective, new energy-efficient machine learning paradigms must be investigated for medical applications. In addition, smaller models with fewer parameters would be better suited to medical edge systems. By processing data as a series of "spikes" instead of continuous values, spiking neural networks (SNN) may be the right model architecture to address these concerns. This work investigates the proposed advantages of SNNs compared to more traditional architectures when tested on various medical datasets. We compare the energy efficiency of SNN and recurrent neural network (RNN) solutions by finding sizes of each architecture that achieve similar accuracy. The energy consumption of each comparable network is assessed using standard tools for such evaluation. On the SEED human emotion dataset, SNN architectures achieved up to 20x lower energy per inference than an RNN while maintaining similar classification accuracy. SNNs also achieved 30x lower energy consumption on the PTB-XL ECG dataset with similar classification accuracy. These results show that spiking neural networks are more energy efficient than traditional machine learning models at inference time while maintaining a similar level of accuracy for various medical classification tasks. With this superior energy efficiency, this makes it possible for medical SNNs to operate on edge and portable systems. / Master of Science / As artificial intelligence grows in popularity, especially with the rise of new large language models like Chat-GPT, a weakness in traditional architectures becomes more pronounced. These AI models require ever-increasing amounts of energy to operate. Thus, there is a need for more energy-efficient AI models, such as the spiking neural network (SNN). In SNNs, information is processed in a series of spiking signals, like the biological brain. This allows the resulting architecture to be highly energy efficient and adapted to processing time-series data. A domain that often encounters time-series data and would benefit from greater energy efficiency is medicine. This work seeks to investigate the proposed advantages of spiking neural networks when applied to the various classification tasks in the medical domain. Specifically, both an SNN and a traditional recurrent neural network (RNN) were trained on medical datasets for brain signal and heart signal classification. Sizes of each architecture were found that achieved similar classification accuracy and the energy consumption of each comparable network was assessed. For the SEED brain signal dataset, the SNN achieved similar classification accuracy to the RNN while consuming as little as 5% of the energy per inference. Similarly, the SNN consumed 30x less energy than the RNN while classifying the PTB-XL ECG dataset. These results show that the SNN architecture is a more energy efficient model than traditional RNNs for various medical tasks at inference time and may serve as the solution to the energy consumption problem of medical AI. Spiking Neural Networks Machine Learning
156	An application of machine learning to statistical physics: from the phases of quantum control to satisfiability problems Day, Alexandre G.R. 27 February 2019 (has links) This dissertation presents a study of machine learning methods with a focus on applications to statistical and condensed matter physics, in particular the problem of quantum state preparation, spin-glass and constraint satisfiability. We will start by introducing the core principles of machine learning such as overfitting, bias-variance tradeoff and the disciplines of supervised, unsupervised and reinforcement learning. This discussion will be set in the context of recent applications of machine learning to statistical physics and condensed matter physics. We then present the problem of quantum state preparation and show how reinforcement learning along with stochastic optimization methods can be applied to identify and define phases of quantum control. Reminiscent of condensed matter physics, the underlying phases of quantum control are identified via a set of order parameters and further detailed in terms of their universal implications for optimal quantum control. In particular, casting the optimal quantum control problem as an optimization problem, we show that it exhibits a generic glassy phase and establish a connection with the fields of spin-glass physics and constraint satisfiability problems. We then demonstrate how unsupervised learning methods can be used to obtain important information about the complexity of the phases described. We end by presenting a novel clustering framework, termed HAL for hierarchical agglomerative learning, which exploits out-of-sample accuracy estimates of machine learning classifiers to perform robust clustering of high-dimensional data. We show applications of HAL to various clustering problems. Physics Machine learning Statistical physics
157	Integrated Process Modeling and Data Analytics for Optimizing Polyolefin Manufacturing Sharma, Niket 19 November 2021 (has links) Polyolefins are one of the most widely used commodity polymers with applications in films, packaging and automotive industry. The modeling of polymerization processes producing polyolefins, including high-density polyethylene (HDPE), polypropylene (PP), and linear low-density polyethylene (LLDPE) using Ziegler-Natta catalysts with multiple active sites, is a complex and challenging task. In our study, we integrate process modeling and data analytics for improving and optimizing polyolefin manufacturing processes. Most of the current literature on polyolefin modeling does not consider all of the commercially important production targets when quantifying the relevant polymerization reactions and their kinetic parameters based on measurable plant data. We develop an effective methodology to estimate kinetic parameters that have the most significant impacts on specific production targets, and to develop the kinetics using all commercially important production targets validated over industrial polyolefin processes. We showcase the utility of dynamic models for efficient grade transition in polyolefin processes. We also use the dynamic models for inferential control of polymer processes. Thus, we showcase the methodology for making first-principle polyolefin process models which are scientifically consistent, but tend to be less accurate due to many modeling assumptions in a complex system. Data analytics and machine learning (ML) have been applied in the chemical process industry for accurate predictions for data-based soft sensors and process monitoring/control. Specifically, for polymer processes, they are very useful since the polymer quality measurements like polymer melt index, molecular weight etc. are usually less frequent compared to the continuous process variable measurements. We showcase the use of predictive machine learning models like neural networks for predicting polymer quality indicators and demonstrate the utility of causal models like partial least squares to study the causal effect of the process parameters on the polymer quality variables. ML models produce accurate results can over-fit the data and also produce scientifically inconsistent results beyond the operating data range. Thus, it is growingly important to develop hybrid models combining data-based ML models and first-principle models. We present a broad perspective of hybrid process modeling and optimization combining the scientific knowledge and data analytics in bioprocessing and chemical engineering with a science-guided machine learning (SGML) approach and not just the direct combinations of first-principle and ML models. We present a detailed review of scientific literature relating to the hybrid SGML approach, and propose a systematic classification of hybrid SGML models according to their methodology and objective. We identify the themes and methodologies which have not been explored much in chemical engineering applications, like the use of scientific knowledge to help improve the ML model architecture and learning process for more scientifically consistent solutions. We apply these hybrid SGML techniques to industrial polyolefin processes such as inverse modeling, science guided loss and many others which have not been applied previously to such polymer applications. / Doctor of Philosophy / Almost everything we see around us from furniture, electronics to bottles, cars, etc. are made fully or partially from plastic polymers. The two most popular polymers which comprise almost two-thirds of polymer production globally are polyethylene (PE) and polypropylene (PP), collectively known as polyolefins. Hence, the optimization of polyolefin manufacturing processes with the aid of simulation models is critical and profitable for chemical industry. Modeling of a chemical/polymer process is helpful for process-scale up, product quality estimation/monitoring and new process development. For making a good simulation model, we need to validate the predictions with actual industrial data. Polyolefin process has complex reaction kinetics with multiple parameters that need to be estimated to accurately match the industrial process. We have developed a novel strategy for estimating the kinetics for the model, including the reaction chemistry and the polymer quality information validating with industrial process. Thus, we have developed a science-based model which includes the knowledge of reaction kinetics, thermodynamics, heat and mass balance for the polyolefin process. The science-based model is scientifically consistent, but may not be very accurate due to many model assumptions. Therefore, for applications requiring very high accuracy predicting any polymer quality targets such as melt index (MI), density, data-based techniques might be more appropriate. Recently, we may have heard a lot about artificial intelligence (AI) and machine learning (ML) the basic principle behind these methods is to making the model learn from data for prediction. The process data that are measured in a chemical/polymer plant can be utilized for data analysis. We can build ML models to predict polymer targets like MI as a function of the input process variables. The ML model predictions are very accurate in the process operating range of the dataset on which the model is learned, but outside the prediction range, they may tend to give scientifically inconsistent results. Thus, there is a need to combine the data-based models and scientific models. In our research, we showcase novel approaches to integrate the science-based models and the data-based ML methodology which we term as the hybrid science-guided machine learning methods (SGML). The hybrid SGML methods applied to polyolefin processes yield not only accurate, but scientifically consistent predictions which can be used for polyolefin process optimization for applications like process development and quality monitoring. polyolefins polymers process modeling Machine learning data analytics hybrid Machine learning science-guided Machine learning
158	Leveraging Infrared Imaging with Machine Learning for Phenotypic Profiling Liu, Xinwen January 2024 (has links) Phenotypic profiling systematically maps and analyzes observable traits (phenotypes) exhibited in cells, tissues, organisms or systems in response to various conditions, including chemical, genetic and disease perturbations. This approach seeks to comprehensively understand the functional consequences of perturbations on biological systems, thereby informing diverse research areas such as drug discovery, disease modeling, functional genomics and systems biology. Corresponding techniques should capture high-dimensional features to distinguish phenotypes affected by different conditions. Current methods mainly include fluorescence imaging, mass spectrometry and omics technologies, coupled with computational analysis, to quantify diverse features such as morphology, metabolism and gene expression in response to perturbations. Yet, they face challenges of high costs, complicated operations and strong batch effects. Vibrational imaging offers an alternative for phenotypic profiling, providing a sensitive, cost-effective and easily operated approach to capture the biochemical fingerprint of phenotypes. Among vibrational imaging techniques, infrared (IR) imaging has further advantages of high throughput, fast imaging speed and full spectrum coverage compared with Raman imaging. However, current biomedical applications of IR imaging mainly concentrate on "digital disease pathology", which uses label-free IR imaging with machine learning for tissue pathology classification and disease diagnosis. The thesis contributes as the first comprehensive study of using IR imaging for phenotypic profiling, focusing on three key areas. First, IR-active vibrational probes are systematically designed to enhance metabolic specificity, thereby enriching measured features and improving sensitivity and specificity for phenotype discrimination. Second, experimental workflows are established for phenotypic profiling using IR imaging across biological samples at various levels, including cellular, tissue and organ, in response to drug and disease perturbations. Lastly, complete data analysis pipelines are developed, including data preprocessing, statistical analysis and machine learning methods, with additional algorithmic developments for analyzing and mapping phenotypes. Chapter 1 lays the groundwork for IR imaging by delving into the theory of IR spectroscopy theory and the instrumentation of IR imaging, establishing a foundation for subsequent studies. Chapter 2 discusses the principles of popular machine learning methods applied in IR imaging, including supervised learning, unsupervised learning and deep learning, providing the algorithmic backbone for later chapters. Additionally, it provides an overview of existing biomedical applications using label-free IR imaging combined with machine learning, facilitating a deeper understanding of the current research landscape and the focal points of IR imaging for traditional biomedical studies. Chapter 3-5 focus on applying IR imaging coupled with machine learning for novel application of phenotypic profiling. Chapter 3 explores the design and development of IR-active vibrational probes for IR imaging. Three types of vibrational probes, including azide, 13C-based probes and deuterium-based probes are introduced to study dynamic metabolic activities of protein, lipids and carbohydrates in cells, small organisms and mice for the first time. The developed probes largely improve the metabolic specificity of IR imaging, enhancing the sensitivity of IR imaging towards different phenotypes. Chapter 4 studies the combination of IR imaging, heavy water labeling and unsupervised learning for tissue metabolic profiling, which provides a novel method to map metabolic tissue atlas in complex mammalian systems. In particular, cell type-, tissue- and organ-specific metabolic profiles are identified with spatial information in situ. In addition, this method further captures metabolic changes during brain development and characterized intratumor metabolic heterogeneity of glioblastoma, showing great promise for disease modeling. Chapter 5 developed Vibrational Painting (VIBRANT), a method using IR imaging, multiplexed vibrational probes and supervised learning for cellular phenotypic profiling of drug perturbations. Three IR-active vibrational probes were designed to measure distinct essential metabolic activities in human cancer cells. More than 20,000 single-cell drug responses were collected, corresponding to 23 drug treatments. Supervised learning is used to accurately predict drug mechanism of action at single-cell level with minimal batch effects. We further designed an algorithm to discover drug candidates with novel mechanisms of action and evaluate drug combinations. Overall, VIBRANT has demonstrated great potential across multiple areas of phenotypic drug screening. Biophysics Machine learning Deep learning (Machine learning) Supervised learning (Machine learning) Phenotype Infrared imaging Infrared spectroscopy
159	Expert Knowledge Elicitation for Machine Learning : Insights from a Survey and Industrial Case Study Svensson, Samuel, Persson, Oskar January 2023 (has links) While machine learning has shown success in many fields, it can be challenging when there are limitations with insufficient training data. By incorporating knowledge into the machine learning pipeline, one can overcome such limitations. Therefore, eliciting expert knowledge can play an important role in the machine learning project pipeline. Expert knowledge can come in many forms, and it is seldom easy to elicit and formalize it in a way that is easily implementable into a machine learning project. While it has been done, not much focus has been on how. Furthermore, the motivations for why knowledge was elicited in a particular way as well as the challenges that may exist with the elicitation, are not always focused on either. Making educated decisions for knowledge elicitation can therefore be challenging for researchers. Hence, this work aims to explore and categorize how expert knowledge elicitation has been done by researchers previously. This was done by developing a taxonomy that was then used for analyzing articles. A total of 43 articles were found, containing 97 elicitation paths that were categorized in order to identify trends and common approaches. The findings from our study were used to provide guidance for an industrial case in its initial stage to show how the taxonomy presented in this work can be applied in a real-world scenario. knowledge elicitation machine learning expert knowledge informed machine learning hybrid machine learning survey taxonomy Computer Systems Datorsystem
160	Applicability analysis of computation double entendre humor recognition with machine learning methods Johansson, David January 2016 (has links) No description available. Natural language processing computational humor machine learning

Search results