Global ETD Search

671	Robustifying Machine Learning based Security Applications Jan, Steve T. K. 27 August 2020 (has links) In recent years, machine learning (ML) has been explored and employed in many fields. However, there are growing concerns about the robustness of machine learning models. These concerns are further amplified in security-critical applications — attackers can manipulate the inputs (i.e., adversarial examples) to cause machine learning models to make a mistake, and it's very challenging to obtain a large amount of attackers' data. These make applying machine learning in security-critical applications difficult. In this dissertation, we present several approaches to robustifying three machine learning based security applications. First, we start from adversarial examples in image recognition. We develop a method to generate robust adversarial examples that remain effective in the physical domain. Our core idea is to use an image-to-image translation network to simulate the digital-to-physical transformation process for generating robust adversarial examples. We further show these robust adversarial examples can improve the robustness of machine learning models by adversarial retraining. The second application is bot detection. We show that the performance of existing machine learning models is not effective if we only have the limit attackers' data. We develop a data synthesis method to address this problem. The key novelty is that our method is distribution aware synthesis, using two different generators in a Generative Adversarial Network to synthesize data for the clustered regions and the outlier regions in the feature space. We show the detection performance using 1% of attackers' data is close to existing methods trained with 100% of the attackers' data. The third component of this dissertation is phishing detection. By designing a novel measurement system, we search and detect phishing websites that adopt evasion techniques not only at the page content level but also at the web domain level. The key novelty is that our system is built on the observation of the evasive behaviors of phishing pages in practice. We also study how existing browsers defenses against phishing websites that impersonate trusted entities at the web domain. Our results show existing browsers are not yet effective to detect them. / Doctor of Philosophy / Machine learning (ML) is computer algorithms that aim to identify hidden patterns from the data. In recent years, machine learning has been widely used in many fields. The range of them is broad, from natural language to autonomous driving. However, there are growing concerns about the robustness of machine learning models. And these concerns are further amplified in security-critical applications — Attackers can manipulate their inputs (i.e., adversarial examples) to cause machine learning models to predict wrong, and it's highly expensive and difficult to obtain a huge amount of attackers' data because attackers are rare compared to the normal users. These make applying machine learning in security-critical applications concerning. In this dissertation, we seek to build better defenses in three types of machine learning based security applications. The first one is image recognition, by developing a method to generate realistic adversarial examples, the machine learning models are more robust for defending against adversarial examples by adversarial retraining. The second one is bot detection, we develop a data synthesis method to detect malicious bots when we only have the limit malicious bots data. For phishing websites, we implement a tool to detect domain name impersonation and detect phishing pages using dynamic and static analysis. Machine learning Security Bot Detection Phishing Attacks
672	Learning-based Cyber Security Analysis and Binary Customization for Security Tian, Ke 13 September 2018 (has links) This thesis presents machine-learning based malware detection and post-detection rewriting techniques for mobile and web security problems. In mobile malware detection, we focus on detecting repackaged mobile malware. We design and demonstrate an Android repackaged malware detection technique based on code heterogeneity analysis. In post-detection rewriting, we aim at enhancing app security with bytecode rewriting. We describe how flow- and sink-based risk prioritization improves the rewriting scalability. We build an interface prototype with natural language processing, in order to customize apps according to natural language inputs. In web malware detection for Iframe injection, we present a tag-level detection system that aims to detect the injection of malicious Iframes for both online and offline cases. Our system detects malicious iframe by combining selective multi-execution and machine learning algorithms. We design multiple contextual features, considering Iframe style, destination and context properties. / Ph. D. / Our computing systems are vulnerable to different kinds of attacks. Cyber security analysis has been a problem ever since the appearance of telecommunication and electronic computers. In the recent years, researchers have developed various tools to protect the confidentiality, integrity, and availability of data and programs. However, new challenges are emerging as for the mobile security and web security. Mobile malware is on the rise and threatens both data and system integrity in Android. Furthermore, web-based iframe attack is also extensively used by web hackers to distribute malicious content after compromising vulnerable sites. This thesis presents on malware detection and post-detection rewriting for both mobile and web security. In mobile malware detection, we focus on detecting repackaged mobile malware. We propose a new Android repackaged malware detection technique based on code heterogeneity analysis. In post-detection rewriting, we aim at enhancing app security with bytecode rewriting. Our rewriting is based on the flow and sink risk prioritization. To increase the feasibility of rewriting, our work showcases a new application of app customization with a more friendly user interface. In web malware detection for Iframe injection, we developed a tag-level detection system which aims to detect injection of malicious Iframes for both online and offline cases. Our system detects malicious iframe by combining selective multi-execution and machine learning. We design multiple contextual features, considering Iframe style, destination and context properties. mobile security web security Machine learning
673	Assessing annual urban change and its impacts on evapotranspiration Wan, Heng 19 June 2020 (has links) Land Use Land Cover Change (LULCC) is a major component of global environmental change, which could result in huge impacts on biodiversity, water yield and quality, climate, soil condition, food security and human welfare. Of all the LULCC types, urbanization is considered to be the most impactful one. Monitoring past and current urbanization processes could provide valuable information for ecosystem services evaluation and policy-making. The National Land Cover Database (NLCD) provides land use land cover data covering the entire United States, and it is widely used as land use land cover data input in numerous environmental models. One major drawback of NLCD is that it is updated every five years, which makes it unsatisfactory for some models requiring land use land cover data with a higher temporal resolution. This dissertation integrated a rich time series of Landsat imagery and NLCD to achieve annual urban change mapping in the Washington D.C. metropolitan area by using time series data change point detection methods. Three different time series change point detection methods were tested and compared to find out the optimal one. One major limitation of using the above time series change point detection method for annual urban mapping is that it relies heavily on NLCD, thus the method is not applicable to near-real time monitoring of urban change. To achieve the near real-time urban change identification, this research applied machine learning-based classification models, including random forest and Artificial Neural Networks (ANN), to automatically detect urban changes by using a rich time series of Landsat imagery as inputs. Urban growth could result in a higher probability of flooding by reducing infiltration and evapotranspiration (ET). ET plays an important role in stormwater mitigation and flood reduction, thus assessing the changes of ET under different urban growth scenarios could yield valuable information for urban planners and policy makers. In this study, spatial-explicit annual ET data at 30-m resolution was generated for Virginia Beach by integrating daily ET data derived from METRIC model and Landsat imagery. Annual ET rates across different major land cover types were compared, and the results indicated that converting forests to urban could result in a huge deduction in ET, thus increasing flood probability. Furthermore, we developed statistical models to explain spatial ET variation using high resolution (1m) land cover data. The results showed that annual ET will increase with the increase of the canopy cover, and it would decrease with the increase of impervious cover and water table depth. / Doctor of Philosophy / Motoring past and current urbanization processes is of importance in terms of ecosystem services evaluation and policy-making because urban growth has huge impacts on the environment. First, this dissertation designed and compared three different methods for annual urban change mapping in Washington D.C. metropolitan area by using a rich time series of Landsat imagery and National Land Cover Database (NLCD). Then, machine-learning based classification models were implemented to achieve near real-time urban change identification. Finally, spatially-explicit evapotranspiration (ET) data for Virginia Beach, a case study location, were generated and annual ET rates for major land cover types were compared to assess the urbanization's impacts on ET. Urban change NDVI Machine learning evapotranspiration
674	Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine Learning Karra, Kiran 21 September 2018 (has links) Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence structure between random variables and allow flexible modeling of multivariate joint distributions. Elidan was the first to recognize this disconnect, and introduced copula based models to the ML community that demonstrated magnitudes of order better performance than the non copula-based models Elidan [2013]. However, the limitation of these is that they are only applicable for continuous random variables and real world data is often naturally modeled jointly as continuous and discrete. This report details our work in bridging this gap of modeling and analyzing data that is jointly continuous and discrete using copulas. Our first research contribution details modeling of jointly continuous and discrete random variables using the copula framework with Bayesian networks, termed Hybrid Copula Bayesian Networks (HCBN) [Karra and Mili, 2016], a continuation of Elidan’s work on Copula Bayesian Networks Elidan [2010]. In this work, we extend the theorems proved by Neslehov ˇ a [2007] from bivariate ´ to multivariate copulas with discrete and continuous marginal distributions. Using the multivariate copula with discrete and continuous marginal distributions as a theoretical basis, we construct an HCBN that can model all possible permutations of discrete and continuous random variables for parent and child nodes, unlike the popular conditional linear Gaussian network model. Finally, we demonstrate on numerous synthetic datasets and a real life dataset that our HCBN compares favorably, from a modeling and flexibility viewpoint, to other hybrid models including the conditional linear Gaussian and the mixture of truncated exponentials models. Our second research contribution then deals with the analysis side, and discusses how one may use copulas for exploratory data analysis. To this end, we introduce a nonparametric copulabased index for detecting the strength and monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals. Our index, termed Copula Index for Detecting Dependence and Monotonicity (CIM), satisfies several desirable properties of measures of association, including Renyi’s properties, the data processing inequality (DPI), and ´ consequently self-equitability. Synthetic data simulations reveal that the statistical power of CIM compares favorably to other state-of-the-art measures of association that are proven to satisfy the DPI. Simulation results with real-world data reveal CIM’s unique ability to detect the monotonicity structure among stochastic signals to find interesting dependencies in large datasets. Additionally, simulations show that CIM shows favorable performance to estimators of mutual information when discovering Markov network structure. Our third research contribution deals with how to assess an estimator’s performance, in the scenario where multiple estimates of the strength of association between random variables need to be rank ordered. More specifically, we introduce a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator’s performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator’s performance compares favorably to kNN, vME, AP, and HMI estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes. / Ph. D. / Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, many of the traditional machine learning techniques rely on linear statistical techniques and models. For example, the correlation coefficient, a widely used construct in modern data analysis, is only a measure of linear dependence and cannot fully capture non-linear interactions. In this dissertation, we aim to address some of these gaps, and how they affect machine learning performance, using the mathematical construct of copulas. Our first contribution deals with accurate probabilistic modeling of real-world data, where the underlying data is both continuous and discrete. We show that even though the copula construct has some limitations with respect to discrete data, it is still amenable to modeling large real-world datasets probabilistically. Our second contribution deals with analysis of non-linear datasets. Here, we develop a new measure of statistical association that can handle discrete, continuous, or combinations of such random variables that are related by any general association pattern. We show that our new metric satisfies several desirable properties and compare it’s performance to other measures of statistical association. Our final contribution attempts to provide a framework for understanding how an estimator of statistical association will affect end-to-end machine learning performance. Here, we develop the estimator response curve, and show a new way to characterize the performance of an estimator of statistical association, termed the estimator response curve. We then show that the estimator response curve can help predict how well an estimator performs in algorithms which require statistical associations to be rank ordered. copula Machine learning big data stochastic probability
675	Accelerating Catalyst Discovery via Ab Initio Machine Learning Li, Zheng 03 December 2019 (has links) In recent decades, machine learning techniques have received an explosion of interest in the domain of high-throughput materials discovery, which is largely attributed to the fastgrowing development of quantum-chemical methods and learning algorithms. Nevertheless, machine learning for catalysis is still at its initial stage due to our insufficient knowledge of the structure-property relationships. In this regard, we demonstrate a holistic machine-learning framework as surrogate models for the expensive density functional theory to facilitate the discovery of high-performance catalysts. The framework, which integrates the descriptor-based kinetic analysis, material fingerprinting and machine learning algorithms, can rapidly explore a broad range of materials space with enormous compositional and configurational degrees of freedom prior to the expensive quantum-chemical calculations and/or experimental testing. Importantly, advanced machine learning approaches (e.g., global sensitivity analysis, principal component analysis, and exploratory analysis) can be utilized to shed light on the underlying physical factors governing the catalytic activity on a diverse type of catalytic materials with different applications. Chapter 1 introduces some basic concepts and knowledge relating to the computational catalyst design. Chapter 2 and Chapter 3 demonstrate the methodology to construct the machine-learning models for bimetallic catalysts. In Chapter 4, the multi-functionality of the machine-learning models is illustrated to understand the metalloporphyrin's underlying structure-property relationships. In Chapter 5, an uncertainty-guided machine learning strategy is introduced to tackle the challenge of data deficiency for perovskite electrode materials design in the electrochemical water splitting cell. / Doctor of Philosophy / Machine learning and deep learning techniques have revolutionized a range of industries in recent years and have huge potential to improve every aspect of our daily lives. Essentially, machine-learning provides algorithms the ability to automatically discover the hidden patterns of data without being explicitly programmed. Because of this, machine learning models have gained huge successes in applications such as website recommendation systems, online fraud detection, robotic technologies, image recognition, etc. Nevertheless, implementing machine-learning techniques in the field of catalyst design remains difficult due to 2 primary challenges. The first challenge is our insufficient knowledge about the structure-property relationships for diverse material systems. Typically, developing a physically intuitive material feature method requests in-depth expert knowledge about the underlying physics of the material system and it is always an active field. The second challenge is the lack of training data in academic research. In many cases, collecting a sufficient amount of training data is not always feasible due to the limitation of computational/experimental resources. Subsequently, the machine learning model optimized with small data tends to be over-fitted and could provide biased predictions with huge uncertainties. To address the above-mentioned challenges, this thesis focus on the development of robust feature methods and strategies for a variety of catalyst systems using the density functional theory (DFT) calculations. Through the case studies in the chapters, we show that the bulk electronic structure characteristics are successful features for capturing the adsorption properties of metal alloys and metal oxides. While molecular graphs are robust features for the molecular property, e.g., energy gap, of metal-organics compounds. Besides, we demonstrate that the adaptive machine learning workflow is an effective strategy to tackle the data deficiency issue in search of perovskite catalysts for the oxygen evolution reaction. Computational catalysis Density functional theory Machine learning
676	Investigating the Convergent, Discriminant, and Predictive Validity of the Mental Toughness Situational Judgment Test Flannery, Nicholas Martin 19 June 2020 (has links) This study investigated the validity of scores of a workplace-based measure of mental toughness, the Mental Toughness Situational Judgment Test (MTSJT). The goal of the study was to determine if MTSJT scores predicted supervisor ratings 1) differentially compared to other measures of mental toughness, grit, and resilience, and 2) incrementally beyond cognitive ability and conscientiousness. Further, two machine learning algorithms – elastic nets and random forests – were used to model predictions at both the item and scale level. MTJST scores provided the most accurate predictions overall when model at the item level via a random forest approach. The MTSJT was the only measure to consistently provide incremental validity when predicting supervisor ratings. The results further emphasize the growing importance of both mental toughness and machine learning algorithms to industrial/organizational psychologists. / Doctor of Philosophy / The study investigated whether the Mental Toughness Situational Judgment Test (MTSJT)– a measure of mental toughness directly in the workplace, could predict employees' supervisor ratings. Further, the study aimed to understand if the MTSJT was a better predictor than other measures of mental toughness, grit, resilience, intelligence, and conscientiousness. The study used machine learning algorithms to generate predictive models using both question-level scores and scale-level scores. The results suggested that the MTSJT scores predicted supervisor ratings at both the question and scale level using a random forest model. Further, the MTJST was a better predictor than most other measures included in the study. The results emphasize the growing importance of both mental toughness and machine learning algorithms to industrial/organizational psychologists. mental toughness Machine learning job performance
677	Assessing Structure–Property Relationships of Crystal Materials using Deep Learning Li, Zheng 05 August 2020 (has links) In recent years, deep learning technologies have received huge attention and interest in the field of high-performance material design. This is primarily because deep learning algorithms in nature have huge advantages over the conventional machine learning models in processing massive amounts of unstructured data with high performance. Besides, deep learning models are capable of recognizing the hidden patterns among unstructured data in an automatic fashion without relying on excessive human domain knowledge. Nevertheless, constructing a robust deep learning model for assessing materials' structure-property relationships remains a non-trivial task due to highly flexible model architecture and the challenge of selecting appropriate material representation methods. In this regard, we develop advanced deep-learning models and implement them for predicting the quantum-chemical calculated properties (i.e., formation energy) for an enormous number of crystal systems. Chapter 1 briefly introduces some fundamental theory of deep learning models (i.e., CNN, GNN) and advanced analysis methods (i.e., saliency map). In Chapter 2, the convolutional neural network (CNN) model is established to find the correlation between the physically intuitive partial electronic density of state (PDOS) and the formation energies of crystals. Importantly, advanced machine learning analysis methods (i.e., salience mapping analysis) are utilized to shed light on underlying physical factors governing the energy properties. In Chapter 3, we introduce the methodology of implementing the cutting-edge graph neural networks (GNN) models for learning an enormous number of crystal structures for the desired properties. / Master of Science / Machine learning technologies, particularly deep learning, have demonstrated remarkable progress in facilitating the high-throughput materials discovery process. In essence, machine learning algorithms have the ability to uncover the hidden patterns of data and make appropriate decisions without being explicitly programmed. Nevertheless, implementing machine learning models in the field of material design remains a challenging task. One of the biggest limitations is our insufficient knowledge about the structure-property relationships for material systems. As the performance of machine learning models is to a large degree determined by the underlying material representation method, which typically requires the experts to have in-depth knowledge of the material systems. Thus, designing effective feature representation methods is always the most crucial aspect for machine learning model development and the process takes a significant amount of manual effort. Even though tremendous efforts have been made in recent years, the research process for robust feature representation methods is still slow. In this regard, we attempt to automate the feature engineering process with the assistance of advanced deep learning algorithms. Unlike the conventional machine learning models, our deep learning models (i.e., convolutional neural networks, graph neural networks) are capable of processing massive amounts of structured data such as spectrum and crystal graphs. Specifically, the deep learning models are explicitly designed to learn the hidden latent variables that are contained in crystal structures in an automatic fashion and provide accurate prediction results. We believe the deep learning models have huge potential to simplify the machine learning modeling process and facilitate the discovery of promising functional materials. Density functional theory Material design Machine learning
678	Classification of Faults in Railway Ties Using Computer Vision and Machine Learning Kulkarni, Amruta Kiran 30 June 2017 (has links) This work focuses on automated classification of railway ties based on their condition using aerial imagery. Four approaches are explored and compared to achieve this goal - handcrafted features, HOG features, transfer learning and proposed CNN architecture. Mean test accuracy per class and Quadratic Weighted Kappa score are used as performance metrics, particularly suited for the ordered classification in this work. Transfer learning approach outperforms the handcrafted features and HOG features by a significant margin. The proposed CNN architecture caters to the unique nature of the railway tie images and their defects. The performance of this approach is superior to the handcrafted and HOG features. It also shows a significant reduction in the number of parameters as compared to the transfer learning approach. Data augmentation boosts the performance of all approaches. The problem of label noise is also analyzed. The techniques proposed in this work will help in reducing the time, cost and dependency on experts involved in traditional railway tie inspections and will facilitate efficient documentation and planning for maintenance of railway ties. / Master of Science Computer Vision Machine Learning Railway Ties
679	Natural Language Driven Image Edits using a Semantic Image Manipulation Language Mohapatra, Akrit 04 June 2018 (has links) Language provides us with a powerful tool to articulate and express ourselves! Understanding and harnessing the expressions of natural language can open the doors to a vast array of creative applications. In this work we explore one such application - natural language based image editing. We propose a novel framework to go from free-form natural language commands to performing fine-grained image edits. Recent progress in the field of deep learning has motivated solving most tasks using end-to-end deep convolutional frameworks. Such methods have shown to be very successful even achieving super-human performance in some cases. Although such progress has shown significant promise for the future we believe there is still progress to be made before their effective application to a task like fine-grained image editing. We approach the problem by dissecting the inputs (image and language query) and focusing on understanding the language input utilizing traditional natural language processing (NLP) techniques. We start by parsing the input query to identify the entities, attributes and relationships and generate a command entity representation. We define our own high-level image manipulation language that serves as an intermediate programming language connecting natural language requests that represent a creative intent over an image into the lower-level operations needed to execute them. The semantic command entity representations are mapped into this high- level language to carry out the intended execution. / Master of Science Machine learning Natural language Processing Computer Vision
680	Synthesizing a Hybrid Benchmark Suite with BenchPrime Wu, Xiaolong 09 October 2018 (has links) This paper presents BenchPrime, an automated benchmark analysis toolset that is systematic and extensible to analyze the similarity and diversity of benchmark suites. BenchPrime takes multiple benchmark suites and their evaluation metrics as inputs and generates a hybrid benchmark suite comprising only essential applications. Unlike prior work, BenchPrime uses linear discriminant analysis rather than principal component analysis, as well as selects the best clustering algorithm and the optimized number of clusters in an automated and metric-tailored way, thereby achieving high accuracy. In addition, BenchPrime ranks the benchmark suites in terms of their application set diversity and estimates how unique each benchmark suite is compared to other suites. As a case study, this work for the first time compares the DenBench with the MediaBench and MiBench using four different metrics to provide a multi-dimensional understanding of the benchmark suites. For each metric, BenchPrime measures to what degree DenBench applications are irreplaceable with those in MediaBench and MiBench. This provides means for identifying an essential subset from the three benchmark suites without compromising the application balance of the full set. The experimental results show that the necessity of including DenBench applications varies across the target metrics and that significant redundancy exists among the three benchmark suites. / Master of Science / Representative benchmarks are widely used in the research area to achieve an accurate and fair evaluation of hardware and software techniques. However, the redundant applications in the benchmark set can skew the average towards redundant characteristics overestimating the benefit of any proposed research. This work proposes a machine learning-based framework BenchPrime to generates a hybrid benchmark suite comprising only essential applications. In addition, BenchPrime ranks the benchmark suites in terms of their application set diversity and estimates how unique each benchmark suite is compared to other suites. Algorithm Machine learning BenchPrime Benchmark subseting

Search results