Spelling suggestions: "subject:"cachine learning"" "subject:"amachine learning""
31 |
Learning control knowledge within an explanation-based learning frameworkDesimone, Roberto V. January 1989 (has links)
No description available.
|
32 |
Development and validation of deep learning classifiers for antimicrobial peptide predictionYan, Jie Lu January 2018 (has links)
University of Macau / Faculty of Science and Technology. / Department of Computer and Information Science
|
33 |
Understanding the Phishing EcosystemLe Page, Sophie 08 July 2019 (has links)
In “phishing attacks”, phishing websites mimic trustworthy websites in order to steal sensitive information from end-users. Despite research by both academia and the industry focusing on development of anti-phishing detection techniques, phishing has increasingly become an online threat. Our inability to slow down phishing attacks shows that we need to go beyond detection and focus more on understanding the phishing ecosystem. In this thesis, we contribute in three ways to understand the phishing ecosystem and to offer insight for future anti-phishing efforts. First, we provide a new and comparative study on the life cycle of phishing and malware attacks. Specifically, we use public click-through statistics of the Bitly URL shortening service to analyze the click-through rate and timespan of phishing and malware attacks before (and after) they were reported. We find that the efforts against phishing attacks are stronger than those against malware attacks.We also find phishing activity indicating that mitigation strategies are not taking down phishing websites fast enough. Second, we develop a method that finds similarities between the DOMs of phishing attacks, since it is known that phishing attacks are variations of previous attacks. We find that existing methods do not capture the structure of the DOM, and question whether they are failing to catch some of the similar attacks. We accordingly evaluate the feasibility of applying Pawlik and Augsten’s recent implementation of Tree Edit Distance (AP-TED)calculations as a way to compare DOMs and identify similar phishing attack instances.Our method agrees with existing ones that 94% of our phishing database are replicas. It also better discriminates the similarities, but at a higher computational cost. The high agreement between methods strengthens the understanding that most phishing attacks are variations, which affects future anti-phishing strategies.Third, we develop a domain classifier exploiting the history and internet presence of a domain with machine learning techniques. It uses only publicly available information to determine whether a known phishing website is hosted on a legitimate but compromised domain, in which case the domain owner is also a victim, or whether the domain itself is maliciously registered. This is especially relevant due to the recent adoption of the General Data Protection Regulation (GDPR), which prevents certain registration information to be made publicly available. Our classifier achieves 94% accuracy on future malicious domains,while maintaining 88% and 92% accuracy on malicious and compromised datasets respectively from two other sources. Accurate domain classification offers insight with regard to different take-down strategies, and with regard to registrars’ prevention of fraudulent registrations.
|
34 |
Investigation on prototype learning.January 2000 (has links)
Keung Chi-Kin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2000. / Includes bibliographical references (leaves 128-135). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Classification --- p.2 / Chapter 1.2 --- Instance-Based Learning --- p.4 / Chapter 1.2.1 --- Three Basic Components --- p.5 / Chapter 1.2.2 --- Advantages --- p.6 / Chapter 1.2.3 --- Disadvantages --- p.7 / Chapter 1.3 --- Thesis Contributions --- p.7 / Chapter 1.4 --- Thesis Organization --- p.8 / Chapter 2 --- Background --- p.10 / Chapter 2.1 --- Improving Instance-Based Learning --- p.10 / Chapter 2.1.1 --- Scaling-up Nearest Neighbor Searching --- p.11 / Chapter 2.1.2 --- Data Reduction --- p.12 / Chapter 2.2 --- Prototype Learning --- p.12 / Chapter 2.2.1 --- Objectives --- p.13 / Chapter 2.2.2 --- Two Types of Prototype Learning --- p.15 / Chapter 2.3 --- Instance-Filtering Methods --- p.15 / Chapter 2.3.1 --- Retaining Border Instances --- p.16 / Chapter 2.3.2 --- Removing Border Instances --- p.21 / Chapter 2.3.3 --- Retaining Center Instances --- p.22 / Chapter 2.3.4 --- Advantages --- p.23 / Chapter 2.3.5 --- Disadvantages --- p.24 / Chapter 2.4 --- Instance-Abstraction Methods --- p.25 / Chapter 2.4.1 --- Advantages --- p.30 / Chapter 2.4.2 --- Disadvantages --- p.30 / Chapter 2.5 --- Other Methods --- p.32 / Chapter 2.6 --- Summary --- p.34 / Chapter 3 --- Integration of Filtering and Abstraction --- p.36 / Chapter 3.1 --- Incremental Integration --- p.37 / Chapter 3.1.1 --- Motivation --- p.37 / Chapter 3.1.2 --- The Integration Method --- p.40 / Chapter 3.1.3 --- Issues --- p.41 / Chapter 3.2 --- Concept Integration --- p.42 / Chapter 3.2.1 --- Motivation --- p.43 / Chapter 3.2.2 --- The Integration Method --- p.44 / Chapter 3.2.3 --- Issues --- p.45 / Chapter 3.3 --- Difference between Integration Methods and Composite Clas- sifiers --- p.48 / Chapter 4 --- The PGF Framework --- p.49 / Chapter 4.1 --- The PGF1 Algorithm --- p.50 / Chapter 4.1.1 --- Instance-Filtering Component --- p.51 / Chapter 4.1.2 --- Instance-Abstraction Component --- p.52 / Chapter 4.2 --- The PGF2 Algorithm --- p.56 / Chapter 4.3 --- Empirical Analysis --- p.57 / Chapter 4.3.1 --- Experimental Setup --- p.57 / Chapter 4.3.2 --- Results of PGF Algorithms --- p.59 / Chapter 4.3.3 --- Analysis of PGF1 --- p.61 / Chapter 4.3.4 --- Analysis of PGF2 --- p.63 / Chapter 4.3.5 --- Overall Behavior of PGF --- p.66 / Chapter 4.3.6 --- Comparisons with Other Approaches --- p.69 / Chapter 4.4 --- Time Complexity --- p.72 / Chapter 4.4.1 --- Filtering Components --- p.72 / Chapter 4.4.2 --- Abstraction Component --- p.74 / Chapter 4.4.3 --- PGF Algorithms --- p.74 / Chapter 4.5 --- Summary --- p.75 / Chapter 5 --- Integrated Concept Prototype Learner --- p.77 / Chapter 5.1 --- Motivation --- p.78 / Chapter 5.2 --- Abstraction Component --- p.80 / Chapter 5.2.1 --- Issues for Abstraction --- p.80 / Chapter 5.2.2 --- Investigation on Typicality --- p.82 / Chapter 5.2.3 --- Typicality in Abstraction --- p.85 / Chapter 5.2.4 --- The TPA algorithm --- p.86 / Chapter 5.2.5 --- Analysis of TPA --- p.90 / Chapter 5.3 --- Filtering Component --- p.93 / Chapter 5.3.1 --- Investigation on Associate --- p.96 / Chapter 5.3.2 --- The RT2 Algorithm --- p.100 / Chapter 5.3.3 --- Analysis of RT2 --- p.101 / Chapter 5.4 --- Concept Integration --- p.103 / Chapter 5.4.1 --- The ICPL Algorithm --- p.104 / Chapter 5.4.2 --- Analysis of ICPL --- p.106 / Chapter 5.5 --- Empirical Analysis --- p.106 / Chapter 5.5.1 --- Experimental Setup --- p.106 / Chapter 5.5.2 --- Results of ICPL Algorithm --- p.109 / Chapter 5.5.3 --- Comparisons with Pure Abstraction and Pure Filtering --- p.110 / Chapter 5.5.4 --- Comparisons with Other Approaches --- p.114 / Chapter 5.6 --- Time Complexity --- p.119 / Chapter 5.7 --- Summary --- p.120 / Chapter 6 --- Conclusions and Future Work --- p.122 / Chapter 6.1 --- Conclusions --- p.122 / Chapter 6.2 --- Future Work --- p.126 / Bibliography --- p.128 / Chapter A --- Detailed Information for Tested Data Sets --- p.136 / Chapter B --- Detailed Experimental Results for PGF --- p.138
|
35 |
Training example adaptation for text categorization.January 2005 (has links)
Ko Hon Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 68-72). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background and Motivation --- p.1 / Chapter 1.2 --- Thesis Organization --- p.4 / Chapter 2 --- Related Work --- p.6 / Chapter 2.1 --- Semi-supervised learning --- p.6 / Chapter 2.2 --- Hierarchical Categorization --- p.10 / Chapter 3 --- Framework Overview --- p.13 / Chapter 4 --- Inherent Concept Detection --- p.18 / Chapter 4.1 --- Data Preprocessing --- p.18 / Chapter 4.2 --- Concept Detection Algorithm --- p.22 / Chapter 4.3 --- Kernel-based Distance Measure --- p.27 / Chapter 5 --- Training Example Discovery from Unlabeled Documents --- p.33 / Chapter 5.1 --- Training Document Discovery --- p.33 / Chapter 5.2 --- Automatically determining the number of extracted positive examples --- p.37 / Chapter 5.3 --- Classification Model --- p.39 / Chapter 6 --- Experimental Evaluation --- p.44 / Chapter 6.1 --- Corpus Description --- p.44 / Chapter 6.2 --- Evaluation Metric --- p.49 / Chapter 6.3 --- Result Analysis --- p.50 / Chapter 7 --- Conclusions and Future Work --- p.66 / Bibliography --- p.68 / Chapter A --- Detailed result on the inherent concept detection process for the TDT and RCV1 corpora --- p.73
|
36 |
Learning by propagation. / CUHK electronic theses & dissertations collectionJanuary 2008 (has links)
Finally, we study how to construct an appropriate graph for spectral clustering. Given a local similarity matrix (a graph), we propose an iterative regularization procedure to iteratively enhance its cluster structure, leading to a global similarity matrix. Significant improvement of clustering performance is observed when the new graph is used for spectral clustering. / In this thesis, we consider the general problem of classifying a data set into a number of subsets, which has been one of the most fundamental problems in machine learning. Specifically, we mainly address the following four common learning problems in three active research fields: semi-supervised classification, semi-supervised clustering, and unsupervised clustering. The first problem we consider is semi-supervised classification from both unlabeled data and pairwise constraints. The pairwise constraints specify which two objects belong to the same class or not. Our aim is to propagate the pairwise constraints to the entire data set. We formulate the propagation model as a semidefinite programming (SDP) problem, which can be globally solved reliably. Our approach is applicable to multi-class problems and handles class labels, pairwise constraints, or a mixture of them in a unified framework. / The second problem is semi-supervised clustering with pairwise constraints. We present a principled framework for learning a data-driven and constraint-consistent nonlinear mapping to reshape the data in a feature space. We formulate the problem as a small-scale SDP problem, whose size is independent of the numbers of the objects and the constraints. Thus it can be globally solved efficiently. Our framework has several attractive features. First, it can effectively propagate pairwise constraints, when available, to the entire data set. Second, it scales well to large-scale problems. Third, it can effectively handle noisy constraints. Fourth, in the absence of constraints, it becomes a novel kernel-based clustering algorithm that can discover linearly non-separable clusters. / Third, we deal with noise robust clustering. Many clustering algorithms, including spectral clustering, often fail on noisy data. We propose a data warping model to map the data into a new space. During the warping, each object spreads its spatial information smoothly over the data graph to other objects. After the warping, hopefully each cluster becomes compact and different clusters become well-separated, including the noise cluster that is formed by the noise objects. The proposed clustering algorithm can handle significantly noisy data, and can find the number of clusters automatically. / Li, Zhenguo. / Adviser: Liu Jianzhuang. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3604. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 121-131). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
37 |
Predictive analytics of institutional attritionVelumula, Sindhu January 1900 (has links)
Master of Science / Department of Computer Science / William H. Hsu / Institutional attrition refers to the phenomenon of members of an organization leaving it over time - a costly challenge faced by many institutions. This work focuses on the problem of predicting attrition as an application of supervised machine learning for classification using summative historical variables. Raising the accuracy, precision, and recall of learned classifiers enables institutional administrators to take individualized preventive action based on the variables that are found to be relevant to the prediction that a particular member is at high risk of departure. This project focuses on using multivariate logistic regression on historical institutional data with wrapper-based feature selection to determine variables that are relevant to a specified classification task for prediction of attrition.
In this work, I first describe a detailed approach to the development of a machine learning pipeline for a range of predictive analytics tasks such as anticipating employee or student attrition. These include: data preparation for supervised inductive learning tasks; training various discriminative models; and evaluating these models using performance metrics such as precision, accuracy, and specificity/sensitivity analysis. Next, I document a synthetic human resource dataset created by data scientists at IBM for simulating employee performance and attrition.
I then apply supervised inductive learning algorithms such as logistic regression, support vector machines (SVM), random forests, and Naive Bayes to predict the attrition of individual employees based on a combination of personal and institution-wide factors. I compare the results of each algorithm to evaluate the predictive models for this classification task.
Finally, I generate basic visualizations common to many analytics dashboards, comprising results such as heat maps of the confusion matrix and the comparative accuracy, precision, recall and F1 score for each algorithm. From an applications perspective, once deployed, this model can be used by human capital services units of an employer to find actionable ways (training, management, incentives, etc.) to reduce attrition and potentially boost longer-term retention.
|
38 |
Prediction Intervals for Class ProbabilitiesYu, Xiaofeng January 2007 (has links)
Prediction intervals for class probabilities are of interest in machine learning because they can quantify the uncertainty about the class probability estimate for a test instance. The idea is that all likely class probability values of the test instance are included, with a pre-specified confidence level, in the calculated prediction interval. This thesis proposes a probabilistic model for calculating such prediction intervals. Given the unobservability of class probabilities, a Bayesian approach is employed to derive a complete distribution of the class probability of a test instance based on a set of class observations of training instances in the neighbourhood of the test instance. A random decision tree ensemble learning algorithm is also proposed, whose prediction output constitutes the neighbourhood that is used by the Bayesian model to produce a PI for the test instance. The Bayesian model, which is used in conjunction with the ensemble learning algorithm and the standard nearest-neighbour classifier, is evaluated on artificial datasets and modified real datasets.
|
39 |
Incorporating prior domain knowledge into inductive machine learning: its implementation in contemporary capital markets.Yu, Ting January 2007 (has links)
An ideal inductive machine learning algorithm produces a model best approximating an underlying target function by using reasonable computational cost. This requires the resultant model to be consistent with the training data, and generalize well over the unseen data. Regular inductive machine learning algorithms rely heavily on numerical data as well as general-purpose inductive bias. However certain environments contain rich domain knowledge prior to the learning task, but it is not easy for regular inductive learning algorithms to utilize prior domain knowledge. This thesis discusses and analyzes various methods of incorporating prior domain knowledge into inductive machine learning through three key issues: consistency, generalization and convergence. Additionally three new methods are proposed and tested over data sets collected from capital markets. These methods utilize financial knowledge collected from various sources, such as experts and research papers, to facilitate the learning process of kernel methods (emerging inductive learning algorithms). The test results are encouraging and demonstrate that prior domain knowledge is valuable to inductive learning machines.
|
40 |
Discovering hierarchy in reinforcement learningHengst, Bernhard, Computer Science & Engineering, Faculty of Engineering, UNSW January 2003 (has links)
This thesis addresses the open problem of automatically discovering hierarchical structure in reinforcement learning. Current algorithms for reinforcement learning fail to scale as problems become more complex. Many complex environments empirically exhibit hierarchy and can be modeled as interrelated subsystems, each in turn with hierarchic structure. Subsystems are often repetitive in time and space, meaning that they reoccur as components of different tasks or occur multiple times in different circumstances in the environment. A learning agent may sometimes scale to larger problems if it successfully exploits this repetition. Evidence suggests that a bottom up approach that repetitively finds building-blocks at one level of abstraction and uses them as background knowledge at the next level of abstraction, makes learning in many complex environments tractable. An algorithm, called HEXQ, is described that automatically decomposes and solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-level hierarchy of interlinked subtasks without being given the model beforehand. The effectiveness and efficiency of the HEXQ decomposition depends largely on the choice of representation in terms of the variables, their temporal relationship and whether the problem exhibits a type of constrained stochasticity. The algorithm is first developed for stochastic shortest path problems and then extended to infinite horizon problems. The operation of the algorithm is demonstrated using a number of examples including a taxi domain, various navigation tasks, the Towers of Hanoi and a larger sporting problem. The main contributions of the thesis are the automation of (1)decomposition, (2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with states described by a number of variables or features. It points the way to further scaling opportunities that encompass approximations, partial observability, selective perception, relational representations and planning. The longer term research aim is to train rather than program intelligent agents
|
Page generated in 0.093 seconds