Spelling suggestions: "subject:"anda achine 1earning"" "subject:"anda achine c1earning""
11 |
Development and validation of deep learning classifiers for antimicrobial peptide predictionYan, Jie Lu January 2018 (has links)
University of Macau / Faculty of Science and Technology. / Department of Computer and Information Science
|
12 |
Understanding the Phishing EcosystemLe Page, Sophie 08 July 2019 (has links)
In “phishing attacks”, phishing websites mimic trustworthy websites in order to steal sensitive information from end-users. Despite research by both academia and the industry focusing on development of anti-phishing detection techniques, phishing has increasingly become an online threat. Our inability to slow down phishing attacks shows that we need to go beyond detection and focus more on understanding the phishing ecosystem. In this thesis, we contribute in three ways to understand the phishing ecosystem and to offer insight for future anti-phishing efforts. First, we provide a new and comparative study on the life cycle of phishing and malware attacks. Specifically, we use public click-through statistics of the Bitly URL shortening service to analyze the click-through rate and timespan of phishing and malware attacks before (and after) they were reported. We find that the efforts against phishing attacks are stronger than those against malware attacks.We also find phishing activity indicating that mitigation strategies are not taking down phishing websites fast enough. Second, we develop a method that finds similarities between the DOMs of phishing attacks, since it is known that phishing attacks are variations of previous attacks. We find that existing methods do not capture the structure of the DOM, and question whether they are failing to catch some of the similar attacks. We accordingly evaluate the feasibility of applying Pawlik and Augsten’s recent implementation of Tree Edit Distance (AP-TED)calculations as a way to compare DOMs and identify similar phishing attack instances.Our method agrees with existing ones that 94% of our phishing database are replicas. It also better discriminates the similarities, but at a higher computational cost. The high agreement between methods strengthens the understanding that most phishing attacks are variations, which affects future anti-phishing strategies.Third, we develop a domain classifier exploiting the history and internet presence of a domain with machine learning techniques. It uses only publicly available information to determine whether a known phishing website is hosted on a legitimate but compromised domain, in which case the domain owner is also a victim, or whether the domain itself is maliciously registered. This is especially relevant due to the recent adoption of the General Data Protection Regulation (GDPR), which prevents certain registration information to be made publicly available. Our classifier achieves 94% accuracy on future malicious domains,while maintaining 88% and 92% accuracy on malicious and compromised datasets respectively from two other sources. Accurate domain classification offers insight with regard to different take-down strategies, and with regard to registrars’ prevention of fraudulent registrations.
|
13 |
Training example adaptation for text categorization.January 2005 (has links)
Ko Hon Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 68-72). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background and Motivation --- p.1 / Chapter 1.2 --- Thesis Organization --- p.4 / Chapter 2 --- Related Work --- p.6 / Chapter 2.1 --- Semi-supervised learning --- p.6 / Chapter 2.2 --- Hierarchical Categorization --- p.10 / Chapter 3 --- Framework Overview --- p.13 / Chapter 4 --- Inherent Concept Detection --- p.18 / Chapter 4.1 --- Data Preprocessing --- p.18 / Chapter 4.2 --- Concept Detection Algorithm --- p.22 / Chapter 4.3 --- Kernel-based Distance Measure --- p.27 / Chapter 5 --- Training Example Discovery from Unlabeled Documents --- p.33 / Chapter 5.1 --- Training Document Discovery --- p.33 / Chapter 5.2 --- Automatically determining the number of extracted positive examples --- p.37 / Chapter 5.3 --- Classification Model --- p.39 / Chapter 6 --- Experimental Evaluation --- p.44 / Chapter 6.1 --- Corpus Description --- p.44 / Chapter 6.2 --- Evaluation Metric --- p.49 / Chapter 6.3 --- Result Analysis --- p.50 / Chapter 7 --- Conclusions and Future Work --- p.66 / Bibliography --- p.68 / Chapter A --- Detailed result on the inherent concept detection process for the TDT and RCV1 corpora --- p.73
|
14 |
Learning by propagation. / CUHK electronic theses & dissertations collectionJanuary 2008 (has links)
Finally, we study how to construct an appropriate graph for spectral clustering. Given a local similarity matrix (a graph), we propose an iterative regularization procedure to iteratively enhance its cluster structure, leading to a global similarity matrix. Significant improvement of clustering performance is observed when the new graph is used for spectral clustering. / In this thesis, we consider the general problem of classifying a data set into a number of subsets, which has been one of the most fundamental problems in machine learning. Specifically, we mainly address the following four common learning problems in three active research fields: semi-supervised classification, semi-supervised clustering, and unsupervised clustering. The first problem we consider is semi-supervised classification from both unlabeled data and pairwise constraints. The pairwise constraints specify which two objects belong to the same class or not. Our aim is to propagate the pairwise constraints to the entire data set. We formulate the propagation model as a semidefinite programming (SDP) problem, which can be globally solved reliably. Our approach is applicable to multi-class problems and handles class labels, pairwise constraints, or a mixture of them in a unified framework. / The second problem is semi-supervised clustering with pairwise constraints. We present a principled framework for learning a data-driven and constraint-consistent nonlinear mapping to reshape the data in a feature space. We formulate the problem as a small-scale SDP problem, whose size is independent of the numbers of the objects and the constraints. Thus it can be globally solved efficiently. Our framework has several attractive features. First, it can effectively propagate pairwise constraints, when available, to the entire data set. Second, it scales well to large-scale problems. Third, it can effectively handle noisy constraints. Fourth, in the absence of constraints, it becomes a novel kernel-based clustering algorithm that can discover linearly non-separable clusters. / Third, we deal with noise robust clustering. Many clustering algorithms, including spectral clustering, often fail on noisy data. We propose a data warping model to map the data into a new space. During the warping, each object spreads its spatial information smoothly over the data graph to other objects. After the warping, hopefully each cluster becomes compact and different clusters become well-separated, including the noise cluster that is formed by the noise objects. The proposed clustering algorithm can handle significantly noisy data, and can find the number of clusters automatically. / Li, Zhenguo. / Adviser: Liu Jianzhuang. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3604. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 121-131). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
15 |
Predictive analytics of institutional attritionVelumula, Sindhu January 1900 (has links)
Master of Science / Department of Computer Science / William H. Hsu / Institutional attrition refers to the phenomenon of members of an organization leaving it over time - a costly challenge faced by many institutions. This work focuses on the problem of predicting attrition as an application of supervised machine learning for classification using summative historical variables. Raising the accuracy, precision, and recall of learned classifiers enables institutional administrators to take individualized preventive action based on the variables that are found to be relevant to the prediction that a particular member is at high risk of departure. This project focuses on using multivariate logistic regression on historical institutional data with wrapper-based feature selection to determine variables that are relevant to a specified classification task for prediction of attrition.
In this work, I first describe a detailed approach to the development of a machine learning pipeline for a range of predictive analytics tasks such as anticipating employee or student attrition. These include: data preparation for supervised inductive learning tasks; training various discriminative models; and evaluating these models using performance metrics such as precision, accuracy, and specificity/sensitivity analysis. Next, I document a synthetic human resource dataset created by data scientists at IBM for simulating employee performance and attrition.
I then apply supervised inductive learning algorithms such as logistic regression, support vector machines (SVM), random forests, and Naive Bayes to predict the attrition of individual employees based on a combination of personal and institution-wide factors. I compare the results of each algorithm to evaluate the predictive models for this classification task.
Finally, I generate basic visualizations common to many analytics dashboards, comprising results such as heat maps of the confusion matrix and the comparative accuracy, precision, recall and F1 score for each algorithm. From an applications perspective, once deployed, this model can be used by human capital services units of an employer to find actionable ways (training, management, incentives, etc.) to reduce attrition and potentially boost longer-term retention.
|
16 |
Prediction Intervals for Class ProbabilitiesYu, Xiaofeng January 2007 (has links)
Prediction intervals for class probabilities are of interest in machine learning because they can quantify the uncertainty about the class probability estimate for a test instance. The idea is that all likely class probability values of the test instance are included, with a pre-specified confidence level, in the calculated prediction interval. This thesis proposes a probabilistic model for calculating such prediction intervals. Given the unobservability of class probabilities, a Bayesian approach is employed to derive a complete distribution of the class probability of a test instance based on a set of class observations of training instances in the neighbourhood of the test instance. A random decision tree ensemble learning algorithm is also proposed, whose prediction output constitutes the neighbourhood that is used by the Bayesian model to produce a PI for the test instance. The Bayesian model, which is used in conjunction with the ensemble learning algorithm and the standard nearest-neighbour classifier, is evaluated on artificial datasets and modified real datasets.
|
17 |
Discovering hierarchy in reinforcement learningHengst, Bernhard, Computer Science & Engineering, Faculty of Engineering, UNSW January 2003 (has links)
This thesis addresses the open problem of automatically discovering hierarchical structure in reinforcement learning. Current algorithms for reinforcement learning fail to scale as problems become more complex. Many complex environments empirically exhibit hierarchy and can be modeled as interrelated subsystems, each in turn with hierarchic structure. Subsystems are often repetitive in time and space, meaning that they reoccur as components of different tasks or occur multiple times in different circumstances in the environment. A learning agent may sometimes scale to larger problems if it successfully exploits this repetition. Evidence suggests that a bottom up approach that repetitively finds building-blocks at one level of abstraction and uses them as background knowledge at the next level of abstraction, makes learning in many complex environments tractable. An algorithm, called HEXQ, is described that automatically decomposes and solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-level hierarchy of interlinked subtasks without being given the model beforehand. The effectiveness and efficiency of the HEXQ decomposition depends largely on the choice of representation in terms of the variables, their temporal relationship and whether the problem exhibits a type of constrained stochasticity. The algorithm is first developed for stochastic shortest path problems and then extended to infinite horizon problems. The operation of the algorithm is demonstrated using a number of examples including a taxi domain, various navigation tasks, the Towers of Hanoi and a larger sporting problem. The main contributions of the thesis are the automation of (1)decomposition, (2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with states described by a number of variables or features. It points the way to further scaling opportunities that encompass approximations, partial observability, selective perception, relational representations and planning. The longer term research aim is to train rather than program intelligent agents
|
18 |
Calibrating recurrent sliding window classifiers for sequential supervised learningJoshi, Saket Subhash 03 October 2003 (has links)
Sequential supervised learning problems involve assigning a class label to
each item in a sequence. Examples include part-of-speech tagging and text-to-speech
mapping. A very general-purpose strategy for solving such problems is
to construct a recurrent sliding window (RSW) classifier, which maps some window
of the input sequence plus some number of previously-predicted items into
a prediction for the next item in the sequence. This paper describes a general purpose
implementation of RSW classifiers and discusses the highly practical
issue of how to choose the size of the input window and the number of previous
predictions to incorporate. Experiments on two real-world domains show that
the optimal choices vary from one learning algorithm to another. They also
depend on the evaluation criterion (number of correctly-predicted items versus
number of correctly-predicted whole sequences). We conclude that window
sizes must be chosen by cross-validation. The results have implications for the
choice of window sizes for other models including hidden Markov models and
conditional random fields. / Graduation date: 2004
|
19 |
A study of model-based average reward reinforcement learningOk, DoKyeong 09 May 1996 (has links)
Reinforcement Learning (RL) is the study of learning agents that improve
their performance from rewards and punishments. Most reinforcement learning
methods optimize the discounted total reward received by an agent, while, in many
domains, the natural criterion is to optimize the average reward per time step. In this
thesis, we introduce a model-based average reward reinforcement learning method
called "H-learning" and show that it performs better than other average reward and
discounted RL methods in the domain of scheduling a simulated Automatic Guided
Vehicle (AGV).
We also introduce a version of H-learning which automatically explores the
unexplored parts of the state space, while always choosing an apparently best action
with respect to the current value function. We show that this "Auto-exploratory H-Learning"
performs much better than the original H-learning under many previously
studied exploration strategies.
To scale H-learning to large state spaces, we extend it to learn action models
and reward functions in the form of Bayesian networks, and approximate its value
function using local linear regression. We show that both of these extensions are very
effective in significantly reducing the space requirement of H-learning, and in making
it converge much faster in the AGV scheduling task. Further, Auto-exploratory H-learning
synergistically combines with Bayesian network model learning and value
function approximation by local linear regression, yielding a highly effective average
reward RL algorithm.
We believe that the algorithms presented here have the potential to scale to
large applications in the context of average reward optimization. / Graduation date:1996
|
20 |
Automatic Segmentation of Lung Carcinoma Using 3D Texture Features in Co-registered 18-FDG PET/CT ImagesMarkel, Daniel 14 December 2011 (has links)
Variability between oncologists in defining the tumor during radiation therapy planning
can be as high as 700% by volume. Robust, automated definition of tumor boundaries
has the ability to significantly improve treatment accuracy and efficiency. However, the information provided in computed tomography (CT) is not sensitive enough to differences between tumor and healthy tissue and positron emission tomography (PET) is hampered by blurriness and low resolution. The textural characteristics of thoracic tissue was investigated and compared with those of tumors found within 21 patient PET and CT images in order to enhance the differences and the boundary between cancerous and healthy tissue. A pattern recognition approach was used from these samples to learn the textural characteristics of each and classify voxels as being either normal or abnormal.
The approach was compared to a number of alternative methods and found to have the
highest overlap with that of an oncologist's tumor definition.
|
Page generated in 0.0796 seconds