Global ETD Search

11	Detecting Lateral Movement in Microsoft Active Directory Log Files : A supervised machine learning approach Uppströmer, Viktor, Råberg, Henning January 2019 (has links) Cyberattacker utgör ett stort hot för dagens företag och organisationer, med engenomsnittlig kostnad för ett intrång på ca 3,86 miljoner USD. För att minimera kostnaden av ett intrång är det viktigt att detektera intrånget i ett så tidigt stadium som möjligt. Avancerande långvariga hot (APT) är en sofistikerad cyberattack som har en lång närvaro i offrets nätverk. Efter attackerarens första intrång kommer fokuset av attacken skifta till att få kontroll över så många enheter som möjligt på nätverket. Detta steg kallas för lateral rörelse och är ett av de mest kritiska stegen i en APT. Syftet med denna uppsats är att undersöka hur och hur väl lateral rörelse kan upptäckas med hjälp av en maskininlärningsmetod. I undersökningen jämförs och utvärderas fem maskininlärningsalgoritmer med upprepad korsvalidering följt av statistisk testning för att bestämma vilken av algoritmerna som är bäst. Undersökningen konkluderar även vilka attributer i det undersökta datasetet som är väsentliga för att detektera laterala rörelser. Datasetet kommer från en Active Directory domänkontrollant där datasetets attributer är skapade av korrelerade loggar med hjälp av datornamn, IP-adress och användarnamn. Datasetet består av en syntetisk, samt, en verklig del vilket skapar ett semi-syntetiskt dataset som innehåller ett multiklass klassifierings problem. Experimentet konkluderar att all fem algoritmer klassificerar rätt med en pricksäkerhet (accuracy) på 0.998. Algoritmen RF presterar med den högsta f-measure (0.88) samt recall (0.858), SVM är bäst gällande precision (0.972) och DT har denlägsta inlärningstiden (1237ms). Baserat på resultaten indikerar undersökningenatt algoritmerna RF, SVM och DT presterar bäst i olika scenarier. Till exempel kan SVM användas om en låg mängd falsk positiva larm är viktigt. Om en balanserad prestation av de olika prestanda mätningarna är viktigast ska RF användas. Undersökningen konkluderar även att en stor mängd utav de undersökta attributerna av datasetet kan bortses i framtida experiment, då det inte påverkade prestandan på någon av algoritmerna. / Cyber attacks raise a high threat for companies and organisations worldwide. With the cost of a data breach reaching $3.86million on average, the demand is high fora rapid solution to detect cyber attacks as early as possible. Advanced persistent threats (APT) are sophisticated cyber attacks which have long persistence inside the network. During an APT, the attacker will spread its foothold over the network. This stage, which is one of the most critical steps in an APT, is called lateral movement. The purpose of the thesis is to investigate lateral movement detection with a machine learning approach. Five machine learning algorithms are compared using repeated cross-validation followed statistical testing to determine the best performing algorithm and feature importance. Features used for learning the classifiers are extracted from Active Directory log entries that relate to each other, with a similar workstation, IP, or account name. These features are the basis of a semi-synthetic dataset, which consists of a multiclass classification problem. The experiment concludes that all five algorithms perform with an accuracy of 0.998. RF displays the highest f1-score (0.88) and recall (0.858), SVM performs the best with the performance metric precision (0.972), and DT has the lowest computational cost (1237ms). Based on these results, the thesis concludes that the algorithms RF, SVM, and DT perform best in different scenarios. For instance, SVM should be used if a low amount of false positives is favoured. If the general and balanced performance of multiple metrics is preferred, then RF will perform best. The results also conclude that a significant amount of the examined features can be disregarded in future experiments, as they do not impact the performance of either classifier. Advanced Persistent Threat Lateral Movement Active Directory Multiclass Classification Intrusion Detection System Avancerade långvariga hot Lateral rörelse Active Directory Multiklassklassificering Intrångsdetektering Computer Systems Datorsystem
12	Human layout estimation using structured output learning Mittal, Arpit January 2012 (has links) In this thesis, we investigate the problem of human layout estimation in unconstrained still images. This involves predicting the spatial configuration of body parts. We start our investigation with pictorial structure models and propose an efficient method of model fitting using skin regions. To detect the skin, we learn a colour model locally from the image by detecting the facial region. The resulting skin detections are also used for hand localisation. Our next contribution is a comprehensive dataset of 2D hand images. We collected this dataset from publicly available image sources, and annotated images with hand bounding boxes. The bounding boxes are not axis aligned, but are rather oriented with respect to the wrist. Our dataset is quite exhaustive as it includes images of different hand shapes and layout configurations. Using our dataset, we train a hand detector that is robust to background clutter and lighting variations. Our hand detector is implemented as a two-stage system. The first stage involves proposing hand hypotheses using complementary image features, which are then evaluated by the second stage classifier. This improves both precision and recall and results in a state-of-the-art hand detection method. In addition we develop a new method of non-maximum suppression based on super-pixels. We also contribute an efficient training algorithm for structured output ranking. In our algorithm, we reduce the time complexity of an expensive training component from quadratic to linear. This algorithm has a broad applicability and we use it for solving human layout estimation and taxonomic multiclass classification problems. For human layout, we use different body part detectors to propose part candidates. These candidates are then combined and scored using our ranking algorithm. By applying this bottom-up approach, we achieve accurate human layout estimation despite variations in viewpoint and layout configuration. In the multiclass classification problem, we define the misclassification error using a class taxonomy. The problem then reduces to a structured output ranking problem and we use our ranking method to optimise it. This allows inclusion of semantic knowledge about the classes and results in a more meaningful classification system. Lastly, we substantiate our ranking algorithm with theoretical proofs and derive the generalisation bounds for it. These bounds prove that the training error reduces to the lowest possible error asymptotically. 006.3
13	Multi-label Classification with Multiple Label Correlation Orders And Structures Posinasetty, Anusha January 2016 (has links) (PDF) Multilabel classification has attracted much interest in recent times due to the wide applicability of the problem and the challenges involved in learning a classifier for multilabeled data. A crucial aspect of multilabel classification is to discover the structure and order of correlations among labels and their effect on the quality of the classifier. In this work, we propose a structural Support Vector Machine (structural SVM) based framework which enables us to systematically investigate the importance of label correlations in multi-label classification. The proposed framework is very flexible and provides a unified approach to handle multiple correlation orders and structures in an adaptive manner and helps to effectively assess the importance of label correlations in improving the generalization performance. We perform extensive empirical evaluation on several datasets from different domains and present results on various performance metrics. Our experiments provide for the first time, interesting insights into the following questions: a) Are label correlations always beneficial in multilabel classification? b) What effect do label correlations have on multiple performance metrics typically used in multilabel classification? c) Is label correlation order significant and if so, what would be the favorable correlation order for a given dataset and a given performance metric? and d) Can we make useful suggestions on the label correlation structure? Multi Label Classification Structural Support Vector Machine Machine Learning Multiclass Classification Multi-Label Classification Algorithms Structural SVM Computer Science
14	Bayes Optimal Feature Selection for Supervised Learning Saneem Ahmed, C G January 2014 (has links) (PDF) The problem of feature selection is critical in several areas of machine learning and data analysis such as, for example, cancer classification using gene expression data, text categorization, etc. In this work, we consider feature selection for supervised learning problems, where one wishes to select a small set of features that facilitate learning a good prediction model in the reduced feature space. Our interest is primarily in filter methods that select features independently of the learning algorithm to be used and are generally faster to implement compared to other types of feature selection algorithms. Many common filter methods for feature selection make use of information-theoretic criteria such as those based on mutual information to guide their search process. However, even in simple binary classification problems, mutual information based methods do not always select the best set of features in terms of the Bayes error. In this thesis, we develop a general approach for selecting a set of features that directly aims to minimize the Bayes error in the reduced feature space with respect to the loss or performance measure of interest. We show that the mutual information based criterion is a special case of our setting when the loss function of interest is the logarithmic loss for class probability estimation. We give a greedy forward algorithm for approximately optimizing this criterion and demonstrate its application to several supervised learning problems including binary classification (with 0-1 error, cost-sensitive error, and F-measure), binary class probability estimation (with logarithmic loss), bipartite ranking (with pairwise disagreement loss), and multiclass classification (with multiclass 0-1 error). Our experiments suggest that the proposed approach is competitive with several state-of-the art methods. Data Analysis Logarithms Supervised Learning Bayes Optimality Binary Classsification Bipartite Ranking Multiclass Classification Bayes Optimal Feature Selection Optimal Feature Selection Bayes Error Binary Class Probability Estimation Supervised Learning Problems Computer Science
15	Sparse Multiclass And Multi-Label Classifier Design For Faster Inference Bapat, Tanuja 12 1900 (has links) (PDF) Many real-world problems like hand-written digit recognition or semantic scene classiﬁcation are treated as multiclass or multi-label classiﬁcation prob-lems. Solutions to these problems using support vector machines (SVMs) are well studied in literature. In this work, we focus on building sparse max-margin classiﬁers for multiclass and multi-label classiﬁcation. Sparse representation of the resulting classiﬁer is important both from eﬃcient training and fast inference viewpoints. This is true especially when the training and test set sizes are large.Very few of the existing multiclass and multi-label classiﬁcation algorithms have given importance to controlling the sparsity of the designed classiﬁers directly. Further, these algorithms were not found to be scalable. Motivated by this, we propose new formulations for sparse multiclass and multi-label classiﬁer design and also give eﬃcient algorithms to solve them. The formulation for sparse multi-label classiﬁcation also incorporates the prior knowledge of label correlations. In both the cases, the classiﬁcation model is designed using a common set of basis vectors across all the classes. These basis vectors are greedily added to an initially empty model, to approximate the target function. The sparsity of the classiﬁer can be controlled by a user deﬁned parameter, dmax which indicates the max-imum number of common basis vectors. The computational complexity of these algorithms for multiclass and multi-label classiﬁer designisO(lk2d2 max), Where l is the number of training set examples and k is the number of classes. The inference time for the proposed multiclass and multi-label classiﬁers is O(kdmax). Numerical experiments on various real-world benchmark datasets demonstrate that the proposed algorithms result in sparse classiﬁers that require lesser number of basis vectors than required by state-of-the-art algorithms, to attain the same generalization performance. Very small value of dmax results in signiﬁcant reduction in inference time. Thus, the proposed algorithms provide useful alternatives to the existing algorithms for sparse multiclass and multi-label classiﬁer design. Artificial Intelligence Machine Learning Multiclass Classification Multi-label Classification Sparse Max-Margin Classifiers Support Vector Machine (SVM) Sparse Classifiers Computer Science
16	Algorithmes de poursuite stochastiques et inégalités de concentration empiriques pour l'apprentissage statistique / Stochastic pursuit algorithms and empirical concentration inequalities for machine learning Peel, Thomas 29 November 2013 (has links) La première partie de cette thèse introduit de nouveaux algorithmes de décomposition parcimonieuse de signaux. Basés sur Matching Pursuit (MP) ils répondent au problème suivant : comment réduire le temps de calcul de l'étape de sélection de MP, souvent très coûteuse. En réponse, nous sous-échantillonnons le dictionnaire à chaque itération, en lignes et en colonnes. Nous montrons que cette approche fondée théoriquement affiche de bons résultats en pratique. Nous proposons ensuite un algorithme itératif de descente de gradient par blocs de coordonnées pour sélectionner des caractéristiques en classification multi-classes. Celui-ci s'appuie sur l'utilisation de codes correcteurs d'erreurs transformant le problème en un problème de représentation parcimonieuse simultanée de signaux. La deuxième partie expose de nouvelles inégalités de concentration empiriques de type Bernstein. En premier, elles concernent la théorie des U-statistiques et sont utilisées pour élaborer des bornes en généralisation dans le cadre d'algorithmes de ranking. Ces bornes tirent parti d'un estimateur de variance pour lequel nous proposons un algorithme de calcul efficace. Ensuite, nous présentons une version empirique de l'inégalité de type Bernstein proposée par Freedman [1975] pour les martingales. Ici encore, la force de notre borne réside dans l'introduction d'un estimateur de variance calculable à partir des données. Cela nous permet de proposer des bornes en généralisation pour l'ensemble des algorithmes d'apprentissage en ligne améliorant l'état de l'art et ouvrant la porte à une nouvelle famille d'algorithmes d'apprentissage tirant parti de cette information empirique. / The first part of this thesis introduces new algorithms for the sparse encoding of signals. Based on Matching Pursuit (MP) they focus on the following problem : how to reduce the computation time of the selection step of MP. As an answer, we sub-sample the dictionary in line and column at each iteration. We show that this theoretically grounded approach has good empirical performances. We then propose a bloc coordinate gradient descent algorithm for feature selection problems in the multiclass classification setting. Thanks to the use of error-correcting output codes, this task can be seen as a simultaneous sparse encoding of signals problem. The second part exposes new empirical Bernstein inequalities. Firstly, they concern the theory of the U-Statistics and are applied in order to design generalization bounds for ranking algorithms. These bounds take advantage of a variance estimator and we propose an efficient algorithm to compute it. Then, we present an empirical version of the Bernstein type inequality for martingales by Freedman [1975]. Again, the strength of our result lies in the variance estimator computable from the data. This allows us to propose generalization bounds for online learning algorithms which improve the state of the art and pave the way to a new family of learning algorithms taking advantage of this empirical information. Matching Pursuit Algorithmes Stochastiques Sélection de Caractéristiques Classification Multi-Classes Inégalités de Bernstein Empiriques U-Statistiques Martingales Ranking Apprentissage en Ligne Bornes d'Erreur en Généralisation Matching Pursuit Stochastic Algorithms Feature Selection Multiclass Classification Empirical Bernstein Inequalities U-Statistics Martingales Ranking Online Learning Generalization Bounds
17	Contributions à l'étude et à la reconnaissance automatique de la parole en Fongbe / Contributions to the study of automatic speech recognitionon Fongbe Laleye, Frejus Adissa Akintola 10 December 2016 (has links) L'une des difficultés d'une langue peu dotée est l'inexistence des services liés aux technologies du traitement de l'écrit et de l'oral. Dans cette thèse, nous avons affronté la problématique de l'étude acoustique de la parole isolée et de la parole continue en Fongbe dans le cadre de la reconnaissance automatique de la parole. La complexité tonale de l'oral et la récente convention de l'écriture du Fongbe nous ont conduit à étudier le Fongbe sur toute la chaîne de la reconnaissance automatique de la parole. En plus des ressources linguistiques collectées (vocabulaires, grands corpus de texte, grands corpus de parole, dictionnaires de prononciation) pour permettre la construction des algorithmes, nous avons proposé une recette complète d'algorithmes (incluant des algorithmes de classification et de reconnaissance de phonèmes isolés et de segmentation de la parole continue en syllabe), basés sur une étude acoustique des différents sons, pour le traitement automatique du Fongbe. Dans ce manuscrit, nous avons aussi présenté une méthodologie de développement de modèles accoustiques et de modèles du langage pour faciliter la reconnaissance automatique de la parole en Fongbe. Dans cette étude, il a été proposé et évalué une modélisation acoustique à base de graphèmes (vu que le Fongbe ne dispose pas encore de dictionnaire phonétique) et aussi l'impact de la prononciation tonale sur la performance d'un système RAP en Fongbe. Enfin, les ressources écrites et orales collectées pour le Fongbe ainsi que les résultats expérimentaux obtenus pour chaque aspect de la chaîne de RAP en Fongbe valident le potentiel des méthodes et algorithmes que nous avons proposés. / One of the difficulties of an unresourced language is the lack of technology services in the speech and text processing. In this thesis, we faced the problematic of an acoustical study of the isolated and continous speech in Fongbe as part of the speech recognition. Tonal complexity of the oral and the recent agreement of writing the Fongbe led us to study the Fongbe throughout the chain of an automatic speech recognition. In addition to the collected linguistic resources (vocabularies, large text and speech corpus, pronunciation dictionaries) for building the algorithms, we proposed a complete recipe of algorithms (including algorithms of classification and recognition of isolated phonemes and segmentation of continuous speech into syllable), based on an acoustic study of the different sounds, for Fongbe automatic processing. In this manuscript, we also presented a methodology for developing acoustic models and language models to facilitate speech recognition in Fongbe. In this study, it was proposed and evaluated an acoustic modeling based on grapheme (since the Fongbe don't have phonetic dictionary) and also the impact of tonal pronunciation on the performance of a Fongbe ASR system. Finally, the written and oral resources collected for Fongbe and experimental results obtained for each aspect of an ASR chain in Fongbe validate the potential of the methods and algorithms that we proposed. Fongbe Reconnaissance automatique de la parole Segmentation automatique de la parole Entropie de Rényi Modélisation acoustique graphémique Modélisation du langage Fusion de décisions Multi-classification DBN Logique floue Fongbe Automatic speech recognition Automatic speech segmentation Rényi entropy Graphem-based acoustical modeling Language modeling Fusion of decisions Multiclass classification DBN Fuzzy logic

Page generated in 0.1494 seconds