Spelling suggestions: "subject:"8upport vector machines"" "subject:"6upport vector machines""
101 |
Efficient Kernel Methods for Statistical DetectionSu, Wanhua 20 March 2008 (has links)
This research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of
new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection.
We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step.
Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines.
One drawback of the existing LAGO is that it only
provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO
provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO.
|
102 |
Impacts of midpoint FACTS controllers on the coordiantion between generator phase backup protection and generator capability limitsElsamahy, Mohamed Salah Kamel 15 July 2011 (has links)
The thesis reports the results of comprehensive studies carried out to explore the impact of midpoint FACTS Controllers (STATCOM and SVC) on the generator distance phase backup protection in order to identify important issues that protection engineers need to consider when designing and setting a generator protection system. In addition, practical, feasible and simple solutions to mitigate the adverse impact of midpoint FACTS Controllers on the generator distance phase backup protection are explored.
The results of these studies show that midpoint FACTS Controllers have an adverse effect on the generator distance phase backup protection. This adverse effect, which can be in the form of underreach, overreach or a time delay, varies according to the fault type, fault location and generator loading. Moreover, it has been found that the adverse effect of the midpoint FACTS Controllers extends to affect the coordination between the generator distance phase backup protection and the generator steady-state overexcited capability limit.
The Support Vector Machines classification technique is proposed as a replacement for the existing generator distance phase backup protection relay in order to alleviate potential problems. It has been demonstrated that this technique is a very promising solution, as it is fast, reliable and has a high performance efficiency. This will result in enhancing the coordination between the generator phase backup protection and the generator steady-state overexcited capability limit in the presence of midpoint FACTS Controllers.
The thesis also presents the results of investigations carried out to explore the impact of the generator distance phase backup protection relay on the generator overexcitation thermal capability. The results of these investigations reveal that with the relay settings according to the current standards, the generator is over-protected and the generator distance phase backup protection relay restricts the generator overexcitation thermal capability during system disturbances. This restriction does not allow the supply of the maximum reactive power of the generating unit during such events. The restriction on the generator overexcitation thermal capability caused by the generator distance phase backup protection relay highlights the necessity to revise the relay settings. The proposed solution in this thesis is to reduce the generator distance phase backup protection relay reach in order to provide secure performance during system disturbances.
|
103 |
Search and Analysis of the Sequence Space of a Protein Using Computational ToolsDubey, Anshul 25 August 2006 (has links)
A new approach to the process of Directed Evolution is proposed, which utilizes different machine learning algorithms. Directed Evolution is a process of improving a protein for catalytic purposes by introducing random mutations in its sequence to create variants. Through these mutations, Directed Evolution explores the sequence space, which is defined as all the possible sequences for a given number of amino acids. Each variant sequence is divided into one of two classes, positive or negative, according to their activity or stability. By employing machine learning algorithms for feature selection on the sequence of these variants of the protein, attributes or amino acids in its sequence important for the classification into positive or negative, can be identified. Support Vector Machines (SVMs) were utilized to identify the important individual amino acids for any protein, which have to be preserved to maintain its activity. The results for the case of beta-lactamase show that such residues can be identified with high accuracy while using a small number of variant sequences. Another class of machine learning problems, Boolean Learning, was used to extend this approach to identifying interactions between the different amino acids in a proteins sequence using the variant sequences. It was shown through simulations that such interactions can be identified for any protein with a reasonable number of variant sequences. For experimental verification of this approach, two fluorescent proteins, mRFP and DsRed, were used to generate variants, which were screened for fluorescence. Using Boolean Learning, an interacting pair was identified, which was shown to be important for the fluorescence. It was also shown through experiments and simulations that knowing such pairs can increase the fraction active variants in the library. A Boolean Learning algorithm was also developed for this application, which can learn Boolean functions from data in the presence of classification noise.
|
104 |
Coevolution Based Prediction Of Protein-protein Interactions With Reduced Training DataPamuk, Bahar 01 February 2009 (has links) (PDF)
Protein-protein interactions are important for the prediction of protein functions since two interacting proteins usually have similar functions in a cell. Available protein interaction networks are incomplete / but, they can be used to predict new interactions in a supervised learning framework. However, in the case that the known protein network includes large number of protein pairs, the training time of the machine learning algorithm becomes quite long. In this thesis work, our aim is to predict protein-protein interactions with a known portion of the interaction network. We used Support Vector Machines (SVM) as the machine learning algoritm and used the already known protein pairs in the network. We chose to use phylogenetic profiles of proteins to form the feature vectors required for the learner since the similarity of two proteins in evolution gives a reasonable rating about whether the two proteins interact or not. For large data sets, the training time of SVM becomes quite long, therefore we reduced the data size in a sensible way while we keep approximately the same prediction accuracy.
We applied a number of clustering techniques to extract the most representative data and features in a two categorical framework. Knowing that the training data set is a two dimensional matrix, we applied data reduction methods in both dimensions, i.e., both in data size and in
feature vector size. We observed that the data clustered by the k-means clustering technique gave superior results in prediction accuracies compared to another data clustering algorithm which was also developed for reducing data size for SVM training. Still the true positive and false positive rates (TPR-FPR) of the training data sets constructed by the two clustering
methods did not give satisfying results about which method outperforms the other. On the other hand, we applied feature selection methods on the feature vectors of training data by selecting the most representative features in biological and in statistical meaning. We used phylogenetic tree of organisms to identify the organisms which are evolutionarily significant.
Additionally we applied Fisher&sbquo / Ä / ô / s test method to select the features which are most representative statistically. The accuracy and TPR-FPR values obtained by feature selection methods could not provide to make a certain decision on the performance comparisons. However it can be mentioned that phylogenetic tree method resulted in acceptable prediction values when compared to Fisher&sbquo / Ä / ô / s test.
|
105 |
Kernel Methods Fast Algorithms and real life applicationsVishwanathan, S V N 06 1900 (has links)
Support Vector Machines (SVM) have recently gained prominence in the field of machine learning and pattern classification (Vapnik, 1995, Herbrich, 2002, Scholkopf and Smola, 2002). Classification is achieved by finding a separating hyperplane in a feature space, which can be mapped back onto a non-linear surface in the input space. However, training an SVM involves solving a quadratic optimization problem, which tends to be computationally intensive. Furthermore, it can be subject to stability problems and is non-trivial to implement. This thesis proposes an fast iterative Support Vector training algorithm which overcomes some of these problems.
Our algorithm, which we christen Simple SVM, works mainly for the quadratic soft margin loss (also called the l2 formulation). We also sketch an extension for the linear soft-margin loss (also called the l1 formulation). Simple SVM works by incrementally changing a candidate Support Vector set using a locally greedy approach, until the supporting hyperplane is found within a finite number of iterations. It is derived by a simple (yet computationally crucial) modification of the incremental SVM training algorithms of Cauwenberghs and Poggio (2001) which allows us to perform update operations very efficiently. Constant-time methods for initialization of the algorithm and experimental evidence for the speed of the proposed algorithm, when compared to methods such as Sequential Minimal Optimization and the Nearest Point Algorithm are given. We present results on a variety of real life datasets to validate our claims.
In many real life applications, especially for the l2 formulation, the kernel matrix K є R n x n can be written as
K = Z T Z + Λ ,
where, Z є R n x m with m << n and Λ є R n x n is diagonal with nonnegative entries. Hence the matrix K - Λ is rank-degenerate, Extending the work of Fine and Scheinberg (2001) and Gill et al. (1975) we propose an efficient factorization algorithm which can be used to find a L D LT factorization of K in 0(nm2) time. The modified factorization, after a rank one update of K, can
be computed in 0(m2) time. We show how the Simple SVM algorithm can be sped up by taking advantage of this new factorization. We also demonstrate applications of our factorization to interior point methods. We show a close relation between the LDV factorization of a rectangular matrix and our LDLT factorization (Gill et al., 1975).
An important feature of SVM's is that they can work with data from any input domain as long as a suitable mapping into a Hilbert space can be found, in other words, given the input data we should be able to compute a positive semi definite kernel matrix of the data (Scholkopf and Smola, 2002). In this thesis we propose kernels on a variety of discrete objects, such as strings, trees, Finite State Automata, and Pushdown Automata. We show that our kernels include as special cases the celebrated Pair-HMM kernels (Durbin et al., 1998, Watkins, 2000), the spectrum kernel (Leslie et al., 20024, convolution kernels for NLP (Collins and Duffy, 2001), graph diffusion kernels (Kondor and Lafferty, 2002) and various other string-matching kernels.
Because of their widespread applications in bio-informatics and web document based algorithms, string kernels are of special practical importance. By intelligently using the matching statistics algorithm of Chang and Lawler (1994), we propose, perhaps, the first ever algorithm to compute string kernels in linear time. This obviates dynamic programming with quadratic time complexity and makes string kernels a viable alternative for the practitioner. We also propose extensions of our string kernels to compute kernels on trees efficiently. This thesis presents a linear time algorithm for ordered trees and a log-linear time algorithm for unordered trees.
In general, SVM's require time proportional to the number of Support Vectors for prediction. In case the dataset is noisy a large fraction of the data points become Support Vectors and thus time required for prediction increases. But, in many applications like search engines or web document retrieval, the dataset is noisy, yet, the speed of prediction is critical. We propose a method for string kernels by which the prediction time can be reduced to linear in the length of the sequence to be classified, regardless of the number of Support Vectors. We achieve this by using a weighted version of our string kernel algorithm.
We explore the relationship between dynamic systems and kernels. We define kernels on various kinds of dynamic systems including Markov chains (both discrete and continuous), diffusion processes on graphs and Markov chains, Finite State Automata, various linear time-invariant systems etc Trajectories arc used to define kernels introduced on initial conditions lying underlying dynamic system. The same idea is extended to define Kernels on a. dynamic system with respect to a set of initial conditions. This framework leads to a large number of novel kernels and also generalize many previously proposed kernels.
Lack of adequate training data is a problem which plagues classifiers. We propose n new method to generate virtual training samples in the case of handwritten digit data. Our method uses the two dimensional suffix tree representation of a set of matrices to encode an exponential number of virtual samples in linear space thus leading to an increase in classification accuracy. This in turn, leads us naturally to a, compact data dependent representation of a test pattern which we call the description tree. We propose a new kernel for images and demonstrate a quadratic time algorithm for computing it by wing the suffix tree representation of an image. We also describe a method to reduce the prediction time to quadratic in the size of the test image by using techniques similar to those used for string kernels.
|
106 |
Blind image and video quality assessment using natural scene and motion modelsSaad, Michele Antoine 05 November 2013 (has links)
We tackle the problems of no-reference/blind image and video quality evaluation. The approach we take is that of modeling the statistical characteristics of natural images and videos, and utilizing deviations from those natural statistics as indicators of perceived quality. We propose a probabilistic model of natural scenes and a probabilistic model of natural videos to drive our image and video quality assessment (I/VQA) algorithms respectively. The VQA problem is considerably different from the IQA problem since it imposes a number of challenges on top of the challenges faced in the IQA problem; namely the challenges arising from the temporal dimension in video that plays an important role in influencing human perception of quality. We compare our IQA approach to the state of the art in blind, reduced reference and full-reference methods, and we show that it is top performing. We compare our VQA approach to the state of the art in reduced and full-reference methods (no blind VQA methods that perform reliably well exist), and show that our algorithm performs as well as the top performing full and reduced reference algorithms in predicting human judgments of quality. / text
|
107 |
Schnelle Identifizierung von oralen Actinomyces-Arten des subgingivalen Biofilms mittels MALDI-TOF-MSBorgmann, Toralf Harald 25 November 2015 (has links) (PDF)
Aktinomyzeten sind ein Teil der residenten Flora des menschlichen Verdauungstraktes, des Urogenitalsystems und der Haut. Die zeitraubende Isolation und Identifikation der Aktinomyzeten durch konventionelle Methoden stellt sich häufig als sehr schwierig dar. In den letzten Jahren hat sich jedoch die Matrix-unterstützte Laser-Desorption/Ionisation-Flugzeit-Massenspektrometrie (MALDI-TOF-MS) als Alternative zu etablierten Verfahren entwickelt und stellt heutzutage eine schnelle und simple Methode zur Bakterienidentifikation dar. Unsere Studie untersucht den Nutzen dieser Methode für eine schnelle und zuverlässige Identifizierung von oralen Aktinomyzeten, die aus dem subgingivalen Biofilm parodontal erkrankter Patienten isoliert wurden. In dieser Studie wurden elf verschiedene Referenzstämme aus den Stammsammlungen ATCC und DSMZ und 674 klinische Stämme untersucht. Alle Stämme wurden durch biochemische Methoden vorab identifiziert und anschließend ausgehend von den erhobenen MALDI-TOF-MS-Daten durch Ähnlichkeitsanalysen und Klassifikationsmethoden identifiziert und klassifiziert. Der Genotyp der Referenzstämme und von 232 klinischen Stämmen wurde durch Sequenzierung der 16S rDNA bestimmt. Die Sequenzierung bestätigte die Identifizierung der Referenzstämme. Diese und die zweifelsfrei durch 16S rDNA Sequenzierung identifizierten Aktinomyzeten wurden verwendet, um eine MALDI-TOF-MS-Datenbank zu erstellen. Methoden der Klassifikation wurden angewandt, um eine Differenzierung und Identifikation zu ermöglichen. Unsere Ergebnisse zeigen, dass eine Kombination aus Datenerhebung mittels MALDI-TOF-MS und deren Verarbeitung mittels SVM-Algorithmen eine gute Möglichkeit für die Identifikation und Differenzierung von oralen Aktinomyzeten darstellt.
|
108 |
Acoustic impulse detection algorithms for application in gunshot localizationVan der Merwe, J. F. January 2012 (has links)
M. Tech. Electrical Engineering. / Attempts to find computational efficient ways to identify and extract gunshot impulses from signals. Areas of study include Generalised Cross Correlation (GCC), sidelobe minimisation utilising Least Square (LS) techniques as well as training algorithms using a Reproducing Kernel Hilbert Space (RKHS) approach. It also incorporates Support Vector Machines (SVM) to train a network to recognise gunshot impulses. By combining these individual research areas more optimal solutions are obtainable.
|
109 |
Νευρωνικά δίκτυα και μηχανές διανυσματικής υποστήριξης / Neural networks and support vector machinesΚυρίτσης, Κωνσταντίνος 01 October 2014 (has links)
Σκοπός αυτής της διπλωματικής εργασίας είναι η σύγκριση δύο μεγάλων κατηγοριών, των Τεχνητών Νευρωνικών Δικτύων και των πολύ δημοφιλείς τα τελευταία χρόνια, Μηχανών Διανυσματικής Υποστήριξης (SVMs) στην Κατηγοριο-ποίηση δεδομένων και στην Παλινδρόμηση.
Στο πρώτο κεφάλαιο έχουν γραφτεί θέματα σχετικά με την Εξόρυξη γνώσης και την Κατηγοριοποίηση δεδομένων, το δεύτερο κεφάλαιο προσεγγίζει αρκετά θέματα από το τεράστιο κεφάλαιο των Νευρωνικών Δικτύων. Αναλύει το λόγο που δημιουργή-θηκαν, το θεωρητικό τους μέρος, αρκετές από τις τοπολογίες τους – αρχιτεκτονικές τους και τέλος τις ανάγκες που δημιουργήθηκαν μέσα από τα πλεονεκτήματά και τα μειονεκτήματα τους, για ακόμη καλύτερα αποτελέσματα. Το τρίτο κεφάλαιο ασχολείται με τις Μηχανές Διανυσματικής Υποστήριξης, για πιο λόγο είναι τόσο δημοφιλείς, πως υλοποιούνται θεωρητικά και γεωμετρικά, τι πετυχαίνουν, τα πλεονεκτήματά και τα μειονεκτήματα τους. Το τέταρτο κεφάλαιο προσπαθεί μέσα από πειραματικά αποτελέσματα να συγκρίνει τα Τεχνητά Νευρωνικά Δίκτυα με τα SVMs με πραγματικά σύνολα δεδομένων (πρότυπα ή στιγμιότυπα), ποιοί δείκτες είναι αυτοί που θα μας δώσουν τελικά ποιος κατηγοριοποιητής είναι συνολικά καλύτερος; Όταν λέμε καλύτερος είναι αυτός που είναι πιο ακριβής ή πιο γρήγορος ή κάτι ενδιάμεσο; Το πέμπτο κεφάλαιο μας εξηγεί τι είναι παλινδρόμηση και συγκρίνει κύριους αλγορίθμους από τα Τεχνητά Νευρωνικά Δίκτυα και των Μηχανών Διανυσματικής Υποστήριξης. Στο έκτο κεφάλαιο και στα πλαίσια της διπλωματικής εργασίας υλοποίησα μία εφαρμογή σε Java, η οποία κάνει ταξινόμηση και παλινδρόμηση σε δεδομένα από αρχεία arff. Επικεντρώνεται μόνο στην ταξινόμηση και την παλινδρόμηση ενώ αυτό που το κάνει διαφορετικό από το Weka είναι η πρόβλεψη (Prediction) στο οποίο μπορούμε εμείς να δώσουμε κάποιο στιγμιότυπο και η εφαρμογή να μας κάνει πρόβλεψη για αυτό. Τέλος ακολουθούν ο επίλογος, τα παραρτήματα τα οποία καλύπτουν θεωρητικές βασικές έννοιες που αναφέρονται στα προηγούμενα κεφάλαια και διαγράμματα UML των κλάσεων που υλοποιούνται στην κατηγοριοποίηση (Classification) και στην πρόβλεψη (Prediction) στο Weka και κάποια κομμάτια κώδικα σε Java από την υλοποίηση του προγράμματος. Στην εργασία υπάρχει αρκετή βιβλιογραφία στην οποία γίνονται συνεχείς αναφορές.
Στην εργασία υπάρχει αρκετή βιβλιογραφία στην οποία γίνονται συνεχείς αναφορές. Έγινε μεγάλη προσπάθεια στο να καταλάβει κάποιος πόσο σημαντική προσπάθεια έχει γίνει σε αυτό το χώρο της τεχνητής νοημοσύνης (Artificial intelligence) από τον Alan Turing και τους McCulloch και Pitts μέχρι τον Vapnik τον Osuna και τον Platt και πολλούς άλλους μετέπειτα. / The aim of this dissertation is the comparison of two major categories, the Artificial Neural Networks and the, very popular recently, Support Vector Machines on Data Classification and Regression.
In the first chapter issues relevant to Data Mining and Data Classification are written, whereas in the second one, several issues from the enormous chapter of Artificial Neural Networks are approached. In this we analyze the reason for their creation, their theoretical part, several of their topologies – architectural and finally the needs that were created from their advantages and disadvantages for better results. In the third chapter we are dealing with the Support Vector Machines, the reason of their popularity, the way of their implementation theoretically and geometrically, their accomplishments and their advantages and disadvantages. In the fourth chapter, via experimental results, we are trying to compare the Artificial Neural Networks to the Support Vector Machines with real aggregate data, patterns or instances, which indicators are those that will finally give us the classifier that is the best. And by saying the best do we imply the most accurate, the fastest or something in between? In the fifth chapter we explain what Regression is and we compare major algorithms from Artificial Neural Networks and Support Vector Machines. In the sixth chapter we implemented an application into Java which performs classification and regression from arff files. It focuses only on classification and regression, while what differentiates it from Weka is Prediction on which we can give an instance and the application can make a prediction on/about it. Finally, we include the Conclusion/Epilogue, the appendices that cover basic theories which refer to previous chapters and UML diagrams of classes that are implemented on classification and Prediction in Weka, as well as some parts of the code in Java from the implementation of the program.
In the Dissertation there is the Bibliography on which we constantly refer to. A great effort has been given so that anyone can understand the importance of the attempt that was done on the field of Artificial Intelligence by Alan Turing, McCulloch and Pitts up to Vapnik, Osuna and Platt and many others that followed.
|
110 |
Plant-wide Performance Monitoring and Controller PrioritizationPareek, Samidh Unknown Date
No description available.
|
Page generated in 0.0944 seconds