Global ETD Search

91	Generalization Performance of Margin Multi-category Classifiers / Performances en généralisation des classifieurs multi-classes à marge Musayeva, Khadija 23 September 2019 (has links) Cette thèse porte sur la théorie de la discrimination multi-classe à marge. Elle a pour cadre la théorie statistique de l’apprentissage de Vapnik et Chervonenkis. L’objectif est d’établir des bornes de généralisation possédant une dépendances explicite au nombre C de catégories, à la taille m de l’échantillon et au paramètre de marge gamma, lorsque la fonction de perte considérée est une fonction de perte à marge possédant la propriété d’être lipschitzienne. La borne de généralisation repose sur la performance empirique du classifieur ainsi que sur sa "capacité". Dans cette thèse, les mesures de capacité considérées sont les suivantes : la complexité de Rademacher, les nombres de recouvrement et la dimension fat-shattering. Nos principales contributions sont obtenues sous l’hypothèse que les classes de fonctions composantes calculées par le classifieur ont des dimensions fat-shattering polynomiales et que les fonctions composantes sont indépendantes. Dans le contexte du schéma de calcul introduit par Mendelson, qui repose sur les relations entre les mesures de capacité évoquées plus haut, nous étudions l’impact que la décomposition au niveau de l’une de ces mesures de capacité a sur les dépendances (de la borne de généralisation) à C, m et gamma. En particulier, nous démontrons que la dépendance à C peut être considérablement améliorée par rapport à l’état de l’art si la décomposition est reportée au niveau du nombre de recouvrement ou de la dimension fat-shattering. Ce changement peut affecter négativement le taux de convergence (dépendance à m), ce qui souligne le fait que l’optimisation par rapport aux trois paramètres fondamentaux se traduit par la recherche d’un compromis. / This thesis deals with the theory of margin multi-category classification, and is based on the statistical learning theory founded by Vapnik and Chervonenkis. We are interested in deriving generalization bounds with explicit dependencies on the number C of categories, the sample size m and the margin parameter gamma, when the loss function considered is a Lipschitz continuous margin loss function. Generalization bounds rely on the empirical performance of the classifier as well as its "capacity". In this work, the following scale-sensitive capacity measures are considered: the Rademacher complexity, the covering numbers and the fat-shattering dimension. Our main contributions are obtained under the assumption that the classes of component functions implemented by a classifier have polynomially growing fat-shattering dimensions and that the component functions are independent. In the context of the pathway of Mendelson, which relates the Rademacher complexity to the covering numbers and the latter to the fat-shattering dimension, we study the impact that decomposing at the level of one of these capacity measures has on the dependencies on C, m and gamma. In particular, we demonstrate that the dependency on C can be substantially improved over the state of the art if the decomposition is postponed to the level of the metric entropy or the fat-shattering dimension. On the other hand, this impacts negatively the rate of convergence (dependency on m), an indication of the fact that optimizing the dependencies on the three basic parameters amounts to looking for a trade-off. Apprentissage Théorie de l’apprentissage Discrimination multi-classe Risques garantis Classifieurs à marge Statistical learning theory Multi-category classification Risk bounds Margin classifiers 006.3 518.1
92	A New Contribution To Nonlinear Robust Regression And Classification With Mars And Its Applications To Data Mining For Quality Control In Manufacturing Yerlikaya, Fatma 01 September 2008 (has links) (PDF) Multivariate adaptive regression spline (MARS) denotes a modern methodology from statistical learning which is very important in both classification and regression, with an increasing number of applications in many areas of science, economy and technology. MARS is very useful for high dimensional problems and shows a great promise for fitting nonlinear multivariate functions. MARS technique does not impose any particular class of relationship between the predictor variables and outcome variable of interest. In other words, a special advantage of MARS lies in its ability to estimate the contribution of the basis functions so that both the additive and interaction effects of the predictors are allowed to determine the response variable. The function fitted by MARS is continuous, whereas the one fitted by classical classification methods (CART) is not. Herewith, MARS becomes an alternative to CART. The MARS algorithm for estimating the model function consists of two complementary algorithms: the forward and backward stepwise algorithms. In the first step, the model is built by adding basis functions until a maximum level of complexity is reached. On the other hand, the backward stepwise algorithm is began by removing the least significant basis functions from the model. In this study, we propose not to use the backward stepwise algorithm. Instead, we construct a penalized residual sum of squares (PRSS) for MARS as a Tikhonov regularization problem, which is also known as ridge regression. We treat this problem using continuous optimization techniques which we consider to become an important complementary technology and alternative to the concept of the backward stepwise algorithm. In particular, we apply the elegant framework of conic quadratic programming which is an area of convex optimization that is very well-structured, herewith, resembling linear programming and, hence, permitting the use of interior point methods. The boundaries of this optimization problem are determined by the multiobjective optimization approach which provides us many alternative solutions. Based on these theoretical and algorithmical studies, this MSc thesis work also contains applications on the data investigated in a T&Uuml / BiTAK project on quality control. By these applications, MARS and our new method are compared. QA Numerical Analysis 297-299.4
93	A Mathematical Contribution Of Statistical Learning And Continuous Optimization Using Infinite And Semi-infinite Programming To Computational Statistics Ozogur-akyuz, Sureyya 01 February 2009 (has links) (PDF) A subfield of artificial intelligence, machine learning (ML), is concerned with the development of algorithms that allow computers to &ldquo / learn&rdquo / . ML is the process of training a system with large number of examples, extracting rules and finding patterns in order to make predictions on new data points (examples). The most common machine learning schemes are supervised, semi-supervised, unsupervised and reinforcement learning. These schemes apply to natural language processing, search engines, medical diagnosis, bioinformatics, detecting credit fraud, stock market analysis, classification of DNA sequences, speech and hand writing recognition in computer vision, to encounter just a few. In this thesis, we focus on Support Vector Machines (SVMs) which is one of the most powerful methods currently in machine learning. As a first motivation, we develop a model selection tool induced into SVM in order to solve a particular problem of computational biology which is prediction of eukaryotic pro-peptide cleavage site applied on the real data collected from NCBI data bank. Based on our biological example, a generalized model selection method is employed as a generalization for all kinds of learning problems. In ML algorithms, one of the crucial issues is the representation of the data. Discrete geometric structures and, especially, linear separability of the data play an important role in ML. If the data is not linearly separable, a kernel function transforms the nonlinear data into a higher-dimensional space in which the nonlinear data are linearly separable. As the data become heterogeneous and large-scale, single kernel methods become insufficient to classify nonlinear data. Convex combinations of kernels were developed to classify this kind of data [8]. Nevertheless, selection of the finite combinations of kernels are limited up to a finite choice. In order to overcome this discrepancy, we propose a novel method of &ldquo / infinite&rdquo / kernel combinations for learning problems with the help of infinite and semi-infinite programming regarding all elements in kernel space. This will provide to study variations of combinations of kernels when considering heterogeneous data in real-world applications. Combination of kernels can be done, e.g., along a homotopy parameter or a more specific parameter. Looking at all infinitesimally fine convex combinations of the kernels from the infinite kernel set, the margin is maximized subject to an infinite number of constraints with a compact index set and an additional (Riemann-Stieltjes) integral constraint due to the combinations. After a parametrization in the space of probability measures, it becomes semi-infinite. We analyze the regularity conditions which satisfy the Reduction Ansatz and discuss the type of distribution functions within the structure of the constraints and our bilevel optimization problem. Finally, we adapted well known numerical methods of semiinfinite programming to our new kernel machine. We improved the discretization method for our specific model and proposed two new algorithms. We proved the convergence of the numerical methods and we analyzed the conditions and assumptions of these convergence theorems such as optimality and convergence.
94	Résumé des Travaux en Statistique et Applications des Statistiques Clémençon, Stéphan 01 December 2006 (has links) (PDF) Ce rapport présente brièvement l'essentiel de mon activité de recherche depuis ma thèse de doctorat [53], laquelle visait principalement à étendre l'utilisation des progrès récents de l'Analyse Harmonique Algorithmique pour l'estimation non paramétrique adaptative dans le cadre d'observations i.i.d. (tels que l'analyse par ondelettes) à l'estimation statistique pour des données markoviennes. Ainsi qu'il est éxpliqué dans [123], des résultats relatifs aux propriétés de concentration de la mesure (i.e. des inégalités de probabilité et de moments sur certaines classes fonctionnelles, adaptées à l'approximation non linéaire) sont indispensables pour exploiter ces outils d'analyse dans un cadre probabiliste et obtenir des procédures d'estimation statistique dont les vitesses de convergence surpassent celles de méthodes antérieures. Dans [53] (voir également [54], [55] et [56]), une méthode d'analyse fondée sur le renouvellement, la méthode dite 'régénérative' (voir [185]), consistant à diviser les trajectoires d'une chaîne de Markov Harris récurrente en segments asymptotiquement i.i.d., a été largement utilisée pour établir les résultats probabilistes requis, le comportement à long terme des processus markoviens étant régi par des processus de renouvellement (définissant de façon aléatoire les segments de la trajectoire). Une fois l'estimateur construit, il importe alors de pouvoir quantifier l'incertitude inhérente à l'estimation fournie (mesurée par des quantiles spécifiques, la variance ou certaines fonctionnelles appropriées de la distribution de la statistique considérée). A cet égard et au delà de l'extrême simplicité de sa mise en oeuvre (puisqu'il s'agit simplement d'eectuer des tirages i.i.d. dans l'échantillon de départ et recalculer la statistique sur le nouvel échantillon, l'échantillon bootstrap), le bootstrap possède des avantages théoriques majeurs sur l'approximation asymptotique gaussienne (la distribution bootstrap approche automatiquement la structure du second ordre dans le développement d'Edegworth de la distribution de la statistique). Il m'est apparu naturel de considérer le problème de l'extension de la procédure traditionnelle de bootstrap aux données markoviennes. Au travers des travaux réalisés en collaboration avec Patrice Bertail, la méthode régénérative s'est avérée non seulement être un outil d'analyse puissant pour établir des théorèmes limites ou des inégalités, mais aussi pouvoir fournir des méthodes pratiques pour l'estimation statistique: la généralisation du bootstrap proposée consiste à ré-échantillonner un nombre aléatoire de segments de données régénératifs (ou d'approximations de ces derniers) de manière à imiter la structure de renouvellement sous-jacente aux données. Cette approche s'est révélée également pertinente pour de nombreux autres problèmes statistiques. Ainsi la première partie du rapport vise essentiellement à présenter le principe des méthodes statistiques fondées sur le renouvellement pour des chaînes de Markov Harris. La seconde partie du rapport est consacrée à la construction et à l'étude de méthodes statistiques pour apprendre à ordonner des objets, et non plus seulement à les classer (i.e. leur aecter un label), dans un cadre supervisé. Ce problème difficile est d'une importance cruciale dans de nombreux domaines d' application, allant de l'élaboration d'indicateurs pour le diagnostic médical à la recherche d'information (moteurs de recherche) et pose d'ambitieuses questions théoriques et algorithmiques, lesquelles ne sont pas encore résolues de manière satisfaisante. Une approche envisageable consiste à se ramener à la classification de paires d'observations, ainsi que le suggère un critère largement utilisé dans les applications mentionnées ci-dessus (le critère AUC) pour évaluer la pertinence d'un ordre. Dans un travail mené en collaboration avec Gabor Lugosi et Nicolas Vayatis, plusieurs résultats ont été obtenus dans cette direction, requérant l'étude de U-processus: l'aspect novateur du problème résidant dans le fait que l'estimateur naturel du risque a ici la forme d'une U-statistique. Toutefois, dans de nombreuses applications telles que la recherche d'information, seul l'ordre relatif aux objets les plus pertinents importe véritablement et la recherche de critères correspondant à de tels problèmes (dits d'ordre localisé) et d'algorithmes permettant de construire des règles pour obtenir des 'rangements' optimaux à l'égard de ces derniers constitue un enjeu crucial dans ce domaine. Plusieurs développements en ce sens ont été réalisés dans une série de travaux (se poursuivant encore actuellement) en collaboration avec Nicolas Vayatis. Enfin, la troisième partie du rapport reflète mon intérêt pour les applications des concepts probabilistes et des méthodes statistiques. Du fait de ma formation initiale, j'ai été naturellement conduit à considérer tout d'abord des applications en finance. Et bien que les approches historiques ne suscitent généralement pas d'engouement dans ce domaine, j'ai pu me convaincre progressivement du rôle important que pouvaient jouer les méthodes statistiques non paramétriques pour analyser les données massives (de très grande dimension et de caractère 'haute fréquence') disponibles en finance afin de détecter des structures cachées et en tirer partie pour l'évaluation du risque de marché ou la gestion de portefeuille par exemple. Ce point de vue est illustré par la brève présentation des travaux menés en ce sens en collaboration avec Skander Slim dans cette troisième partie. Ces dernières années, j'ai eu l'opportunité de pouvoir rencontrer des mathématiciens appliqués et des scientifiques travaillant dans d'autres domaines, pouvant également bénéficier des avancées de la modélisation probabiliste et des méthodes statistiques. J'ai pu ainsi aborder des applications relatives à la toxicologie, plus précisément au problème de l'évaluation des risque de contamination par voie alimentaire, lors de mon année de délégation auprès de l'Institut National de la Recherche Agronomique au sein de l'unité Metarisk, unité pluridisciplinaire entièrement consacrée à l'analyse du risque alimentaire. J'ai pu par exemple utiliser mes compétences dans le domaine de la modélisation maarkovienne afin de proposer un modèle stochastique décrivant l'évolution temporelle de la quantité de contaminant présente dans l'organisme (de manère à prendre en compte à la fois le phénomène d'accumulation du aux ingestions successives et la pharmacocinétique propre au contaminant régissant le processus d'élimination) et des méthodes d'inférence statistique adéquates lors de travaux en collaboration avec Patrice Bertail et Jessica Tressou. Cette direction de recherche se poursuit actuellement et l'on peut espérer qu'elle permette à terme de fonder des recommandations dans le domaine de la santé publique. Par ailleurs, j'ai la chance de pouvoir travailler actuellement avec Hector de Arazoza, Bertran Auvert, Patrice Bertail, Rachid Lounes et Viet-Chi Tran sur la modélisation stochastique de l'épidémie du virus VIH à partir des données épidémiologiques recensées sur la population de Cuba, lesquelles constituent l'une des bases de données les mieux renseignées sur l'évolution d'une épidémie de ce type. Et bien que ce projet vise essentiellement à obtenir un modèle numérique (permettant d'effectuer des prévisions quant à l'incidence de l'épidémie à court terme, de manière à pouvoir planifier la fabrication de la quantité d'anti-rétroviraux nécéssaire par exemple), il nous a conduit à aborder des questions théoriques ambitieuses, allant de l'existence d'une mesure quasi-stationnaire décrivant l'évolution à long terme de l'épidémie aux problèmes relatifs au caractère incomplet des données épidémiologiques disponibles. Il m'est malheureusement impossible d'évoquer ces questions ici sans risquer de les dénaturer, la présentation des problèmes mathématiques rencontrés dans ce projet mériterait à elle seule un rapport entier. [MATH] Mathematics Markov chain/process regenerative process nonparametric statistics bootstrap limit theorems supervised statistical learning ranking applications to biosciences application to finance
95	Multistage stochastic programming models for the portfolio optimization of oil projects Chen, Wei, 1974- 20 December 2011 (has links) Exploration and production (E&P) involves the upstream activities from looking for promising reservoirs to extracting oil and selling it to downstream companies. E&P is the most profitable business in the oil industry. However, it is also the most capital-intensive and risky. Hence, the proper assessment of E&P projects with effective management of uncertainties is crucial to the success of any upstream business. This dissertation is concentrated on developing portfolio optimization models to manage E&P projects. The idea is not new, but it has been mostly restricted to the conceptual level due to the inherent complications to capture interactions among projects. We disentangle the complications by modeling the project portfolio optimization problem as multistage stochastic programs with mixed integer programming (MIP) techniques. Due to the disparate nature of uncertainties, we separately consider explored and unexplored oil fields. We model portfolios of real options and portfolios of decision trees for the two cases, respectively. The resulting project portfolio models provide rigorous and consistent treatments to optimally balance the total rewards and the overall risk. For explored oil fields, oil price fluctuations dominate the geologic risk. The field development process hence can be modeled and assessed as sequentially compounded options with our optimization based option pricing models. We can further model the portfolio of real options to solve the dynamic capital budgeting problem for oil projects. For unexplored oil fields, the geologic risk plays the dominating role to determine how a field is optimally explored and developed. We can model the E&P process as a decision tree in the form of an optimization model with MIP techniques. By applying the inventory-style budget constraints, we can pool multiple project-specific decision trees to get the multistage E&P project portfolio optimization (MEPPO) model. The resulting large scale MILP is efficiently solved by a decomposition-based primal heuristic algorithm. The MEPPO model requires a scenario tree to approximate the stochastic process of the geologic parameters. We apply statistical learning, Monte Carlo simulation, and scenario reduction methods to generate the scenario tree, in which prior beliefs can be progressively refined with new information. / text Options pricing Real options Portfolio of real options Decision analysis Decision trees Portfolio of projects Risk management Exploration and production Oil and gas Oil field development Scenario generation Statistical learning Project dependence
96	Computational intelligence methods on biomedical signal analysis and data mining in medical records Vladutu, Liviu-Mihai 05 May 2009 (has links) This thesis is centered around the development and application of computationally effective solutions based on artificial neural networks (ANN) for biomedical signal analysis and data mining in medical records. The ultimate goal of this work in the field of Biomedical Engineering is to provide the clinician with the best possible information needed to make an accurate diagnosis (in our case of myocardial ischemia) and to propose advanced mathematical models for recovering the complex dependencies between the variables of a physical process from a set of perturbed observations. After describing some of the types of ANN mainly used in this work, we start designing a model for pattern classification, by constructing several local models, for neighborhoods of the state space. For this task, we use the novel k-windows clustering algorithm, to automatically detect neighborhoods in the state space. This algorithm, with a slight modification (unsupervised k-windows algorithm) has the ability to endogenously determine the number of clusters present in the data set during the clustering process. We used this method together with the other 2 mentioned below (NetSOM and sNet-SOM) for the problem of ischemia detection. Next, we propose the utilization of a statistically extracted distance measure in the context of Generalized Radial Basis Function (GRBF) networks. The main properties of the GRBF networks are retained in a new metric space, called Statistical Distance Metric (SDM). The regularization potential of these networks can be realized with this type of distance. Furthermore, the recent engineering of neural networks offers effective solutions for learning smooth functionals that lie on high dimensional spaces.We tested this solution with an application from bioinformatics, one example from data mining of commercial databases and finally with some examples using medical databases from a Machine Learning Repository. We continue by establishing the network self-organizing map (NetSOM) model, which attempts to generalize the regularization and ordering potential of the basic SOM from the space of vectors to the space of approximating functions. It becomes a device for the ordering of local experts (i.e. independent neural networks) over its lattice of neurons and for their selection and coordination. Finally, an alternative to NetSOM is proposed, which uses unsupervised ordering based on Self-organizing maps (SOM) for the "simple" regions and for the "difficult" ones a two-stage learning process. There are two differences resulted from the comparison with the previous model (NetSOM), one is that we replaced a fixed-size of the SOM with a dinamically expanded map and second, the supervised learning was based this time on Radial Basis Functions (RBF) Networks and Support Vector Machines (SVM). There are two fields in which this tool (called sNet-SOM) was used, namely: ischemia detection and Data Mining. / Η παρούσα διδακτορική διατριβή είναι επικεντρωμένη γύρω από την ανάπτυξη και εφαρμογή, με χαμηλές υπολογιστικές απαιτήσεις, βασισμένες σε Τεχνητά Νευρωνικά Δίκτυα, για την Ανάλυση Βιοϊατρικών σημάτων και Data Mining σε Ιατρικά Δεδομένα. Απώτερος σκοπός της παρούσης διατριβής στον τομέα της Βιοϊατρικής Τεχνολογίας είναι να παρέχει στους ιατρούς με την καλύτερη δυνατή πληροφόρηση για να κάνουν μια ακριβή διάγνωση (στην περίπτωση του ισχαιμικού μυοκαρδίου) και να προτείνει αναπτυγμένα μαθηματικά μοντέλα για να ανακάμψει πολύπλοκες εξαρτήσεις μεταξύ τον μεταβλητών μιας φυσικής διεργασίας από ένα σύνολο διαφορετικών παρατηρήσεων. Μετά την περιγραφή μερικών από τους βασικούς τύπους τεχνητών Νευρωνικών Δικτύων που χρησιμοποιούνται στην παρούσα διατριβή, εμείς αρχίσαμε να σχεδιάζουμε ένα μοντέλο για ταξινόμηση προτύπων κατασκευάζοντας πολλά τοπικά μοντέλα γειτονικά με τον παρόντα χώρο. Για αυτό το σκοπό εμείς χρησιμοποιούμε το αλγόριθμο για clustering k-windows για να ανιχνεύει αυτόματα γειτονιές στον παρόντα χώρο. Αυτός ο αλγόριθμος με μια ελαφριά τροποποίηση έχει την ικανότητα να καθορίζει ενδογενώς την παρουσία του αριθμού τον clusters στο σύνολο τον δεδομένων κατά την διάρκεια της διαδικασίας του clustering. Όταν η διαδικασία του clustering ολοκληρώνεται ένα εκπαιδευμένο Εμπροσθοτροφοδοτούμενο Νευρωνικό Δίκτυο δρα ως ο τοπικός προβλέπτης για κάθε cluster. Εν συνεχεία, προτείνουμε τη χρήση εξαγόμενης στατιστικής μετρητικής απόστασης, μέσα στο γενικότερο πλαίσιο των δικτύων ( GRBF). Οι κύριες λειτουργίες των GRBF (Generalized Radial Basis Functions) δικτύων διατηρούνται στο καινούργιο μετρητικό χώρο. Η δυναμική κανονικοποίηση αυτών των δικτύων μπορεί να πραγματοποιηθεί με αυτό τον τύπο αποστάσεων. Επιπλέον η πρόσφατη τεχνολογία των ΝΝ (Neural Networks) προσφέρει αποτελεσματικές λύσεις για τη μάθηση ομαλών συναρτήσεων που βρίσκεται σε υψηλούς διαστατικούς χώρους. Δοκιμάσαμε αυτή τη λύση σε εφαρμογή βιοπληροφορικής, μία από εμπορικές βάσεις δεδομένων και τέλος με μερικά παραδείγματα χρησιμοποιώντας βάσεις δεδομένων από το UCI (University of California at Irvine) από το ιατρικό πεδίο. Συνεχίζοντας, καθιδρύουμε το δίκτυο NetSOM (network Self-Οrganizing Map), που προσπαθεί να γενικεύσει (generalize) την κανονικοποίηση (regularization) και να δώσει δυναμικές εντολές (ordering) του βασικού SOM από το διανυσματικό χώρο στο χώρο των προσεγγιστικών συναρτήσεων. Αποτελεί μια εντολοδόχο διαδικασία για τους τοπικούς ειδικούς πάνω από το πλέγμα των νευρώνων και για την επιλογή και το συντονισμό τους. Τέλος, αναλύεται μια εναλλακτική λύση του NetSOM, που χρησιμοποιεί μη εκπαιδευμένες εντολές βασισμένες στο SOMs για τις “απλές ” περιοχές και για τις “δύσκολες ” μια διαδικασία μάθησης 2-επιπέδων. Υπάρχουν 2 διαφορές στα αποτελέσματα από την σύγκριση με το προηγούμενο μοντέλο (NetSOM), η πρώτη είναι ότι αντικαταστήσαμε (we replaced) a fixed-size των SOM με ένα πιο δυναμικό ταίριασμα (mapping) και η δεύτερη, η εκπαιδευόμενη εκμάθηση βασίστηκε αυτή τη φορά στην RBF και στις μηχανές υποστήριξης διανυσμάτων (SVM). Αυτό το εργαλείο χρησιμοποιήθηκε στην αναγνώριση των ισχαιμιών και εξόρυξη δεδομένων από βάσεις δεδομένων. Artificial neural networks Statistical learning Ischemia detection K-windows clustering 610.285 Στατιστική μάθηση Αλγόριθμοι για clustering
97	Supervised metric learning with generalization guarantees Bellet, Aurélien 11 December 2012 (has links) (PDF) In recent years, the crucial importance of metrics in machine learningalgorithms has led to an increasing interest in optimizing distanceand similarity functions using knowledge from training data to make them suitable for the problem at hand.This area of research is known as metric learning. Existing methods typically aim at optimizing the parameters of a given metric with respect to some local constraints over the training sample. The learned metrics are generally used in nearest-neighbor and clustering algorithms.When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance, which is parameterized by a positive semi-definite matrix. Recent methods offer good scalability to large datasets.Less work has been devoted to metric learning from structured objects (such as strings or trees), because it often involves complex procedures. Most of the work has focused on optimizing a notion of edit distance, which measures (in terms of number of operations) the cost of turning an object into another.We identify two important limitations of current supervised metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not really been studied so far. Second, and perhaps more importantly, the question of the generalization ability of metric learning methods has been largely ignored.In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Unlike other string kernels, it is guaranteed to be valid and parameter-free. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (epsilon,gamma,tau)-good similarity functions and formulated as a convex optimization problem. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend the same ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (epsilon,gamma,tau)-goodness. The similarity is learned based on global constraints that are more appropriate to linear classification. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms. It is based on a simple adaptation of the notion of algorithmic robustness and allows the derivation of bounds for various loss functions and regularizers. Metric learning Statistical learning Convex optimization Classification Structured data Edit distance Generalization bounds
98	Learning based event model for knowledge extraction and prediction system in the context of Smart City / Un modèle de gestion d'évènements basé sur l'apprentissage pour un système d'extraction et de prédiction dans le contexte de Ville Intelligente Kotevska, Olivera 30 January 2018 (has links) Des milliards de «choses» connectées à l’internet constituent les réseaux symbiotiques de périphériques de communication (par exemple, les téléphones, les tablettes, les ordinateurs portables), les appareils intelligents, les objets (par exemple, la maison intelligente, le réfrigérateur, etc.) et des réseaux de personnes comme les réseaux sociaux. La notion de réseaux traditionnels se développe et, à l'avenir, elle ira au-delà, y compris plus d'entités et d'informations. Ces réseaux et ces dispositifs détectent, surveillent et génèrent constamment une grande uantité de données sur tous les aspects de la vie humaine. L'un des principaux défis dans ce domaine est que le réseau se compose de «choses» qui sont hétérogènes à bien des égards, les deux autres, c'est qu'ils changent au fil du temps, et il y a tellement d'entités dans le réseau qui sont essentielles pour identifier le lien entre eux.Dans cette recherche, nous abordons ces problèmes en combinant la théorie et les algorithmes du traitement des événements avec les domaines d'apprentissage par machine. Notre objectif est de proposer une solution possible pour mieux utiliser les informations générées par ces réseaux. Cela aidera à créer des systèmes qui détectent et répondent rapidement aux situations qui se produisent dans la vie urbaine afin qu'une décision intelligente puisse être prise pour les citoyens, les organisations, les entreprises et les administrations municipales. Les médias sociaux sont considérés comme une source d'information sur les situations et les faits liés aux utilisateurs et à leur environnement social. Au début, nous abordons le problème de l'identification de l'opinion publique pour une période donnée (année, mois) afin de mieux comprendre la dynamique de la ville. Pour résoudre ce problème, nous avons proposé un nouvel algorithme pour analyser des données textuelles complexes et bruyantes telles que Twitter-messages-tweets. Cet algorithme permet de catégoriser automatiquement et d'identifier la similarité entre les sujets d'événement en utilisant les techniques de regroupement. Le deuxième défi est de combiner les données du réseau avec diverses propriétés et caractéristiques en format commun qui faciliteront le partage des données entre les services. Pour le résoudre, nous avons créé un modèle d'événement commun qui réduit la complexité de la représentation tout en conservant la quantité maximale d'informations. Ce modèle comporte deux ajouts majeurs : la sémantiques et l’évolutivité. La partie sémantique signifie que notre modèle est souligné avec une ontologie de niveau supérieur qui ajoute des capacités d'interopérabilité. Bien que la partie d'évolutivité signifie que la structure du modèle proposé est flexible, ce qui ajoute des fonctionnalités d'extensibilité. Nous avons validé ce modèle en utilisant des modèles d'événements complexes et des techniques d'analyse prédictive. Pour faire face à l'environnement dynamique et aux changements inattendus, nous avons créé un modèle de réseau dynamique et résilient. Il choisit toujours le modèle optimal pour les analyses et s'adapte automatiquement aux modifications en sélectionnant le meilleur modèle. Nous avons utilisé une approche qualitative et quantitative pour une sélection évolutive de flux d'événements, qui réduit la solution pour l'analyse des liens, l’optimale et l’alternative du meilleur modèle. / Billions of “things” connected to the Internet constitute the symbiotic networks of communication devices (e.g., phones, tablets, and laptops), smart appliances (e.g., fridge, coffee maker and so forth) and networks of people (e.g., social networks). So, the concept of traditional networks (e.g., computer networks) is expanding and in future will go beyond it, including more entities and information. These networks and devices are constantly sensing, monitoring and generating a vast amount of data on all aspects of human life. One of the main challenges in this area is that the network consists of “things” which are heterogeneous in many ways, the other is that their state of the interconnected objects is changing over time, and there are so many entities in the network which is crucial to identify their interdependency in order to better monitor and predict the network behavior. In this research, we address these problems by combining the theory and algorithms of event processing with machine learning domains. Our goal is to propose a possible solution to better use the information generated by these networks. It will help to create systems that detect and respond promptly to situations occurring in urban life so that smart decision can be made for citizens, organizations, companies and city administrations. Social media is treated as a source of information about situations and facts related to the users and their social environment. At first, we tackle the problem of identifying the public opinion for a given period (year, month) to get a better understanding of city dynamics. To solve this problem, we proposed a new algorithm to analyze complex and noisy textual data such as Twitter messages-tweets. This algorithm permits an automatic categorization and similarity identification between event topics by using clustering techniques. The second challenge is combing network data with various properties and characteristics in common format that will facilitate data sharing among services. To solve it we created common event model that reduces the representation complexity while keeping the maximum amount of information. This model has two major additions: semantic and scalability. The semantic part means that our model is underlined with an upper-level ontology that adds interoperability capabilities. While the scalability part means that the structure of the proposed model is flexible in adding new entries and features. We validated this model by using complex event patterns and predictive analytics techniques. To deal with the dynamic environment and unexpected changes we created dynamic, resilient network model. It always chooses the optimal model for analytics and automatically adapts to the changes by selecting the next best model. We used qualitative and quantitative approach for scalable event stream selection, that narrows down the solution for link analysis, optimal and alternative best model. It also identifies efficient relationship analysis between data streams such as correlation, causality, similarity to identify relevant data sources that can act as an alternative data source or complement the analytics process. Ville intelligente Apprentissage automatique Intelligence artificielle Apprentissage statistique Médias sociaux Récupération de l'information Smart City Machine Learing Artificial Intelligence Statistical learning Social Media Information Retrieval 004
99	Application de l'Analyse en Composantes Principales pour étudier l'adaptation biologique en génomique des populations / Application of Principal Component Analysis to study biological adaptation in population genomics Luu, Keurcien 21 December 2017 (has links) L'identification de gènes ayant permis à des populations de s'adapter à leur environnement local constitue une des problématiques majeures du domaine de la génétique des populations. Les méthodes statistiques actuelles répondant à cette problématique ne sont plus adaptées aux données de séquençage nouvelle génération (NGS). Nous proposons dans cette thèse de nouvelles statistiques adaptées à ces nouveaux volumes de données, destinées à la détection de gènes sous sélection. Nos méthodes reposent exclusivement sur l'Analyse en Composantes Principales, dont nous justifierons l'utilisation en génétique des populations. Nous expliquerons également les raisons pour lesquelles nos approches généralisent les méthodes statistiques existantes et démontrons l'intérêt d'utiliser une approche basée sur l'Analyse en Composantes Principales en comparant nos méthodes à celles de l'état de l'art. Notre travail a notamment abouti au développement de pcadapt, une librairie R permettant l'utilisation de nos statistiques de détection sur des données génétiques variées. / Identifying genes involved in local adaptation is of major interest in population genetics. Current statistical methods for genome scans are no longer suited to the analysis of Next Generation Sequencing (NGS) data. We propose new statistical methods to perform genome scans on massive datasets. Our methods rely exclusively on Principal Component Analysis which use in population genetics will be discussed extensively. We also explain the reasons why our approaches can be seen as extensions of existing methods and demonstrate how our PCA-based statistics compare with state-of-the-art methods. Our work has led to the development of pcadapt, an R package designed for outlier detection for various genetic data. Génétique des populations Machine Learning Apprentissage statistique Séquençage nouvelle génération Bio-Informatique Population Genetics Machine Learning Statistical Learning Next-Generation Sequencing Bioinformatics 004 570 510
100	Camera-Based Friction Estimation with Deep Convolutional Neural Networks Jonnarth, Arvi January 2018 (has links) During recent years, great progress has been made within the field of deep learning, and more specifically, within neural networks. Deep convolutional neural networks (CNN) have been especially successful within image processing in tasks such as image classification and object detection. Car manufacturers, amongst other actors, are starting to realize the potential of deep learning and have begun applying it to autonomous driving. This is not a simple task, and many challenges still lie ahead. A sub-problem, that needs to be solved, is a way of automatically determining the road conditions, including the friction. Since many modern cars are equipped with cameras these days, it is only natural to approach this problem with CNNs. This is what has been done in this thesis. First, a data set is gathered which consists of 37,000 labeled road images that are taken through the front window of a car. Second, CNNs are trained on this data set to classify the friction of a given road. Gathering road images and labeling them with the correct friction is a time consuming and difficult process, and requires human supervision. For this reason, experiments are made on a second data set, which consist of 54,000 simulated images. These images are captured from the racing game World Rally Championship 7 and are used in addition to the real images, to investigate what can be gained from this. Experiments conducted during this thesis show that CNNs are a good approach for the problem of estimating the road friction. The limiting factor, however, is the data set. Not only does the data set need to be much bigger, but it also has to include a much wider variety of driving conditions. Friction is a complex property and depends on many variables, and CNNs are only effective on the type of data that they have been trained on. For these reasons, new data has to be gather by actively seeking different driving conditions in order for this approach to be deployable in practice. / Under de senaste åren har det gjorts stora framsteg inom maskininlärning, särskilt gällande neurala nätverk. Djupa neurala närverk med faltningslager, eller faltningsnätverk (eng. convolutional neural network) har framför allt varit framgångsrika inom bildbehandling i problem så som bildklassificering och objektdetektering. Biltillverkare, bland andra aktörer, har nu börjat att inse potentialen av maskininlärning och påbörjat dess tillämpning inom autonom körning. Detta är ingen enkel uppgift och många utmaningar finns fortfarande framöver. Ett delproblem som måste lösas är ett sätt att automatiskt avgöra väglaget, där friktionen ingår. Eftersom många nya bilar är utrustade med kameror är det naturligt att försöka tackla detta problem med faltningsnätverk, vilket är varför detta har gjorts under detta examensarbete. Först samlar vi in en datamängd beståendes av 37 000 bilder tagna på vägar genom framrutan av en bil. Dessa bilder kategoriseras efter friktionen på vägen. Sedan tränar vi faltningsnätverk på denna datamängd för att klassificera friktionen. Att samla in vägbilder och att kategorisera dessa är en tidskrävande och svår process och kräver mänsklig övervakning. Av denna anledning utförs experiment på en andra datamängd beståendes av 54 000 simulerade bilder. Dessa har blivit insamlade genom spelet World Rally Championship 7 där syftet är att undersöka om prestandan på nätverken kan ökas genom simulerat data och därmed minska kravet på storleken av den riktiga datamängden. De experiment som har utförts under examensarbetet visar på att faltningsnätverk är ett bra tillvägagångssätt för att skatta vägfriktionen. Den begränsande faktorn i det här fallet är datamängden. Datamängden behöver inte bara vara större, men den måste framför allt täcka in ett bredare urval av väglag och väderförhållanden. Friktion är en komplex egenskap och beror på många variabler, och faltningsnätverk är endast effektiva på den typen av data som de har tränats på. Av dessa anledningar behöver ny data samlas in genom att aktivt söka efter nya körförhållanden om detta tillvägagångssätt ska vara tillämpbart i praktiken. Machine Learning Deep Learning Statistical Learning Friction Estimation Computer Vision Neural Networks Convolutional Neural Networks Digital Image Processing Computer and Information Sciences Data- och informationsvetenskap

Search results