Στοχαστικός (γραμμικός) προγραμματισμόςΜαγουλά, Ναταλία 07 April 2011 (has links)
Πολλά είναι τα προβλήματα απόφασης τα οποία μπορούν να μοντελοποιηθούν ως προβλήματα γραμμικού προγραμματισμού. Πολλές όμως είναι και οι καταστάσεις όπου δεν είναι λογικό να υποτεθεί ότι οι παράμετροι του μοντέλου καθορίζονται προσδιοριστικά. Για παράδειγμα, μελλοντικές παραγωγικότητες σε ένα πρόβλημα παραγωγής, εισροές σε μία δεξαμενή που συνδέεται με έναν υδροσταθμό παραγωγής ηλεκτρικού ρεύματος, απαιτήσεις στους διάφορους κόμβους σε ένα δίκτυο μεταφορών κλπ, είναι καταλληλότερα μοντελοποιημένες ως αβέβαιες παράμετροι, οι οποίες χαρακτηρίζονται στην καλύτερη περίπτωση από τις κατανομές πιθανότητας.
Η αβεβαιότητα γύρω από τις πραγματοποιημένες τιμές εκείνων των παραμέτρων δεν μπορεί να εξαλειφθεί πάντα εξαιτίας της εισαγωγής των μέσων τιμών τους ή μερικών άλλων (σταθερών) εκτιμήσεων κατά τη διάρκεια της διαδικασίας μοντελοποίησης. Δηλαδή ανάλογα με την υπό μελέτη κατάσταση, το γραμμικό προσδιοριστικό μοντέλο μπορεί να μην είναι το κατάλληλο μοντέλο για την περιγραφή του προβλήματος που θέλουμε να λύσουμε. Σε αυτή τη διπλωματική υπογραμμίζουμε την ανάγκη να διευρυνθεί το πεδίο της μοντελοποίησης των προβλημάτων απόφασης που παρουσιάζονται στην πραγματική ζωή με την εισαγωγή του στοχαστικού προγραμματισμού. / There are many practical decision problems than can be modeled as linear programs. However, there are also many situations that it is unreasonable to assume that the coefficients of model are deterministically fixed. For instance, future productivities in a production problem, inflows into a reservoir connected to a hydro power station, demands at various nodes in a transportation network, and so on, are often appropriately modeled as uncertain parameters, which are at best characterized by probability distributions.
The uncertainty about the realized values of those parameters cannot always be wiped out just by inserting their mean values or some other (fixed) estimates during the modelling process. That is, depending on the practical situation under consideration, the linear deterministic model may not be the appropriate model for describing the problem we want to solve. In this project we emphasize the need to broaden the scope of modelling real life decision problems by inserting stochastic programming.
Política antitruste e sua consistência: uma análise das decisões do Sistema Brasileiro de Defesa da Concorrência relativas aos Atos de Concentração / An analysis of the Brazilian Antitrust Policy ConsistencyCardoso, Diego Soares 20 May 2013 (has links)
Previous issue date: 2013-05-20 / Financiadora de Estudos e Projetos / The goal of competition policy, also known as antitrust policy, is promoting the welfare and economic efficiency by preserving fair competition in markets. Merger control is one of the main responsibilities of antitrust institutions. Prohibitions and restrictions of merger operations affect market structures, thus making these decisions relevant to economic agents. This Master's thesis analyzes the decisions made by Brazilian antitrust institutions regarding merger processes. Data was collected from public documents issued from 2004 to 2011. Bivariate analysis, discrete choice models and classification decision trees show that these merger control decisions are consistent with Brazilian antitrust law. Consistent competition policy reduces uncertainty, aligns expectations and increases the efficiency of antitrust law enforcement. Therefore, this research contributes to better understanding Brazilian competition policy related to merger control and its decision drivers. / As políticas de defesa da concorrência, ou políticas antitruste, visam ao maior bem-estar social por meio da manutenção de ambientes concorrenciais que promovam a eficiência econômica. No Brasil, os órgãos que compõem o Sistema Brasileiro de Defesa da Concorrência são os responsáveis pelas decisões sobre os agentes econômicos a fim de atingir os objetivos das políticas antitruste. Nesse âmbito, as decisões que influenciam a estrutura de mercados por meio das restrições e vetos a processos como fusões e aquisições de empresas - os julgamentos de Atos de Concentração - apresentam elevada relevância. Este trabalho realiza uma avaliação das decisões do Sistema Brasileiro de Defesa da Concorrência relativas aos Atos de Concentração. Para tal, foram coletados dados a partir dos documentos públicos emitidos pelos órgãos antitruste no período entre 2004 e 2011. Por meio da aplicação de modelos de regressão de escolha discreta e árvores de decisão induzidas, verificou-se que tais decisões são consistentes com as regras antitruste brasileiras. A consistência com regras estabelecidas possibilita uma maior eficiência na aplicação das políticas de defesa da concorrência, uma vez que reduz as incertezas dos agentes econômicos, alinha as expectativas e facilita a condução dos processos. Nesse sentido, esta investigação contribui para uma melhor compreensão dos fatores que influenciam as decisões dos órgãos brasileiros de defesa da concorrência, oferecendo também indicativos que auxiliam na verificação da eficiência da aplicação de tais políticas.
Forêts uniformément aléatoires et détection des irrégularités aux cotisations sociales / Detection of irregularities in social contributions using random uniform forestsCiss, Saïp 20 June 2014 (has links)
Nous présentons dans cette thèse une application de l'apprentissage statistique à la détection des irrégularités aux cotisations sociales. L'apprentissage statistique a pour but de modéliser des problèmes dans lesquels il existe une relation, généralement non déterministe, entre des variables et le phénomène que l'on cherche à évaluer. Un aspect essentiel de cette modélisation est la prédiction des occurrences inconnues du phénomène, à partir des données déjà observées. Dans le cas des cotisations sociales, la représentation du problème s'exprime par le postulat de l'existence d'une relation entre les déclarations de cotisation des entreprises et les contrôles effectués par les organismes de recouvrement. Les inspecteurs du contrôle certifient le caractère exact ou inexact d'un certain nombre de déclarations et notifient, le cas échéant, un redressement aux entreprises concernées. L'algorithme d'apprentissage "apprend", grâce à un modèle, la relation entre les déclarations et les résultats des contrôles, puis produit une évaluation de l'ensemble des déclarations non encore contrôlées. La première partie de l'évaluation attribue un caractère régulier ou irrégulier à chaque déclaration, avec une certaine probabilité. La seconde estime les montants de redressement espérés pour chaque déclaration. Au sein de l'URSSAF (Union de Recouvrement des cotisations de Sécurité sociale et d'Allocations Familiales) d'Île-de-France, et dans le cadre d'un contrat CIFRE (Conventions Industrielles de Formation par la Recherche), nous avons développé un modèle de détection des irrégularités aux cotisations sociales que nous présentons et détaillons tout au long de la thèse. L'algorithme fonctionne sous le logiciel libre R. Il est entièrement opérationnel et a été expérimenté en situation réelle durant l'année 2012. Pour garantir ses propriétés et résultats, des outils probabilistes et statistiques sont nécessaires et nous discutons des aspects théoriques ayant accompagné sa conception. Dans la première partie de la thèse, nous effectuons une présentation générale du problème de la détection des irrégularités aux cotisations sociales. Dans la seconde, nous abordons la détection spécifiquement, à travers les données utilisées pour définir et évaluer les irrégularités. En particulier, les seules données disponibles suffisent à modéliser la détection. Nous y présentons également un nouvel algorithme de forêts aléatoires, nommé "forêt uniformément aléatoire", qui constitue le moteur de détection. Dans la troisième partie, nous détaillons les propriétés théoriques des forêts uniformément aléatoires. Dans la quatrième, nous présentons un point de vue économique, lorsque les irrégularités aux cotisations sociales ont un caractère volontaire, cela dans le cadre de la lutte contre le travail dissimulé. En particulier, nous nous intéressons au lien entre la situation financière des entreprises et la fraude aux cotisations sociales. La dernière partie est consacrée aux résultats expérimentaux et réels du modèle, dont nous discutons.Chacun des chapitres de la thèse peut être lu indépendamment des autres et quelques notions sont redondantes afin de faciliter l'exploration du contenu. / We present in this thesis an application of machine learning to irregularities in the case of social contributions. These are, in France, all contributions due by employees and companies to the "Sécurité sociale", the french system of social welfare (alternative incomes in case of unemployement, Medicare, pensions, ...). Social contributions are paid by companies to the URSSAF network which in charge to recover them. Our main goal was to build a model that would be able to detect irregularities with a little false positive rate. We, first, begin the thesis by presenting the URSSAF and how irregularities can appear, how can we handle them and what are the data we can use. Then, we talk about a new machine learning algorithm we have developped for, "random uniform forests" (and its R package "randomUniformForest") which are a variant of Breiman "random Forests" (tm), since they share the same principles but in in a different way. We present theorical background of the model and provide several examples. Then, we use it to show, when irregularities are fraud, how financial situation of firms can affect their propensity for fraud. In the last chapter, we provide a full evaluation for declarations of social contributions of all firms in Ile-de-France for year 2013, by using the model to predict if declarations present irregularities or not.
Vision-based moving pedestrian recognition from imprecise and uncertain data / Reconnaissance de piétons par vision à partir de données imprécises et incertainesZhou, Dingfu 05 December 2014 (has links)
La mise en oeuvre de systèmes avancés d’aide à la conduite (ADAS) basée vision, est une tâche complexe et difficile surtout d’un point de vue robustesse en conditions d’utilisation réelles. Une des fonctionnalités des ADAS vise à percevoir et à comprendre l’environnement de l’ego-véhicule et à fournir l’assistance nécessaire au conducteur pour réagir à des situations d’urgence. Dans cette thèse, nous nous concentrons sur la détection et la reconnaissance des objets mobiles car leur dynamique les rend plus imprévisibles et donc plus dangereux. La détection de ces objets, l’estimation de leurs positions et la reconnaissance de leurs catégories sont importants pour les ADAS et la navigation autonome. Par conséquent, nous proposons de construire un système complet pour la détection des objets en mouvement et la reconnaissance basées uniquement sur les capteurs de vision. L’approche proposée permet de détecter tout type d’objets en mouvement en fonction de deux méthodes complémentaires. L’idée de base est de détecter les objets mobiles par stéréovision en utilisant l’image résiduelle du mouvement apparent (RIMF). La RIMF est définie comme l’image du mouvement apparent causé par le déplacement des objets mobiles lorsque le mouvement de la caméra a été compensé. Afin de détecter tous les mouvements de manière robuste et de supprimer les faux positifs, les incertitudes liées à l’estimation de l’ego-mouvement et au calcul de la disparité doivent être considérées. Les étapes principales de l’algorithme sont les suivantes : premièrement, la pose relative de la caméra est estimée en minimisant la somme des erreurs de reprojection des points d’intérêt appariées et la matrice de covariance est alors calculée en utilisant une stratégie de propagation d’erreurs de premier ordre. Ensuite, une vraisemblance de mouvement est calculée pour chaque pixel en propageant les incertitudes sur l’ego-mouvement et la disparité par rapport à la RIMF. Enfin, la probabilité de mouvement et le gradient de profondeur sont utilisés pour minimiser une fonctionnelle d’énergie de manière à obtenir la segmentation des objets en mouvement. Dans le même temps, les boîtes englobantes des objets mobiles sont générées en utilisant la carte des U-disparités. Après avoir obtenu la boîte englobante de l’objet en mouvement, nous cherchons à reconnaître si l’objet en mouvement est un piéton ou pas. Par rapport aux algorithmes de classification supervisée (comme le boosting et les SVM) qui nécessitent un grand nombre d’exemples d’apprentissage étiquetés, notre algorithme de boosting semi-supervisé est entraîné avec seulement quelques exemples étiquetés et de nombreuses instances non étiquetées. Les exemples étiquetés sont d’abord utilisés pour estimer les probabilités d’appartenance aux classes des exemples non étiquetés, et ce à l’aide de modèles de mélange de gaussiennes après une étape de réduction de dimension réalisée par une analyse en composantes principales. Ensuite, nous appliquons une stratégie de boosting sur des arbres de décision entraînés à l’aide des instances étiquetées de manière probabiliste. Les performances de la méthode proposée sont évaluées sur plusieurs jeux de données de classification de référence, ainsi que sur la détection et la reconnaissance des piétons. Enfin, l’algorithme de détection et de reconnaissances des objets en mouvement est testé sur les images du jeu de données KITTI et les résultats expérimentaux montrent que les méthodes proposées obtiennent de bonnes performances dans différents scénarios de conduite en milieu urbain. / Vision-based Advanced Driver Assistance Systems (ADAS) is a complex and challenging task in real world traffic scenarios. The ADAS aims at perceiving andunderstanding the surrounding environment of the ego-vehicle and providing necessary assistance for the drivers if facing some emergencies. In this thesis, we will only focus on detecting and recognizing moving objects because they are more dangerous than static ones. Detecting these objects, estimating their positions and recognizing their categories are significantly important for ADAS and autonomous navigation. Consequently, we propose to build a complete system for moving objects detection and recognition based on vision sensors. The proposed approach can detect any kinds of moving objects based on two adjacent frames only. The core idea is to detect the moving pixels by using the Residual Image Motion Flow (RIMF). The RIMF is defined as the residual image changes caused by moving objects with compensated camera motion. In order to robustly detect all kinds of motion and remove false positive detections, uncertainties in the ego-motion estimation and disparity computation should also be considered. The main steps of our general algorithm are the following : first, the relative camera pose is estimated by minimizing the sum of the reprojection errors of matched features and its covariance matrix is also calculated by using a first-order errors propagation strategy. Next, a motion likelihood for each pixel is obtained by propagating the uncertainties of the ego-motion and disparity to the RIMF. Finally, the motion likelihood and the depth gradient are used in a graph-cut-based approach to obtain the moving objects segmentation. At the same time, the bounding boxes of moving object are generated based on the U-disparity map. After obtaining the bounding boxes of the moving object, we want to classify the moving objects as a pedestrian or not. Compared to supervised classification algorithms (such as boosting and SVM) which require a large amount of labeled training instances, our proposed semi-supervised boosting algorithm is trained with only a few labeled instances and many unlabeled instances. Firstly labeled instances are used to estimate the probabilistic class labels of the unlabeled instances using Gaussian Mixture Models after a dimension reduction step performed via Principal Component Analysis. Then, we apply a boosting strategy on decision stumps trained using the calculated soft labeled instances. The performances of the proposed method are evaluated on several state-of-the-art classification datasets, as well as on a pedestrian detection and recognition problem.Finally, both our moving objects detection and recognition algorithms are tested on the public images dataset KITTI and the experimental results show that the proposed methods can achieve good performances in different urban scenarios.
Modelação e análise da vida útil (metrológica) de medidores tipo indução de energia elétrica ativaSilva, Marcelo Rubia da [UNESP] 27 August 2010 (has links) (PDF)
Hodnocení strategických záměrů za podmínek rizika - rozhodování o investování prostřednictvím aparátu rozhodovacích stromů / Classification of strategical plans under conditions of the risk {--} decision making of investment by the apparatus of decision treesJÍCHOVÁ, Romana January 2009 (has links)
In my thesis I dealt with the capital decision making, with the methods to classification of the investments and with decision making under risk and uncertainty. The aim of the thesis was the application of mathematical methods by selection the options of the investments. The main task was to show the possibility of using decision trees, which are the graphical instruments for describing actions available to the decision maker. In the practical part there is described the process of making a decision tree on the example of the sale of real properties and on the example of the extraction of coal oil.
Contributions to decision tree based learning / Contributions à l’apprentissage de l’arbre des décisionsQureshi, Taimur 08 July 2010 (has links)
Advances in data collection methods, storage and processing technology are providing a unique challenge and opportunity for automated data learning techniques which aim at producing high-level information, or models, from data. A Typical knowledge discovery process consists of data selection, data preparation, data transformation, data mining and interpretation/validation of the results. Thus, we develop automatic learning techniques which contribute to the data preparation, transformation and mining tasks of knowledge discovery. In doing so, we try to improve the prediction accuracy of the overall learning process. Our work focuses on decision tree based learning and thus, we introduce various preprocessing and transformation techniques such as discretization, fuzzy partitioning and dimensionality reduction to improve this type of learning. However, these techniques can be used in other learning methods e.g. discretization can also be used for naive-bayes classifiers. The data preparation step represents almost 80 percent of the problem and is both time consuming and critical for the quality of modeling. Discretization of continuous features is an important problem that has effects on accuracy, complexity, variance and understandability of the induction models. In this thesis, we propose and develop resampling based aggregation techniques that improve the quality of discretization. Later, we validate by comparing with other discretization techniques and with an optimal partitioning method on 10 benchmark data sets.The second part of our thesis concerns with automatic fuzzy partitioning for soft decision tree induction. Soft or fuzzy decision tree is an extension of the classical crisp tree induction such that fuzzy logic is embedded into the induction process with the effect of more accurate models and reduced variance, but still interpretable and autonomous. We modify the above resampling based partitioning method to generate fuzzy partitions. In addition we propose, develop and validate another fuzzy partitioning method that improves the accuracy of the decision tree.Finally, we adopt a topological learning scheme and perform non-linear dimensionality reduction. We modify an existing manifold learning based technique and see whether it can enhance the predictive power and interpretability of classification. / La recherche avancée dans les méthodes d'acquisition de données ainsi que les méthodes de stockage et les technologies d'apprentissage, s'attaquent défi d'automatiser de manière systématique les techniques d'apprentissage de données en vue d'extraire des connaissances valides et utilisables.La procédure de découverte de connaissances s'effectue selon les étapes suivants: la sélection des données, la préparation de ces données, leurs transformation, le fouille de données et finalement l'interprétation et validation des résultats trouvés. Dans ce travail de thèse, nous avons développé des techniques qui contribuent à la préparation et la transformation des données ainsi qu'a des méthodes de fouille des données pour extraire les connaissances. A travers ces travaux, on a essayé d'améliorer l'exactitude de la prédiction durant tout le processus d'apprentissage. Les travaux de cette thèse se basent sur les arbres de décision. On a alors introduit plusieurs approches de prétraitement et des techniques de transformation; comme le discrétisation, le partitionnement flou et la réduction des dimensions afin d'améliorer les performances des arbres de décision. Cependant, ces techniques peuvent être utilisées dans d'autres méthodes d'apprentissage comme la discrétisation qui peut être utilisées pour la classification bayesienne.Dans le processus de fouille de données, la phase de préparation de données occupe généralement 80 percent du temps. En autre, elle est critique pour la qualité de la modélisation. La discrétisation des attributs continus demeure ainsi un problème très important qui affecte la précision, la complexité, la variance et la compréhension des modèles d'induction. Dans cette thèse, nous avons proposes et développé des techniques qui ce basent sur le ré-échantillonnage. Nous avons également étudié d'autres alternatives comme le partitionnement flou pour une induction floue des arbres de décision. Ainsi la logique floue est incorporée dans le processus d'induction pour augmenter la précision des modèles et réduire la variance, en maintenant l'interprétabilité.Finalement, nous adoptons un schéma d'apprentissage topologique qui vise à effectuer une réduction de dimensions non-linéaire. Nous modifions une technique d'apprentissage à base de variété topologiques `manifolds' pour savoir si on peut augmenter la précision et l'interprétabilité de la classification.
Supervised Learning of Piecewise Linear ModelsManwani, Naresh January 2012 (has links) (PDF)
Supervised learning of piecewise linear models is a well studied problem in machine learning community. The key idea in piecewise linear modeling is to properly partition the input space and learn a linear model for every partition. Decision trees and regression trees are classic examples of piecewise linear models for classification and regression problems.
The existing approaches for learning decision/regression trees can be broadly classified in to two classes, namely, fixed structure approaches and greedy approaches. In the fixed structure approaches, tree structure is fixed before hand by fixing the number of non leaf nodes, height of the tree and paths from root node to every leaf node of the tree. Mixture of experts and hierarchical mixture of experts are examples of fixed structure approaches for learning piecewise linear models. Parameters of the models are found using, e.g., maximum likelihood estimation, for which expectation maximization(EM) algorithm can be used. Fixed structure piecewise linear models can also be learnt using risk minimization under an appropriate loss function. Learning an optimal decision tree using fixed structure approach is a hard problem. Constructing an optimal binary decision tree is known to be NP Complete. On the other hand, greedy approaches do not assume any parametric form or any fixed structure for the decision tree classifier. Most of the greedy approaches learn tree structured piecewise linear models in a top down fashion. These are built by binary or multi-way recursive partitioning of the input space. The main issues in top down decision tree induction is to choose an appropriate objective function to rate the split rules. The objective function should be easy to optimize. Top-down decision trees are easy to implement and understand, but there are no optimality guarantees due to their greedy nature. Regression trees are built in the similar way as decision trees. In regression trees, every leaf node is associated with a linear regression function.
All piece wise linear modeling techniques deal with two main tasks, namely, partitioning of the input space and learning a linear model for every partition. However, Partitioning of the input space and learning linear models for different partitions are not independent problems. Simultaneous optimal estimation of partitions and learning linear models for every partition, is a combinatorial problem and hence computationally hard. However, piecewise linear models provide better insights in to the classification or regression problem by giving explicit representation of the structure in the data. The information captured by piecewise linear models can be summarized in terms of simple rules, so that, they can be used to analyze the properties of the domain from which the data originates. These properties make piecewise linear models, like decision trees and regression trees, extremely useful in many data mining applications and place them among top data mining algorithms.
In this thesis, we address the problem of supervised learning of piecewise linear models for classification and regression. We propose novel algorithms for learning piecewise linear classifiers and regression functions. We also address the problem of noise tolerant learning of classifiers in presence of label noise.
We propose a novel algorithm for learning polyhedral classifiers which are the simplest form of piecewise linear classifiers. Polyhedral classifiers are useful when points of positive class fall inside a convex region and all the negative class points are distributed outside the convex region. Then the region of positive class can be well approximated by a simple polyhedral set. The key challenge in optimally learning a fixed structure polyhedral classifier is to identify sub problems, where each sub problem is a linear classification problem. This is a hard problem and identifying polyhedral separability is known to be NP complete. The goal of any polyhedral learning algorithm is to efficiently handle underlying combinatorial problem while achieving good classification accuracy. Existing methods for learning a fixed structure polyhedral classifier are based on solving non convex constrained optimization problems. These approaches do not efficiently handle the combinatorial aspect of the problem and are computationally expensive. We propose a method of model based estimation of posterior class probability to learn polyhedral classifiers. We solve an unconstrained optimization problem using a simple two step algorithm (similar to EM algorithm) to find the model parameters. To the best of our knowledge, this is the first attempt to form an unconstrained optimization problem for learning polyhedral classifiers. We then modify our algorithm to find the number of required hyperplanes also automatically. We experimentally show that our approach is better than the existing polyhedral learning algorithms in terms of training time, performance and the complexity.
Most often, class conditional densities are multimodal. In such cases, each class region may be represented as a union of polyhedral regions and hence a single polyhedral classifier is not sufficient. To handle such situation, a generic decision tree is required. Learning optimal fixed structure decision tree is a computationally hard problem. On the other hand, top-down decision trees have no optimality guarantees due to the greedy nature. However, top-down decision tree approaches are widely used as they are versatile and easy to implement. Most of the existing top-down decision tree algorithms (CART,OC1,C4.5, etc.) use impurity measures to assess the goodness of hyper planes at each node of the tree. These measures do not properly capture the geometric structures in the data. We propose a novel decision tree algorithm that ,at each node, selects hyperplanes based on an objective function which takes into consideration geometric structure of the class regions. The resulting optimization problem turns out to be a generalized eigen value problem and hence is efficiently solved. We show through empirical studies that our approach leads to smaller size trees and better performance compared to other top-down decision tree approaches. We also provide some theoretical justification for the proposed method of learning decision trees.
Piecewise linear regression is similar to the corresponding classification problem. For example, in regression trees, each leaf node is associated with a linear regression model. Thus the problem is once again that of (simultaneous) estimation of optimal partitions and learning a linear model for each partition. Regression trees, hinge hyperplane method, mixture of experts are some of the approaches to learn continuous piecewise linear regression models. Many of these algorithms are computationally intensive. We present a method of learning piecewise linear regression model which is computationally simple and is capable of learning discontinuous functions as well. The method is based on the idea of K plane regression that can identify a set of linear models given the training data. K plane regression is a simple algorithm motivated by the philosophy of k means clustering. However this simple algorithm has several problems. It does not give a model function so that we can predict the target value for any given input. Also, it is very sensitive to noise. We propose a modified K plane regression algorithm which can learn continuous as well as discontinuous functions. The proposed algorithm still retains the spirit of k means algorithm and after every iteration it improves the objective function. The proposed method learns a proper Piece wise linear model that can be used for prediction. The algorithm is also more robust to additive noise than K plane regression.
While learning classifiers, one normally assumes that the class labels in the training data set are noise free. However, in many applications like Spam filtering, text classification etc., the training data can be mislabeled due to subjective errors. In such cases, the standard learning algorithms (SVM, Adaboost, decision trees etc.) start over fitting on the noisy points and lead to poor test accuracy. Thus analyzing the vulnerabilities of classifiers to label noise has recently attracted growing interest from the machine learning community. The existing noise tolerant learning approaches first try to identify the noisy points and then learn classifier on remaining points. In this thesis, we address the issue of developing learning algorithms which are inherently noise tolerant. An algorithm is inherently noise tolerant if, the classifier it learns with noisy samples would have the same performance on test data as that learnt from noise free samples. Algorithms having such robustness (under suitable assumption on the noise) are attractive for learning with noisy samples. Here, we consider non uniform label noise which is a generic noise model. In non uniform label noise, the probability of the class label for an example being incorrect, is a function of the feature vector of the example.(We assume that this probability is less than 0.5 for all feature vectors.) This can account for most cases of noisy data sets. There is no provably optimal algorithm for learning noise tolerant classifiers in presence of non uniform label noise. We propose a novel characterization of noise tolerance of an algorithm. We analyze noise tolerance properties of risk minimization frame work as risk minimization is a common strategy for classifier learning. We show that risk minimization under 01 loss has the best noise tolerance properties. None of the other convex loss functions have such noise tolerance properties. Empirical risk minimization under 01 loss is a hard problem as 01 loss function is not differentiable. We propose a gradient free stochastic optimization technique to minimize risk under 01 loss function for noise tolerant learning of linear classifiers. We show (under some conditions) that the algorithm converges asymptotically to the global minima of the risk under 01 loss function. We illustrate the noise tolerance of our algorithm through simulations experiments. We demonstrate the noise tolerance of the algorithm through simulations.
Návrh a implementace Data Mining modelu v technologii MS SQL Server / Design and implementation of Data Mining model with MS SQL Server technologyPeroutka, Lukáš January 2012 (has links)
This thesis focuses on design and implementation of a data mining solution with real-world data. The task is analysed, processed and its results evaluated. The mined data set contains study records of students from University of Economics, Prague (VŠE) over the course of past three years. First part of the thesis focuses on theory of data mining, definition of the term, history and development of this particular field. Current best practices and meth-odology are described, as well as methods for determining the quality of data and methods for data pre-processing ahead of the actual data mining task. The most common data mining techniques are introduced, including their basic concepts, advantages and disadvantages. The theoretical basis is then used to implement a concrete data mining solution with educational data. The source data set is described, analysed and some of the data are chosen as input for created models. The solution is based on MS SQL Server data mining platform and it's goal is to find, describe and analyse potential as-sociations and dependencies in data. Results of respective models are evaluated, including their potential added value. Also mentioned are possible extensions and suggestions for further development of the solution.
Rare dileptonic B meson decays at LHCbMorda, Alessandro 28 September 2015 (has links)
Les désintégrations rares B0(s)→ll sont générées par des courants neutres avec changement de la saveur. Pour cette raison, ainsi qu'à cause de la suppression d'hélicité, leurs taux de désintégration sont très petits dans le Modèle Standard (MS), mais la présence de particules virtuelles de Nouvelle Physique peut radicalement modifier cette prédiction. Une partie du travail original présenté dans cette thèse est dédié à l'optimisation de l'algorithme d'Analyse Multi Varié (MVA) pour la recherche de la désintégration B0(s)→μμ avec l'échantillon collecté par l'expérience LHCb. Cet échantillon a été combiné avec celui collecté par l'expérience CMS et pour la première fois la désintégration B0(s)→μμ a été observée. En vue d'améliorer la sensibilité au mode B0(s)→μμ de nouvelles études ont également été menées pour augmenter la performance des analyses multivariées. Une autre partie du travail original présenté dans cette thèse concerne la définition d'une chaine de sélection pour la recherche des désintégrations B0(s)→τ τ dans l'état final où les deux τ vont en trois π chargées et un τ est étudié. La présence des deux ν dans l'état final de la désintégration rend difficile une reconstruction des impulsions des deux τ. Cependant, la possibilité de mesurer les deux vertex de désintégration des τ ainsi que le vertex d'origine du candidat B, permet d'imposer des contraintes géométriques qui peuvent être utilisées dans la reconstruction des impulsions des deux τ. En particulier, un nouvel algorithme pour la reconstruction complété, événement par événement, de ces impulsions et de leurs variables associées est présenté et discuté. / The B0(s)→ll decays are generated by Flavor Changing Neutral Currents, hence they can proceed only through loop processes. For this reason, and because of an additional helicity suppression, their branching ratios are predicted to be very small in the Standard Model (SM). A part of the original work presented in this thesis has been devoted to the optimization of the Multi Variate Analysis classifier for the search of the B0(s)→μμ with the full dataset collected at LHCb. This dataset has also been combined with the one collected from CMS to obtain the first observation of B0(s)→μμ has been obtained. In view of the update of the analysis aiming to improve the sensitivity for the B0(s)→μμ mode, a new isolation variable, exploiting a topological vertexing algorithm, has been developed and additional studies for a further optimization of the MVA classifier performances have been done. The presence of two ν in the final state of the decay makes the reconstruction of the τ momenta of the two τ. Nevertheless the possibility of measuring the two decay vertexes of the τ, as well as the B candidate production vertex, allows to impose geometrical constraints that can be used in the reconstruction of the τ momenta. In particular, a new algorithm for the full reconstruction of each event of these momenta and of related variables has been presented and discussed.
