Spelling suggestions: "subject:"discriminant analysis."" "subject:"oiscriminant analysis.""
271 |
Multivariat dataanalys för att undersöka skillnader i undervisnings- och bedömningspraxis i kursen kemi 2Larsson, Daniel January 2018 (has links)
Trots att det inom forskningsvärlden propageras för formativ bedömning, kan man i dagsläget notera en mycket stor variation gällande införlivandet av, samt effekter av, formativ bedömning i skolor. Metoder för att kartlägga formativ bedömningspraxis fordras för att kunna särskilja på ”god” respektive ”mindre god” formativ bedömningspraxis. Syftet med föreliggande uppsats var att, med hjälp av en elevenkät och multivariata projektionsmetoder såsom PCA och PLS-DA, kartlägga, och särskilja, formativ bedömningspraxis hos sex olika gymnasieklasser som genomfört kursen kemi 2. Ett sekundärt syfte var även att, med samma verktyg, försöka karakterisera och särskilja frekvenser av olika genomförda undervisningsmoment inom samma kurs och klasser. Studien visade, på ett grafiskt och illustrativt sätt, en stor variation av upplevelser av formativ bedömning inom de tillfrågade klasserna. Vidare visade sig PCA vara ett utmärkt verktyg för att identifiera elevsvar som låg utanför den ”normala” variationen. Genom en PLS-DA-analys påvisades en skillnad i frekvenser av genomförda undervisningsmoment mellan två kommunala och en privat skola – även om dessa resultat bör tolkas med en viss försiktighet.
|
272 |
Kernel methods for flight data monitoring / Méthodes à noyau pour l'analyse de données de vols appliquées aux opérations aériennesChrysanthos, Nicolas 24 October 2014 (has links)
L'analyse de données de vols appliquée aux opérations aériennes ou "Flight Data Monitoring" (FDM), est le processus par lequel une compagnie aérienne recueille, analyse et traite de façon régulière les données enregistrées dans les avions, dans le but d'améliorer de façon globale la sécurité.L'objectif de cette thèse est d'élaborer dans le cadre des méthodes à noyau, des techniques pour la détection des vols atypiques qui présentent potentiellement des problèmes qui ne peuvent être trouvés en utilisant les méthodes classiques. Dans la première partie, nous proposons une nouvelle méthode pour la détection d'anomalies.Nous utilisons une nouvelle technique de réduction de dimension appelée analyse en entropie principale par noyau afin de concevoir une méthode qui est à la fois non supervisée et robuste.Dans la deuxième partie, nous résolvons le problème de la structure des données dans le domaine FDM.Tout d'abord, nous étendons la méthode pour prendre en compte les paramètres de différents types tels que continus, discrets ou angulaires.Ensuite, nous explorons des techniques permettant de prendre en compte l'aspect temporel des vols et proposons un nouveau noyau dans la famille des techniques de déformation de temps dynamique, et démontrons qu'il est plus rapide à calculer que les techniques concurrentes et est de plus défini positif.Nous illustrons notre approche avec des résultats prometteurs sur des données réelles des compagnies aériennes TAP et Transavia comprenant plusieurs centaines de vols / Flight Data Monitoring (FDM), is the process by which an airline routinely collects, processes, and analyses the data recorded in aircrafts with the goal of improving the overall safety or operational efficiency.The goal of this thesis is to investigate machine learning methods, and in particular kernel methods, for the detection of atypical flights that may present problems that cannot be found using traditional methods.Atypical flights may present safety of operational issues and thus need to be studied by an FDM expert.In the first part we propose a novel method for anomaly detection that is suited to the constraints of the field of FDM.We rely on a novel dimensionality reduction technique called kernel entropy component analysis to design a method which is both unsupervised and robust.In the second part we solve the most salient issue regarding the field of FDM, which is how the data is structured.Firstly, we extend the method to take into account parameters of diverse types such as continuous, discrete or angular.Secondly, we explore techniques to take into account the temporal aspect of flights and propose a new kernel in the family of dynamic time warping techniques, and demonstrate that it is faster to compute than competing techniques and is positive definite.We illustrate our approach with promising results on real world datasets from airlines TAP and Transavia comprising hundreds of flights
|
273 |
Application of multivariate statistics and Geographic Information Systems (GIS) to map groundwater quality in the Beaufort West area, Western Cape, South AfricaSolomon, Henok Goitom January 2013 (has links)
Magister Scientiae - MSc (Environ & Water Science) / Groundwater in arid and semi-arid areas like the Karoo region of South Africa is an important source of domestic, agricultural and industrial source of fresh water. As a scarce resource, it requires extensive quality control and protection through innovative methods and efficient strategies. The town of Beaufort West and its vicinity use groundwater as a major source of municipal and private water supply. Forty nine groundwater samples were collected from spatially referenced boreholes located in and around the town of Beaufort West and were analyzed for EC, pH, TDS,TH, SAR, TA, Ca2+, Mg2+, Na+, K+, HCO3-, Cl-, NO3- and SO42- according to SANS 241 standards and tested for ionic balance. The groundwater of the study area was characterized using WHO and South African drinking water quality standards as well as TDS and Salinity hazard classifications. These comparisons and classifications characterized the groundwater of the study area as hard to very hard, with low to medium salinity hazard. These results are in accordance with the dominance of the ions Ca2+, Na+, HCO3 - and Cl- in the groundwater samples. Linear relationships between the hydrochemical variables were analysed through correlation and multiple regression analysis to relate the groundwater quality to the underlying hydrogeochemical processes. These linear relationships explained the contribution of the measured variables towards the salinity, hardness and anthropogenic contamination of the groundwater. The groundwater of the study area was also assessed using conventional trilinear diagrams and scatter plots to interpret the water quality and determine the major ion chemistry. The conventional methods highlighted the sources of the hydrochemical variables through analysis and interpretation of rock-water interaction and evaporations processes. To supplement these conventional methods and reveal hidden hydrogeochemical phenomenon, multivariate statistical analyses were employed. Factor analysis reduced the hydrochemical variables into three factors (Hardness, Alkalinity and Landuse) that characterize the groundwater quality in relation to the source of its hydrochemistry. Furthermore, combination of Cluster (CA) and Discriminant analyses (DA) were used to classify the groundwater in to different hydrochemical facies and determine the dominant hydrochemical variables that characterize these facies. The classification results were also compared with the trilinear diagrammatic interpretations to highlight the advantages of these multivariate statistical methods. The CA and DA classifications resulted in to six different hydrochemical facies that are characterized by NO3 -, Na+ and pH. These three hydrochemical variables explain 93.9% of the differences between the water types and highlight the influence of natural hydrogeochemical and anthropogenic processes on the groundwater quality. All the univariate, bivariate, multivariate statistical and conventional hydrogeochemical analyses results were analyzed spatially using ArcGIS 10.0. The spatial analysis employed the Inverse Distance Weighted (IDW) interpolation method to predict spatial distribution of unmeasured areas and reclassification of the interpolation results for classification purposes. The results of the different analyses methods employed in the thesis illustrate that the groundwater in the study area is generally hard but permissible in the absence of better alternative water source and useful for irrigation.
|
274 |
Modèles prudents en apprentissage statistique supervisé / Cautious models in supervised machine learningYang, Gen 22 March 2016 (has links)
Dans certains champs d’apprentissage supervisé (e.g. diagnostic médical, vision artificielle), les modèles prédictifs sont non seulement évalués sur leur précision mais également sur la capacité à l'obtention d'une représentation plus fiable des données et des connaissances qu'elles induisent, afin d'assister la prise de décisions de manière prudente. C'est la problématique étudiée dans le cadre de cette thèse. Plus spécifiquement, nous avons examiné deux approches existantes de la littérature de l'apprentissage statistique pour rendre les modèles et les prédictions plus prudents et plus fiables: le cadre des probabilités imprécises et l'apprentissage sensible aux coûts. Ces deux domaines visent tous les deux à rendre les modèles d'apprentissage et les inférences plus fiables et plus prudents. Pourtant peu de travaux existants ont tenté de les relier, en raison de problèmes à la fois théorique et pratique. Nos contributions consistent à clarifier et à résoudre ces problèmes. Sur le plan théorique, peu de travaux existants ont abordé la manière de quantifier les différentes erreurs de classification quand des prédictions sous forme d'ensembles sont produites et quand ces erreurs ne se valent pas (en termes de conséquences). Notre première contribution a donc été d'établir des propriétés générales et des lignes directrices permettant la quantification des coûts d'erreurs de classification pour les prédictions sous forme d'ensembles. Ces propriétés nous ont permis de dériver une formule générale, le coût affaiblie généralisé (CAG), qui rend possible la comparaison des classifieurs quelle que soit la forme de leurs prédictions (singleton ou ensemble) en tenant compte d'un paramètre d'aversion à la prudence. Sur le plan pratique, la plupart des classifieurs utilisant les probabilités imprécises ne permettent pas d'intégrer des coûts d'erreurs de classification génériques de manière simple, car la complexité du calcul augmente de magnitude lorsque des coûts non unitaires sont utilisés. Ce problème a mené à notre deuxième contribution, la mise en place d'un classifieur qui permet de gérer les intervalles de probabilités produits par les probabilités imprécises et les coûts d'erreurs génériques avec le même ordre de complexité que dans le cas où les probabilités standards et les coûts unitaires sont utilisés. Il s'agit d'utiliser une technique de décomposition binaire, les dichotomies emboîtées. Les propriétés et les pré-requis de ce classifieur ont été étudiés en détail. Nous avons notamment pu voir que les dichotomies emboîtées sont applicables à tout modèle probabiliste imprécis et permettent de réduire le niveau d'indétermination du modèle imprécis sans perte de pouvoir prédictif. Des expériences variées ont été menées tout au long de la thèse pour appuyer nos contributions. Nous avons caractérisé le comportement du CAG à l’aide des jeux de données ordinales. Ces expériences ont mis en évidence les différences entre un modèle basé sur les probabilités standards pour produire des prédictions indéterminées et un modèle utilisant les probabilités imprécises. Ce dernier est en général plus compétent car il permet de distinguer deux sources d'indétermination (l'ambiguïté et le manque d'informations), même si l'utilisation conjointe de ces deux types de modèles présente également un intérêt particulier dans l'optique d'assister le décideur à améliorer les données ou les classifieurs. De plus, des expériences sur une grande variété de jeux de données ont montré que l'utilisation des dichotomies emboîtées permet d'améliorer significativement le pouvoir prédictif d'un modèle imprécis avec des coûts génériques. / In some areas of supervised machine learning (e.g. medical diagnostics, computer vision), predictive models are not only evaluated on their accuracy but also on their ability to obtain more reliable representation of the data and the induced knowledge, in order to allow for cautious decision making. This is the problem we studied in this thesis. Specifically, we examined two existing approaches of the literature to make models and predictions more cautious and more reliable: the framework of imprecise probabilities and the one of cost-sensitive learning. These two areas are both used to make models and inferences more reliable and cautious. Yet few existing studies have attempted to bridge these two frameworks due to both theoretical and practical problems. Our contributions are to clarify and to resolve these problems. Theoretically, few existing studies have addressed how to quantify the different classification errors when set-valued predictions are produced and when the costs of mistakes are not equal (in terms of consequences). Our first contribution has been to establish general properties and guidelines for quantifying the misclassification costs for set-valued predictions. These properties have led us to derive a general formula, that we call the generalized discounted cost (GDC), which allow the comparison of classifiers whatever the form of their predictions (singleton or set-valued) in the light of a risk aversion parameter. Practically, most classifiers basing on imprecise probabilities fail to integrate generic misclassification costs efficiently because the computational complexity increases by an order (or more) of magnitude when non unitary costs are used. This problem has led to our second contribution, the implementation of a classifier that can manage the probability intervals produced by imprecise probabilities and the generic error costs with the same order of complexity as in the case where standard probabilities and unitary costs are used. This is to use a binary decomposition technique, the nested dichotomies. The properties and prerequisites of this technique have been studied in detail. In particular, we saw that the nested dichotomies are applicable to all imprecise probabilistic models and they reduce the imprecision level of imprecise models without loss of predictive power. Various experiments were conducted throughout the thesis to illustrate and support our contributions. We characterized the behavior of the GDC using ordinal data sets. These experiences have highlighted the differences between a model based on standard probability framework to produce indeterminate predictions and a model based on imprecise probabilities. The latter is generally more competent because it distinguishes two sources of uncertainty (ambiguity and the lack of information), even if the combined use of these two types of models is also of particular interest as it can assist the decision-maker to improve the data quality or the classifiers. In addition, experiments conducted on a wide variety of data sets showed that the use of nested dichotomies significantly improves the predictive power of an indeterminate model with generic costs.
|
275 |
Bayesian multiple hypotheses testing with quadratic criterion / Test bayésien entre hypothèses multiples avec critère quadratiqueZhang, Jian 04 April 2014 (has links)
Le problème de détection et localisation d’anomalie peut être traité comme le problème du test entre des hypothèses multiples (THM) dans le cadre bayésien. Le test bayésien avec la fonction de perte 0−1 est une solution standard pour ce problème, mais les hypothèses alternatives pourraient avoir une importance tout à fait différente en pratique. La fonction de perte 0−1 ne reflète pas cette réalité tandis que la fonction de perte quadratique est plus appropriée. L’objectif de cette thèse est la conception d’un test bayésien avec la fonction de perte quadratique ainsi que son étude asymptotique. La construction de ce test est effectuée en deux étapes. Dans la première étape, un test bayésien avec la fonction de perte quadratique pour le problème du THM sans l’hypothèse de base est conçu et les bornes inférieures et supérieures des probabilités de classification erronée sont calculées. La deuxième étape construit un test bayésien pour le problème du THM avec l’hypothèse de base. Les bornes inférieures et supérieures des probabilités de fausse alarme, des probabilités de détection manquée, et des probabilités de classification erronée sont calculées. A partir de ces bornes, l’équivalence asymptotique entre le test proposé et le test standard avec la fonction de perte 0−1 est étudiée. Beaucoup d’expériences de simulation et une expérimentation acoustique ont illustré l’efficacité du nouveau test statistique / The anomaly detection and localization problem can be treated as a multiple hypotheses testing (MHT) problem in the Bayesian framework. The Bayesian test with the 0−1 loss function is a standard solution for this problem, but the alternative hypotheses have quite different importance in practice. The 0−1 loss function does not reflect this fact while the quadratic loss function is more appropriate. The objective of the thesis is the design of a Bayesian test with the quadratic loss function and its asymptotic study. The construction of the test is made in two steps. In the first step, a Bayesian test with the quadratic loss function for the MHT problem without the null hypothesis is designed and the lower and upper bounds of the misclassification probabilities are calculated. The second step constructs a Bayesian test for the MHT problem with the null hypothesis. The lower and upper bounds of the false alarm probabilities, the missed detection probabilities as well as the misclassification probabilities are calculated. From these bounds, the asymptotic equivalence between the proposed test and the standard one with the 0-1 loss function is studied. A lot of simulation and an acoustic experiment have illustrated the effectiveness of the new statistical test
|
276 |
Regressão logística e análise discriminante na predição da recuperação de portfólios de créditos do tipo non-performing loans / Logistic regression and discriminant analysis in prediction of the recovery of non-performing loans credits portfolioSilva, Priscila Cristina 23 February 2017 (has links)
Submitted by Nadir Basilio (nadirsb@uninove.br) on 2017-08-04T21:33:38Z
No. of bitstreams: 1
Priscila Cristina Silva.pdf: 2177666 bytes, checksum: a8d3c5290664fa16f138371def86fcdd (MD5) / Made available in DSpace on 2017-08-04T21:33:38Z (GMT). No. of bitstreams: 1
Priscila Cristina Silva.pdf: 2177666 bytes, checksum: a8d3c5290664fa16f138371def86fcdd (MD5)
Previous issue date: 2017-02-23 / Customers with credit agreement in arrears for more than 90 days are characterized as non-performing loans and cause concerns in credit companies because the lack of guarantee of discharge debtor's amount. To treat this type of customer are applied collection scoring models that have as main objective to predict those debtors who have propensity to honor their debts, that is, this model focuses on credit recovery. Models based on statistical prediction techniques can be applied to the recovery of these credits, such as logistic regression and discriminant analysis. Therefore, the aim of this paper was to apply logistic regression and discriminant analysis models in predicting the recovery of non-performing loans credit portfolios. The database used was provided by the company Serasa Experian and contains a sample of ten thousand customers with twenty independent variables and a variable binary response (dependent) indicating whether or not the defaulting customer paid their debt. The sample was divided into training, validation and test and the models cited in the objective were applied individually. Then, two new logistic regression models and discriminant analysis were implemented from the outputs of the individually implemented models. The both models applied individually as the new models had generally good performance form, highlighting the new model of discriminant analysis that got correct classification of percentage higher than the new logistic regression model. It was concluded, then, based on the results that the models are a good option for predicting the credit portfolio recovery. / Os clientes que possuem contrato de crédito em atraso há mais de 90 dias são caracterizados como non-performing loans e preocupam as instituições financeiras fornecedoras de crédito pela falta de garantia da quitação desse montante devedor. Para tratar este tipo de cliente são aplicados modelos de collection scoring que têm como principal objetivo predizer aqueles devedores que possuem propensão em quitar suas dívidas, ou seja, esse modelo busca a recuperação de crédito. Modelos baseados em técnicas estatísticas de predição podem ser aplicados na recuperação como a regressão logística e a análise discriminante. Deste modo, o objetivo deste trabalho foi aplicar os modelos de regressão logística e análise discriminante na predição da recuperação de portfólios de crédito do tipo non-performing loans. A base de dados utilizada foi cedida pela empresa Serasa Experian e contém uma amostra de dez mil indivíduos com vinte variáveis independentes e uma variável resposta (dependente) binária indicando se o cliente inadimplente pagou ou não sua dívida. A amostra foi dividida em treinamento, validação e teste e foram aplicados os modelos citados de forma individual. Em seguida, dois novos modelos de regressão logística e análise discriminante foram implementados a partir das saídas (outputs) dos modelos aplicados individualmente. Com base nos resultados, tanto os modelos aplicados individualmente quanto os novos modelos apresentaram bom desempenho, com destaque para o novo modelo de análise discriminante que apresentou um percentual de classificações corretas superior ao novo modelo de regressão logística. Concluiu-se, então, que os modelos são uma boa opção para predição da recuperação de portfólios de crédito do tipo non-performing loans.
|
277 |
Hardware / Algorithm Integration for Pharmaceutical AnalysisCasey J Smith (8755572) 29 April 2020 (has links)
New experimental strategies and algorithmic approaches were devised and tested to improve the analysis of pharmaceutically relevant materials. These new methods were developed to address key bottlenecks in the design of amorphous solid dispersions for the delivery of low-solubility active pharmaceutical ingredients in the final dosage forms exhibiting high bioavailability. <br>
|
278 |
Machine learning methods for seasonal allergic rhinitis studiesFeng, Zijie January 2021 (has links)
Seasonal allergic rhinitis (SAR) is a disease caused by allergens from both environmental and genetic factors. Some researchers have studied the SAR based on traditional genetic methodologies. As technology develops, a new technique called single-cell RNA sequencing (scRNA-seq) is developed, which can generate high-dimension data. We apply two machine learning (ML) algorithms, random forest (RF) and partial least squares discriminant analysis (PLS-DA), for cell source classification and gene selection based on the SAR scRNA-seq time-series data from three allergic patients and four healthy controls denoised by single-cell variational inference (scVI). We additionally propose a new fitting method consisting of bootstrap and cubic smoothing splines to fit the averaged gene expressions per cell from different populations. To sum up, we find that both RF and PLS-DA could provide high classification accuracy, and RF is more preferable, considering its stable performance and strong gene-selection ability. Based on our analysis, there are 10 genes having discriminatory power to classify cells of allergic patients and healthy controls at any timepoints. Although there is no literature founded to show the direct connections between such 10 genes and SAR, the potential associations are indirectly confirmed by some studies. It shows a possibility that we can alarm allergic patients before a disease outbreak based on their genetic information. Meanwhile, our experiment results indicate that ML algorithms may discover something between genes and SAR compared with traditional techniques, which needs to be analyzed in genetics in the future.
|
279 |
Adaptace systémů pro rozpoznání mluvčího / Adaptation of Speaker Recognition SystemsNovotný, Ondřej January 2014 (has links)
In this paper, we propose techniques for adaptation of speaker recognition systems. The aim of this work is to create adaptation for Probabilistic Linear Discriminant Analysis. Special attention is given to unsupervised adaptation. Our test shows appropriate clustering techniques for speaker estimation of the identity and estimation of the number of speakers in adaptation dataset. For the test, we are using NIST and Switchboard corpora.
|
280 |
Identifikace obličeje / Face IdentificationMacenauer, Oto January 2010 (has links)
This document introduces the reader to area of face recognition. Miscellaneous methods are mentioned and categorized to be able to understand the process of face recognition. Main focus of this document is on issues of current face recognition and possibilities do solve these inconveniences in order to be able to massively spread face recognition. The second part of this work is focused on implementation of selected methods, which are Linear Discriminant Analysis and Principal Component Analysis. Those methods are compared to each other and results are given at the end of work.
|
Page generated in 0.0644 seconds