Global ETD Search

1	Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis Manandhr-Shrestha, Nabin K. 29 March 2010 (has links) The aim of the present study is to identify the di®erentially expressed genes be- tween two di®erent conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statis- ticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal vari- ances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as di®erentially expressed, even though the signi¯cance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be di®erentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of ¯nding di®erentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic’s t-test method, Tusher and Tibshirani’s SAM method among others. The next step of this research is to check whether the genes selected by the pro- posed Behrens -Fisher method is useful for the classi¯cation of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene se- lection method with some other statistical learning methods, we have found better classi¯cation result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive de¯nite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insu±ciency. The e±ciency of this established method has been demonstrated by ap- plying them in three real microarray data and calculating the misclassi¯cation error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classi¯cation performance of di®erent classi¯ers before and after taking the correlation between the genes. The classi¯cation performance of the classi¯er has been signi¯cantly improved once the correlation was accounted. The classi¯cation performance of di®erent classi¯ers have been measured by the misclas- si¯cation rates and the confusion matrix. The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not a®ect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the signi¯cance level is not su±cient. The rejection region should be rede¯ned accordingly and depends on the degree of correlation. The e®ect of the correlation in selecting the appropriate rejection region have also been studied. Genes False Discovery Rate Multiple Testing Correlation Classi¯cation American Studies Arts and Humanities Mathematics Statistics and Probability
2	Modélisation, création et évaluation de ux de terminologies et de terminologies d'interface : application à la production d'examens complémentaires de biologie et d'imagerie médicale. Griffon, Nicolas 25 October 2013 (has links) (PDF) Les intérêts théoriques, cliniques et économiques, de l'informatisation des prescriptions au sein des établissements de santé sont nombreux : diminution du nombre de prescriptions, amélioration de leur pertinence clinique, diminution des erreurs médicales... Ces béné ces restent théoriques car l'informatisation des prescriptions est, en pratique, confrontée à de nombreux problèmes, parmi lesquels les problèmes d'interopérabilité et d'utilisabilité des solutions logicielles. L'utilisation de terminologies d'interface au sein de ux de terminologies permettrait de dépasser ces problèmes. L'objectif principal de ce travail était de modéliser et développer ces ux de terminologies pour la production d'examens de biologie et d'imagerie médicale puis d'en évaluer les béné ces en termes d'interopérabilité et d'utilisabilité. Des techniques d'analyse des processus ont permis d'aboutir à une modélisation des ux de terminologies qui semble commune à de nombreux domaines. La création des ux proprement dits repose sur des terminologies d'interface, éditées pour l'occasion, et des référentiels nationaux ou internationaux reconnus. Pour l'évaluation, des méthodes spéci- ques mises au point lors du travail d'intégration d'une terminologie d'interface iconique au sein d'un moteur de recherche de recommandations médicales et d'un dossier médical, ont été appliquées. Les ux de terminologies créés induisaient d'importantes pertes d'information entre les di érents systèmes d'information. En imagerie, la terminologie d'interface de prescription était signi cativement plus simple à utiliser que les autres terminologies, une telle di érence n'a pas été mise en évidence dans le domaine de la biologie. Si les ux de terminologies ne sont pas encore fonctionnels, les terminologies d'interface, elles, sont disponibles pour tout établissement de santé ou éditeur de logiciels et devraient faciliter la mise en place de logiciels d'aide à la prescription. A chage des données Analyse et exécution des tâches Diagnostic par imagerie/Classi cation Études d'évaluation Examens biologiques/Classi cation Medical Subject Headings Prescription électronique Terminologie Terminologie comme sujet
3	Generalized Modeling and Estimation of Rating Classes and Default Probabilities Considering Dependencies in Cross and Longitudinal Section Tillich, Daniel 30 March 2017 (has links) (PDF) Our sample (Xit; Yit) consists of pairs of variables. The real variable Xit measures the creditworthiness of individual i in period t. The Bernoulli variable Yit is the default indicator of individual i in period t. The objective is to estimate a credit rating system, i.e. to particularly divide the range of the creditworthiness into several rating classes, each with a homogeneous default risk. The field of change point analysis provides a way to estimate the breakpoints between the rating classes. As yet, the literature only considers models without dependencies or with dependence only in cross section. This contribution proposes multi-period models including dependencies in cross section as well as in longitudinal section. Furthermore, estimators for the model parameters are suggested. The estimators are applied to a data set of a German credit bureau. Zeitabhängigkeit Rating-Klassifizierung Kreditrisiko Split-Point Regression mit Diskontinuitäten time dependence rating classi cation credit risk split-point regression with discontinuities ddc:330 rvk:QH 400
4	Strojové učení v klasifikaci obrazu / Machine Learning in Image Classification Král, Jiří January 2011 (has links) This project deals vith analysis and testing of algorithms and statistical models, that could potentionaly improve resuts of FIT BUT in ImageNet Large Scale Visual Recognition Challenge and TRECVID. Multinomial model was tested. Phonotactic Intersession Variation Compensation (PIVCO) model was used for reducing random e ffects in image representation and for dimensionality reduction. PIVCO - dimensionality reduction achieved the best mean average precision while reducing to one-twenyth of original dimension. KPCA model was tested to approximate Kernel SVM. All statistical models were tested on Pascal VOC 2007 dataset.
5	Generalized Modeling and Estimation of Rating Classes and Default Probabilities Considering Dependencies in Cross and Longitudinal Section Tillich, Daniel 30 March 2017 (has links) Our sample (Xit; Yit) consists of pairs of variables. The real variable Xit measures the creditworthiness of individual i in period t. The Bernoulli variable Yit is the default indicator of individual i in period t. The objective is to estimate a credit rating system, i.e. to particularly divide the range of the creditworthiness into several rating classes, each with a homogeneous default risk. The field of change point analysis provides a way to estimate the breakpoints between the rating classes. As yet, the literature only considers models without dependencies or with dependence only in cross section. This contribution proposes multi-period models including dependencies in cross section as well as in longitudinal section. Furthermore, estimators for the model parameters are suggested. The estimators are applied to a data set of a German credit bureau. info:eu-repo/classification/ddc/330 ddc:330
6	Estimation in discontinuous Bernoulli mixture models applicable in credit rating systems with dependent data Tillich, Daniel, Lehmann, Christoph 30 March 2017 (has links) (PDF) Objective: We consider the following problem from credit risk modeling: Our sample (Xi; Yi), 1 < i < n, consists of pairs of variables. The first variable Xi measures the creditworthiness of individual i. The second variable Yi is the default indicator of individual i. It has two states: Yi = 1 indicates a default, Yi = 0 a non-default. A default occurs, if individual i cannot meet its contractual credit obligations, i. e. it cannot pay back its outstandings regularly. In afirst step, our objective is to estimate the threshold between good and bad creditworthiness in the sense of dividing the range of Xi into two rating classes: One class with good creditworthiness and a low probability of default and another class with bad creditworthiness and a high probability of default. Methods: Given observations of individual creditworthiness Xi and defaults Yi, the field of change point analysis provides a natural way to estimate the breakpoint between the rating classes. In order to account for dependency between the observations, the literature proposes a combination of three model classes: These are a breakpoint model, a linear one-factor model for the creditworthiness Xi, and a Bernoulli mixture model for the defaults Yi. We generalize the dependency structure further and use a generalized link between systematic factor and idiosyncratic factor of creditworthiness. So the systematic factor cannot only change the location, but also the form of the distribution of creditworthiness. Results: For the case of two rating classes, we propose several estimators for the breakpoint and for the default probabilities within the rating classes. We prove the strong consistency of these estimators in the given non-i.i.d. framework. The theoretical results are illustrated by a simulation study. Finally, we give an overview of research opportunities. Regression mit Sprung Änderungspunkt Splitpunkt Kreditrisiko Ratingklassifizierung Ausfallwahrscheinlichkeit Abhängigkeit starke Konsistenz regression with jump change point split point credit risk rating classi - cation default probability dependence strong consistency ddc:330 rvk:QH 400
7	Klasifikace testovacích manévrů z letových dat / Classification of Testing Maneuvers from Flight Data Funiak, Martin January 2015 (has links) Zapisovač letových údajů je zařízení určené pro zaznamenávání letových dat z různých senzorů v letadlech. Analýza letových údajů hraje důležitou roli ve vývoji a testování avioniky. Testování a hodnocení charakteristik letadla se často provádí pomocí testovacích manévrů. Naměřená data z jednoho letu jsou uložena v jednom letovém záznamu, který může obsahovat několik testovacích manévrů. Cílem této práce je identi kovat základní testovací manévry s pomocí naměřených letových dat. Teoretická část popisuje letové manévry a formát měřených letových dat. Analytická část popisuje výzkum v oblasti klasi kace založené na statistice a teorii pravděpodobnosti potřebnou pro pochopení složitých Gaussovských směšovacích modelů. Práce uvádí implementaci, kde jsou Gaussovy směšovací modely použité pro klasifi kaci testovacích manévrů. Navržené řešení bylo testováno pro data získána z letového simulátoru a ze skutečného letadla. Ukázalo se, že Gaussovy směšovací modely poskytují vhodné řešení pro tento úkol. Další možný vývoj práce je popsán v závěrečné kapitole.
8	Estimation in discontinuous Bernoulli mixture models applicable in credit rating systems with dependent data Tillich, Daniel, Lehmann, Christoph 30 March 2017 (has links) Objective: We consider the following problem from credit risk modeling: Our sample (Xi; Yi), 1 < i < n, consists of pairs of variables. The first variable Xi measures the creditworthiness of individual i. The second variable Yi is the default indicator of individual i. It has two states: Yi = 1 indicates a default, Yi = 0 a non-default. A default occurs, if individual i cannot meet its contractual credit obligations, i. e. it cannot pay back its outstandings regularly. In afirst step, our objective is to estimate the threshold between good and bad creditworthiness in the sense of dividing the range of Xi into two rating classes: One class with good creditworthiness and a low probability of default and another class with bad creditworthiness and a high probability of default. Methods: Given observations of individual creditworthiness Xi and defaults Yi, the field of change point analysis provides a natural way to estimate the breakpoint between the rating classes. In order to account for dependency between the observations, the literature proposes a combination of three model classes: These are a breakpoint model, a linear one-factor model for the creditworthiness Xi, and a Bernoulli mixture model for the defaults Yi. We generalize the dependency structure further and use a generalized link between systematic factor and idiosyncratic factor of creditworthiness. So the systematic factor cannot only change the location, but also the form of the distribution of creditworthiness. Results: For the case of two rating classes, we propose several estimators for the breakpoint and for the default probabilities within the rating classes. We prove the strong consistency of these estimators in the given non-i.i.d. framework. The theoretical results are illustrated by a simulation study. Finally, we give an overview of research opportunities. info:eu-repo/classification/ddc/330 ddc:330
9	Fouille de données d'usage du Web : Contributions au prétraitement de logs Web Intersites et à l'extraction des motifs séquentiels avec un faible support Tanasa, Doru 03 June 2005 (has links) (PDF) Les quinze dernières années ont été marquées par une croissance exponentielle du domaine du Web tant dans le nombre de sites Web disponibles que dans le nombre d'utilisateurs de ces sites. Cette croissance a généré de très grandes masses de données relatives aux traces d'usage duWeb par les internautes, celles-ci enregistrées dans des fichiers logs Web. De plus, les propriétaires de ces sites ont exprimé le besoin de mieux comprendre leurs visiteurs afin de mieux répondre à leurs attentes. Le Web Usage Mining (WUM), domaine de recherche assez récent, correspond justement au processus d'extraction des connaissances à partir des données (ECD) appliqué aux données d'usage sur le Web. Il comporte trois étapes principales : le prétraitement des données, la découverte des schémas et l'analyse (ou l'interprétation) des résultats. Un processus WUM extrait des patrons de comportement à partir des données d'usage et, éventuellement, à partir d'informations sur le site (structure et contenu) et sur les utilisateurs du site (profils). La quantité des données d'usage à analyser ainsi que leur faible qualité (en particulier l'absence de structuration) sont les principaux problèmes en WUM. Les algorithmes classiques de fouille de données appliqués sur ces données donnent généralement des résultats décevants en termes de pratiques des internautes (par exemple des patrons séquentiels évidents, dénués d'intérêt). Dans cette thèse, nous apportons deux contributions importantes pour un processus WUM, implémentées dans notre bo^³te à outils AxisLogMiner. Nous proposons une méthodologie générale de prétraitement des logs Web et une méthodologie générale divisive avec trois approches (ainsi que des méthodes concrètes associées) pour la découverte des motifs séquentiels ayant un faible support. Notre première contribution concerne le prétraitement des données d'usage Web, domaine encore très peu abordé dans la littérature. L'originalité de la méthodologie de prétraitement proposée consiste dans le fait qu'elle prend en compte l'aspect multi-sites du WUM, indispensable pour appréhender les pratiques des internautes qui naviguent de fa»con transparente, par exemple, sur plusieurs sites Web d'une même organisation. Outre l'intégration des principaux travaux existants sur ce thème, nous proposons dans notre méthodologie quatre étapes distinctes : la fusion des fichiers logs, le nettoyage, la structuration et l'agrégation des données. En particulier, nous proposons plusieurs heuristiques pour le nettoyage des robots Web, des variables agrégées décrivant les sessions et les visites, ainsi que l'enregistrement de ces données dans un modèle relationnel. Plusieurs expérimentations ont été réalisées, montrant que notre méthodologie permet une forte réduction (jusqu'à 10 fois) du nombre des requêtes initiales et offre des logs structurés plus riches pour l'étape suivante de fouille de données. Notre deuxième contribution vise la découverte à partir d'un fichier log prétraité de grande taille, des comportements minoritaires correspondant à des motifs séquentiels de très faible support. Pour cela, nous proposons une méthodologie générale visant à diviser le fichier log prétraité en sous-logs, se déclinant selon trois approches d'extraction de motifs séquentiels au support faible (Séquentielle, Itérative et Hiérarchique). Celles-ci ont été implémentées dans des méthodes concrètes hybrides mettant en jeu des algorithmes de classification et d'extraction de motifs séquentiels. Plusieurs expérimentations, réalisées sur des logs issus de sites académiques, nous ont permis de découvrir des motifs séquentiels intéressants ayant un support très faible, dont la découverte par un algorithme classique de type Apriori était impossible. Enfin, nous proposons une boite à outils appelée AxisLogMiner, qui supporte notre méthodologie de prétraitement et, actuellement, deux méthodes concrètes hybrides pour la découverte des motifs séquentiels en WUM. Cette boite à outils a donné lieu à de nombreux prétraitements de fichiers logs et aussi à des expérimentations avec nos méthodes implémentées. Web usage mining (WUM) journaux d'accµes Web méthodologie WUM prétraitement WUM WUM multi-sites fouille de données Web fouille de données extraction des motifs séquentiels support faible classi¯cation non-supervisée méthodologie divisive boîte à outils WUM Apriori-GST AxisLogMiner

Search results