Global ETD Search

31	Aspects of statistical disclosure control Smith, Duncan Geoffrey January 2012 (has links) This work concerns the evaluation of statistical disclosure control risk by adopting the position of the data intruder. The underlying assertion is that risk metrics should be based on the actual inferences that an intruder can make. Ideally metrics would also take into account how sensitive the inferences would be, but that is subjective. A parallel theme is that of the knowledgeable data intruder; an intruder who has the technical skills to maximally exploit the information contained in released data. This also raises the issue of computational costs and the benefits of using good algorithms. A metric for attribution risk in tabular data is presented. It addresses the issue that most measures for tabular data are based on the risk of identification. The metric can also take into account assumed levels of intruder knowledge regarding the population, and it can be applied to both exact and perturbed collections of tables. An improved implementation of the Key Variable Mapping System (Elliot, et al., 2010) is presented. The problem is more precisely defined in terms of categorical variables rather than responses to survey questions. This allows much more efficient algorithms to be developed, leading to significant performance increases. The advantages and disadvantages of alternative matching strategies are investigated. Some are shown to dominate others. The costs of searching for a match are also considered, providing insight into how a knowledgeable intruder might tailor a strategy to balance the probability of a correct match and the time and effort required to find a match. A novel approach to model determination in decomposable graphical models is described. It offers purely computational advantages over existing schemes, but allows data sets to be more thoroughly checked for disclosure risk. It is shown that a Bayesian strategy for matching between a sample and a population offers much higher probabilities of a correct match than traditional strategies would suggest. The Special Uniques Detection Algorithm (Elliot et al., 2002) (Manning et al., 2008), for identifying risky sample counts of 1, is compared against Bayesian (using Markov Chain Monte Carlo and simulated annealing) alternatives. It is shown that the alternatives are better at identifying risky sample uniques, and can do so with reduced computational costs. 005.8
32	Using Naive Bayes and N-Gram for Document Classification Användning av Naive Bayes och N-Gram för dokumentklassificering Farah Mohamed, Khalif January 2015 (has links) The purpose of this degree project is to present, evaluate and improve probabilistic machine-learning methods for supervised text classification. We will explore Naive Bayes algorithm and character level n-gram, two probabilistic methods. The two methods will then be compared. Probabilistic algorithms like Naive Bayes and character level n-gram are some of the most effective methods in text classification, but to get accurate results they need a large training set. Because of too simple assumptions, Naive Bayes is a poor classifier. To rectify the problem, we will try to improve the algorithm, by using some transformed word and n-gram counts. / Syftet med det här examensarbetet är att presentera, utvärdera och förbättra probabilistiska maskin-lärande metoder för övervakad textklassificering. Vi ska bekanta oss med Naive Bayes och tecken-baserad n-gram, två probabilistiska metoder. Vi ska sedan jämföra metoderna. Probabilistiska algoritmerna är bland de mest effektiva metoder för övervakad textklassificering, men för att de ska ge noggranna resultat behövs det att de tränas med en stor mängd data. På grund av antaganden som görs i modellen, är Naive Bayes en dålig klassificerare. För att åtgärda problemet, ska vi försöka förbättra algoritmerna genom att modifiera ordfrekvenserna i dokumentet. Bayes Computer Sciences Datavetenskap (datalogi)
33	La moyenne bayésienne pour les modèles basés sur les graphes acycliques orientés Bouzite, Fatima Ezzahraa 08 April 2022 (has links) Les méthodes d'inférence causale sont utiles pour répondre à plusieurs questions de recherche dans différents domaines, notamment en épidémiologie. Les graphes acycliques orientés sont des outils importants pour l'inférence causale. Entre autres, ils peuvent être utilisés pour identifier les variables confondantes utilisées dans l'ajustement de modèles statistiques afin d'estimer sans biais l'effet d'un traitement. Ces graphes sont construits à partir des connaissances du domaine d'application. Pourtant, ces connaissances sont parfois insuffisantes pour supposer que le graphe construit est correct. Souvent, un chercheur peut proposer divers graphiques correspondants à une même problématique. Dans ce projet, on développe une alternative au modèle moyen bayésien traditionnel qui se base sur un ensemble de graphes proposés par un utilisateur. Pour sa mise en œuvre, on estime d'abord la vraisemblance des données sous les modèles impliqués par chacun des graphes afin de déterminer la probabilité a posteriori de chaque graphe. On identifie, pour chaque graphe, un ensemble de covariables d'ajustement suffisant pour éviter le biais de confusion et on estime l'effet causal à partir d'approches appropriées en ajustant pour ces covariables. Finalement, l'effet causal global est estimé comme une moyenne pondérée des estimations correspondantes à chacun des graphes. La performance de cette approche est étudiée à l'aide d'une étude de simulation où le mécanisme de génération des données est inspiré de l'étude Study of Osteoporotic Fractures (SOF). Différents scénarios sont présentés selon les liens considérés entre les variables. L'étude de simulation démontre une bonne performance générale de notre méthode par comparaison au modèle moyen bayésien traditionnel. L'application de cette approche est illustrée à l'aide de données de l'étude SOF dont l'objectif est l'estimation de l'effet de l'activité physique sur le risque de fractures de la hanche. / Causal inference methods are useful for answering several research questions in different fields, including epidemiology. Directed acyclic graphs are important tools for causal inference. Among other things, they can be used to identify confounding variables used in fitting statistical models to unbiasedly estimate the effect of a treatment. These graphs are built from the knowledge of the domain of application. However, this knowledge is sometimes insufficient to assume that the constructed graph is correct. Often, a researcher can propose various graphs corresponding to the same problem. In this project, we develop an alternative to the traditional Bayesian model averaging which is based on a set of graphs proposed by a user. For its implementation, we first estimate the likelihood of the data under the models implied by each graph to determine the posterior probability of each graph. A set of adjustment covariates sufficient to control for confounding bias is identified for each graph and the causal effect is estimated using appropriate approaches by adjusting for these covariates. Finally, the overall causal effect is estimated as a weighted average of the graph-specific estimates. The performance of this approach is studied using a simulation study in which the data generation mechanism is inspired by the Study of Osteoporotic Fractures (SOF). Different scenarios varying in their relationships between the variables are presented. The simulation study shows a good overall performance of our method compared to the traditional Bayesian model averaging. The application of this approach is illustrated using data from the SOF, whose objective is to estimate the effect of physical activity on the risk of hip fractures. Théorème de Bayes. Graphes orientés. Causalité.
34	Bayesian adaptive variable selection in linear models : a generalization of Zellner's informative g-prior Ndiaye, Djibril 14 May 2022 (has links) Bayesian inference is about recovering the full conditional posterior distribution of the parameters of a statistical model. This exercise, however, can be challenging to undertake if the model specification is not available a priori, as is typically the case. This thesis proposes a new framework to select the subset of regressors that are the relevant features that explain a target variable in linear regression models. We generalize Zellner's g-prior with a random matrix, and we present a likelihood-based search algorithm, which uses Bayesian tools to compute the posterior distribution of the model parameters over all possible models generated, based on the maximum a posteriori (MAP). We use Markov chain Monte Carlo (MCMC) methods to gather samples of the model parameters and specify all distributions underlying these model parameters. We then use these simulations to derive a posterior distribution for the model parameters by introducing a new parameter that allows us to control how the selection of variables is done. Using simulated datasets, we show that our algorithm yields a higher frequency of choosing the correct variables and has a higher predictive power relative to other widely used variable selection models such as adaptive Lasso, Bayesian adaptive Lasso, and relative to well-known machine learning algorithms. Taken together, this framework and its promising performance under various model environments highlight that simulation tools and Bayesian inference methods can be efficiently combined to deal with well-known problems that have long loomed the variable selection literature. / L'inférence bayésienne consiste à retrouver la distribution conditionnelle a posteriori complète des paramètres d'un modèle statistique. Cet exercice, cependant, peut être difficile à entreprendre si la spécification du modèle n'est pas disponible a priori, comme c'est généralement le cas. Cette thèse propose une nouvelle approche pour sélectionner le sous-ensemble de régresseurs qui sont les caractéristiques pertinentes qui expliquent une variable cible dans les modèles de régression linéaire. Nous généralisons le g-prior de Zellner avec une matrice aléatoire et nous présentons un algorithme de recherche basé sur la vraisemblance, qui utilise des outils bayésiens pour calculer la distribution a posteriori des paramètres du modèle sur tous les modèles possibles générés. La sélection du modèle se fera sur la base du maximum a posteriori (MAP). Nous utilisons les méthodes de Monte Carlo par chaînes de Markov pour échantillonner suivant les distributions a posteriori de ces paramètres du modèle. Nous utilisons ensuite ces simulations pour dériver une estimation a posteriori des paramètres du modèle en introduisant un autre paramètre qui nous permet de contrôler la manière dont la sélection de la variable est effectuée. À l'aide de données simulées, nous montrons que notre méthode donne une fréquence plus élevée de choix des variables importantes et a un pouvoir prédictif plus élevé par rapport à d'autres modèles de sélection de variables largement utilisés tels que le Lasso adaptatif, le Lasso adaptatif bayésien, et par rapport aux algorithmes d'apprentissage automatique bien connus. Pris ensemble, cette approche et ses performances prometteuses dans divers scénarios de données mettent en évidence le fait que les outils de simulation et les techniques d'inférence bayésienne puissent être efficacement combinés pour traiter des problèmes bien connus qui ont longtemps pesé sur la littérature de la sélection de variables (en particulier en grande dimension). Théorème de Bayes. Modèles linéaires (Statistique)
35	Hat Bayes eine Chance? Sontag, Ralph 10 May 2004 (has links) (PDF) Workshop "Netz- und Service-Infrastrukturen" Hat Bayes eine Chance? Seit einigen Monaten oder Jahren werden verstärkt Bayes-Filter eingesetzt, um die Nutz-E-Mail ("`Ham"') vom unerwünschten "`Spam"' zu trennen. Diese stoßen jedoch leicht an ihre Grenzen. In einem zweiten Abschnitt wird ein Filtertest der Zeitschrift c't genauer analysiert. ddc:004 Bayes Thomas Bayes-Test Bayes-Verfahren E-Mail Filter Filter <Stochastik> Spam-Mail
36	Bayes kontra Spam Sontag, Ralph 02 July 2003 (has links) Workshop Mensch-Computer-Vernetzung Derzeitige Spam-Erkennung weist eklatante Mängel auf. Die Zunahme des Spam-Aufkommens erfordet neue Ansätze, um der Plage Herr zu werden. Der Vortrag erläutert, wie mit Hilfe des Satzes von Bayes die Spam-Erkennung deutlich verbessert werden kann. info:eu-repo/classification/ddc/004 ddc:004 BAYES Bayes; Thomas Bayes-Test E-Mail Spam-Mail
37	A Bayesian Test of Independence for Two-way Contingency Tables Under Cluster Sampling Bhatta, Dilli 19 April 2013 (has links) We consider a Bayesian approach to the study of independence in a two-way contingency table obtained from a two-stage cluster sampling design. We study the association between two categorical variables when (a) there are no covariates and (b) there are covariates at both unit and cluster levels. Our main idea for the Bayesian test of independence is to convert the cluster sample into an equivalent simple random sample which provides a surrogate of the original sample. Then, this surrogate sample is used to compute the Bayes factor to make an inference about independence. For the test of independence without covariates, the Rao-Scott corrections to the standard chi-squared (or likelihood ratio) statistic were developed. They are ``large sample' methods and provide appropriate inference when there are large cell counts. However, they are less successful when there are small cell counts. We have developed the methodology to overcome the limitations of Rao-Scott correction. We have used a hierarchical Bayesian model to convert the observed cluster samples to simple random samples. This provides the surrogate samples which can be used to derive the distribution of the Bayes factor to make an inference about independence. We have used a sampling-based method to fit the model. For the test of independence with covariates, we first convert the cluster sample with covariates to a cluster sample without covariates. We use multinomial logistic regression model with random effects to accommodate the cluster effects. Our idea is to fit the cluster samples to the random effect models and predict the new samples by adjusting with the covariates. This provides the cluster sample without covariates. We then use a hierarchical Bayesian model to convert this cluster sample to a simple random sample which allows us to calculate the Bayes factor to make an inference about independence. We use Markov chain Monte Carlo methods to fit our models. We apply our first method to the Third International Mathematics and Science Study (1995) for third grade U.S. students in which we study the association between the mathematics test scores and the communities the students come from, and science test scores and the communities the students come from. We also provide a simulation study which establishes our methodology as a viable alternative to the Rao-Scott approximations for relatively small two-stage cluster samples. We apply our second method to the data from the Trend in International Mathematics and Science Study (2007) for fourth grade U.S. students to assess the association between the mathematics and science scores represented as categorical variables and also provide the simulation study. The result shows that if there is strong association between two categorical variables, there is no difference between the significance of the test in using the model (a) with covariates and (b) without covariates. However, in simulation studies, there is a noticeable difference in the significance of the test between the two models when there are borderline cases (i.e., situations where there is marginal significance). Surrogate samples Bayes factor Hierarchical Baye
38	Programové nástroje pro analýzu diskrétních problémů teorie rozhodování / Software tools for analysis of discrete problems of Decision theory Chlum, Ladislav January 2007 (has links) První kapitola je přehledem o softwaru pro řešení diskrétních rozhodovacích úloh. V dalších dvou kapitolách je podrobně popsán teoretický základ, který dále slouží jako návod pro návrh jednotlivých algoritmů a datového modelu. Předposlední kapitola je podrobná programová analýza, ze které je možné vycházet při vývoji v jakémkoliv prostředí. Obsahuje základní informace jaké metody by měly vzniknout, jaká vstupní data jsou pro jednotlivé metody nutná, a přehledný popis algoritmů. Poslední kapitola nejprve seznamuje s podnikovým informačním systémem Microsoft Bussines Solution ? Axapta, který slouží jako vývojové prostředí pro implementaci nového modulu ?Diskrétní úlohy?. Nový modul byl navržen tak, aby byl zcela nezávislý na funkcionalitě celého systému. Obsahuje dvanáct různých modelů vícekriteriálního hodnocení variant (Konjunktivní, Disjunktivní, Permutační, Lexikografická, ORESTE, WSA, TOPSIS, AGREPREF, ELECTRE I., třídu metod PROMETHEE a MAPPAC) a Bayesovskou analýzu pro řešení jednokriterálních rozhodovacích úloh při riziku.
39	Improving Multi-class Text Classification with Naive Bayes Rennie, Jason D. M. 01 September 2001 (has links) There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification. AI naive bayes text classification feature selection
40	Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria TAKEDA, Kazuya, KITAOKA, Norihide, SAKAI, Makoto 01 July 2010 (has links) No description available. Bayes error dimensionality reduction speech recognition

Search results