Spelling suggestions: "subject:"can estatistics"" "subject:"can cstatistics""
1 |
Statistiques de scan : théorie et application à l'épidémiologie / Scan statistics : theory and application to epidemiologyGenin, Mickaël 03 December 2013 (has links)
La notion de cluster désigne l'agrégation dans le temps et/ou l'espace d'évènements. Dans de nombreux domaines, les experts observent certaines agrégations d'évènements et la question se pose de savoir si ces agrégations peuvent être considérées comme normales (le fruit du hasard) ou non. D'un point de vue probabiliste, la normalité peut être décrite par une hypothèse nulle de répartition aléatoire des évènements. La détection de clusters d'évènements est un domaine de la statistique qui s'est particulièrement étendu au cours des dernières décennies. En premier lieu, la communauté scientifique s'est attachée à développer des méthodes dans le cadre unidimensionnel (ex : le temps) puis, par la suite, a étendu ces méthodes au cas multidimensionnel, et notamment bidimensionnel (l'espace). Parmi l'ensemble des méthodes de détection de clusters d'évènements, trois grands types de tests peuvent être distingués. Le premier concerne les tests globaux qui permettent de détecter une tendance globale à l'agrégation, sans pour autant localiser les clusters éventuels. Le deuxième type correspond aux tests focalisés qui sont utilisés lorsque des connaissances a priori permettent de définir un point source (date ou localisation spatiale) et de tester l'agrégation autour de ce dernier. Le troisième type englobe les tests de détection de cluster (ou sans point source défini) qui permettent la localisation, sans connaissance a priori, de clusters d'évènements et le test de leur significativité statistique. Au sein de cette thèse, nous nous sommes focalisés sur cette dernière catégorie et plus particulièrement aux méthodes basées sur les statistiques de scan (ou de balayage). Ces méthodes sont apparues au début des années 1960 et permettent de détecter des clusters d'évènements et de déterminer leur aspect "normal" (le fruit du hasard) ou "anormal". L'étape de détection est réalisée par le balayage (scan) par une fenêtre, dite fenêtre de scan, du domaine d'étude (discret ou continu) dans lequel sont observés les évènements (ex: le temps, l'espace,…). Cette phase de détection conduit à un ensemble de fenêtres définissant chacune un cluster potentiel. Une statistique de scan est une variable aléatoire définie comme la fenêtre comportant le nombre maximum d'évènements observés. Les statistiques de scan sont utilisées comme statistique de test pour vérifier l'indépendance et l'appartenance à une distribution donnée des observations, contre une hypothèse alternative privilégiant l'existence de cluster au sein de la région étudiée. Par ailleurs, la principale difficulté réside dans la détermination de la distribution, sous l'hypothèse nulle, de la statistique de scan. En effet, puisqu'elle est définie comme le maximum d'une suite de variables aléatoires dépendantes, la dépendance étant due au recouvrement des différentes fenêtres de scan, il n'existe que dans de très rares cas de figure des solutions explicites. Aussi, un pan de la littérature est axé sur le développement de méthodes (formules exactes et surtout approximations) permettant de déterminer la distribution des statistiques de scan. Par ailleurs, dans le cadre bidimensionnel, la fenêtre de scan peut prendre différentes formes géométriques (rectangulaire, circulaire,…) qui pourraient avoir une influence sur l'approximation de la distribution de la statistique de scan. Cependant, à notre connaissance, aucune étude n'a évalué cette influence. Dans le cadre spatial, les statistiques de scan spatiales développées par M. Kulldorff s'imposent comme étant, de loin, les méthodes les plus utilisées par la communauté scientifique. Le principe de ces méthodes résident dans le fait de scanner le domaine d'étude avec des fenêtres de forme circulaire et de sélectionner le cluster le plus probable comme celui maximisant un test de rapport de vraisemblance. [...] / The concept of cluster means the aggregation of events in time and / or space. In many areas, experts observe certain aggregations of events and the question arises whether these aggregations can be considered normal (by chance) or not. From a probabilistic point of view, normality can be described by a null hypothesis of random distribution of events.The detection of clusters of events is an area of statistics that has particularly spread over the past decades. First, the scientific community has focused on developing methods for the one-dimensional framework (eg time) and then subsequently extended these methods to the multidimensional case, especially two-dimensional (space). Of all the methods for detecting clusters of events, three major types of tests can be distinguished. The first type concerns global tests that detect an overall tendency to aggregation, without locating any clusters. The second type corresponds to the focused tests that are used when a priori knowledge is used to define a point source (date or spatial location) and to test the aggregation around it. The third type includes the cluster detection tests that allow localization, without a priori, cluster of events and test their statistical significance. In this thesis, we focused on the latter category, especially to methods based on scan statistics.These methods have emerged in the early 1960s and can detect clusters of events and determine their \"normal" appearance (coincidence) or "abnormal". The detection step is performed by scanning through a window, namely scanning window, the studied area (discrete or continuous, time, space), in which the events are observed. This detection step leads to a set of windows, each defining a potential cluster. A scan statistic is a random variable defined as the window with the maximum number of events observed.Scan statistics are used as a test statistic to check the independence and belonging to a given distribution of observations, against an alternative hypothesis supporting the existence of cluster within the studied region. Moreover, the main difficulty lies in determining the distribution of scan statistics under the null hypothesis. Indeed, since it is defined as the maximum of a sequence of dependent random variables, the dependence is due to the recovery of different windows scan, it exists only in very rare cases explicit solutions. Also, a piece of literature is focused on the development of methods (exact formulas and approximations) to determine the distribution of scan statistics. Moreover, in the two-dimensional framework, the scanning window can take various geometric shapes (rectangular, circular, ...) that could have an influence on the approximation of the distribution of the scan statistic. However, to our knowledge, no study has evaluated this influence. In the spatial context, the spatial scan statistics developed by M. Kulldorff are the most commonly used methods for spatial cluster detection. The principle of these methods lies in scanning the studied area with circular windows and selecting the most likely cluster maximizing a likelihood ratio test statistic. Statistical inference of the latter is achieved through Monte Carlo simulations. However, in the case of huge databases and / or when important accuracy of the critical probability associated with the detected cluster is required, Monte Carlo simulations are extremely time-consuming.First , we evaluated the influence of the scanning window shape on the distribution of two dimensional discrete scan statistics. A simulation study performed with squared, rectangular and discrete circle scanning windows has highlighted the fact that the distributions of the associated scan statistics are very close each to other but significantly different. The power of the scan statistics is related to the shape of the scanning window and that of the existing cluster under alternative hypothesis through out a simulation study. [...]
|
2 |
Στατιστικές συναρτήσεις σάρωσης και αξιοπιστία συστημάτων / Scan statistics and systems' reliabilityΠήττα, Θεοδώρα 22 December 2009 (has links)
Σκοπός της εργασίας είναι η σύνδεση της στατιστικής συνάρτησης σάρωσης S_(n,m), που εκφράζει τον μέγιστο αριθμό των επιτυχιών που περιέχονται σε ένα κινούμενο παράθυρο μήκους m το οποίο “σαρώνει” n - συνεχόμενες προσπάθειες Bernoulli, με την αξιοπιστία ενός συνεχόμενου k-μεταξύ-m-από-τα-n συστήματος αποτυχίας (k-μεταξύ-m-από-τα-n:F σύστημα).
Αρχικά υπολογίζουμε τη συνάρτηση κατανομής και τη συνάρτηση πιθανότητας της στατιστικής συνάρτησης σάρωσης S_(n,m). Αυτό το επιτυγχάνουμε συνδέοντας την S_(n,m) με την τυχαία μεταβλητή T_k^((m))που εκφράζει τον χρόνο αναμονής μέχρι να συμβεί μια γενικευμένη ροή ή αλλιώς μέχρι να συμβεί η “πρώτη σάρωση” σε μια ακολουθία τυχαίων μεταβλητών Bernoulli οι οποίες παίρνουν τιμές 0 ή 1 ανάλογα με το αν έχουμε αποτυχία ή επιτυχία, αντίστοιχα. Υπολογίζουμε τη συνάρτηση κατανομής και τη συνάρτηση πιθανότητας της T_k^((m)) είτε με τη μέθοδο της εμβάπτισης σε Μαρκοβιανή αλυσίδα είτε μέσω αναδρομικών τύπων και παίρνουμε τις αντίστοιχες συναρτήσεις για την τυχαία μεταβλητή S_(n,m) [Glaz and Balakrishnan (1999), Balakrishnan and Koutras (2001)].
Στη συνέχεια ασχολούμαστε με την αξιοπιστία του συνεχόμενου k-μεταξύ-m-από-τα-n:F συστήματος (Griffith, 1986). Ένα τέτοιο σύστημα αποτυγχάνει αν ανάμεσα σε m συνεχόμενες συνιστώσες υπάρχουν τουλάχιστον k που αποτυγχάνουν (1≤k≤m≤n). Παρουσιάζουμε ακριβείς τύπους για την αξιοπιστία για k=2 καθώς και για m=n,n-1,n-2,n-3 (Sfakianakis, Kounias and Hillaris, 1992) και δίνουμε έναν αναδρομικό αλγόριθμο για τον υπολογισμό της (Malinowski and Preuss, 1994). Χρησιμοποιώντας μια δυϊκή σχέση ανάμεσα στη συνάρτηση κατανομής της T_k^((m)) και κατ’ επέκταση της S_(n,m) με την αξιοπιστία, συνδέουμε την αξιοπιστία αυτού του συστήματος με τη στατιστική συνάρτηση σάρωσης S_(n,m).
Τέλος σκιαγραφούμε κάποιες εφαρμογές των στατιστικών συναρτήσεων σάρωσης στην μοριακή βιολογία [Karlin and Ghandour (1985), Glaz and Naus (1991), κ.ά.], στον ποιοτικό έλεγχο [Roberts,1958] κ.τ.λ.. / The aim of this dissertation is to combine the scan statistic S_(n,m), which represents the maximum number of successes contained in a moving window of length m over n consecutive Bernoulli trials, with the reliability of a consecutive k-within-m-out-of-n failure system (k-within-m-out-of-n:F system).
First, we evaluate the probability mass function and the cumulative distribution function of the random variable S_(n,m). We obtain that by combining S_(n,m) with the random variable T_k^((m)) which denotes the waiting time until for the first time k successes are contained in a moving window of length m (scan of type k/m) over a sequence of Bernoulli trials with 1 marked as a success and 0 as a failure. The probability mass function and the cumulative distribution function of T_k^((m)) are evaluated using two methods: i. Markov chain embedding method and ii. recursive schemes. Finally, through T_k^((m)) we evaluate the probability mass function and the cumulative distribution function of S_(n,m) [Glaz and Balakrishnan (1999), Balakrishnan and Koutras (2002)].
Next, we evaluate the reliability, R, of the consecutive k-within-m-out-of-n failure system (Griffith, 1986). Such a system fails if and only if there exist m consecutive components which include among them at least k failed ones (1≤k≤m≤n). Exact formulae for the reliability are presented for k=2 as well as for m=n,n-1,n-2,n-3 (Sfakianakis, Kounias and Hillaris, 1992). A recursive algorithm for the reliability evaluation is also given (Malinowski and Preuss, 1994). Using a dual relation between the cumulative distribution function of T_k^((m)) and therefore of S_(n,m) and the reliability R, we manage to combine the reliability of this system with the scan statistic S_(n,m).
Finally, we briefly present some other applications of the scan statistics in molecular biology [Karlin and Ghandour (1985), Glaz and Naus (1991), e.t.c.], quality control [Roberts,1958] and other more.
|
3 |
Estudo do microcrédito na cidade de Goiânia: o espaço é relevante?Oliveira, Felipe Resende 12 March 2014 (has links)
Submitted by Suethene Souza (suethene.souza@ufpe.br) on 2015-03-13T16:47:41Z
No. of bitstreams: 2
DISSERTAÇÃO Filipe Resende Oliveira.pdf: 1143088 bytes, checksum: 00731c4fa9cfba25729e5095d7ba0be0 (MD5)
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-13T16:47:41Z (GMT). No. of bitstreams: 2
DISSERTAÇÃO Filipe Resende Oliveira.pdf: 1143088 bytes, checksum: 00731c4fa9cfba25729e5095d7ba0be0 (MD5)
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Previous issue date: 2014-03-12 / CNPQ / O trabalho busca analisar a possibilidade de influência do ambiente nos empréstimos realizados pelo Banco do Povo de Goiânia. Além disso, o trabalho visa captar a presença de alguma influência do ambiente para aglomeração dos indivíduos inadimplentes. A base de dados é obtida pelo Banco do Povo de Goiânia. O estudo se baseia nos modelos de difusão da informação. A metodologia utilizada para detecção de clusters espacial é o modelo Scan Statistics, no qual as distribuições de probabilidade associada aos dados em aleatoriedade espacial são as distribuições de Poisson e Bernoulli. Os resultados indicam a existência de cluster para os empreendedores. Quando analisamos os clientes inadimplentes há 30 dias ou mais, o método indica que os clientes estão distribuídos aleatoriamente no município de Goiânia.
|
4 |
Separating Features from Noise with Persistence and StatisticsWang, Bei January 2010 (has links)
<p>In this thesis, we explore techniques in statistics and persistent homology, which detect features among data sets such as graphs, triangulations and point cloud. We accompany our theorems with algorithms and experiments, to demonstrate their effectiveness in practice.</p><p></p><p>We start with the derivation of graph scan statistics, a measure useful to assess the statistical significance of a subgraph in terms of edge density. We cluster graphs into densely-connected subgraphs based on this measure. We give algorithms for finding such clusterings and experiment on real-world data.</p><p></p><p>We next study statistics on persistence, for piecewise-linear functions defined on the triangulations of topological spaces. We derive persistence pairing probabilities among vertices in the triangulation. We also provide upper bounds for total persistence in expectation. </p><p></p><p>We continue by examining the elevation function defined on the triangulation of a surface. Its local maxima obtained by persistence pairing are useful in describing features of the triangulations of protein surfaces. We describe an algorithm to compute these local maxima, with a run-time ten-thousand times faster in practice than previous method. We connect such improvement with the total Gaussian curvature of the surfaces.</p><p></p><p>Finally, we study a stratification learning problem: given a point cloud sampled from a stratified space, which points belong to the same strata, at a given scale level? We assess the local structure of a point in relation to its neighbors using kernel and cokernel persistent homology. We prove the effectiveness of such assessment through several inference theorems, under the assumption of dense sample. The topological inference theorem relates the sample density with the homological feature size. The probabilistic inference theorem provides sample estimates to assess the local structure with confidence. We describe an algorithm that computes the kernel and cokernel persistence diagrams and prove its correctness. We further experiment on simple synthetic data.</p> / Dissertation
|
5 |
Some new anomaly detection methods with applications to financial dataZhao, Zhicong 06 August 2021 (has links)
Novel clustering methods are presented and applied to financial data. First, a scan-statistics method for detecting price point clusters in financial transaction data is considered. The method is applied to Electronic Business Transfer (EBT) transaction data of the Supplemental Nutrition Assistance Program (SNAP). For a given vendor, transaction amounts are fit via maximum likelihood estimation which are then converted to the unit interval via a natural copula transformation. Next, a new Markov type relation for order statistics on the unit interval is developed. The relation is used to characterize the distribution of the minimum exceedance of all copula transformed transaction amounts above an observed order statistic. Conditional on observed order statistics, independent and asymptotically identical indicator functions are constructed and the success probably as a function of the gaps in consecutive order statistics is specified. The success probabilities are shown to be a function of the hazard rate of the transformed transaction distribution. If gaps are smaller than expected, then the corresponding indicator functions are more likely to be one. A scan statistic is then applied to the sequence of indicator functions to detect locations where too many gaps are smaller than expected. These sets of gaps are then flagged as being anomalous price point clusters. It is noted that prominent price point clusters appearing in the data may be a historical vestige of previous versions of the SNAP program involving outdated paper "food stamps". The second part of the project develops a novel clustering method whereby the time series of daily total EBT transaction amounts are clustered by periodicity. The schemeworks by normalizing the time series of daily total transaction amounts for two distinct vendors and taking daily differences in those two series. The difference series is then examined for periodicity via a novel F statistic. We find one may cluster the monthly periodicities of vendors by type of store using the F statistic, a proxy for a distance metric. This may indicate that spending preferences for SNAP benefit recipients varies by day of the month, however, this opens further questions about potential forcing mechanisms and the apparent changing appetites for spending.
|
6 |
ANTIMICROBIAL RESISTANCE OF HUMAN CAMPYLOBACTER JEJUNI INFECTIONS FROM SASKATCHEWANOtto, Simon James Garfield 29 April 2011 (has links)
Saskatchewan is the only province in Canada to have routinely tested the antimicrobial susceptibility of all provincially reported human cases of campylobacteriosis. From 1999 to 2006, 1378 human Campylobacter species infections were tested for susceptibility at the Saskatchewan Disease Control Laboratory using the Canadian Integrated Program for Antimicrobial Resistance Surveillance panel and minimum inhibitory concentration (MIC) breakpoints. Of these, 1200 were C. jejuni, 129 were C. coli, with the remaining made up of C. lari, C. laridis, C. upsaliensis and undifferentiated Campylobacter species. Campylobacter coli had significantly higher prevalences of ciprofloxacin resistance (CIPr), erythromycin resistance (ERYr), combined CIPr-ERYr resistance and multidrug resistance (to three or greater drug classes) than C. jejuni. Logistic regression models indicated that CIPr in C. jejuni decreased from 1999 to 2004 and subsequently increased in 2005 and 2006. The risk of CIPr was significantly increased in the winter months (January to March) compared to other seasons. A comparison of logistic regression and Cox proportional hazard survival models found that the latter were better able to detect significant temporal trends in CIPr and tetracycline resistance by directly modeling MICs, but that these trends were more difficult to interpret. Scan statistics detected significant spatial clusters of CIPr C. jejuni infections in urban centers (Saskatoon and Regina) and temporal clusters in the winter months; the space-time permutation model did not detect any space-time clusters. Bernoulli scan tests were computationally the fastest for cluster detection, compared to ordinal MIC and multinomial antibiogram models. eBURST analysis of antibiogram patterns showed a marked distinction between case and non-case isolates from the scan statistic clusters. Multilevel logistic regression models detected significant individual and regional contextual risk factors for infection with CIPr C. jejuni. Patients infected in the winter, that were between the ages of 40-45 years of age, that lived in urban regions and that lived in regions of moderately high poultry density had higher risks of a resistant infection. These results advance the epidemiologic knowledge of CIPr C. jejuni in Saskatchewan and provide novel analytical methods for antimicrobial resistance surveillance data in Canada. / Saskatchewan Disease Control Laboratory (Saskatchewan Ministry of Health); Laboratory for Foodborne Zoonoses (Public Health Agency of Canada); Centre for Foodborne, Environmental and Zoonotic Infectious Diseases (Public Health Agency of Canada); Ontario Veterinary College Blake Graham Fellowship
|
Page generated in 0.0961 seconds