Spelling suggestions: "subject:"estatistics"" "subject:"cstatistics""
441 |
Inference on quantile regression for mixed models with applications to GeneChip data /Wang, Huixia, January 2006 (has links)
Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2006. / Source: Dissertation Abstracts International, Volume: 67-11, Section: B, page: 6488. Adviser: Xuming He. Includes bibliographical references (leaves 111-113) Available on microfilm from Pro Quest Information and Learning.
|
442 |
Marginal mixture analysis of correlated bounded-response data /Yang, Yan, January 2006 (has links)
Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2006. / Source: Dissertation Abstracts International, Volume: 67-11, Section: B, page: 6489. Adviser: Douglas G. Simpson. Includes bibliographical references (leaves 90-92) Available on microfilm from Pro Quest Information and Learning.
|
443 |
Partially Bayesian variable selection in classification trees /Noe, Douglas Alan, January 2006 (has links)
Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2006. / Source: Dissertation Abstracts International, Volume: 67-11, Section: B, page: 6486. Adviser: Xuming He. Includes bibliographical references (leaves 104-105) Available on microfilm from Pro Quest Information and Learning.
|
444 |
Relaxations of Differential Privacy and Risk/Utility Evaluations of Synthetic Data and Fidelity MeasuresMcClure, David R. January 2015 (has links)
<p>Many organizations collect data that would be useful to public researchers, but cannot be shared due to promises of confidentiality to those that participated in the study. This thesis evaluates the risks and utility of several existing release methods, as well as develops new ones with different risk/utility tradeoffs.</p><p>In Chapter 2, I present a new risk metric, called model-specific probabilistic differ- ential privacy (MPDP), which is a relaxed version of differential privacy that allows the risk of a release to be based on the worst-case among plausible datasets instead of all possible datasets. In addition, I develop a generic algorithm called local sensitiv- ity random sampling (LSRS) that, under certain assumptions, is guaranteed to give releases that meet MPDP for any query with computable local sensitivity. I demon- strate, using several well-known queries, that LSRS releases have much higher utility than standard differentially private release mechanism, the Laplace Mechanism, at only marginally higher risk.</p><p>In Chapter 3, using to synthesis models, I empirically characterize the risks of releasing synthetic data under the standard “all but one” assumption on intruder background knowledge, as well the effect decreasing the number of observations the intruder knows beforehand has on that risk. I find in these examples that even in the “all but one” case, there is no risk except to extreme outliers, and even then the risk is mild. I find that the effect of removing observations from an intruder’s background knowledge has on risk heavily depends on how well that intruder can fill in those missing observations: the risk remains fairly constant if he/she can fill them in well, and the risk drops quickly if he/she cannot.</p><p>In Chapter 4, I characterize the risk/utility tradeoffs for an augmentation of synthetic data called fidelity measures (see Section 1.2.3). Fidelity measures were proposed in Reiter et al. (2009) to quantify the degree to which the results of an analysis performed on a released synthetic dataset match with the results of the same analysis performed on the confidential data. I compare the risk/utility of two different fidelity measures, the confidence interval overlap (Karr et al., 2006) and a new fidelity measure I call the mean predicted probability difference (MPPD). Simultaneously, I compare the risk/utility tradeoffs of two different private release mechanisms, LSRS and a heuristic release method called “safety zones”. I find that the confidence interval overlap can be applied to a wider variety of analyses and is more specific than MPPD, but MPPD is more robust to the influence of individual observations in the confidential data, which means it can be released with less noise than the confidence interval overlap with the same level of risk. I also find that while safety zones are much simpler to compute and generally have good utility (whereas the utility of LSRS depends on the value of ε), it is also much more vulnerable to context specific attacks that, while not easy for an intruder to implement, are difficult to anticipate.</p> / Dissertation
|
445 |
CHRNA3-CHRNA5-CHRNB4 -geenialueen assosiaatio painoindeksiin ja verenpaineeseen eri muuttujanvalintamenetelmilläLiuski, H. (Heli) 23 May 2013 (has links)
Työn tavoitteena on selvittää, löytyykö CHRNA3-CHRNA5-CHRNB4 -geenialueelta assosiaatiota tarkasteltaviin vasteisiin, painoindeksiin ja verenpaineeseen. Kyseiseltä geenialueelta on aiemmin havaittu assosiaatiota tupakoimiseen, ja tämän vuoksi tutkimuksessa analysoidaan erikseen geenialueen vaikutusta tupakoimattomilla ja tupakoivilla. Geenialueelta on tutkimukseen valittu 18 SNP-kohtaa, joiden epäillään vaikuttavan vasteisiin. Tutkimusaineistona on Pohjois-Suomen syntymäkohorttitutkimus 1966, jonka populaatiorakenteen korjaamiseen on muodostettu 10 pääkomponenttia. Tässä tutkimuksessa myös arvioidaan näiden pääkomponenttien tarpeellisuutta.
Mahdollisia tutkimuksesta löytyviä assosiaatiota geenialueen ja vasteiden väliltä analysoidaan eri muuttujanvalintamenetelmien avulla. Tutkimuksessa keskitytään eteenpäin valitsevaan menetelmään ja parhaan osajoukon algoritmiin. Eteenpäin valitsevassa menetelmässä käytetään eri lähtömalleja ja muuttujakokonaisuuksia assosiaatioiden selvittämiseksi. Parhaan osajoukon algoritmissa valitaan yhdestä kahdeksaan muuttujaa ja jokaiselle muuttujamäärälle neljä erilaista osajoukkoa.
Tuloksissa havaitaan selvää assosiaatiota tiettyjen SNP-kohtien ja vasteiden välillä. SNP-kohtien, rs6495309, rs1996371 ja rs4887077, havaitaan vaikuttavan painoindeksiin tupakoivilla, kun taas SNP-kohdan rs1948 havaitaan vaikuttavan systoliseen verenpaineeseen. SNP-kohtien vaikutus vasteisiin näyttäisi olevan voimakkaampaa tupakoivilla kuin tupakoimattomilla.
Tulokset sellaisinaan eivät ole hyödynnettävissä yleisesti, mutta ne ovat suuntaa antavia seuraaville tutkimuksille. Tarvitaan tutkimuksia laajemmassa populaatiossa, jotta voidaan varmentaa tuloksien paikkaansapitävyys. Painoindeksiin ja verenpaineeseen vaikuttavat geenikohdat voivat mahdollisesti vaihdella eri populaatioissa geneettisen taustan vuoksi.
|
446 |
EM- ja MCEM-algoritmi apuvälineenä suurimman uskottavuuden estimoinnissaKuismin, M. (Markku) 05 December 2013 (has links)
Tutkielmassa tutkitaan suurimman uskottavuuden menetelmään perustuvaa Expected Maximization-algoritmia (EM-algoritmi). Työn pääpaino on algoritmin ominaisuuksien teoreettisessa tarkastelussa eikä siinä käsitellä todellisia tutkimusongelmia tai empiirisiä aineistoja.
Aluksi tarkastellaan algoritmia matemaattisesti SU-menetelmän tavoin. Tämä teoriaosuus perustuu pääsääntöisesti McLachlanin ja Krishnanin kirjaan The EM Algorithm and Extensions (1997). Algoritmin avulla tutkitaan kahden normaalijakauman sekoitusta ja tähän liittyviä parametreja. Tämä esimerkki perustuu pääsääntöisesti Louisin (1982) artikkeliin.
EM-algoritmin lisäksi tutkitaan Monte Carlo EM-algoritmia (MCEM-algoritmi). Algoritmia sovelletaan yksinkertaisen yleistetyn lineaarisen sekamallin parametrien analysoinnissa. Aineistona käytetään McCullochin artikkelin esimerkin mukaan simuloitua binääristä dataa. Tässä osuudessa lähteinä on pääsääntöisesti käytetty McCullochin artikkelia Maximum likelihood algorithms for generalized linear mixed models (1997) ja Robertin ja Casellan kirjaa Introducing Monte Carlo Methods with R (2010).
Lopuksi vertaillaan muodostetulla MCEM-algoritmilla laskettuja estimaatteja toiseen Markovin ketju Monte Carlo-menetelmään. Tätä varten simuloitua aineistoa analysoidaan myös bayesiläisittäin soveltamalla Gibbsin otantaa parametrien posteriorien simuloimisessa. Pääasiallisena lähteenä on käytety Läärän kirjoittamaa luentomonistetta Johdatus bayesiläiseen tilastotieteeseen (2013).
EM-algoritmilla saatiin normaalijakaumien sekoituksessa erittäin hyviä SU-estimaatteja. Algoritmi on herkkä alkuarvojen valinnalle ja kaukana SU-estimaateista valituilla alkuarvoilla algoritmia on ajettava kauan uskottavuuden maksimoimiseksi. MCEM-algoritmin tapauksessa Monte Carlo otoskoon valitseminen liian suureksi lähinnä hidastaa algoritmia kohtuuttomasti eikä millään tavalla edistä algoritmin stabiloitumista. Tutkielmassa ei saatu MCEM-algoritmilla estimaatteja, jotka olisivat maksimoineet uskottavuusfunktion arvon. Laskemalla aineistosta bootsrap-estimaatit saatiin paremmat tulokset, joilla uskottavuusfunktion arvo on MCEM-estimaatteja suurempi.
|
447 |
High dimensional land cover inference using remotely sensed MODIS dataGlanz, Hunter S. 12 March 2016 (has links)
Image segmentation persists as a major statistical problem, with the volume
and complexity of data expanding alongside new technologies. Land cover
classification, one of the most studied problems in Remote Sensing, provides an
important example of image segmentation whose needs transcend the choice of
a particular classification method. That is, the challenges associated with
land cover classification pervade the analysis process from data
pre-processing to estimation of a final land cover map. Many of the same
challenges also plague the task of land cover change detection.
Multispectral, multitemporal data with inherent spatial relationships have
hardly received adequate treatment due to the large size of the data and
the presence of missing values.
In this work we propose a novel, concerted application of methods which
provide a unified way to estimate model parameters, impute missing data,
reduce dimensionality, classify land cover, and detect land cover changes.
This comprehensive analysis adopts a Bayesian approach which incorporates
prior knowledge to improve the interpretability, efficiency, and versatility
of land cover classification and change detection. We explore a parsimonious,
parametric model that allows for a natural application of principal components
analysis to isolate important spectral characteristics while preserving
temporal information. Moreover, it allows us to impute missing data and
estimate parameters via expectation-maximization (EM). A significant byproduct
of our framework includes a suite of training data assessment tools. To
classify land cover, we employ a spanning tree approximation to a lattice
Potts prior to incorporate spatial relationships in a judicious way and more
efficiently access the posterior distribution of pixel labels. We then achieve
exact inference of the labels via the centroid estimator. To detect land
cover changes, we develop a new EM algorithm based on the same parametric model.
We perform simulation studies to validate our models and methods, and
conduct an extensive continental scale case study using MODIS data. The results
show that we successfully classify land cover and recover the spatial patterns
present in large scale data. Application of our change point method
to an area in the Amazon successfully identifies the progression of
deforestation through portions of the region.
|
448 |
Hierarchical bayesian models for genome-wide association studiesJohnston, Ian 08 April 2016 (has links)
I consider a well-known problem in the field of statistical genetics called a genome-wide association study (GWAS) where the goal is to identify a set of genetic markers that are associated to a disease. A typical GWAS data set contains, for thousands of unrelated individuals, a set of hundreds of thousands of markers, a set of other covariates such as age, gender, smoking status and other risk factors, and a response variable that indicates the presence or absence of a particular disease. Due to biological phenomena such as the recombination of DNA and linkage disequilibrium, parents are more likely to pass parts of DNA that lie close to each other on a chromosome together to their offspring; this non-random association between adjacent markers leads to strong correlation between markers in GWAS data sets. As a statistician, I reduce the complex problem of GWAS to its essentials, i.e. variable selection on a large-p-small-n data set that exhibits multicollinearity, and develop solutions that complement and advance the current state-of-the-art methods. Before outlining and explaining my contributions to the field in detail, I present a literature review that summarizes the history of GWAS and the relevant tools and techniques that researchers have developed over the years for this problem.
|
449 |
Two sequential tests against cyclic trendRoberts, Helen Murray January 1960 (has links)
Let the chance variables x1, x2, •••, xn have the joint cumulative distribution F: F(x1, x2,•••,Xn) and assume that the distribution function F(x1, x2,•••,xn) is continuous. Let ^n be the class of all continuous cumulative distribution functions. Let Wn be the class of all continuous cumulative distribution functions of the form F(x1,x2,•••,Xn) = F(x1)F(x2)•••F(Xn). The hypothesis of randomness states that F(x1,x2,•••,xn) assumed to belong to ^n actually belongs to Wn.
In this dissertation two sequential tests of randomness proposed by Noether are studied. In the first sequential test the alternative to randomness is characterized by a stochastic relation of the type Xi= Xi-l + Ui, in the second sequential test the alternative is characterized by an irregular cyclical trend.
The first test is based on the statistic Tm which is equal to the number of rank positions xm+1 may take given the ranks of (x1,x2,•••,xm) so as to convert (z1,z2,•••,zm-1) into (z1,z2,•••,zm) where zi=sign(xi+1 - xi). It is shown
under the null hypothesis that Tm is an unbiased estimate of a corresponding population parameter Tm and is a biased estimate of Tm under the alternative hypothesis.
The properties of Tm under the null hypothesis are then examined and it is shown that Tm and Tm+k (k > 2) are independent. It follows from this property, by using Hoeffding and Robbins theorem, that sigma log Ti is asymptotically normal.
It is shown, both under the null and the alternative hypotheses, that this test terminates with probability one. By sampling some idea is gained about the number of observations needed for the test to terminate under both hypotheses, and also about the effect of this modified test on the probabilities of Type I and Type II errors.
The second sequential test is based on runs up-and-down. This test is described and then a modification of this test is studied. Runs of three different lengths are considered and the corresponding parameter determined. By sampling some idea is obtained about the number of observations needed for the modified test to terminate, and the effect of this test on the probabilities of Type I and Type II errors. The first seauential test, the sequential run test and the modified sequential run test are compared.
|
450 |
A Correlated Random Effects Model for Nonignorable Missing Data in Value-Added Assessment of Teacher EffectsJanuary 2012 (has links)
abstract: Value-added models (VAMs) are used by many states to assess contributions of individual teachers and schools to students' academic growth. The generalized persistence VAM, one of the most flexible in the literature, estimates the ``value added'' by individual teachers to their students' current and future test scores by employing a mixed model with a longitudinal database of test scores. There is concern, however, that missing values that are common in the longitudinal student scores can bias value-added assessments, especially when the models serve as a basis for personnel decisions -- such as promoting or dismissing teachers -- as they are being used in some states. Certain types of missing data require that the VAM be modeled jointly with the missingness process in order to obtain unbiased parameter estimates. This dissertation studies two problems. First, the flexibility and multimembership random effects structure of the generalized persistence model lead to computational challenges that have limited the model's availability. To this point, no methods have been developed for scalable maximum likelihood estimation of the model. An EM algorithm to compute maximum likelihood estimates efficiently is developed, making use of the sparse structure of the random effects and error covariance matrices. The algorithm is implemented in the package GPvam in R statistical software. Illustrations of the gains in computational efficiency achieved by the estimation procedure are given. Furthermore, to address the presence of potentially nonignorable missing data, a flexible correlated random effects model is developed that extends the generalized persistence model to jointly model the test scores and the missingness process, allowing the process to depend on both students and teachers. The joint model gives the ability to test the sensitivity of the VAM to the presence of nonignorable missing data. Estimation of the model is challenging due to the non-hierarchical dependence structure and the resulting intractable high-dimensional integrals. Maximum likelihood estimation of the model is performed using an EM algorithm with fully exponential Laplace approximations for the E step. The methods are illustrated with data from university calculus classes and with data from standardized test scores from an urban school district. / Dissertation/Thesis / Ph.D. Mathematics 2012
|
Page generated in 0.0683 seconds