Global ETD Search

1	Methods from Statistical Computing for Genetic Analysis of Complex Traits Mahjani, Behrang January 2016 (has links) The goal of this thesis is to explore, improve and implement some advanced modern computational methods in statistics, focusing on applications in genetics. The thesis has three major directions. First, we study likelihoods for genetics analysis of experimental populations. Here, the maximum likelihood can be viewed as a computational global optimization problem. We introduce a faster optimization algorithm called PruneDIRECT, and explain how it can be parallelized for permutation testing using the Map-Reduce framework. We have implemented PruneDIRECT as an open source R package, and also Software as a Service for cloud infrastructures (QTLaaS). The second part of the thesis focusses on using sparse matrix methods for solving linear mixed models with large correlation matrices. For populations with known pedigrees, we show that the inverse of covariance matrix is sparse. We describe how to use this sparsity to develop a new method to maximize the likelihood and calculate the variance components. In the final part of the thesis we study computational challenges of psychiatric genetics, using only pedigree information. The aim is to investigate existence of maternal effects in obsessive compulsive behavior. We add the maternal effects to the linear mixed model, used in the second part of this thesis, and we describe the computational challenges of working with binary traits. / eSSENCE Statistical Computing QTL mapping Global Optimization Linear Mixed Models
2	Are There Too Many R Packages? Hornik, Kurt January 2012 (has links) (PDF) The number of R extension packages available from the CRAN repository has tremendously grown over the past 10 years. We look at this phenomenon in more detail, and discuss some of its consequences. In particular, we argue that the statistical computing community needs a more common understanding of software quality, and better domain-specific semantic resources.
3	Inference procedures based on order statistics Frey, Jesse C. 01 August 2005 (has links) No description available. Statistics Order statistics Ranked-set sampling Statistical computing Confidence bands
4	A diagnostic function to examine candidate distributions to model univariate data Richards, John January 1900 (has links) Master of Science / Department of Statistics / Suzanne Dubnicka / To help with identifying distributions to effectively model univariate continuous data, the R function diagnostic is proposed. The function will aid in determining reasonable candidate distributions that the data may have come from. It uses a combination of the Pearson goodness of fit statistic, Anderson-Darling statistic, Lin’s concordance correlation between the theoretical quantiles and observed quantiles, and the maximum difference between the theoretical quantiles and the observed quantiles. The function generates reasonable candidate distributions, QQ plots, and histograms with superimposed density curves. When a simulation study was done, the function worked adequately; however, it was also found that many of the distributions look very similar if the parameters are chosen carefully. The function was then used to attempt to decipher which distribution could be used to model weekly grocery expenditures of a family household. statistics R statistical computing goodness of fit probability distributions diagnostic Statistics (0463)
5	Transparent and Efficient I/O for Statistical Computing Zhang, Yi January 2012 (has links) <p>Statistical analysis of massive array data is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating intermediate results. In the big data setting, storage and I/O efficiency is a key to efficient analytics. Because of the distinct characteristics of disk-resident arrays and the operations performed on them, we need a computing environment that is easy to use, scalable to big data, and different from traditional, CPU- and memory-centric solutions.</p><p>R is a popular computing environment for statistical/numerical data analysis. Like many such environments, R performs poorly for large datasets. This dissertation presents RIOT (R with I/O Transparency), a framework to make R programs I/O-efficient in a way transparent to users. RIOT-DB, an implementation of RIOT using a relational database system as its backend, significantly outperforms R in many big-data scenarios. RIOT users are insulated from the data management backend and I/O optimization specifics. Because of this transparency, RIOT is easy to adopt by the majority of the R users.</p><p>While RIOT-DB demonstrates the feasibility of transparent I/O efficiency and the potential of database-style inter-operator optimizations, it also reveals significant deficiencies of database systems in handling statistical computation. To improve the efficiency of array storage, RIOT uses a novel storage structure called Linearized-Array B-tree, or LAB-tree. LAB-tree supports flexible array layouts and automatically adapts to varying sparsity across parts of an array and over time. It also implements splitting strategies and update batching policies with good theoretical guarantees and/or practical performance.</p><p>While LAB-tree removes many I/O inefficiencies that arise in accessing individual arrays, programs consisting of multiple operators need further optimization. To this end, RIOT incorporates an I/O optimization framework, RIOTShare, which is able to jointly optimize I/O sharing and array layouts for a broad range of analysis tasks expressible in nested-loop forms. RIOTShare explores the middle ground between the high-level, database-style operator-based query optimization and low-level, compiler-style loop-based code optimization.</p><p>In sum, combining a transparent language binding mechanism, an efficient and flexible storage engine, and an accurate I/O sharing and array layout optimizer, RIOT provides a systematic solution for data-intensive array-based statistical computing.</p> / Dissertation Computer science Databases Input/Output Polyhedral optimization R Scientific computing Statistical computing
6	What Drives Package Authors to Participate in the R Project for Statistical Computing? Exploring Motivation, Values, and Work Design Mair, Patrick, Hofmann, Eva, Gruber, Kathrin, Hatzinger, Reinhold, Zeileis, Achim, Hornik, Kurt January 2015 (has links) (PDF) One of the cornerstones of the R system for statistical computing is the multitude of packages contributed by numerous package authors. This makes an extremely broad range of statistical techniques and other quantitative methods freely available. So far no empirical study has investigated psychological factors that drive authors to participate in the R project. This article presents a study of R package authors, collecting data on different types of participation (number of packages, participation in mailing lists, participation in conferences), three psychological scales (types of motivation, psychological values, and work design characteristics), as well as various sociodemographic factors. The data are analyzed using item response models and subsequent generalized linear models, showing that the most important determinants for participation are a hybrid form of motivation and the social characteristics of the work design. Other factors are found to have less impact or influence only specific aspects of participation. (authors' abstract)
7	Motives for Participation in Open-Source Software Projects: A Survey among R Package Authors Mair, Patrick, Hofmann, Eva, Gruber, Kathrin, Hatzinger, Reinhold, Zeileis, Achim, Hornik, Kurt 04 1900 (has links) (PDF) One of the cornerstones of the R system for statistical computing is the multitude of contributed packages making an extremely broad range of statistical techniques and other quantitative methods freely available. This study investigates which factors are the crucial determinants responsible for the participation of the package authors in the R project. For this purpose a survey was conducted among R package authors, collecting data on different types of participation in the R project, three psychometric scales (hybrid forms of motivation, work design characteristics, and values), as well as various specie-demographic factors. These data are analyzed using item response theory and generalized linear models, showing that the most important determinants for participation are a hybrid form of motivation and the knowledge characteristics of the work design. Other factors are found to have less impact or influence only specific aspects of participation. (authors' abstract) / Series: Research Report Series / Department of Statistics and Mathematics RVK ST 250, QR 770, SR 870
8	Análise, imputação de dados e interfaces computacionais em estudos de séries temporais epidemiológicas / Analysis, data imputation and computer interfaces in time-series epidemiologic studies Washington Leite Junger 01 April 2008 (has links) efeitos são frequentemente observados na morbidade e mortalidade por doenças respiratórias e cardiovasculares, câncer de pulmão, diminuição da função respiratória, absenteísmo escolar e problemas relacionados com a gravidez. Estudos também sugerem que os grupos mais suscetíveis são as crianças e os idosos. Esta tese apresenta estudos sobre o efeito da poluição do ar na saúde na saúde na cidade do Rio de Janeiro e aborda aspectos metodológicos sobre a análise de dados e imputação de dados faltantes em séries temporais epidemiológicas. A análise de séries temporais foi usada para estimar o efeito da poluição do ar na mortalidade de pessoas idosas por câncer de pulmão com dados dos anos 2000 e 2001. Este estudo teve como objetivo avaliar se a poluição do ar está associada com antecipação de óbitos de pessoas que já fazem parte de uma população de risco. Outro estudo foi realizado para avaliar o efeito da poluição do ar no baixo peso ao nascer de nascimentos a termo. O desenho deste estudo foi o de corte transversal usando os dados disponíveis no ano de 2002. Em ambos os estudos foram estimados efeitos moderados da poluição do ar. Aspectos metodológicos dos estudos epidemiológicos da poluição do ar na saúde também são abordados na tese. Um método para imputação de dados faltantes é proposto e implementado numa biblioteca para o aplicativo R. A metodologia de imputação é avaliada e comparada com outros métodos frequentemente usados para imputação de séries temporais de concentrações de poluentes atmosféricos por meio de técnicas de simulação. O método proposto apresentou desempenho superior aos tradicionalmente utilizados. Também é realizada uma breve revisão da metodologia usada nos estudos de séries temporais sobre os efeitos da poluição do ar na saúde. Os tópicos abordados na revisão estão implementados numa biblioteca para a análise de dados de séries temporais epidemiológicas no aplicativo estatístico R. O uso da biblioteca é exemplificado com dados de internações hospitalares de crianças por doenças respiratórias no Rio de Janeiro. Os estudos de cunho metodológico foram desenvolvidos no âmbito do estudo multicêntrico para avaliação dos efeitos da poluição do ar na América Latina o Projeto ESCALA. / Air pollution is a public health problem in major urban areas and its effects are frequently observed in the morbidity and mortality due respiratory and cardiovascular causes, lung cancer, decreasing in the respiratory function, school absenteeism, and pregnancy outcomes. This thesis presents studies on the effects of air pollution on health in the Rio de Janeiro city and tackle some methodological issues on data analysis and missing data imputation in epidemiologic time series. Daily time series were used to estimate the effect of the air pollution on deaths among the elderly due to lung cancer during 2000 and 2001. The purpose of the study was to evaluate if air pollution is associated with premature deaths of people that already are in risk population. Another study was conducted to assess the relationship between air pollution and low birth weight of singleton full term babies. A crosssectional design was used on data available during the year 2002. Moderate effects of the air pollution were estimated in both studies. Methodological aspects of epidemiologic studies on air pollution are also approached. A data imputation method is presented and implemented as library for the statistical package R. The imputation methodology is evaluated and compared to others often used for data imputation in time series of air pollutant concentrations, through simulation techniques. The proposed method has shown best performance compared to those traditionally used. A brief review on the methodology used in the time series studies on the effects of air pollution on health is also presented. The issues approached in the review are also implemented as a library for the analysis of epidemiologic time series in R. The use of the library is exemplified with the analysis on the data of hospital admissions of children due to respiratory causes in the city of Rio de Janeiro. The methodological studies were carried out under the umbrella of the multi-city study to assess the effects of air pollution on health in the Latin America the ESCALA Project. Poluição do ar Epidemiologia ambiental Séries temporais Modelagem estatística Estatística computacional Environmental epidemiology Air pollution Time series Statistical modeling Statistical computing EPIDEMIOLOGIA
9	Análise, imputação de dados e interfaces computacionais em estudos de séries temporais epidemiológicas / Analysis, data imputation and computer interfaces in time-series epidemiologic studies Washington Leite Junger 01 April 2008 (has links) efeitos são frequentemente observados na morbidade e mortalidade por doenças respiratórias e cardiovasculares, câncer de pulmão, diminuição da função respiratória, absenteísmo escolar e problemas relacionados com a gravidez. Estudos também sugerem que os grupos mais suscetíveis são as crianças e os idosos. Esta tese apresenta estudos sobre o efeito da poluição do ar na saúde na saúde na cidade do Rio de Janeiro e aborda aspectos metodológicos sobre a análise de dados e imputação de dados faltantes em séries temporais epidemiológicas. A análise de séries temporais foi usada para estimar o efeito da poluição do ar na mortalidade de pessoas idosas por câncer de pulmão com dados dos anos 2000 e 2001. Este estudo teve como objetivo avaliar se a poluição do ar está associada com antecipação de óbitos de pessoas que já fazem parte de uma população de risco. Outro estudo foi realizado para avaliar o efeito da poluição do ar no baixo peso ao nascer de nascimentos a termo. O desenho deste estudo foi o de corte transversal usando os dados disponíveis no ano de 2002. Em ambos os estudos foram estimados efeitos moderados da poluição do ar. Aspectos metodológicos dos estudos epidemiológicos da poluição do ar na saúde também são abordados na tese. Um método para imputação de dados faltantes é proposto e implementado numa biblioteca para o aplicativo R. A metodologia de imputação é avaliada e comparada com outros métodos frequentemente usados para imputação de séries temporais de concentrações de poluentes atmosféricos por meio de técnicas de simulação. O método proposto apresentou desempenho superior aos tradicionalmente utilizados. Também é realizada uma breve revisão da metodologia usada nos estudos de séries temporais sobre os efeitos da poluição do ar na saúde. Os tópicos abordados na revisão estão implementados numa biblioteca para a análise de dados de séries temporais epidemiológicas no aplicativo estatístico R. O uso da biblioteca é exemplificado com dados de internações hospitalares de crianças por doenças respiratórias no Rio de Janeiro. Os estudos de cunho metodológico foram desenvolvidos no âmbito do estudo multicêntrico para avaliação dos efeitos da poluição do ar na América Latina o Projeto ESCALA. / Air pollution is a public health problem in major urban areas and its effects are frequently observed in the morbidity and mortality due respiratory and cardiovascular causes, lung cancer, decreasing in the respiratory function, school absenteeism, and pregnancy outcomes. This thesis presents studies on the effects of air pollution on health in the Rio de Janeiro city and tackle some methodological issues on data analysis and missing data imputation in epidemiologic time series. Daily time series were used to estimate the effect of the air pollution on deaths among the elderly due to lung cancer during 2000 and 2001. The purpose of the study was to evaluate if air pollution is associated with premature deaths of people that already are in risk population. Another study was conducted to assess the relationship between air pollution and low birth weight of singleton full term babies. A crosssectional design was used on data available during the year 2002. Moderate effects of the air pollution were estimated in both studies. Methodological aspects of epidemiologic studies on air pollution are also approached. A data imputation method is presented and implemented as library for the statistical package R. The imputation methodology is evaluated and compared to others often used for data imputation in time series of air pollutant concentrations, through simulation techniques. The proposed method has shown best performance compared to those traditionally used. A brief review on the methodology used in the time series studies on the effects of air pollution on health is also presented. The issues approached in the review are also implemented as a library for the analysis of epidemiologic time series in R. The use of the library is exemplified with the analysis on the data of hospital admissions of children due to respiratory causes in the city of Rio de Janeiro. The methodological studies were carried out under the umbrella of the multi-city study to assess the effects of air pollution on health in the Latin America the ESCALA Project. Poluição do ar Epidemiologia ambiental Séries temporais Modelagem estatística Estatística computacional Environmental epidemiology Air pollution Time series Statistical modeling Statistical computing EPIDEMIOLOGIA
10	Iterated Grid Search Algorithm on Unimodal Criteria Kim, Jinhyo 02 June 1997 (has links) The unimodality of a function seems a simple concept. But in the Euclidean space R^m, m=3,4,..., it is not easy to define. We have an easy tool to find the minimum point of a unimodal function. The goal of this project is to formalize and support distinctive strategies that typically guarantee convergence. Support is given both by analytic arguments and simulation study. Application is envisioned in low-dimensional but non-trivial problems. The convergence of the proposed iterated grid search algorithm is presented along with the results of particular application studies. It has been recognized that the derivative methods, such as the Newton-type method, are not entirely satisfactory, so a variety of other tools are being considered as alternatives. Many other tools have been rejected because of apparent manipulative difficulties. But in our current research, we focus on the simple algorithm and the guaranteed convergence for unimodal function to avoid the possible chaotic behavior of the function. Furthermore, in case the loss function to be optimized is not unimodal, we suggest a weaker condition: almost (noisy) unimodality, under which the iterated grid search finds an estimated optimum point. / Ph. D. statistical computing nonlinear estimation statistical optimization statistical simulation Iterated Grid Search grid dichotomous search unimodality quasi-convexity envelope condition number derivative-free LD5655.V856 1997.K56

Search results