Spelling suggestions: "subject:"istatistical dethodology"" "subject:"istatistical methododology""
71 |
Sparse Principal Component Analysis for High-Dimensional Data: A Comparative StudyBonner, Ashley J. 10 1900 (has links)
<p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc)
|
72 |
STATISTICAL AND METHODOLOGICAL ISSUES ON COVARIATE ADJUSTMENT IN CLINICAL TRIALSChu, Rong 04 1900 (has links)
<p><strong>Background and objectives</strong></p> <p>We investigate three issues related to the adjustment for baseline covariates in late phase clinical trials: (1) the analysis of correlated outcomes in multicentre RCTs, (2) the assessment of the probability and implication of prognostic imbalance in RCTs, and (3) the adjustment for baseline confounding in cohort studies.</p> <p><strong>Methods</strong></p> <p>Project 1: We investigated the properties of six statistical methods for analyzing continuous outcomes in multicentre randomized controlled trials (RCTs) where within-centre clustering was possible. We simulated studies over various intraclass correlation (ICC) values with several centre combinations.</p> <p>Project 2: We simulated data from RCTs evaluating a binary outcome by varying risk of the outcome, effect of the treatment, power and prevalence of a binary prognostic factor (PF), and sample size. We compared logistic regression models with and without adjustment for the PF, in terms of bias, standard error, coverage of confidence interval, and statistical power. A tool to assess sample size requirement to control for chance imbalance was proposed.</p> <p>Project 3: We conducted a prospective cohort study to evaluate the effect of tuberculosis (TB) at the initiation of antiretroviral therapy (ART) on all cause mortality using Cox proportional hazard model on propensity score (PS) matched patients to control for potential confounding. We assessed the robustness of results using sensitivity analyses.</p> <p><strong>Results and conclusions</strong></p> <p>Project 1: All six methods produce unbiased estimates of treatment effect in multicentre trials. Adjusting for centre as a random intercept leads to the most efficient treatment effect estimation, and hence should be used in the presence of clustering.</p> <p>Project 2: The probability of prognostic imbalance in small trials can be substantial. Covariate adjustment improves estimation accuracy and statistical power, and hence should be performed when strong PFs are observed.</p> <p>Project 3: After controlling for the important confounding variables, HIV patients who had TB at the initiation of ART have a moderate increase in the risk of overall mortality.</p> / Doctor of Philosophy (PhD)
|
73 |
LIKELIHOOD INFERENCE FOR LEFT TRUNCATED AND RIGHT CENSORED LIFETIME DATAMitra, Debanjan 04 1900 (has links)
<p>Left truncation arises because in many situations, failure of a unit is observed only if it fails after a certain period. In many situations, the units under study may not be followed until all of them fail and the experimenter may have to stop at a certain time when some of the units may still be working. This introduces right censoring into the data. Some commonly used lifetime distributions are lognormal, Weibull and gamma, all of which are special cases of the flexible generalized gamma family. Likelihood inference via the Expectation Maximization (EM) algorithm is used to estimate the model parameters of lognormal, Weibull, gamma and generalized gamma distributions, based on left truncated and right censored data. The asymptotic variance-covariance matrices of the maximum likelihood estimates (MLEs) are derived using the missing information principle. By using the asymptotic variances and the asymptotic normality of the MLEs, asymptotic confidence intervals for the parameters are constructed. For comparison purpose, Newton-Raphson (NR) method is also used for the parameter estimation, and asymptotic confidence intervals corresponding to the NR method and parametric bootstrap are also obtained. Through Monte Carlo simulations, the performance of all these methods of inference are studied. With regard to prediction analysis, the probability that a right censored unit will be working until a future year is estimated, and an asymptotic confidence interval for the probability is then derived by the delta-method. All the methods of inference developed here are illustrated with some numerical examples.</p> / Doctor of Philosophy (PhD)
|
74 |
LIKELIHOOD-BASED INFERENTIAL METHODS FOR SOME FLEXIBLE CURE RATE MODELSPal, Suvra 04 1900 (has links)
<p>Recently, the Conway-Maxwell Poisson (COM-Poisson) cure rate model has been proposed which includes as special cases some of the well-known cure rate models discussed in the literature. Data obtained from cancer clinical trials are often right censored and the expectation maximization (EM) algorithm can be efficiently used for the determination of the maximum likelihood estimates (MLEs) of the model parameters based on right censored data.</p> <p>By assuming the lifetime distribution to be exponential, lognormal, Weibull, and gamma, the necessary steps of the EM algorithm are developed for the COM-Poisson cure rate model and some of its special cases. The inferential method is examined by means of an extensive simulation study. Model discrimination within the COM-Poisson family is carried out by likelihood ratio test as well as by information-based criteria. Finally, the proposed method is illustrated with a cutaneous melanoma data on cancer recurrence. As the lifetime distributions considered are not nested, it is not possible to carry out a formal statistical test to determine which among these provides an adequate fit to the data. For this reason, the wider class of generalized gamma distributions is considered which contains all of the above mentioned lifetime distributions as special cases. The steps of the EM algorithm are then developed for this general class of distributions and a simulation study is carried out to evaluate the performance of the proposed estimation method. Model discrimination within the generalized gamma family is carried out by likelihood ratio test and information-based criteria. Finally, for the considered cutaneous melanoma data, the two-way flexibility of the COM-Poisson family and the generalized gamma family is utilized to carry out a two-way model discrimination to select a parsimonious competing cause distribution along with a suitable choice of a lifetime distribution that provides the best fit to the data.</p> / Doctor of Philosophy (PhD)
|
75 |
IDENTIFYING AND OVERCOMING OBSTACLES TO SAMPLE SIZE AND POWER CALCULATIONS IN FMRI STUDIESGuo, Qing 25 September 2014 (has links)
<p>Functional<strong> </strong>magnetic resonance imaging (fMRI) is a popular technique to study brain function and neural networks. Functional MRI studies are often characterized by small sample sizes and rarely consider statistical power when setting a sample size. This could lead to data dredging, and hence false positive findings. With the widespread use of fMRI studies in clinical disorders, the vulnerability of participants points to an ethical imperative for reliable results so as to uphold promises typically made to participants that the study results will help understand their conditions. While important, power-based sample size calculations can be challenging. The majority of fMRI studies are observational, i.e., are not designed to randomize participants to test efficacy and safety of any therapeutic intervention. My PhD thesis therefore addresses two objectives: firstly, to identify potential obstacles to implementing sample size calculations, and secondly to provide solutions to these obstacles in observational clinical fMRI studies. This thesis contains three projects.</p> <p>Implementing a power-based sample size calculation requires specifications of effect sizes and variances. Typically in health research, these input parameters for the calculation are estimated from results of previous studies, however these often seem to be lacking in the fMRI literature. Project 1 addresses the first objective through a systematic review of 100 fMRI studies with clinical participants, examining how often observed input parameters were reported in the results section so as to help design a new well-powered study. Results confirmed that both input estimates and sample size calculations were rarely reported. The omission of observed inputs in the results section is an impediment to carrying out sample size calculations for future studies.</p> <p>Uncertainty in input parameters is typically dealt with using sensitivity analysis; however this can result in a wide range of candidate sample sizes, leading to difficulty in setting a sample size. Project 2 suggests a cost-efficiency approach as a short-term strategy to deal with the uncertainty in input data and, through an example, illustrates how it narrowed the range to choose a sample size on the basis of maximizing return on investment.</p> <p>Routine reporting of the input estimates can thus facilitate sample size calculations for future studies. Moreover, increasing the overall quality of reporting in fMRI studies helps reduce bias in reported input estimates and hence helps ensure a rigorous sample size calculation in the long run. Project 3 is a systematic review of overall reporting quality of observational clinical fMRI studies, highlighting under-reported areas for improvement and suggesting creating a shortened version of the checklist which contains essential details adapted from the guidelines proposed by Poldrack et al. (2008) to accommodate strict word limits for reporting observational clinical fMRI studies.</p> <p>In conclusion, this PhD thesis facilitates future sample size and power calculations in the fMRI literature by identifying impediments, by providing a short-term solution to overcome the impediments using a cost-efficiency approach in conjunction with conventional methods, and by suggesting a long-term strategy to ensure a rigorous sample size calculation through improving the overall quality of reporting.</p> / Doctor of Philosophy (PhD)
|
76 |
Latent Growth Model Approach to Characterize Maternal Prenatal DNA Methylation TrajectoriesLapato, Dana 01 January 2019 (has links)
Background. DNA methylation (DNAm) is a removable chemical modification to the DNA sequence intimately associated with genomic stability, cellular identity, and gene expression. DNAm patterning reflects joint contributions from genetic, environmental, and behavioral factors. As such, differences in DNAm patterns may explain interindividual variability in risk liability for complex traits like major depression (MD). Hundreds of significant DNAm loci have been identified using cross-sectional association studies. This dissertation builds on that foundational work to explore novel statistical approaches for longitudinal DNAm analyses. Methods. Repeated measures of genome-wide DNAm and social and environmental determinants of health were collected up to six times across pregnancy and the first year postpartum as part of the Pregnancy, Race, Environment, Genes (PREG) Study. Statistical analyses were completed using a combination of the R statistical environment, Bioconductor packages, MplusAutomate, and Mplus software. Prenatal maternal DNAm was measured using the Infinium HumanMethylation450 Beadchip. Latent growth curve models were used to analyze repeated measures of maternal DNAm and to quantify site-level DNAm latent trajectories over the course of pregnancy. The purpose was to characterize the location and nature of prenatal DNAm changes and to test the influence of clinical and demographic factors on prenatal DNAm remodeling. Results. Over 1300 sites had DNAm trajectories significantly associated with either maternal age or lifetime MD. Many of the genomic regions overlapping significant results replicated previous age and MD-related genetic and DNAm findings. Discussion. Future work should capitalize on the progress made here integrating structural equation modeling (SEM) with longitudinal omics-level measures.
|
77 |
On the Performance of some Poisson Ridge Regression EstimatorsZaldivar, Cynthia 28 March 2018 (has links)
Multiple regression models play an important role in analyzing and making predictions about data. Prediction accuracy becomes lower when two or more explanatory variables in the model are highly correlated. One solution is to use ridge regression. The purpose of this thesis is to study the performance of available ridge regression estimators for Poisson regression models in the presence of moderately to highly correlated variables. As performance criteria, we use mean square error (MSE), mean absolute percentage error (MAPE), and percentage of times the maximum likelihood (ML) estimator produces a higher MSE than the ridge regression estimator. A Monte Carlo simulation study was conducted to compare performance of the estimators under three experimental conditions: correlation, sample size, and intercept. It is evident from simulation results that all ridge estimators performed better than the ML estimator. We proposed new estimators based on the results, which performed very well compared to the original estimators. Finally, the estimators are illustrated using data on recreational habits.
|
78 |
Geographic Factors of Residential Burglaries - A Case Study in Nashville, TennesseeHall, Jonathan A. 01 November 2010 (has links)
This study examines geographic patterns and geographic factors of residential burglary at the Nashville, TN area for a twenty year period at five year interval starting in 1988. The purpose of this study is to identify what geographic factors have impacted on residential burglary rates, and if there were changes in the geographic patterns of residential burglary over the study period. Several criminological theories guide this study, with the most prominent being Social Disorganization Theory and Routine Activities Theory. Both of these theories focus on the relationships of place and crime. A number of spatial analysis methods are hence adopted to analyze residential burglary rates at block group level for each of the study year. Spatial autocorrelation approaches, particularly Global and Local Moran's I statistics, are utilized to detect the hotspots of residential burglary. To understand the underlying geographic factors of residential burglary, both OLS and GWR regression analyses are conducted to examine the relationships between residential burglary rates and various geographic factors, such as Percentages of Minorities, Singles, Vacant Housing Units, Renter Occupied Housing Units, and Persons below Poverty Line.
The findings indicate that residential burglaries exhibit clustered patterns by forming various hotspots around the study area, especially in the central city and over time these hotspots tended to move in a northeasterly direction during the study period of 1988-2008. Overall, four of the five geographic factors under examination show positive correlations with the rate of residential burglary at block group level. Percentages of Vacant Housing Units and Persons below Poverty Line (both are indicators of neighbor economic well-being) are the strong indicators of crime, while Percentages of Minorities (ethnic heterogeneity indictor) and Renter Occupied Housing Units (residential turnover indictor) only show modest correlation in a less degree. Counter-intuitively, Percentage of Singles (another indicator of residential turnover) is in fact a deterrent of residential burglary; however, the reason for this deterrence is not entirely clear.
|
79 |
Metodologia estat?stica na solu??o do problema do caixeiro viajante e na avalia??o de algoritmos : um estudo aplicado ? transgen?tica computacionalRamos, Iloneide Carlos de Oliveira 03 March 2005 (has links)
Made available in DSpace on 2014-12-17T14:55:03Z (GMT). No. of bitstreams: 1
IloneideCOR.pdf: 1010601 bytes, checksum: 76bbc04aa0a456f079121fb0d750ea74 (MD5)
Previous issue date: 2005-03-03 / The problems of combinatory optimization have involved a large number of researchers in search of approximative solutions for them, since it is generally accepted that they are unsolvable in polynomial time. Initially, these solutions were focused on heuristics. Currently, metaheuristics are used more for this task, especially those based on evolutionary algorithms. The two main contributions of this work are: the creation of what is called an -Operon- heuristic, for the construction of the information chains necessary for the implementation of transgenetic (evolutionary) algorithms, mainly using statistical methodology - the Cluster Analysis and the Principal Component Analysis; and the utilization of statistical analyses that are adequate for the evaluation of the performance of the algorithms that are developed to solve these problems. The aim of the Operon is to construct good quality dynamic information chains to promote an -intelligent- search in the space of solutions. The Traveling Salesman Problem (TSP) is intended for applications based on a transgenetic algorithmic known as ProtoG. A strategy is also proposed for the renovation of part of the chromosome population indicated by adopting a minimum limit in the coefficient of variation of the adequation function of the individuals, with calculations based on the population. Statistical methodology is used for the evaluation of the performance of four algorithms, as follows: the proposed ProtoG, two memetic algorithms and a Simulated Annealing algorithm. Three performance analyses of these algorithms are proposed. The first is accomplished through the Logistic Regression, based on the probability of finding an optimal solution for a TSP instance by the algorithm being tested. The second is accomplished through Survival Analysis, based on a probability of the time observed for its execution until an optimal solution is achieved. The third is accomplished by means of a non-parametric Analysis of Variance, considering the Percent Error of the Solution (PES) obtained by the percentage in which the solution found exceeds the best solution available in the literature. Six experiments have been conducted applied to sixty-one instances of Euclidean TSP with sizes of up to 1,655 cities. The first two experiments deal with the adjustments of four parameters used in the ProtoG algorithm in an attempt to improve its performance. The last four have been undertaken to evaluate the performance of the ProtoG in comparison to the three algorithms adopted. For these sixty-one instances, it has been concluded on the grounds of statistical tests that there is evidence that the ProtoG performs better than these three algorithms in fifty instances. In addition, for the thirty-six instances considered in the last three trials in which the performance of the algorithms was evaluated through PES, it was observed that the PES average obtained with the ProtoG was less than 1% in almost half of these instances, having reached the greatest average for one instance of 1,173 cities, with an PES average equal to 3.52%. Therefore, the ProtoG can be considered a competitive algorithm for solving the TSP, since it is not rare in the literature find PESs averages greater than 10% to be reported for instances of this size. / Os problemas de otimiza??o combinat?ria t?m envolvido um grande n?mero de pesquisadores na busca por solu??es aproximativas para aqueles, desde a aceita??o de que eles s?o considerados insol?veis em tempo polinomial. Inicialmente, essas solu??es eram focalizadas por meio de heur?sticas. Atualmente, as metaheur?sticas s?o mais utilizadas para essa tarefa, especialmente aquelas baseadas em algoritmos evolucion?rios. As duas principais contribui??es deste trabalho s?o: a cria??o de uma heur?stica, denominada Operon, para a constru??o de cadeias de informa??es necess?rias ? implementa??o de algoritmos transgen?ticos (evolucion?rios) utilizando, principalmente, a metodologia estat?stica - An?lise de Agrupamentos e An?lise de Componentes Principais -; e a utiliza??o de an?lises estat?sticas adequadas ? avalia??o da performance de algoritmos destinados ? solu??o desses problemas. O Operon visa construir, de forma din?mica e de boa qualidade, cadeias de informa??es a fim de promover uma busca -inteligente- no espa?o de solu??es. O Problema do Caixeiro Viajante (PCV) ? focalizado para as aplica??es que s?o realizadas com base num algoritmo transgen?tico, denominado ProtoG. Prop?e-se, tamb?m, uma estrat?gia de renova??o de parte da popula??o de cromossomos indicada pela ado??o de um limite m?nimo no coeficiente de varia??o da fun??o de adequa??o dos indiv?duos, calculado com base na popula??o. S?o propostas tr?s an?lises estat?sticas para avaliar a performance de algoritmos. A primeira ? realizada atrav?s da An?lise de Regress?o Log?stica, com base na probabilidade de obten??o da solu??o ?tima de uma inst?ncia do PCV pelo algoritmo em teste. A segunda ? realizada atrav?s da An?lise de Sobreviv?ncia, com base numa probabilidade envolvendo o tempo de execu??o observado at? que a solu??o ?tima seja obtida. A terceira ? realizada por meio da An?lise de Vari?ncia n?o param?trica, considerando o Erro Percentual da Solu??o (EPS) obtido pela percentagem em que a solu??o encontrada excede a melhor solu??o dispon?vel na literatura. Utiliza-se essa metodologia para a avalia??o da performance de quatro algoritmos, a saber: o ProtoG proposto, dois algoritmos mem?ticos e um algoritmo Simulated Annealing. Foram realizados seis experimentos, aplicados a sessenta e uma inst?ncias do PCV euclidiano, com tamanhos de at? 1.655 cidades. Os dois primeiros experimentos tratam do ajuste de quatro par?metros utilizados no algoritmo ProtoG, visando melhorar a performance do mesmo. Os quatro ?ltimos s?o utilizados para avaliar a performance do ProtoG em compara??o aos tr?s algoritmos adotados. Para essas sessenta e uma inst?ncias, conclui-se, sob testes estat?sticos, que h? evid?ncias de que o ProtoG ? superior a esses tr?s algoritmos em cinq?enta inst?ncias. Al?m disso, para as trinta e seis inst?ncias consideradas nos tr?s ?ltimos experimentos, nos quais a avalia??o da performance dos algoritmos foi realizada com base no EPS, observou-se que o ProtoG obteve EPSs m?dios menores que 1% em quase metade das inst?ncias, tendo atingido a maior m?dia para uma inst?ncia composta por 1.173 cidades, com EPS m?dio igual a 3,52%. Logo, o ProtoG pode ser considerado um algoritmo competitivo para solucionar o PCV, pois n?o ? raro serem reportados, na literatura, EPSs m?dios maiores que 10% para inst?ncias desse porte.
|
80 |
Niche-Based Modeling of Japanese Stiltgrass (Microstegium vimineum) Using Presence-Only InformationBush, Nathan 23 November 2015 (has links)
The Connecticut River watershed is experiencing a rapid invasion of aggressive non-native plant species, which threaten watershed function and structure. Volunteer-based monitoring programs such as the University of Massachusetts’ OutSmart Invasives Species Project, Early Detection Distribution Mapping System (EDDMapS) and the Invasive Plant Atlas of New England (IPANE) have gathered valuable invasive plant data. These programs provide a unique opportunity for researchers to model invasive plant species utilizing citizen-sourced data. This study took advantage of these large data sources to model invasive plant distribution and to determine environmental and biophysical predictors that are most influential in dispersion, and to identify a suitable presence-only model for use by conservation biologists and land managers at varying spatial scales. This research focused on the invasive plant species of high interest - Japanese stiltgrass (Mircostegium vimineum). This was identified as a threat by U.S. Fish and Wildlife Service refuge biologists and refuge managers, but for which no mutli-scale practical and systematic approach for detection, has yet been developed. Environmental and biophysical variables include factors directly affecting species physiology and locality such as annual temperatures, growing degree days, soil pH, available water supply, elevation, closeness to hydrology and roads, and NDVI. Spatial scales selected for this study include New England (regional), the Connecticut River watershed (watershed), and the U.S. Fish and Wildlife, Silvio O. Conte National Fish and Wildlife Refuge, Salmon River Division (local). At each spatial scale, three software programs were implemented: maximum entropy habitat model by means of the MaxEnt software, ecological niche factor analysis (ENFA) using Openmodeller software, and a generalized linear model (GLM) employed in the statistical software R. Results suggest that each modeling algorithm performance varies among spatial scales. The best fit modeling software designated for each scale will be useful for refuge biologists and managers in determining where to allocate resources and what areas are prone to invasion. Utilizing the regional scale results, managers will understand what areas on a broad-scale are at risk of M. vimineum invasion under current climatic variables. The watershed-scale results will be practical for protecting areas designated as most critical for ensuring the persistence of rare and endangered species and their habitats. Furthermore, the local-scale, or fine-scale, analysis will be directly useful for on-the-ground conservation efforts. Managers and biologists can use results to direct resources to areas where M. vimineum is most likely to occur to effectively improve early detection rapid response (EDRR).
|
Page generated in 0.0769 seconds