• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • Tagged with
  • 3
  • 3
  • 3
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Clustering Algorithm for Zero-Inflated Data

January 2020 (has links)
Zero-inflated data are common in biomedical research. In cluster analysis, the heuristic approach fails to provide inferential properties to the outcome while the existing model-based approach only works in the case of a mixture of multivariate normal. In this dissertation, I developed two new model-based clustering algorithms- the multivariate zero-inflated log-normal and the multivariate zero-inflated Poisson clustering algorithms. I then applied these methods to the questionnaire data and compare the resulting clusters to the ones derived from assuming multivariate normal distribution. Associations between clustering results and clinical outcomes were also investigated.
2

Statistical analysis of large scale data with perturbation subsampling

Yao, Yujing January 2022 (has links)
The past two decades have witnessed rapid growth in the amount of data available to us. Many fields, including physics, biology, and medical studies, generate enormous datasets with a large sample size, a high number of dimensions, or both. For example, some datasets in physics contains millions of records. It is forecasted by Statista Survey that in 2022, there will be over 86 millions users of health apps in United States, which will generate massive mHealth data. In addition, more and more large studies have been carried out, such as the UK Biobank study. This gives us unprecedented access to data and allows us to extract and infer vital information. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. For increasingly large datasets, computation can be a big hurdle for valid analysis. Conventional statistical methods lack the scalability to handle such large sample size. In addition, data storage and processing might be beyond usual computer capacity. The UK Biobank genotypes and phenotypes dataset contains about 500,000 individuals and more than 800,000 genotyped single nucleotide polymorphism (SNP) measurements per person, the size of which may well exceed a computer's physical memory. Further, the high dimensionality combined with the large sample size could lead to heavy computational cost and algorithmic instability. The aim of this dissertation is to provide some statistical approaches to address the issues. Chapter 1 provides a review on existing literature. In Chapter 2, a novel perturbation subsampling approach is developed based on independent and identically distributed stochastic weights for the analysis of large scale data. The method is justified based on optimizing convex criterion functions by establishing asymptotic consistency and normality for the resulting estimators. The method can provide consistent point estimator and variance estimator simultaneously. The method is also feasible for a distributed framework. The finite sample performance of the proposed method is examined through simulation studies and real data analysis. In Chapter 3, a repeated block perturbation subsampling is developed for the analysis of large scale longitudinal data using generalized estimating equation (GEE) approach. The GEE approach is a general method for the analysis of longitudinal data by fitting marginal models. The proposed method can provide consistent point estimator and variance estimator simultaneously. The asymptotic properties of the resulting subsample estimators are also studied. The finite sample performances of the proposed methods are evaluated through simulation studies and mHealth data analysis. With the development of technology, large scale high dimensional data is also increasingly prevailing. Conventional statistical methods for high dimensional data such as adaptive lasso (AL) lack the scalability to handle processing of such large sample size. Chapter 4 introduces the repeated perturbation subsampling adaptive lasso (RPAL), a new procedure which incorporates features of both perturbation and subsampling to yield a robust, computationally efficient estimator for variable selection, statistical inference and finite sample false discovery control in the analysis of big data. RPAL is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency. The theoretical properties of RPAL are studied and simulation studies are carried out by comparing the proposed estimator to the full data estimator and traditional subsampling estimators. The proposed method is also illustrated with the analysis of omics datasets.
3

Data Quality Assessment for the Secondary Use of Person-Generated Wearable Device Data: Assessing Self-Tracking Data for Research Purposes

Cho, Sylvia January 2021 (has links)
The Quantified Self movement has led to an increased routine use of consumer wearables, generating large amounts of person-generated wearable device data. This has become an opportunity to researchers to conduct research with large-scale person-generated wearable device data without having to collect data in a costly and time-consuming way. However, there are known challenges of wearable device data such as missing data or inaccurate data which raises the need to assess the quality of data before conducting research. Currently, there is a lack of in-depth understanding on data quality challenges of using person-generated wearable device data for research purposes, and how data quality assessment should be conducted. Data quality assessment could be especially a burden to those without the domain knowledge on a specific data type, which might be the case for emerging biomedical data sources. The goal of this dissertation is to advance the knowledge on data quality challenges and assessment of person-generated wearable device data and facilitate data quality assessment for those without the domain knowledge on the emerging data type. The dissertation consists of two aims: (1) identifying data quality dimensions important for assessing the quality of person-generated wearable device data for research purposes, (2) designing and evaluating an interactive data quality characterization tool that supports researchers in assessing the fitness-for-use of fitness tracker data. In the first aim, a multi-method approach was taken, conducting literature review, survey, and focus group discussion sessions. We found that intrinsic data quality dimensions applicable to electronic health record data such as conformance, completeness, and plausibility are applicable to person-generated wearable device data. In addition, contextual/fitness-for-use dimensions such as breadth and density completeness, and temporal data granularity were identified given the fact that our focus was on assessing data quality for research purposes. In the second aim, we followed an iterative design process from understanding informational needs to designing a prototype, and evaluating the usability of the final version of a tool. The tool allows users to customize the definition of data completeness (fitness-for-use measures), and provides data summarization on the cohort that meets that definition. We found that the interactive tool that incorporates fitness-for-use measures and allows customization on data completeness, can support assessing fitness-for-use assessment more accurately and in less time than a tool that only presents information on intrinsic data quality measures.

Page generated in 0.1194 seconds