Global ETD Search

1	Model-based data mining methods for identifying patterns in biomedical and health data Hilton, Ross P. 07 January 2016 (has links) In this thesis we provide statistical and model-based data mining methods for pattern detection with applications to biomedical and healthcare data sets. In particular, we examine applications in costly acute or chronic disease management. In Chapter II, we consider nuclear magnetic resonance experiments in which we seek to locate and demix smooth, yet highly localized components in a noisy two-dimensional signal. By using wavelet-based methods we are able to separate components from the noisy background, as well as from other neighboring components. In Chapter III, we pilot methods for identifying profiles of patient utilization of the healthcare system from large, highly-sensitive, patient-level data. We combine model-based data mining methods with clustering analysis in order to extract longitudinal utilization profiles. We transform these profiles into simple visual displays that can inform policy decisions and quantify the potential cost savings of interventions that improve adherence to recommended care guidelines. In Chapter IV, we propose new methods integrating survival analysis models and clustering analysis to profile patient-level utilization behaviors while controlling for variations in the population’s demographic and healthcare characteristics and explaining variations in utilization due to different state-based Medicaid programs, as well as access and urbanicity measures. Component identification Healthcare utilization Sequence clustering Latent variable model Medicaid system
2	Modeling social factors of HIV risk in Mexico Valencia, Celina I., Valencia, Celina I. January 2017 (has links) Background: Human Immunodeficiency Virus (HIV) and Acquired Immunodeficiency Syndrome (AIDS) is an urgent public health issue in Mexico. Mexico has witnessed a 122% increase in reported prevalence of HIV since 2001 (Holtz et al., 2014). Country estimates suggest there are between 140,000-230,000 individuals living with HIV in Mexico (CENSIDA, 2014). While approximately 50% of individuals living with HIV in Mexico are unaware that they are living with the virus (CENSIDA, 2014). Despite a federal universal HIV program implemented in 2011, HIV in Mexico has not reached a chronic infectious disease status as seen in other regions of the globe (Deeks, 2013). The mortality rate among individuals with HIV/AIDS in Mexico is 4.2 per 100,000 (CENSIDA, 2014). There is a paucity of findings regarding social and epidemiological data focused on populations outside traditional at risk populations of HIV in Mexico (Martin-Onraët et al., 2016). Analyzing aggregate country level data for Mexico provides necessary insights to better understanding previously unconsidered social factors that are informing sexual and reproductive health trends influencing HIV health patterns. Methods: Secondary analyses were performed on Mexico's Encuesta Nacional de Salud y Nutrición 2012 (ENSANUT). Mexico’s ENSANUT is a probabilistic aggregate national dataset with a multistage stratified cluster sampling design (Janssen et al., 2013). ENSANUT is Mexico’s equivalent to the National Health and Nutrition Examination Survey (NHANES) in the United States. Data is collected via self-report interviews conducted at the participant's home. A structured questionnaire was administered to individuals 20 years of age and older (≥ 20) where sexual and reproductive data was collected from participants. The ENSANUT adult study sub-sample (n=46,227) is comprised of 42.75% men and 57.25% women. A general linear model (GLM), principal component analysis (PCA), chi-squares (χ²), and logistic regressions were applied to the study adult subsample to disentangle social factors associated with sexually transmitted infections (STIs) in the population. Quantitative analyses were conducted on SAS 9.4. Findings: Men were more likely to have a STI diagnosis (OR=3.60; 95% CI 3.00, 4.32, p=<0.001). Previous HIV testing was found to be protective for STI diagnosis across both genders (OR=0.82, 95% CI 0.72, 0.94, p=<0.001). Co-infections of HIV/gonorrhea and HIV/syphilis (n=20) were the highest in the study population. The latent variable model indicates mental health and access to health care resources are critical for positive sexual and reproductive health outcomes in Mexico. Mental health was found to be non-protective for STI risk among the study population (OR=1.59, 95% CI 1.41, 1.81, p=<0.0001). Policy recommendations: 1. Increased access and utilization of HIV resources and mental health services would benefit the study population. Further qualitative research is needed to better understand the barriers to health care access and utilization in these two domains; 2. Increase in preventative programs and health initiatives that encourage established strategies for positive sexual and reproductive health outcomes. These strategies include: universal human papillomavirus (HPV) vaccines, wide availability of Pre-Exposure Prophylaxis (PrEP), and routine HIV/STI screenings; 3. Alternative data collection strategies for ENSANUT which are culturally appropriate for sexual and reproductive health constructs. HIV latent variable model Mexico population data sexually transmitted diseases social risk
3	Probabilistic Modeling of Multi-relational and Multivariate Discrete Data Wu, Hao 07 February 2017 (has links) Modeling and discovering knowledge from multi-relational and multivariate discrete data is a crucial task that arises in many research and application domains, e.g. text mining, intelligence analysis, epidemiology, social science, etc. In this dissertation, we study and address three problems involving the modeling of multi-relational discrete data and multivariate multi-response count data, viz. (1) discovering surprising patterns from multi-relational data, (2) constructing a generative model for multivariate categorical data, and (3) simultaneously modeling multivariate multi-response count data and estimating covariance structures between multiple responses. To discover surprising multi-relational patterns, we first study the ``where do I start?'' problem originating from intelligence analysis. By studying nine methods with origins in association analysis, graph metrics, and probabilistic modeling, we identify several classes of algorithmic strategies that can supply starting points to analysts, and thus help to discover interesting multi-relational patterns from datasets. To actually mine for interesting multi-relational patterns, we represent the multi-relational patterns as dense and well-connected chains of biclusters over multiple relations, and model the discrete data by the maximum entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery. To build a generative model for multivariate categorical data, we apply the maximum entropy principle to propose a categorical maximum entropy model such that in a statistically well-founded way we can optimally use given prior information about the data, and are unbiased otherwise. Generally, inferring the maximum entropy model could be infeasible in practice. Here, we leverage the structure of the categorical data space to design an efficient model inference algorithm to estimate the categorical maximum entropy model, and we demonstrate how the proposed model is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and US census datasets, and demonstrate its feasibility using an epidemic simulation application. Modeling data with multivariate count responses is a challenging problem due to the discrete nature of the responses. Existing methods for univariate count responses cannot be easily extended to the multivariate case since the dependency among multiple responses needs to be properly accounted for. To model multivariate data with multiple count responses, we propose a novel multivariate Poisson log-normal model (MVPLN). By simultaneously estimating the regression coefficients and inverse covariance matrix over the latent variables with an efficient Monte Carlo EM algorithm, the proposed model takes advantages of association among multiple count responses to improve the model prediction accuracy. Simulation studies and applications to real world data are conducted to systematically evaluate the performance of the proposed method in comparison with conventional methods. / Ph. D. / In this decade of big data, massive data of various types are generated every day from different research areas and industry sectors. Among all these types of data, text data, i.e. text documents, are important to many research and real world applications. One challenge faced when analyzing massive text data is which documents we should investigate first to initialize the analysis and how to identify stories and plots, if any, that hide inside the massive text documents. For example, in intelligence analysis, when analyzing intelligence documents, some common questions that analysts ask are ‘How is a suspect connected to the passenger manifest on this flight?’ and ‘How do distributed terrorist cells interface with each other?’. This is a crucial task so called storytelling. In the first half of this dissertation, we will study this problem and design mathematical models and computer algorithms to automatically identify useful information from text data to help analysts to discover hidden stories and plots from massive text documents. We also incorporate visual analytics techniques and design a visualization system to support human-in-the-loop exploratory data analysis so that analysts could interact with the algorithms and models iteratively to investigate given datasets. In the second half of this dissertation, we study two problems that arise from the domain of public health. When epidemic of certain disease happens, e.g. flu seasons, public health officials need to make certain policies in advance to prevent or alleviate the epidemic. A data-driven approach would be to make such public health policies using simulation results and predictions based on historical data. One problem usually faced in epidemic simulation is that researchers would like to run simulations with real-world data so that the simulation results can be close to real-world scenarios but at the same time protect the private information of individuals. To solve this problem, we design and implement a mathematical model that could generate realistic sythetic population using U.S. Census Survey to help conduct the epidemic simulation. Using flus as an example, we also propose a mathematical model to study associations between different types of flus with the information collected from social media, like Twitter. We believe that identifying such associations between different types of flus will help officials to make appropriate public health policies. Multivariate Discrete Data Multi-relational Data Maximum Entropy Modeling Subjective Interestingness Latent Variable Model Multivariate Poisson Regression Covariance Estimation.
4	Partial Least Squares for Serially Dependent Data Singer, Marco 04 August 2016 (has links) No description available. 510 Mathematik (PPN61756535X)
5	Bayesian Methods for Genetic Association Studies Xu, Lizhen 08 January 2013 (has links) We develop statistical methods for tackling two important problems in genetic association studies. First, we propose a Bayesian approach to overcome the winner's curse in genetic studies. Second, we consider a Bayesian latent variable model for analyzing longitudinal family data with pleiotropic phenotypes. Winner's curse in genetic association studies refers to the estimation bias of the reported odds ratios (OR) for an associated genetic variant from the initial discovery samples. It is a consequence of the sequential procedure in which the estimated effect of an associated genetic marker must first pass a stringent significance threshold. We propose a hierarchical Bayes method in which a spike-and-slab prior is used to account for the possibility that the significant test result may be due to chance. We examine the robustness of the method using different priors corresponding to different degrees of confidence in the testing results and propose a Bayesian model averaging procedure to combine estimates produced by different models. The Bayesian estimators yield smaller variance compared to the conditional likelihood estimator and outperform the latter in the low power studies. We investigate the performance of the method with simulations and applications to four real data examples. Pleiotropy occurs when a single genetic factor influences multiple quantitative or qualitative phenotypes, and it is present in many genetic studies of complex human traits. The longitudinal family studies combine the features of longitudinal studies in individuals and cross-sectional studies in families. Therefore, they provide more information about the genetic and environmental factors associated with the trait of interest. We propose a Bayesian latent variable modeling approach to model multiple phenotypes simultaneously in order to detect the pleiotropic effect and allow for longitudinal and/or family data. An efficient MCMC algorithm is developed to obtain the posterior samples by using hierarchical centering and parameter expansion techniques. We apply spike and slab prior methods to test whether the phenotypes are significantly associated with the latent disease status. We compute Bayes factors using path sampling and discuss their application in testing the significance of factor loadings and the indirect fixed effects. We examine the performance of our methods via extensive simulations and apply them to the blood pressure data from a genetic study of type 1 diabetes (T1D) complications. winner's curse spike and slab prior Hierarchical Bayes Model Bayesian Model Averaging Latent variable model pleiotropy genetic association studies Markov chain Monte Carlo path sampling Bayesian inference 0463
6	Bayesian Methods for Genetic Association Studies Xu, Lizhen 08 January 2013 (has links) We develop statistical methods for tackling two important problems in genetic association studies. First, we propose a Bayesian approach to overcome the winner's curse in genetic studies. Second, we consider a Bayesian latent variable model for analyzing longitudinal family data with pleiotropic phenotypes. Winner's curse in genetic association studies refers to the estimation bias of the reported odds ratios (OR) for an associated genetic variant from the initial discovery samples. It is a consequence of the sequential procedure in which the estimated effect of an associated genetic marker must first pass a stringent significance threshold. We propose a hierarchical Bayes method in which a spike-and-slab prior is used to account for the possibility that the significant test result may be due to chance. We examine the robustness of the method using different priors corresponding to different degrees of confidence in the testing results and propose a Bayesian model averaging procedure to combine estimates produced by different models. The Bayesian estimators yield smaller variance compared to the conditional likelihood estimator and outperform the latter in the low power studies. We investigate the performance of the method with simulations and applications to four real data examples. Pleiotropy occurs when a single genetic factor influences multiple quantitative or qualitative phenotypes, and it is present in many genetic studies of complex human traits. The longitudinal family studies combine the features of longitudinal studies in individuals and cross-sectional studies in families. Therefore, they provide more information about the genetic and environmental factors associated with the trait of interest. We propose a Bayesian latent variable modeling approach to model multiple phenotypes simultaneously in order to detect the pleiotropic effect and allow for longitudinal and/or family data. An efficient MCMC algorithm is developed to obtain the posterior samples by using hierarchical centering and parameter expansion techniques. We apply spike and slab prior methods to test whether the phenotypes are significantly associated with the latent disease status. We compute Bayes factors using path sampling and discuss their application in testing the significance of factor loadings and the indirect fixed effects. We examine the performance of our methods via extensive simulations and apply them to the blood pressure data from a genetic study of type 1 diabetes (T1D) complications. winner's curse spike and slab prior Hierarchical Bayes Model Bayesian Model Averaging Latent variable model pleiotropy genetic association studies Markov chain Monte Carlo path sampling Bayesian inference 0463
7	Modélisation et classification dynamique de données temporelles non stationnaires / Dynamic classification and modeling of non-stationary temporal data El Assaad, Hani 11 December 2014 (has links) Cette thèse aborde la problématique de la classification non supervisée de données lorsque les caractéristiques des classes sont susceptibles d'évoluer au cours du temps. On parlera également, dans ce cas, de classification dynamique de données temporelles non stationnaires. Le cadre applicatif des travaux concerne le diagnostic par reconnaissance des formes de systèmes complexes dynamiques dont les classes de fonctionnement peuvent, suite à des phénomènes d'usures, des déréglages progressifs ou des contextes d'exploitation variables, évoluer au cours du temps. Un modèle probabiliste dynamique, fondé à la fois sur les mélanges de lois et sur les modèles dynamiques à espace d'état, a ainsi été proposé. Compte tenu de la structure complexe de ce modèle, une variante variationnelle de l'algorithme EM a été proposée pour l'apprentissage de ses paramètres. Dans la perspective du traitement rapide de flux de données, une version séquentielle de cet algorithme a également été développée, ainsi qu'une stratégie de choix dynamique du nombre de classes. Une série d'expérimentations menées sur des données simulées et des données réelles acquises sur le système d'aiguillage des trains a permis d'évaluer le potentiel des approches proposées / Nowadays, diagnosis and monitoring for predictive maintenance of railway components are important key subjects for both operators and manufacturers. They seek to anticipate upcoming maintenance actions, reduce maintenance costs and increase the availability of rail network. In order to maintain the components at a satisfactory level of operation, the implementation of reliable diagnostic strategy is required. In this thesis, we are interested in a main component of railway infrastructure, the railway switch; an important safety device whose failure could heavily impact the availability of the transportation system. The diagnosis of this system is therefore essential and can be done by exploiting sequential measurements acquired successively while the state of the system is evolving over time. These measurements consist of power consumption curves that are acquired during several switch operations. The shape of these curves is indicative of the operating state of the system. The aim is to track the temporal dynamic evolution of railway component state under different operating contexts by analyzing the specific data in order to detect and diagnose problems that may lead to functioning failure. This thesis tackles the problem of temporal data clustering within a broader context of developing innovative tools and decision-aid methods. We propose a new dynamic probabilistic approach within a temporal data clustering framework. This approach is based on both Gaussian mixture models and state-space models. The main challenge facing this work is the estimation of model parameters associated with this approach because of its complex structure. In order to meet this challenge, a variational approach has been developed. The results obtained on both synthetic and real data highlight the advantage of the proposed algorithms compared to other state of the art methods in terms of clustering and estimation accuracy Diagnostic Classification automatique Modèle de mélange Données temporelles non stationnaires Classes évolutives Filtre de Kalman Diagnosis Clustering Dynamic latent variable model Temporal data clustering Evolving clusters Kalman filter
8	Determinants of Union Member Attitudes Towards Employee Involvement Programs Hoell, Robert Craig 02 October 1998 (has links) This study investigates the role social information and personal dispositions play in the development of attitudes of unionized employees towards employee involvement programs. A theoretical model was developed in order to understand how social information and dispositions form union member attitudes towards employee involvement programs. This was designed from models of employee involvement and attitude formation. Data were collected from employees at electrical power generation facilities. Measures of organizational and union commitment, locus of control, participativeness, social information provided by the company, social information provided by the union, and employee involvement attitudes were gathered through a survey distributed at the facilities. General affect and satisfaction towards four types of employee involvement programs union members are most likely to encounter were measured. Specific hypotheses were developed in order to test and analyze parts of the theoretical model. While the results were at times contrary to the hypothesized relationships within the model, the data fit with the theorized model well enough to provide support for it. This model effectively demonstrated how employee involvement attitudes are formed from such data, and the relationships between the variables measured. / Ph. D. Structural Equation Model Social Information Processing Organizational Behavior Employee Attitudes Industrial Relations Labor Relations Labor Unions Employee Participation Employee Involvement Latent Variable Model LISREL Human Resource Management
9	Latent Variable Models of Categorical Responses in the Bayesian and Frequentist Frameworks Farouni, Tarek January 2014 (has links) No description available. Educational Tests and Measurements Psychology Statistics Psychometrics Latent Variable Model Model Identifiability Categorical Item Factor Analysis Multidimensional IRT Nonlinear Factor Analysis
10	Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision Täckström, Oscar January 2013 (has links) Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language. linguistic structure prediction structured prediction latent-variable model semi-supervised learning multilingual learning cross-lingual learning indirect supervision partial supervision ambiguous supervision part-of-speech tagging dependency parsing named-entity recognition sentiment analysis

Search results