• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 134
  • 55
  • 42
  • 15
  • 14
  • 8
  • 6
  • 4
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 324
  • 141
  • 120
  • 119
  • 70
  • 54
  • 44
  • 40
  • 27
  • 24
  • 22
  • 22
  • 21
  • 21
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
101

Imputação múltipla de dados faltantes: exemplo de aplicação no Estudo Pró-Saúde / Multiple imputation of missing data: application in the Pro-Saude Program

Thaís de Paulo Rangel 05 March 2013 (has links)
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Dados faltantes são um problema comum em estudos epidemiológicos e, dependendo da forma como ocorrem, as estimativas dos parâmetros de interesse podem estar enviesadas. A literatura aponta algumas técnicas para se lidar com a questão, e, a imputação múltipla vem recebendo destaque nos últimos anos. Esta dissertação apresenta os resultados da utilização da imputação múltipla de dados no contexto do Estudo Pró-Saúde, um estudo longitudinal entre funcionários técnico-administrativos de uma universidade no Rio de Janeiro. No primeiro estudo, após simulação da ocorrência de dados faltantes, imputou-se a variável cor/raça das participantes, e aplicou-se um modelo de análise de sobrevivência previamente estabelecido, tendo como desfecho a história auto-relatada de miomas uterinos. Houve replicação do procedimento (100 vezes) para se determinar a distribuição dos coeficientes e erros-padrão das estimativas da variável de interesse. Apesar da natureza transversal dos dados aqui utilizados (informações da linha de base do Estudo Pró-Saúde, coletadas em 1999 e 2001), buscou-se resgatar a história do seguimento das participantes por meio de seus relatos, criando uma situação na qual a utilização do modelo de riscos proporcionais de Cox era possível. Nos cenários avaliados, a imputação demonstrou resultados satisfatórios, inclusive quando da avaliação de performance realizada. A técnica demonstrou um bom desempenho quando o mecanismo de ocorrência dos dados faltantes era do tipo MAR (Missing At Random) e o percentual de não-resposta era de 10%. Ao se imputar os dados e combinar as estimativas obtidas nos 10 bancos (m=10) gerados, o viés das estimativas era de 0,0011 para a categoria preta e 0,0015 para pardas, corroborando a eficiência da imputação neste cenário. Demais configurações também apresentaram resultados semelhantes. No segundo artigo, desenvolve-se um tutorial para aplicação da imputação múltipla em estudos epidemiológicos, que deverá facilitar a utilização da técnica por pesquisadores brasileiros ainda não familiarizados com o procedimento. São apresentados os passos básicos e decisões necessárias para se imputar um banco de dados, e um dos cenários utilizados no primeiro estudo é apresentado como exemplo de aplicação da técnica. Todas as análises foram conduzidas no programa estatístico R, versão 2.15 e os scripts utilizados são apresentados ao final do texto. / Missing data are a common problem in epidemiologic studies and depending on the way they occur, the resulting estimates may be biased. Literature shows several techniques to deal with this subject and multiple imputation has been receiving attention in the recent years. This dissertation presents the results of applying multiple imputation of missing data in the context of the Pro-Saude Study, a longitudinal study among civil servants at a university in Rio de Janeiro, Brazil. In the first paper, after simulation of missing data, the variable color/race of the female servants was imputed and analyzed through a previously established survival model, which had the self-reported history of uterine leiomyoma as the outcome. The process has been replicated a hundred times in order to determine the distribution of the coefficient and standard errors of the variable being imputed. Although the data presented were cross-sectionally collected (baseline data of the Pro-Saude Study, gathered in 1999 and 2001), the following of the servants were determined using self-reported information. In this scenario, the Cox proportional hazards model could be applied. In the situations created, imputation showed adequate results, including in the performance analyses. The technique had a satisfactory effectiveness when the missing mechanism was MAR (Missing At Random) and the percent of missing data was 10. Imputing the missing information and combining the estimates of the 10 resulting datasets produced a bias of 0,0011 to black women and 0,0015 to brown (mixed-race) women, what corroborates the efficiency of multiple imputation in this scenario. In the second paper, a tutorial was created to guide the application of multiple imputation in epidemiologic studies, which should facilitate the use of the technique by Brazilian researchers who are still not familiarized with the procedure. Basic steps and important decisions necessary to impute a dataset are presented and one of the scenarios of the first paper is used as an application example. All the analyses were performed at R statistical software, version 2.15 and the scripts are presented at the end of the text.
102

Application of genomic technologies to the horse

Corbin, Laura Jayne January 2013 (has links)
The publication of a draft equine genome sequence and the release by Illumina of a 50,000 marker single-nucleotide polymorphism (SNP) genotyping chip has provided equine researchers with the opportunity to use new approaches to study the relationships between genotype and phenotype. In particular, it is hoped that the use of high-density markers applied to population samples will enable progress to be made with regard to more complex diseases. The first objective of this thesis is to explore the potential for the equine SNP chip to enable such studies to be performed in the horse. The second objective is to investigate the genetic background of osteochondrosis (OC) in the horse. These objectives have been tackled using 348 Thoroughbreds from the US, divided into cases and controls, and a further 836 UK Thoroughbreds, the majority with no phenotype data. All horses had been genotyped with the Illumina Equine SNP50 BeadChip. Linkage disequilibrium (LD) is the non-random association of alleles at neighbouring loci. The reliance of many genomic methodologies on LD between neutral markers and causal variants makes it an important characteristic of genome structure. In this thesis, the genomic data has been used to study the extent of LD in the Thoroughbred and the results considered in terms of genome coverage. Results suggest that the SNP chip offers good coverage of the genome. Published theoretical relationships between LD and historical effective population size (Ne) were exploited to enable accuracy predictions for genome-wide evaluation (GWE) to be made. A subsequent in-depth exploration of this theory cast some doubt on the reliability of this approach in the estimation of Ne, but the general conclusion that the Thoroughbred population has a small Ne which should enable GWE to be carried out efficiently in this population, remains valid. In the course of these studies, possible errors embedded within the current sequence assembly were identified using empirical approaches. Osteochondrosis is a developmental orthopaedic disease which affects the joints of young horses. Osteochondrosis is considered multifactorial in origin with a variety of environmental factors and heredity having been implicated. In this thesis, a genome-wide association study was carried out to identify quantitative trait loci (QTL) associated with OC. A single SNP was found to be significantly associated with OC. The low heritability of OC combined with the apparent lack of major QTL suggests GWE as an alternative approach to tackle this disease. A GWE analysis was carried out on the same dataset but the resulting genomic breeding values had no predictive ability for OC status. This, combined with the small number of significant QTL, indicates a lack of power which could be addressed in the future by increasing sample size. An alternative to genotyping more horses for the 50K SNP chip would be to use a low-density SNP panel and impute remaining genotypes. The final chapter of this thesis examines the feasibility of this approach in the Thoroughbred. Results suggest that genotyping only a subset of samples at high density and the remainder at lower density could be an effective strategy to enable greater progress to be made in the arena of equine genomics. Finally, this thesis provides an outlook on the future for genomics in the horse.
103

Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive models

Fernando Assunção 09 May 2012 (has links)
Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
104

Machine learning with the cancer genome atlas head and neck squamous cell carcinoma dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

Rendleman, Michael 01 May 2019 (has links)
In recent years, more data is becoming available for historical oncology case analysis. A large dataset that describes over 500 patient cases of Head and Neck Squamous Cell Carcinoma is a potential goldmine for finding ways to improve oncological decision support. Unfortunately, the best approaches for finding useful inferences are unknown. With so much information, from DNA and RNA sequencing to clinical records, we must use computational learning to find associations and biomarkers. The available data has sparsity, inconsistencies, and is very large for some datatypes. We processed clinical records with an expert oncologist and used complex modeling methods to substitute (impute) data for cases missing treatment information. We used machine learning algorithms to see if imputed data is useful for predicting patient survival. We saw no difference in ability to predict patient survival with the imputed data, though imputed treatment variables were more important to survival models. To deal with the large number of features in RNA expression data, we used two approaches: using all the data with High Performance Computers, and transforming the data into a smaller set of features (sparse principal components, or SPCs). We compared the performance of survival models with both datasets and saw no differences. However, the SPC models trained more quickly while also allowing us to pinpoint the biological processes each SPC is involved in to inform future biomarker discovery. We also examined ten processed molecular features for survival prediction ability and found some predictive power, though not enough to be clinically useful.
105

Anomaly detection in unknown environments using wireless sensor networks

Li, YuanYuan 01 May 2010 (has links)
This dissertation addresses the problem of distributed anomaly detection in Wireless Sensor Networks (WSN). A challenge of designing such systems is that the sensor nodes are battery powered, often have different capabilities and generally operate in dynamic environments. Programming such sensor nodes at a large scale can be a tedious job if the system is not carefully designed. Data modeling in distributed systems is important for determining the normal operation mode of the system. Being able to model the expected sensor signatures for typical operations greatly simplifies the human designer’s job by enabling the system to autonomously characterize the expected sensor data streams. This, in turn, allows the system to perform autonomous anomaly detection to recognize when unexpected sensor signals are detected. This type of distributed sensor modeling can be used in a wide variety of sensor networks, such as detecting the presence of intruders, detecting sensor failures, and so forth. The advantage of this approach is that the human designer does not have to characterize the anomalous signatures in advance. The contributions of this approach include: (1) providing a way for a WSN to autonomously model sensor data with no prior knowledge of the environment; (2) enabling a distributed system to detect anomalies in both sensor signals and temporal events online; (3) providing a way to automatically extract semantic labels from temporal sequences; (4) providing a way for WSNs to save communication power by transmitting compressed temporal sequences; (5) enabling the system to detect time-related anomalies without prior knowledge of abnormal events; and, (6) providing a novel missing data estimation method that utilizes temporal and spatial information to replace missing values. The algorithms have been designed, developed, evaluated, and validated experimentally in synthesized data, and in real-world sensor network applications.
106

METHODOLOGY AND APPLICATIONS IN IMPUTATION, FOOD CONSUMPTION AND OBESITY RESEARCH

Kyureghian, Gayaneh 2009 May 1900 (has links)
Obesity is a rapidly growing public health threat as well as an economic problem in the United States. The recent changes in eating habits, especially the relative increase of food away from home (FAFH) consumption over the last three decades raised the possibility of causal linkage between obesity and FAFH. This study confirms the positive, significant association between the body mass index and FAFH consumption in adults, consistent with previous findings in the economic and nutrition literature. This work goes a step further, however. We demonstrate FAFH consumption at quick-service restaurants has a significantly larger effect on body mass index than FAFH consumption at full-service restaurants. Further disaggregation of FAFH by meal occasion reveals that lunch consumed away from home has the largest positive effect on body mass index compared to other meal occasions (breakfast, dinner and snacks). Survey data with missing observations or latent variables are not rare phenomena. The missing value imputation methods are combined into two groups, contingent upon the existence or absence of an underlying explicit statistical model. Explicit modeling methods include unconditional mean value imputation, conditional mean and regression imputation, stochastic regression imputation, and multiple imputation. The methods based on implicit modeling include hot deck and cold deck imputation. In the second essay, we review imputation methods commonly used in the agricultural economics literature. Our analysis revealed strong preference of researchers for the regression imputation method. We consider several alternative (regression, mean and median) single imputation methods to impute and to append prices of foods consumed at home (foods commercially purchased and prepared from ingredients) from the National Health and Nutrition Examination Survey (NHANES) dietary intake data. We also demonstrate the superiority of regression imputation method compared to the mean and median imputation methods for commercially prepared foods. For ingredient foods, the results are ambiguous with no imputation method clearly outperforming the others.
107

A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models

Alemdar, Meltem 12 August 2009 (has links)
Unlike multilevel data with a purely nested structure, data that are cross-classified not only may be clustered into hierarchically ordered units but also may belong to more than one unit at a given level of a hierarchy. In a cross-classified design, students at a given school might be from several different neighborhoods and one neighborhood might have students who attend a number of different schools. In this type of scenario, schools and neighborhoods are considered to be cross-classified factors, and cross-classified random effects modeling (CCREM) should be used to analyze these data appropriately. A common problem in any type of multilevel analysis is the presence of missing data at any given level. There has been little research conducted in the multilevel literature about the impact of missing data, and none in the area of cross-classified models. The purpose of this study was to examine the effect of data that are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), on CCREM estimates while exploring multiple imputation to handle the missing data. In addition, this study examined the impact of including an auxiliary variable that is correlated with the variable with missingness (the level-1 predictor) in the imputation model for multiple imputation. This study expanded on the CCREM Monte Carlo simulation work of Meyers (2004) by the inclusion of studying the effect of missing data and method for handling these missing data with CCREM. The results demonstrated that in general, multiple imputation met Hoogland and Boomsma’s (1998) relative bias estimation criteria (less than 5% in magnitude) for parameter estimates under different types of missing data patterns. For the standard error estimates, substantial relative bias (defined by Hoogland and Boomsma as greater than 10%) was found in some conditions. When multiple imputation was used to handle the missing data then substantial bias was found in the standard errors in most cells where data were MNAR. This bias increased as a function of the percentage of missing data.
108

Statistical Evaluation of Continuous-Scale Diagnostic Tests with Missing Data

Wang, Binhuan 12 June 2012 (has links)
The receiver operating characteristic (ROC) curve methodology is the statistical methodology for assessment of the accuracy of diagnostics tests or bio-markers. Currently most widely used statistical methods for the inferences of ROC curves are complete-data based parametric, semi-parametric or nonparametric methods. However, these methods cannot be used in diagnostic applications with missing data. In practical situations, missing diagnostic data occur more commonly due to various reasons such as medical tests being too expensive, too time consuming or too invasive. This dissertation aims to develop new nonparametric statistical methods for evaluating the accuracy of diagnostic tests or biomarkers in the presence of missing data. Specifically, novel nonparametric statistical methods will be developed with different types of missing data for (i) the inference of the area under the ROC curve (AUC, which is a summary index for the diagnostic accuracy of the test) and (ii) the joint inference of the sensitivity and the specificity of a continuous-scale diagnostic test. In this dissertation, we will provide a general framework that combines the empirical likelihood and general estimation equations with nuisance parameters for the joint inferences of sensitivity and specificity with missing diagnostic data. The proposed methods will have sound theoretical properties. The theoretical development is challenging because the proposed profile log-empirical likelihood ratio statistics are not the standard sum of independent random variables. The new methods have the power of likelihood based approaches and jackknife method in ROC studies. Therefore, they are expected to be more robust, more accurate and less computationally intensive than existing methods in the evaluation of competing diagnostic tests.
109

Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies

Si, Yajuan January 2012 (has links)
<p>The thesis develops nonparametric Bayesian models to handle incomplete categorical variables in data sets with high dimension using the framework of multiple imputation. It presents methods for ignorable missing data in cross-sectional studies, and potentially non-ignorable missing data in panel studies with refreshment samples.</p><p>The first contribution is a fully Bayesian, joint modeling approach of multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. </p><p>I illustrate repeated sampling properties of the approach</p><p>using simulated data. This approach offers better performance than default chained equations methods, which are often used in such settings. I apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.</p><p>For the second contribution, I extend the nonparametric Bayesian imputation engine to consider a mix of potentially non-ignorable attrition and ignorable item nonresponse in multiple wave panel studies. Ignoring the attrition in models for panel data can result in biased inference if the reason for attrition is systematic and related to the missing values. Panel data alone cannot estimate the attrition effect without untestable assumptions about the missing data mechanism. Refreshment samples offer an extra data source that can be utilized to estimate the attrition effect while reducing reliance on strong assumptions of the missing data mechanism. </p><p>I consider two novel Bayesian approaches to handle the attrition and item non-response simultaneously under multiple imputation in a two wave panel with one refreshment sample when the variables involved are categorical and high dimensional. </p><p>First, I present a semi-parametric selection model that includes an additive non-ignorable attrition model with main effects of all variables, including demographic variables and outcome measures in wave 1 and wave 2. The survey variables are modeled jointly using Bayesian mixture of multinomial distributions. I develop the posterior computation algorithms for the semi-parametric selection model under different prior choices for the regression coefficients in the attrition model. </p><p>Second, I propose two Bayesian pattern mixture models for this scenario that use latent classes to model the dependency among the variables and the attrition. I develop a dependent Bayesian latent pattern mixture model for which variables are modeled via latent classes and attrition is treated as a covariate in the class allocation weights. And, I develop a joint Bayesian latent pattern mixture model, for which attrition and variables are modeled jointly via latent classes.</p><p>I show via simulation studies that the pattern mixture models can recover true parameter estimates, even when inferences based on the panel alone are biased from attrition. </p><p>I apply both the selection and pattern mixture models to data from the 2007-2008 Associated Press/Yahoo News election panel study.</p> / Dissertation
110

METHODOLOGY AND APPLICATIONS IN IMPUTATION, FOOD CONSUMPTION AND OBESITY RESEARCH

Kyureghian, Gayaneh 2009 May 1900 (has links)
Obesity is a rapidly growing public health threat as well as an economic problem in the United States. The recent changes in eating habits, especially the relative increase of food away from home (FAFH) consumption over the last three decades raised the possibility of causal linkage between obesity and FAFH. This study confirms the positive, significant association between the body mass index and FAFH consumption in adults, consistent with previous findings in the economic and nutrition literature. This work goes a step further, however. We demonstrate FAFH consumption at quick-service restaurants has a significantly larger effect on body mass index than FAFH consumption at full-service restaurants. Further disaggregation of FAFH by meal occasion reveals that lunch consumed away from home has the largest positive effect on body mass index compared to other meal occasions (breakfast, dinner and snacks). Survey data with missing observations or latent variables are not rare phenomena. The missing value imputation methods are combined into two groups, contingent upon the existence or absence of an underlying explicit statistical model. Explicit modeling methods include unconditional mean value imputation, conditional mean and regression imputation, stochastic regression imputation, and multiple imputation. The methods based on implicit modeling include hot deck and cold deck imputation. In the second essay, we review imputation methods commonly used in the agricultural economics literature. Our analysis revealed strong preference of researchers for the regression imputation method. We consider several alternative (regression, mean and median) single imputation methods to impute and to append prices of foods consumed at home (foods commercially purchased and prepared from ingredients) from the National Health and Nutrition Examination Survey (NHANES) dietary intake data. We also demonstrate the superiority of regression imputation method compared to the mean and median imputation methods for commercially prepared foods. For ingredient foods, the results are ambiguous with no imputation method clearly outperforming the others.

Page generated in 0.1202 seconds