Global ETD Search

91	Freeway Short-Term Traffic Flow Forecasting by Considering Traffic Volatility Dynamics and Missing Data Situations Zhang, Yanru 2011 August 1900 (has links) Short-term traffic flow forecasting is a critical function in advanced traffic management systems (ATMS) and advanced traveler information systems (ATIS). Accurate forecasting results are useful to indicate future traffic conditions and assist traffic managers in seeking solutions to congestion problems on urban freeways and surface streets. There is new research interest in short-term traffic flow forecasting due to recent developments in ITS technologies. Previous research involves technologies in multiple areas, and a significant number of forecasting methods exist in literature. However, forecasting reliability is not properly addressed in existing studies. Most forecasting methods only focus on the expected value of traffic flow, assuming constant variance when perform forecasting. This method does not consider the volatility nature of traffic flow data. This paper demonstrated that the variance part of traffic flow data is not constant, and dependency exists. A volatility model studies the dependency among the variance part of traffic flow data and provides a prediction range to indicate the reliability of traffic flow forecasting. We proposed an ARIMA-GARCH (Autoregressive Integrated Moving Average- AutoRegressive Conditional Heteroskedasticity) model to study the volatile nature of traffic flow data. Another problem of existing studies is that most methods have limited forecasting abilities when there is missing data in historical or current traffic flow data. We developed a General Regression Neural Network(GRNN) based multivariate forecasting method to deal with this issue. This method uses upstream information to predict traffic flow at the studied site. The study results indicate that the ARIMA-GARCH model outperforms other methods in non-missing data situations, while the GRNN model performs better in missing data situations. Short-Term Traffic Flow Forecasting Forecasting Reliability Missing Data GRNN ARIMA-GARCH
92	Dimension Reduction and Covariance Structure for Multivariate Data, Beyond Gaussian Assumption Maadooliat, Mehdi 2011 August 1900 (has links) Storage and analysis of high-dimensional datasets are always challenging. Dimension reduction techniques are commonly used to reduce the complexity of the data and obtain the informative aspects of datasets. Principal Component Analysis (PCA) is one of the commonly used dimension reduction techniques. However, PCA does not work well when there are outliers or the data distribution is skewed. Gene expression index estimation is an important problem in bioinformatics. Some of the popular methods in this area are based on the PCA, and thus may not work well when there is non-Gaussian structure in the data. To address this issue, a likelihood based data transformation method with a computationally efficient algorithm is developed. Also, a new multivariate expression index is studied and the performance of the multivariate expression index is compared with the commonly used univariate expression index. As an extension of the gene expression index estimation problem, a general procedure that integrates data transformation with the PCA is developed. In particular, this general method can handle missing data and data with functional structure. It is well-known that the PCA can be obtained by the eigen decomposition of the sample covariance matrix. Another focus of this dissertation is to study the covariance (or correlation) structure under the non-Gaussian assumption. An important issue in modeling the covariance matrix is the positive definiteness constraint. The modified Cholesky decomposition of the inverse covariance matrix has been considered to address this issue in the literature. An alternative Cholesky decomposition of the covariance matrix is considered and used to construct an estimator of the covariance matrix under multivariate-t assumption. The advantage of this alternative Cholesky decomposition is the decoupling of the correlation and the variances. Correlation modelling Functional PCA Missing data PCA Profile likelihood Transformation model
93	The problem of missing residential mobility information in the german microcensus : an evaluation of two statistical approaches with the socio-economic panel / Bašić, Edin. January 2008 (has links) Zugl.: Berlin, Freie University, Diss., 2008.
94	Socioeconomic and Racial/Ethnic Disparities in Cognitive Trajectories among the Oldest Old: The Role of Vascular and Functional Health January 2011 (has links) abstract: Identifying modifiable causes of chronic disease is essential to prepare for the needs of an aging population. Cognitive decline is a precursor to the development of Alzheimer's and other dementing diseases, representing some of the most prevalent and least understood sources of morbidity and mortality associated with aging. To contribute to the literature on cognitive aging, this work focuses on the role of vascular and physical health in the development of cognitive trajectories while accounting for the socioeconomic context where health disparities are developed. The Assets and Health Dynamics among the Oldest-Old study provided a nationally-representative sample of non-institutionalized adults age 65 and over in 1998, with biennial follow-up continuing until 2008. Latent growth models with adjustment for non-random missing data were used to assess vascular, physical, and social predictors of cognitive change. A core aim of this project was examining socioeconomic and racial/ethnic variation in vascular predictors of cognitive trajectories. Results indicated that diabetes and heart problems were directly related to an increased rate of memory decline in whites, where these risk factors were only associated with baseline word-recall for blacks when conditioned on gender and household assets. These results support the vascular hypotheses of cognitive aging and attest to the significance of socioeconomic and racial/ethnic variation in vascular influences on cognitive health. The second substantive portion of this dissertation used parallel process latent growth models to examine the co-development of cognitive and functional health. Initial word-recall scores were consistently associated with later functional limitations, but baseline functional limitations were not consistently associated with later word-recall scores. Gender and household income moderated this relationship, and indicators of lifecourse SES were better equipped to explain variation in initial cognitive and functional status than change in these measures over time. Overall, this work suggests that research examining associations between cognitive decline, chronic disease, and disability must account for the social context where individuals and their health develop. Also, these findings advocate that reducing socioeconomic and racial/ethnic disparities in cognitive health among the aging requires interventions early in the lifecourse, as disparities in cognitive trajectories were solidified prior to late old age. / Dissertation/Thesis / Ph.D. Sociology 2011 Sociology Aging Gerontology Cognitive Aging Latent Growth Models Lifecourse SES Missing Data Racial/Ethnic Disparities
95	Machine Learning for incomplete data / Machine Learning for incomplete data Mesquita, Diego Parente Paiva January 2017 (has links) MESQUITA, Diego Parente Paiva. Machine Learning for incomplete data. 2017. 55 f. Dissertação (Mestrado em Ciência da Computação)-Universidade Federal do Ceará, Fortaleza, 2017. / Submitted by Jonatas Martins (jonatasmartins@lia.ufc.br) on 2017-08-29T14:42:43Z No. of bitstreams: 1 2017_dis_dppmesquita.pdf: 673221 bytes, checksum: eec550f75e2965d1120185327465a595 (MD5) / Approved for entry into archive by Rocilda Sales (rocilda@ufc.br) on 2017-08-29T16:04:36Z (GMT) No. of bitstreams: 1 2017_dis_dppmesquita.pdf: 673221 bytes, checksum: eec550f75e2965d1120185327465a595 (MD5) / Made available in DSpace on 2017-08-29T16:04:36Z (GMT). No. of bitstreams: 1 2017_dis_dppmesquita.pdf: 673221 bytes, checksum: eec550f75e2965d1120185327465a595 (MD5) Previous issue date: 2017 / Methods based on basis functions (such as the sigmoid and q-Gaussian functions) and similarity measures (such as distances or kernel functions) are widely used in machine learning and related fields. These methods often take for granted that data is fully observed and are not equipped to handle incomplete data in an organic manner. This assumption is often flawed, as incomplete data is a fact in various domains such as medical diagnosis and sensor analytics. Therefore, one might find it useful to be able to estimate the value of these functions in the presence of partially observed data. We propose methodologies to estimate the Gaussian Kernel, the Euclidean Distance, the Epanechnikov kernel and arbitrary basis functions in the presence of possibly incomplete feature vectors. To obtain such estimates, the incomplete feature vectors are treated as continuous random variables and, based on that, we take the expected value of the transforms of interest. / Métodos baseados em funções de base (como as funções sigmoid e a q-Gaussian) e medidas de similaridade (como distâncias ou funções de kernel) são comuns em Aprendizado de Máquina e áreas correlatas. Comumente, no entanto, esses métodos não são equipados para utilizar dados incompletos de maneira orgânica. Isso pode ser visto como um impedimento, uma vez que dados parcialmente observados são comuns em vários domínios, como aplicações médicas e dados provenientes de sensores. Nesta dissertação, propomos metodologias para estimar o valor do kernel Gaussiano, da distância Euclidiana, do kernel Epanechnikov e de funções de base arbitrárias na presença de vetores possivelmente parcialmente observados. Para obter tais estimativas, os vetores incompletos são tratados como variáveis aleatórias contínuas e, baseado nisso, tomamos o valor esperado da transformada de interesse. Machine Learning Missing data Gaussian kernel Euclidean distance Epanechnikov kernel Basis functions
96	Simultaneous Variable and Feature Group Selection in Heterogeneous Learning: Optimization and Applications January 2014 (has links) abstract: Advances in data collection technologies have made it cost-effective to obtain heterogeneous data from multiple data sources. Very often, the data are of very high dimension and feature selection is preferred in order to reduce noise, save computational cost and learn interpretable models. Due to the multi-modality nature of heterogeneous data, it is interesting to design efficient machine learning models that are capable of performing variable selection and feature group (data source) selection simultaneously (a.k.a bi-level selection). In this thesis, I carry out research along this direction with a particular focus on designing efficient optimization algorithms. I start with a unified bi-level learning model that contains several existing feature selection models as special cases. Then the proposed model is further extended to tackle the block-wise missing data, one of the major challenges in the diagnosis of Alzheimer's Disease (AD). Moreover, I propose a novel interpretable sparse group feature selection model that greatly facilitates the procedure of parameter tuning and model selection. Last but not least, I show that by solving the sparse group hard thresholding problem directly, the sparse group feature selection model can be further improved in terms of both algorithmic complexity and efficiency. Promising results are demonstrated in the extensive evaluation on multiple real-world data sets. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2014 Computer science block-wise missing data feature selection hard-thresholding multi-source optimization
97	Planned Missing Data in Mediation Analysis January 2015 (has links) abstract: This dissertation examines a planned missing data design in the context of mediational analysis. The study considered a scenario in which the high cost of an expensive mediator limited sample size, but in which less expensive mediators could be gathered on a larger sample size. Simulated multivariate normal data were generated from a latent variable mediation model with three observed indicator variables, M1, M2, and M3. Planned missingness was implemented on M1 under the missing completely at random mechanism. Five analysis methods were employed: latent variable mediation model with all three mediators as indicators of a latent construct (Method 1), auxiliary variable model with M1 as the mediator and M2 and M3 as auxiliary variables (Method 2), auxiliary variable model with M1 as the mediator and M2 as a single auxiliary variable (Method 3), maximum likelihood estimation including all available data but incorporating only mediator M1 (Method 4), and listwise deletion (Method 5). The main outcome of interest was empirical power to detect the mediated effect. The main effects of mediation effect size, sample size, and missing data rate performed as expected with power increasing for increasing mediation effect sizes, increasing sample sizes, and decreasing missing data rates. Consistent with expectations, power was the greatest for analysis methods that included all three mediators, and power decreased with analysis methods that included less information. Across all design cells relative to the complete data condition, Method 1 with 20% missingness on M1 produced only 2.06% loss in power for the mediated effect; with 50% missingness, 6.02% loss; and 80% missingess, only 11.86% loss. Method 2 exhibited 20.72% power loss at 80% missingness, even though the total amount of data utilized was the same as Method 1. Methods 3 – 5 exhibited greater power loss. Compared to an average power loss of 11.55% across all levels of missingness for Method 1, average power losses for Methods 3, 4, and 5 were 23.87%, 29.35%, and 32.40%, respectively. In conclusion, planned missingness in a multiple mediator design may permit higher quality characterization of the mediator construct at feasible cost. / Dissertation/Thesis / Doctoral Dissertation Psychology 2015 MCAR missing completely at random planned missing data statistical mediation
98	Three-Level Multiple Imputation: A Fully Conditional Specication Approach January 2015 (has links) abstract: Currently, there is a clear gap in the missing data literature for three-level models. To date, the literature has only focused on the theoretical and algorithmic work required to implement three-level imputation using the joint model (JM) method of imputation, leaving relatively no work done on fully conditional specication (FCS) method. Moreover, the literature lacks any methodological evaluation of three-level imputation. Thus, this thesis serves two purposes: (1) to develop an algorithm in order to implement FCS in the context of a three-level model and (2) to evaluate both imputation methods. The simulation investigated a random intercept model under both 20% and 40% missing data rates. The ndings of this thesis suggest that the estimates for both JM and FCS were largely unbiased, gave good coverage, and produced similar results. The sole exception for both methods was the slope for the level-3 variable, which was modestly biased. The bias exhibited by the methods could be due to the small number of clusters used. This nding suggests that future research ought to investigate and establish clear recommendations for the number of clusters required by these imputation methods. To conclude, this thesis serves as a preliminary start in tackling a much larger issue and gap in the current missing data literature. / Dissertation/Thesis / Masters Thesis Psychology 2015 Statistics Fully Conditional Specification Missing Data Multilevel Modeling Multiple Imputation Three-level
99	Imputação múltipla de dados faltantes: exemplo de aplicação no Estudo Pró-Saúde / Multiple imputation of missing data: application in the Pro-Saude Program Thaís de Paulo Rangel 05 March 2013 (has links) Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Dados faltantes são um problema comum em estudos epidemiológicos e, dependendo da forma como ocorrem, as estimativas dos parâmetros de interesse podem estar enviesadas. A literatura aponta algumas técnicas para se lidar com a questão, e, a imputação múltipla vem recebendo destaque nos últimos anos. Esta dissertação apresenta os resultados da utilização da imputação múltipla de dados no contexto do Estudo Pró-Saúde, um estudo longitudinal entre funcionários técnico-administrativos de uma universidade no Rio de Janeiro. No primeiro estudo, após simulação da ocorrência de dados faltantes, imputou-se a variável cor/raça das participantes, e aplicou-se um modelo de análise de sobrevivência previamente estabelecido, tendo como desfecho a história auto-relatada de miomas uterinos. Houve replicação do procedimento (100 vezes) para se determinar a distribuição dos coeficientes e erros-padrão das estimativas da variável de interesse. Apesar da natureza transversal dos dados aqui utilizados (informações da linha de base do Estudo Pró-Saúde, coletadas em 1999 e 2001), buscou-se resgatar a história do seguimento das participantes por meio de seus relatos, criando uma situação na qual a utilização do modelo de riscos proporcionais de Cox era possível. Nos cenários avaliados, a imputação demonstrou resultados satisfatórios, inclusive quando da avaliação de performance realizada. A técnica demonstrou um bom desempenho quando o mecanismo de ocorrência dos dados faltantes era do tipo MAR (Missing At Random) e o percentual de não-resposta era de 10%. Ao se imputar os dados e combinar as estimativas obtidas nos 10 bancos (m=10) gerados, o viés das estimativas era de 0,0011 para a categoria preta e 0,0015 para pardas, corroborando a eficiência da imputação neste cenário. Demais configurações também apresentaram resultados semelhantes. No segundo artigo, desenvolve-se um tutorial para aplicação da imputação múltipla em estudos epidemiológicos, que deverá facilitar a utilização da técnica por pesquisadores brasileiros ainda não familiarizados com o procedimento. São apresentados os passos básicos e decisões necessárias para se imputar um banco de dados, e um dos cenários utilizados no primeiro estudo é apresentado como exemplo de aplicação da técnica. Todas as análises foram conduzidas no programa estatístico R, versão 2.15 e os scripts utilizados são apresentados ao final do texto. / Missing data are a common problem in epidemiologic studies and depending on the way they occur, the resulting estimates may be biased. Literature shows several techniques to deal with this subject and multiple imputation has been receiving attention in the recent years. This dissertation presents the results of applying multiple imputation of missing data in the context of the Pro-Saude Study, a longitudinal study among civil servants at a university in Rio de Janeiro, Brazil. In the first paper, after simulation of missing data, the variable color/race of the female servants was imputed and analyzed through a previously established survival model, which had the self-reported history of uterine leiomyoma as the outcome. The process has been replicated a hundred times in order to determine the distribution of the coefficient and standard errors of the variable being imputed. Although the data presented were cross-sectionally collected (baseline data of the Pro-Saude Study, gathered in 1999 and 2001), the following of the servants were determined using self-reported information. In this scenario, the Cox proportional hazards model could be applied. In the situations created, imputation showed adequate results, including in the performance analyses. The technique had a satisfactory effectiveness when the missing mechanism was MAR (Missing At Random) and the percent of missing data was 10. Imputing the missing information and combining the estimates of the 10 resulting datasets produced a bias of 0,0011 to black women and 0,0015 to brown (mixed-race) women, what corroborates the efficiency of multiple imputation in this scenario. In the second paper, a tutorial was created to guide the application of multiple imputation in epidemiologic studies, which should facilitate the use of the technique by Brazilian researchers who are still not familiarized with the procedure. Basic steps and important decisions necessary to impute a dataset are presented and one of the scenarios of the first paper is used as an application example. All the analyses were performed at R statistical software, version 2.15 and the scripts are presented at the end of the text. Dados faltantes Imputação múltipla Análise de sobrevivência Tutorial Missing data Multiple imputation Survival analysis Tutorial EPIDEMIOLOGIA
100	Dados filogenômicos para inferência de relações evolutivas entre espécies do gênero Cereus Mill. (Cactaceae, Cereeae) / Phylogenomic data for inference of evolutionary relationships among species of the genus Cereus Mill. (Cactaceae, Cereeae) Juliana Rodrigues Bombonato 04 June 2018 (has links) Estudos filogenômicos usando Sequenciamento de Próxima Geração (do inglês, Next Generation Sequencing - NGS) estão se tornando cada vez mais comuns. O uso de marcadores oriundos do sequenciamento de DNA de uma biblioteca genômica reduzida, neste caso ddRADSeq (do inglês, Double Digestion Restriction Site Associated DNA Sequencing), para este fim é promissor, pelo menos considerando sua relação custo-benefício em grandes conjuntos de dados de grupos não-modelo, bem como a representação genômica recuperada. Aqui usamos ddRADSeq para inferir a filogenia em nível de espécie do gênero Cereus (Cactaceae). Esse gênero compreende em cerca de 25 espécies reconhecidas predominantemente sul-americanas distribuídas em quatro subgêneros. Nossa amostra inclui representantes de Cereus, além de espécies dos gêneros próximos, Cipocereus e Praecereus, além de grupos externos. A biblioteca ddRADSeq foi preparada utilizando as enzimas EcoRI e HPAII. Após o controle de qualidade (tamanho e quantificação dos fragmentos), a biblioteca foi sequenciada no Illumina HiSeq 2500. O processamento de bioinformática a partir de arquivos FASTQ incluiu o controle da presença de adaptadores, filtragem por qualidade (softwares FastQC, MultiQC e SeqyClean) e chamada de SNPs (software iPyRAD). Três cenários de permissividade a dados faltantes foram realizados no iPyRAD, recuperando conjuntos de dados com 333 (até 40% de dados perdidos), 1440 (até 60% de dados perdidos) e 6141 (até 80% de dados faltantes) loci. Para cada conjunto de dados, árvores de Máxima Verossimilhança (MV) foram geradas usando duas supermatrizes: SNPs ligados e Loci. Em geral, observamos algumas inconsistências entre as árvores ML geradas em softwares distintos (IQTree e RaxML) ou baseadas no tipo de matriz distinta (SNPs ligados e Loci). Por outro lado, a precisão e a resolução, foram melhoradas usando o maior conjunto de dados (até 80% de dados perdidos). Em geral, apresentamos uma filogenia com resolução inédita para o gênero Cereus, que foi resolvido como um provável grupo monofilético, composto por quatro clados principais e com alto suporte em suas relações internas. Além disso, nossos dados contribuem para agregar informações sobre o debate sobre o aumento de dados faltantes para conduzir a análise filogenética com loci RAD. / Phylogenomics studies using Next Generation Sequencing (NGS) are becoming increasingly common. The use of Double Digest Restriction Site Associated DNA Sequencing (ddRADSeq) markers to this end is promising, at least considering its cost-effectiveness in large datasets of non-model groups as well as the genome-wide representation recovered in the data. Here we used ddRADSeq to infer the species level phylogeny of genus Cereus (Cactaceae). This genus comprises about 25 species recognized predominantly South American species distributed into four subgenera. Our sample includes representatives of Cereus, in addition to species from the closely allied genera Cipocereus and Praecereus, besides outgroups. The ddRADSeq library was prepared using EcoRI and HPAII enzymes. After the quality control (fragments size and quantification) the library was sequenced in Illumina HiSeq 2500. The bioinformatic processing on raw FASTQ files included adapter trimming, quality filtering (FastQC, MultiQC and SeqyClean softwares) and SNPs calling (iPyRAD software). Three scenarios of permissiveness to missing data were carry out in iPyRAD, recovering datasets with 333 (up tp 40% missing data), 1440 (up to 60% missing data) and 6141 (up to 80% missing data) loci. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices: SNPs linked and Loci. In general, we observe few inconsistences between ML trees generated in distinct softwares (IQTree and RaxML) or based in distinctive matrix type (SNP linked and Loci). On the other hand, the accuracy and resolution were improved using the larger dataset (up to 80% missing data). Overall, we present a phylogeny with unprecedent resolution for genus Cereus, which was resolved as a likely monophyletic group, composed by four main clades and with high support in their internal relationships. Further, our data contributes to aggregate information on the debate about to increasing missing data to conduct phylogenetic analysis with RAD loci.

Search results