Global ETD Search

21	Modlování vývoje výše škodních událostí / Modeling development of incurred value of claim Kantorová, Petra January 2010 (has links) This diploma project is focused on the estimation of incurred value of claim and probability of the claim remaining opened (not settled) in the specific stage of the insurance settlement process. The change of incurred value of claim means the change of settlement process stage. Generalized linear model is used for modelling these changes. Classical linear regression model also belongs into this theory, which is its special case, just with stricter premises. Generalized linear model among others allows solving the problem of heteroscedasticity in the unusual way using joint model. This model is applied in the practical part of this piece of work. Logistic regression is the part of the generalized linear model theory, which helps to model the probability of the claim remaining opened in this piece of work. The model outcome is presented in graphic way, especially the graphs containing probability that levels of given claim will occur in certain range.
22	Concave selection in generalized linear models Jiang, Dingfeng 01 May 2012 (has links) A family of concave penalties, including the smoothly clipped absolute deviation (SCAD) and minimax concave penalties (MCP), has been shown to have attractive properties in variable selection. The computation of concave penalized solutions, however, is a difficult task. We propose a majorization minimization by coordinate descent (MMCD) algorithm to compute the solutions of concave penalized generalized linear models (GLM). In contrast to the existing algorithms that uses local quadratic or local linear approximation of the penalty, the MMCD majorizes the negative log-likelihood by a quadratic loss, but does not use any approximation to the penalty. This strategy avoids the computation of scaling factors in iterative steps, hence improves the efficiency of coordinate descent. Under certain regularity conditions, we establish the theoretical convergence property of the MMCD algorithm. We implement this algorithm in a penalized logistic regression model using the SCAD and MCP penalties. Simulation studies and a data example demonstrate that the MMCD works sufficiently fast for the penalized logistic regression in high-dimensional settings where the number of covariates is much larger than the sample size. Grouping structure among predictors exists in many regression applications. We first propose an l2 grouped concave penalty to incorporate such group information in a regression model. The l2 grouped concave penalty performs group selection and includes group Lasso as a special case. An efficient algorithm is developed and its theoretical convergence property is established under certain regularity conditions. The group selection property of the l2 grouped concave penalty is desirable in some applications; while in other applications selection at both group and individual levels is needed. Hence, we propose an l1 grouped concave penalty for variable selection at both individual and group levels. An efficient algorithm is also developed for the l1 grouped concave penalty. Simulation studies are performed to evaluate the finite-sample performance of the two grouped concave selection methods. The new grouped penalties are also used in analyzing two motivation datasets. The results from both the simulation and real data analyses demonstrate certain benefits of using grouped penalties. Therefore, the proposed concave group penalties are valuable alternatives to the standard concave penalties. concave penalty generalized linear model high dimentional data MCP SCAD variable selection Biostatistics
23	Multiple Learning for Generalized Linear Models in Big Data Xiang Liu (11819735) 19 December 2021 (has links) Big data is an enabling technology in digital transformation. It perfectly complements ordinary linear models and generalized linear models, as training well-performed ordinary linear models and generalized linear models require huge amounts of data. With the help of big data, ordinary and generalized linear models can be well-trained and thus offer better services to human beings. However, there are still many challenges to address for training ordinary linear models and generalized linear models in big data. One of the most prominent challenges is the computational challenges. Computational challenges refer to the memory inflation and training inefficiency issues occurred when processing data and training models. Hundreds of algorithms were proposed by the experts to alleviate/overcome the memory inflation issues. However, the solutions obtained are locally optimal solutions. Additionally, most of the proposed algorithms require loading the dataset to RAM many times when updating the model parameters. If multiple model hyper-parameters needed to be computed and compared, e.g. ridge regression, parallel computing techniques are applied in practice. Thus, multiple learning with sufficient statistics arrays are proposed to tackle the memory inflation and training inefficiency issues. Distributed Computing big data Linear regression analyses Distributed computing Sufficient statistics Generalized Linear Model
24	Extensions to Bayesian generalized linear mixed effects models for household tuberculosis transmission McIntosh, Avery Isaac 12 May 2017 (has links) Understanding tuberculosis transmission is vital for efforts at interrupting the spread of disease. Household contact studies that follow persons sharing a household with a TB case—so-called household contacts—and test for latent TB infection by tuberculin skin test conversion give investigators vital information about risk factors for TB transmission. In these studies, investigators often assume secondary cases are infected by the primary TB case, despite substantial evidence that infection from a source outside the home is often equally likely, especially in high-prevalence settings. Investigators may discard information on contacts who test positive at study initiation due to uncertainty of the infection source, or assume infected contacts were infected from the index case prior to study initiation. With either assumption, information on transmission dynamics is lost or incomplete, and estimates of household risk factors for transmission will be biased. This dissertation describes an approach to modeling TB transmission that accounts for community-acquired transmission in the estimation of transmission risk factors from household contact study data. The proposed model generates population-specific estimates of the probability a contact of an infectious case will be infected from a source outside the home—a vital statistic for planning effective interventions to halt disease spread—in additional to estimates of household transmission predictors. We first describe the model analytically, and then apply it to synthetic datasets under different risk scenarios. We then fit the model to data taken from three household contact studies in different locations: Brazil, India, and Uganda. Infection predictors such as contact sleeping proximity to the index case and index case disease severity are underestimated in standard models compared to the proposed method, and non-household TB infection risk increases with age stratum, reflecting longer at-risk duration for community-based exposure for older contacts. This analysis will aid public health planners in understanding how best to interrupt TB spread in disparate populations by characterizing where transmission risk is greatest and which risk factors influence household-acquired transmission. Finally, we present an open-source software package in the R environment titled upmfit for modular implementation of the Bayesian Markov Chain Monte Carlo methods used to estimate the model. / 2018-05-10T00:00:00Z Biostatistics Bayesian mixed effects Generalized linear model Hierarchical models Infection Tuberculosis
25	Relational Outlier Detection: Techniques and Applications Lu, Yen-Cheng 10 June 2021 (has links) Nowadays, outlier detection has attracted growing interest. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Furthermore, different from the traditional outlier detection that focuses on only numerical data, modern outlier detection models must be able to handle data in various types and structures. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. For the first task, existing solutions for mixed-type data mostly focus on computational efficiency, and their strategies are mostly heuristic driven, lacking a statistical foundation. The proposed contributions of our work include: (1) Constructing a novel unsupervised framework based on a robust generalized linear model (GLM), (2) Developing a model that is capable of capturing large variances of outliers and dependencies among mixed-type observations, and designing an approach for approximating the analytically intractable Bayesian inference, and (3) Conducting extensive experiments to validate effectiveness and efficiency. For the second task, we extended and applied the modeling strategy to a real-world problem. The existing solutions to the specific task are mostly supervised, and the traditional outlier detection methods only focus on detecting outliers by the data distributions, ignoring the input-output relation between the genres and the extracted features. The proposed contributions of our work for this task include: (1) Proposing an unsupervised outlier detection framework for music genre data, (2) Extending the GLM based model in the first task to handle categorical responses and developing an approach to approximate the analytically intractable Bayesian inference, and (3) Conducting experiments to demonstrate that the proposed method outperforms the benchmark methods. For the third task, we focused on improving the outlier detection performance in the second task by proposing a novel framework and expanded the research scope to general categorized time-series data. Existing studies have suggested a large number of methods for automatic time series classification. However, there is a lack of research focusing on detecting outliers from manually categorized time series. The proposed contributions of our work for this task include: (1) Proposing a novel semi-supervised robust outlier detection framework for categorized time-series datasets, (2) Further extending the new framework to an active learning system that takes user insights into account, and (3) Conducting a comprehensive set of experiments to demonstrate the performance of the proposed method in real-world applications. / Doctor of Philosophy / In recent years, outlier detection has been one of the most important topics in the data mining and machine learning research domain. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. The first task aims on constructing a novel unsupervised framework, developing a model that is capable of capturing the normal pattern and the effects, and designing an approach for model fitting. In the second task, we further extended and applied the modeling strategy to a real-world problem in the music technology domain. For the third task, we expanded the research scope from the previous task to general categorized time-series data, and focused on improving the outlier detection performance by proposing a novel semi-supervised framework. Relational Outlier Detection Generalized Linear Model Robust Estimation Music Genre Recognition Time Series Outlier Detection
26	A Consensus Model for Predicting the Distribution of the Threatened Plant Telephus Spurge (Euphorbia Telephioides) Bracken, Jason 02 December 2016 (has links) No description available. Conservation Biology Botany SDM species distribution model Telephus spurge Euphorbia telephioides GLM generalized linear model
27	Semiparametric Methods for the Generalized Linear Model Chen, Jinsong 01 July 2010 (has links) The generalized linear model (GLM) is a popular model in many research areas. In the GLM, each outcome of the dependent variable is assumed to be generated from a particular distribution function in the exponential family. The mean of the distribution depends on the independent variables. The link function provides the relationship between the linear predictor and the mean of the distribution function. In this dissertation, two semiparametric extensions of the GLM will be developed. In the first part of this dissertation, we have proposed a new model, called a semiparametric generalized linear model with a log-concave random component (SGLM-L). In this model, the estimate of the distribution of the random component has a nonparametric form while the estimate of the systematic part has a parametric form. In the second part of this dissertation, we have proposed a model, called a generalized semiparametric single-index mixed model (GSSIMM). A nonparametric component with a single index is incorporated into the mean function in the generalized linear mixed model (GLMM) assuming that the random component is following a parametric distribution. In the first part of this dissertation, since most of the literature on the GLM deals with the parametric random component, we relax the parametric distribution assumption for the random component of the GLM and impose a log-concave constraint on the distribution. An iterative numerical algorithm for computing the estimators in the SGLM-L is developed. We construct a log-likelihood ratio test for inference. In the second part of this dissertation, we use a single index model to generalize the GLMM to have a linear combination of covariates enter the model via a nonparametric mean function, because the linear model in the GLMM is not complex enough to capture the underlying relationship between the response and its associated covariates. The marginal likelihood is approximated using the Laplace method. A penalized quasi-likelihood approach is proposed to estimate the nonparametric function and parameters including single-index coe±cients in the GSSIMM. We estimate variance components using marginal quasi-likelihood. Asymptotic properties of the estimators are developed using a similar idea by Yu (2008). A simulation example is carried out to compare the performance of the GSSIMM with that of the GLMM. We demonstrate the advantage of my approach using a study of the association between daily air pollutants and daily mortality adjusted for temperature and wind speed in various counties of North Carolina. / Ph. D. Penalized splines Generalized linear mixed model Generalized linear model Single-Index Model
28	Sea turtle bycatch by the U.S. Atlantic pelagic longline fishery: A simulation modeling analysis of estimation methods Barlow, Paige Fithian 01 September 2009 (has links) The U.S. pelagic longline fishery catches 98% of domestic swordfish landings but is also one of the three fisheries most affecting federally protected sea turtles (Crowder and Myers 2001, Witherington et al 2009). Bycatch by fisheries is considered the main anthropogenic threat to sea turtles (NRC 1990). Accurate and precise bycatch estimates are imperative for sea turtle conservation and appropriate fishery management. However, estimation is complicated by only 8% observer coverage of fishing and data that are hierarchical in structure (i.e., multiple sets per trip), zero-heavy (i.e., bycatch is rare), and often overdispersed (i.e., larger variance than expected). Therefore, I evaluated two predominant bycatch estimation methods, the delta-lognormal method and generalized linear models, and investigated improvements in uncertainty incorporation. I constructed a simulation model to evaluate bycatch estimation at two spatial scales under ten spatial models of sea turtle, fishing set, and observer distributions. Results indicated that distributing observers relative to fishing effort and using the delta-lognormal-strata method was most appropriate. The delta-lognormal-strata 95% confidence interval (CI) was wider than statistically appropriate. The delta-lognormal-all sets pooled 95% CI was narrower but simulated bycatch was above the CI too frequently. Thus, I developed a bycatch estimate risk distribution to incorporate uncertainty in bycatch estimates. It gives managers access to the entire distribution of bycatch estimates and their choice of any risk level. Results support the management agency's observer distribution and estimation method but suggest a new procedure to incorporate uncertainty. This study is also informative for many similar datasets. / Master of Science pelagic longline fishery sea turtle simulation model bycatch delta-lognormal estimation generalized linear model
29	Skogsväxters utbredning i relation till pH, latitud och trädsammansättning : Exkursion för ekologiundervisning Carlsson, Rebecka January 2016 (has links) This study investigated the impact of three edaphic factors on the distribution of forest plants in Sweden. Based on 2657 plots with 22 common species, Canonical Correspondence Analysis (CCA) and Generalized-linear-model (GLM) were performed with pH measurements in the top layer of the soil, latitude and deciduous tree proportion as explanatory variables. Variation of the species occurrence could to a substantial degree be explained by pH, latitude and proportion of timber volume of deciduous tree species. Furthermore, the majority of species were affected by the studied environmental variables. Therefore, these factors have an important role in the ecological interactions in the forest. All species also showed broad pH-niches with many occurrences spread out within the species entire pH-range. Finally, the study relates to educational science through designing a meaningful excursion for secondary school when teaching ecology. Forest herb soil pH latitudinal gradient of Sweden Canonical Correspondence Analysis (CCA) Generalized-linear-model (GLM) excursion secondary school ecology
30	Modelos de transição para dados binários / Transition models for binary data Lara, Idemauro Antonio Rodrigues de 31 October 2007 (has links) Dados binários ou dicotômicos são comuns em muitas áreas das ciências, nas quais, muitas vezes, há interesse em registrar a ocorrência, ou não, de um evento particular. Por outro lado, quando cada unidade amostral é avaliada em mais de uma ocasião no tempo, tem-se dados longitudinais ou medidas repetidas no tempo. é comum também, nesses estudos, se ter uma ou mais variáveis explicativas associadas às variáveis respostas. As variáveis explicativas podem ser dependentes ou independentes do tempo. Na literatura, há técnicas disponíveis para a modelagem e análise desses dados, sendo os modelos disponíveis extensões dos modelos lineares generalizados. O enfoque do presente trabalho é dado aos modelos lineares generalizados de transição para a análise de dados longitudinais envolvendo uma resposta do tipo binária. Esses modelos são baseados em processos estocásticos e o interesse está em modelar as probabilidades de mudanças ou transições de categorias de respostas dos indivíduos no tempo. A suposição mais utilizada nesses processos é a da propriedade markoviana, a qual condiciona a resposta numa dada ocasião ao estado na ocasião anterior. Assim, são revistos os fundamentos para se especificar tais modelos, distinguindo-se os casos estacionário e não-estacionário. O método da máxima verossimilhança é utilizado para o ajuste dos modelos e estimação das probabilidades. Adicionalmente, apresentam-se testes assintóticos para comparar tratamentos, baseados na razão de chances e na diferença das probabilidades de transição. Outra questão explorada é a combinação do modelo de efeitos aleatórios com a do modelo de transição. Os métodos são ilustrados com um exemplo da área da saúde. Para esses dados, o processo é considerado estacionário de ordem dois e o teste proposto sinaliza diferença estatisticamente significativa a favor do tratamento ativo. Apesar de ser uma abordagem inicial dessa metodologia, verifica-se, que os modelos de transição têm notável aplicabilidade e são fontes para estudos e pesquisas futuras. / Binary or dichotomous data are quite common in many fields of Sciences in which there is an interest in registering the occurrence of a particular event. On the other hand, when each sampled unit is evaluated in more than one occasion, we have longitudinal data or repeated measures over time. It is also common, in longitudinal studies, to have explanatory variables associated to response measures, which can be time dependent or independent. In the literature, there are many approaches to modeling and evaluating these data, where the models are extensions of generalized linear models. This work focus on generalized linear transition models suitable for analyzing longitudinal data with binary response. Such models are based on stochastic processes and we aim to model the probabilities of change or transitions of individual response categories in time. The most used assumption in these processes is the Markov property, in which the response in one occasion depends on the immediately preceding response. Thus we review the fundamentals to specify these models, showing the diferences between stationary and non-stationary processes. The maximum likelihood approach is used in order to fit the models and estimate the probabilities. Furthermore, we show asymptotic tests to compare treatments based on odds ratio and on the diferences of transition probabilities. We also present a combination of random-efects model with transition model. The methods are illustrated with health data. For these data, the process is stationary of order two and the suggested test points to a significant statistical diference in favor of the active treatment. This work is an initial approach to transition models, which have high applicability and are great sources for further studies and researches. Análise de dados longitudinais Analysis of longitudinal data Generalized linear model Likelihood Modelos lineares generalizados Processos estocásticos Stochastic processes Verossimilhança

Search results