Global ETD Search

21	Gaussian copula modelling for integer-valued time series Lennon, Hannah January 2016 (has links) This thesis is concerned with the modelling of integer-valued time series. The data naturally occurs in various areas whenever a number of events are observed over time. The model considered in this study consists of a Gaussian copula with autoregressive-moving average (ARMA) dependence and discrete margins that can be specified, unspecified, with or without covariates. It can be interpreted as a 'digitised' ARMA model. An ARMA model is used for the latent process so that well-established methods in time series analysis can be used. Still the computation of the log-likelihood poses many problems because it is the sum of 2^N terms involving the Gaussian cumulative distribution function when N is the length of the time series. We consider an Monte Carlo Expectation-Maximisation (MCEM) algorithm for the maximum likelihood estimation of the model which works well for small to moderate N. Then an Approximate Bayesian Computation (ABC) method is developed to take advantage of the fact that data can be simulated easily from an ARMA model and digitised. A spectral comparison method is used in the rejection-acceptance step. This is shown to work well for large N. Finally we write the model in an R-vine copula representation and use a sequential algorithm for the computation of the log-likelihood. We evaluate the score and Hessian of the log-likelihood and give analytic solutions for the standard errors. The proposed methodologies are illustrated using simulation studies and highlight the advantages of incorporating classic ideas from time series analysis into modern methods of model fitting. For illustration we compare the three methods on US polio incidence data (Zeger, 1988) and we discuss their relative merits. 519.5
22	Resgatando a diversidade genética e história demográfica de povos nativos americanos através de populações mestiças do sul do Brasil e Uruguai / Rescuing the genetic diversity and demographic history of native american peoples through mestizo populations of Southern Brazil and Uruguay Tavares, Gustavo Medina January 2018 (has links) Após a chegada dos conquistadores europeus, as populações nativas americanas foram dizimadas por diversas razões, como guerras e doenças, o que possivelmente levou diversas linhagens genéticas autóctones à extinção. Entretanto, durante essa invasão, houve miscigenação entre os colonizadores e os povos nativos e muitos estudos genéticos têm mostrado uma importante contribuição matrilinear nativa americana na formação da população colonial. Portanto, se muitos indivíduos na atual população urbana brasileira carregam linhagens nativas americanas no seu DNA mitocondrial (mtDNA), muito da diversidade genética nativa perdida durante o período colonial pode ter se mantido, por miscigenação, nas populações urbanas. Assim, essas populações representam, efetivamente, um importante reservatório genético de linhagens nativas americanas no Brasil e em outros países americanos, constituindo o reflexo mais fiel da diversidade genética pré-colombiana em populações nativas. Baseado nisso, este estudo teve como objetivos 1) comparar os padrões de diversidade genética de linhagens nativas americanas do mtDNA em populações nativas do Sul do Brasil e da população urbana (miscigenada) adjacente; e 2) comparar, através de Computação Bayesiana Aproximada (ABC), a história demográfica de ambas populações para chegar a uma estimativa do nível de redução do tamanho efetivo populacional (Ne) das populações indígenas aqui tratadas. Foram utilizados dados já publicados da região hipervariável (HVS-I) do mtDNA de linhagens nativas de 396 indivíduos Nativos Americanos (NAT) pertencentes aos grupos Guarani, Caingangue e Charrua e de 309 indivíduos de populações miscigenadas urbanas (URB) do Sul do Brasil e do Uruguai As análises de variabilidade e estrutura genética, bem como testes de neutralidade, foram feitos no programa Arlequin 3.5 e a rede de haplótipos mitocondriais foi estimada através do método Median-Joining utilizando o programa Network 5.0. Estimativas temporais do tamanho populacional efetivo foram feitas através de Skyline Plot Bayesiano utilizando o pacote de programas do BEAST 1.8.4. Por fim, o programa DIYABC 2.1 foi utilizado para testar cenários evolutivos e para estimar o Ne dos nativos americanos pré- (Nanc) e pós-contato (Nnat), para assim, se estimar o impacto da redução de variação genética causada pela colonização europeia. Os resultados deste estudo indicam que URB é a melhor preditora da diversidade nativa ancestral, possuindo uma diversidade substancialmente maior que NAT, pelo menos na região Sul do Brasil e no Uruguai (H = 0,96 vs. 0,85, Nhap = 131 vs. 27, respectivamente). Ademais, a composição de haplogrupos é bastante diferente entre as populações, sugerindo que a população nativa tenha tido eventos de gargalo afetando os haplogrupos B2 e C1 e super-representando o haplogrupo A2. Em relação à demografia histórica, observou-se que URB mantém sinais de expansão remetendo à entrada na América, contrastando com NAT em que esses sinais estão erodidos, apenas retendo sinais de contração populacional recente. De acordo com as estimativas aqui geradas, o declínio populacional em NAT foi de cerca de 300 vezes (84 – 555). Em outras palavras, a população efetiva nativa amricana nessa região corresponderia a apenas 0,33% (0,18% – 1,19%) da população ancestral– 99,8%, corroborando os achados de outros estudos genéticos e também com os registros históricos. / After the arrival of the European conquerors, the Native American populations were decimated due to multiple reasons, such as wars and diseases, which possibly led many autochtonous genetic lineages to extinction. However, during the European invasion of the Americas, colonizers and indigenous people admixed, and many genetic studies have shown an important Native American matrilineal contribution to the formation of the Colonial population. Therefore, if many individuals in the current urban population harbor Native American lineages in their mitochondrial DNA (mtDNA), much of Native American genetic diversity that have been lost during the Colonial Era may have been mantained by admixture in urban populations. In this case, these populations effectively represent an important reservoir of Native lineages in Brazil and other American countries, constituting the most accurate portrait of pre-Columbian genetic diverstity of Native populations. Based on this, the aims of the presente study were 1) to compare the patterns of genetic diversity of Native American mtDNA lineages in Native populations from Southern Brazil and the surrounding admixed urban populations; and 2) to compare, using Approximate Bayesian Computation (ABC), the demographic history of both groups to estimate the level of reduction in the effective population size (Ne) for the indigenous groups present here. We used mtDNA hypervariable segment (HVS-I) data of indigenous origin already published from 396 Native American individuals (NAT) belonging to the Guarani, Kaingang, and Charrua groups, and 309 individuals from Southern Brazilian and Uruguayan admixed urban populations (URB) The analyzes of variability and genetic structure, as well as the neutrality tests were accomplished using Arlequin 3.5, and the mitochondrial haplotype network estimated through the Median-Joining method available in Network 5.0. Time estimates for effective population size were performed using Bayesian Skyline Plot available in the BEAST 1.8.4 package. Finally, the DIYABC 2.1 software was used to test evolutionary scenarios and to estimate the pre (Nanc) and post-contact (Nnat) Native American Ne, and estimate the impact of the colonization process on the Native American genetic variability. The results indicate that URB is the best predictor of ancestral Native diversity, having substancially greater genetic diversity than NAT, at least in the Southern Brazilian and Uruguayan regions (H = 0.96 vs. 0.85, Nhap = 11 vs. 27, respectively). Moreover, the haplogroup compositions are very distinct between these groups, suggesting that the Native population passed through bottleneck events affecting the haplogroups B2 and C1, and overrepresenting the haplogroup A2. In relation to demographic history, we observed that URB retains signals of population expansion back to the entry in the Americas. In contrast, these signals are eroded in NAT, which maintains only signals of recent population contraction. According to our estimates, the population decline in NAT was around 300x (84 – 555x). In other words, the effective Native American population in this region would correspond to only 0.33% (0.18% – 1.19%) of the ancestral population, corroborating the findings of other genetic studies and historical records. Genética de populações DNA mitocondrial Ameríndios Uruguai Brasil, Sul Mitochondrial DNA Approximate bayesian computation Admixture
23	Inferring Viral Dynamics from Sequence Data Ibeh, Neke January 2016 (has links) One of the primary objectives of infectious disease research is uncovering the direct link that exists between viral population dynamics and molecular evolution. For RNA viruses in particular, evolution occurs at such a rapid pace that epidemiological processes become ingrained into gene sequences. Conceptually, this link is easy to make: as RNA viruses spread throughout a population, they evolve with each new host infection. However, developing a quantitative understanding of this connection is difficult. Thus, the emerging discipline of phylodynamics is centered on reconciling epidemiology and phylogenetics using genetic analysis. Here, we present two research studies that draw on phylodynamic principles in order to characterize the progression and evolution of the Ebola virus and the human immunodefficiency virus (HIV). In the first study, the interplay between selection and epistasis in the Ebola virus genome is elucidated through the ancestral reconstruction of a critical region in the Ebola virus glycoprotein. Hence, we provide a novel mechanistic account of the structural changes that led up to the 2014 Ebola virus outbreak. The second study applies an approximate Bayesian computation (ABC) approach to the inference of epidemiological parameters. First, we demonstrate the accuracy of this approach with simulated data. Then, we infer the dynamics of the Swiss HIV-1 epidemic, illustrating the applicability of this statistical method to the public health sector. Altogether, this thesis unravels some of the complex dynamics that shape epidemic progression, and provides potential avenues for facilitating viral surveillance efforts. phylodynamics phylogenetics approximate Bayesian computation virology epidemiology epistasis HIV/AIDS Ebola diversifying selection mucin-like domain
24	Resgatando a diversidade genética e história demográfica de povos nativos americanos através de populações mestiças do sul do Brasil e Uruguai / Rescuing the genetic diversity and demographic history of native american peoples through mestizo populations of Southern Brazil and Uruguay Tavares, Gustavo Medina January 2018 (has links) Após a chegada dos conquistadores europeus, as populações nativas americanas foram dizimadas por diversas razões, como guerras e doenças, o que possivelmente levou diversas linhagens genéticas autóctones à extinção. Entretanto, durante essa invasão, houve miscigenação entre os colonizadores e os povos nativos e muitos estudos genéticos têm mostrado uma importante contribuição matrilinear nativa americana na formação da população colonial. Portanto, se muitos indivíduos na atual população urbana brasileira carregam linhagens nativas americanas no seu DNA mitocondrial (mtDNA), muito da diversidade genética nativa perdida durante o período colonial pode ter se mantido, por miscigenação, nas populações urbanas. Assim, essas populações representam, efetivamente, um importante reservatório genético de linhagens nativas americanas no Brasil e em outros países americanos, constituindo o reflexo mais fiel da diversidade genética pré-colombiana em populações nativas. Baseado nisso, este estudo teve como objetivos 1) comparar os padrões de diversidade genética de linhagens nativas americanas do mtDNA em populações nativas do Sul do Brasil e da população urbana (miscigenada) adjacente; e 2) comparar, através de Computação Bayesiana Aproximada (ABC), a história demográfica de ambas populações para chegar a uma estimativa do nível de redução do tamanho efetivo populacional (Ne) das populações indígenas aqui tratadas. Foram utilizados dados já publicados da região hipervariável (HVS-I) do mtDNA de linhagens nativas de 396 indivíduos Nativos Americanos (NAT) pertencentes aos grupos Guarani, Caingangue e Charrua e de 309 indivíduos de populações miscigenadas urbanas (URB) do Sul do Brasil e do Uruguai As análises de variabilidade e estrutura genética, bem como testes de neutralidade, foram feitos no programa Arlequin 3.5 e a rede de haplótipos mitocondriais foi estimada através do método Median-Joining utilizando o programa Network 5.0. Estimativas temporais do tamanho populacional efetivo foram feitas através de Skyline Plot Bayesiano utilizando o pacote de programas do BEAST 1.8.4. Por fim, o programa DIYABC 2.1 foi utilizado para testar cenários evolutivos e para estimar o Ne dos nativos americanos pré- (Nanc) e pós-contato (Nnat), para assim, se estimar o impacto da redução de variação genética causada pela colonização europeia. Os resultados deste estudo indicam que URB é a melhor preditora da diversidade nativa ancestral, possuindo uma diversidade substancialmente maior que NAT, pelo menos na região Sul do Brasil e no Uruguai (H = 0,96 vs. 0,85, Nhap = 131 vs. 27, respectivamente). Ademais, a composição de haplogrupos é bastante diferente entre as populações, sugerindo que a população nativa tenha tido eventos de gargalo afetando os haplogrupos B2 e C1 e super-representando o haplogrupo A2. Em relação à demografia histórica, observou-se que URB mantém sinais de expansão remetendo à entrada na América, contrastando com NAT em que esses sinais estão erodidos, apenas retendo sinais de contração populacional recente. De acordo com as estimativas aqui geradas, o declínio populacional em NAT foi de cerca de 300 vezes (84 – 555). Em outras palavras, a população efetiva nativa amricana nessa região corresponderia a apenas 0,33% (0,18% – 1,19%) da população ancestral– 99,8%, corroborando os achados de outros estudos genéticos e também com os registros históricos. / After the arrival of the European conquerors, the Native American populations were decimated due to multiple reasons, such as wars and diseases, which possibly led many autochtonous genetic lineages to extinction. However, during the European invasion of the Americas, colonizers and indigenous people admixed, and many genetic studies have shown an important Native American matrilineal contribution to the formation of the Colonial population. Therefore, if many individuals in the current urban population harbor Native American lineages in their mitochondrial DNA (mtDNA), much of Native American genetic diversity that have been lost during the Colonial Era may have been mantained by admixture in urban populations. In this case, these populations effectively represent an important reservoir of Native lineages in Brazil and other American countries, constituting the most accurate portrait of pre-Columbian genetic diverstity of Native populations. Based on this, the aims of the presente study were 1) to compare the patterns of genetic diversity of Native American mtDNA lineages in Native populations from Southern Brazil and the surrounding admixed urban populations; and 2) to compare, using Approximate Bayesian Computation (ABC), the demographic history of both groups to estimate the level of reduction in the effective population size (Ne) for the indigenous groups present here. We used mtDNA hypervariable segment (HVS-I) data of indigenous origin already published from 396 Native American individuals (NAT) belonging to the Guarani, Kaingang, and Charrua groups, and 309 individuals from Southern Brazilian and Uruguayan admixed urban populations (URB) The analyzes of variability and genetic structure, as well as the neutrality tests were accomplished using Arlequin 3.5, and the mitochondrial haplotype network estimated through the Median-Joining method available in Network 5.0. Time estimates for effective population size were performed using Bayesian Skyline Plot available in the BEAST 1.8.4 package. Finally, the DIYABC 2.1 software was used to test evolutionary scenarios and to estimate the pre (Nanc) and post-contact (Nnat) Native American Ne, and estimate the impact of the colonization process on the Native American genetic variability. The results indicate that URB is the best predictor of ancestral Native diversity, having substancially greater genetic diversity than NAT, at least in the Southern Brazilian and Uruguayan regions (H = 0.96 vs. 0.85, Nhap = 11 vs. 27, respectively). Moreover, the haplogroup compositions are very distinct between these groups, suggesting that the Native population passed through bottleneck events affecting the haplogroups B2 and C1, and overrepresenting the haplogroup A2. In relation to demographic history, we observed that URB retains signals of population expansion back to the entry in the Americas. In contrast, these signals are eroded in NAT, which maintains only signals of recent population contraction. According to our estimates, the population decline in NAT was around 300x (84 – 555x). In other words, the effective Native American population in this region would correspond to only 0.33% (0.18% – 1.19%) of the ancestral population, corroborating the findings of other genetic studies and historical records. Genética de populações DNA mitocondrial Ameríndios Uruguai Brasil, Sul Mitochondrial DNA Approximate bayesian computation Admixture
25	Modern Monte Carlo Methods and Their Application in Semiparametric Regression Thomas, Samuel Joseph 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The essence of Bayesian data analysis is to ascertain posterior distributions. Posteriors generally do not have closed-form expressions for direct computation in practical applications. Analysts, therefore, resort to Markov Chain Monte Carlo (MCMC) methods for the generation of sample observations that approximate the desired posterior distribution. Standard MCMC methods simulate sample values from the desired posterior distribution via random proposals. As a result, the mechanism used to generate the proposals inevitably determines the efficiency of the algorithm. One of the modern MCMC techniques designed to explore the high-dimensional space more efficiently is Hamiltonian Monte Carlo (HMC), based on the Hamiltonian differential equations. Inspired by classical mechanics, these equations incorporate a latent variable to generate MCMC proposals that are likely to be accepted. This dissertation discusses how such a powerful computational approach can be used for implementing statistical models. Along this line, I created a unified computational procedure for using HMC to fit various types of statistical models. The procedure that I proposed can be applied to a broad class of models, including linear models, generalized linear models, mixed-effects models, and various types of semiparametric regression models. To facilitate the fitting of a diverse set of models, I incorporated new parameterization and decomposition schemes to ensure the numerical performance of Bayesian model fitting without sacrificing the procedure’s general applicability. As a concrete application, I demonstrate how to use the proposed procedure to fit a multivariate generalized additive model (GAM), a nonstandard statistical model with a complex covariance structure and numerous parameters. Byproducts of the research include two software packages that all practical data analysts to use the proposed computational method to fit their own models. The research’s main methodological contribution is the unified computational approach that it presents for Bayesian model fitting that can be used for standard and nonstandard statistical models. Availability of such a procedure has greatly enhanced statistical modelers’ toolbox for implementing new and nonstandard statistical models. Bayesian Computation Generalized Additive Model Hamiltonian Monte Carlo Markov Chain Monte Carlo Semiparametric Regression
26	Bayesian estimation of factor analysis models with incomplete data Merkle, Edgar C. 10 October 2005 (has links) No description available. Bayesian computation Factor analysis Missing data Incomplete data Data augmentation Multiple imputation
27	Likelihood-Free Bayesian Modeling Turner, Brandon Michael 15 December 2011 (has links) No description available. Quantitative Psychology approximate Bayesian computation Bayesian modeling likelihood-free inference memory models mixture algorithm
28	Calibration of Breast Cancer Natural History Models Using Approximate Bayesian Computation / Kalibrering av natural history models för bröstcancer med approximate bayesian computation Bergqvist, Oscar January 2020 (has links) Natural history models for breast cancer describe the unobservable disease progression. These models can either be fitted using likelihood-based estimation to data on individual tumour characteristics, or calibrated to fit statistics at a population level. Likelihood-based inference using individual level data has the advantage of ensuring model parameter identifiability. However, the likelihood function can be computationally heavy to evaluate or even intractable. In this thesis likelihood-free estimation using Approximate Bayesian Computation (ABC) will be explored. The main objective is to investigate whether ABC can be used to fit models to data collected in the presence of mammography screening. As a background, a literature review of ABC is provided. As a first step an ABC-MCMC algorithm is constructed for two simple models both describing populations in absence of mammography screening, but assuming different functional forms of tumour growth. The algorithm is evaluated for these models in a simulation study using synthetic data, and compared with results obtained using likelihood-based inference. Later, it is investigated whether ABC can be used for the models in presence of screening. The findings of this thesis indicate that ABC is not directly applicable to these models. However, by including a sub-model for tumour onset and assuming that all individuals in the population have the same screening attendance it was possible to develop an ABC-MCMC algorithm that carefully takes individual level data into consideration in the estimation procedure. Finally, the algorithm was tested in a simple simulation study using synthetic data. Future research is still needed to evaluate the statistical properties of the algorithm (using extended simulation) and to test it on observational data where previous estimates are available for reference. / Natural history models för bröstcancer är statistiska modeller som beskriver det dolda sjukdomsförloppet. Dessa modeller brukar antingen anpassas till data på individnivå med likelihood-baserade metoder, eller kalibreras mot statistik för hela populationen. Fördelen med att använda data på individnivå är att identifierbarhet hos modellparametrarna kan garanteras. För dessa modeller händer det dock att det är beräkningsintensivt eller rent utav omöjligt att evaluera likelihood-funktionen. Huvudsyftet med denna uppsats är att utforska huruvida metoden Approximate Bayesian Computation (ABC), som används för skattning av statistiska modeller där likelihood-funktionen inte är tillgänglig, kan implementeras för en modell som beskriver bröstcancer hos individer som genomgår mammografiscreening. Som en del av bakgrunden presenteras en sammanfattning av modern ABC-forskning. Metoden består av två delar. I den första delen implementeras en ABC-MCMC algoritm för två enklare modeller. Båda dessa modeller beskriver tumörtillväxten hos individer som ej genomgår mammografiscreening, men modellerna antar olika typer av tumörtillväxt. Algoritmen testades i en simulationsstudie med syntetisk data genom att jämföra resultaten med motsvarande från likelihood-baserade metoder. I den andra delen av metoden undersöks huruvida ABC är kompatibelt med modeller för bröstcancer hos individer som genomgår screening. Genom att lägga till en modell för uppkomst av tumörer och göra det förenklande antagandet att alla individer i populationen genomgår screening vid samma ålder, kunde en ABC-MCMC algoritm utvecklas med hänsyn till data på individnivå. Algoritmen testades sedan i en simulationsstudie nyttjande syntetisk data. Framtida studier behövs för att undersöka algoritmens statistiska egenskaper (genom upprepad simulering av flera dataset) och för att testa den mot observationell data där tidigare parameterskattningar finns tillgängliga. Approximate Bayesian Computation ABC breast cancer natural history models random effects Bayesian statistics likelihood-free inference Approximate Bayesian computation ABC natural history models för bröstcancer random effects bayesiansk statistik likelihood-fri inferens Probability Theory and Statistics Sannolikhetsteori och statistik
29	Topics in Modern Bayesian Computation Qamar, Shaan January 2015 (has links) <p>Collections of large volumes of rich and complex data has become ubiquitous in recent years, posing new challenges in methodological and theoretical statistics alike. Today, statisticians are tasked with developing flexible methods capable of adapting to the degree of complexity and noise in increasingly rich data gathered across a variety of disciplines and settings. This has spurred the need for novel multivariate regression techniques that can efficiently capture a wide range of naturally occurring predictor-response relations, identify important predictors and their interactions and do so even when the number of predictors is large but the sample size remains limited. </p><p>Meanwhile, efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Aided by the tremendous success of modern computing, Bayesian methods have gained tremendous popularity in recent years. These methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions. In addition, they provide a practical way of encoding model structure that can lead to large gains in statistical estimation and more interpretable results. However, this flexibility is often hindered in applications to modern data which are increasingly high dimensional, both in the number of observations $n$ and the number of predictors $p$. Here, computational complexity and the curse of dimensionality typically render posterior computation inefficient. In particular, Markov chain Monte Carlo (MCMC) methods which remain the workhorse for Bayesian computation (owing to their generality and asymptotic accuracy guarantee), typically suffer data processing and computational bottlenecks as a consequence of (i) the need to hold the entire dataset (or available sufficient statistics) in memory at once; and (ii) having to evaluate of the (often expensive to compute) data likelihood at each sampling iteration. </p><p>This thesis divides into two parts. The first part concerns itself with developing efficient MCMC methods for posterior computation in the high dimensional {\em large-n large-p} setting. In particular, we develop an efficient and widely applicable approximate inference algorithm that extends MCMC to the online data setting, and separately propose a novel stochastic search sampling scheme for variable selection in high dimensional predictor settings. The second part of this thesis develops novel methods for structured sparsity in the high-dimensional {\em large-p small-n} regression setting. Here, statistical methods should scale well with the predictor dimension and be able to efficiently identify low dimensional structure so as to facilitate optimal statistical estimation in the presence of limited data. Importantly, these methods must be flexible to accommodate potentially complex relationships between the response and its associated explanatory variables. The first work proposes a nonparametric additive Gaussian process model to learn predictor-response relations that may be highly nonlinear and include numerous lower order interaction effects, possibly in different parts of the predictor space. A second work proposes a novel class of Bayesian shrinkage priors for multivariate regression with a tensor valued predictor. Dimension reduction is achieved using a low-rank additive decomposition for the latter, enabling a highly flexible and rich structure within which excellent cell-estimation and region selection may be obtained through state-of-the-art shrinkage methods. In addition, the methods developed in these works come with strong theoretical guarantees.</p> / Dissertation Statistics Approximate Bayesian computation High dimensional regression Nonparametric regression Scalable Markov chain Monte Carlo Structured additive models Variable selection
30	Multi-objective ROC learning for classification Clark, Andrew Robert James January 2011 (has links) Receiver operating characteristic (ROC) curves are widely used for evaluating classifier performance, having been applied to e.g. signal detection, medical diagnostics and safety critical systems. They allow examination of the trade-offs between true and false positive rates as misclassification costs are varied. Examination of the resulting graphs and calcu- lation of the area under the ROC curve (AUC) allows assessment of how well a classifier is able to separate two classes and allows selection of an operating point with full knowledge of the available trade-offs. In this thesis a multi-objective evolutionary algorithm (MOEA) is used to find clas- sifiers whose ROC graph locations are Pareto optimal. The Relevance Vector Machine (RVM) is a state-of-the-art classifier that produces sparse Bayesian models, but is unfor- tunately prone to overfitting. Using the MOEA, hyper-parameters for RVM classifiers are set, optimising them not only in terms of true and false positive rates but also a novel measure of RVM complexity, thus encouraging sparseness, and producing approximations to the Pareto front. Several methods for regularising the RVM during the MOEA train- ing process are examined and their performance evaluated on a number of benchmark datasets demonstrating they possess the capability to avoid overfitting whilst producing performance equivalent to that of the maximum likelihood trained RVM. A common task in bioinformatics is to identify genes associated with various genetic conditions by finding those genes useful for classifying a condition against a baseline. Typ- ically, datasets contain large numbers of gene expressions measured in relatively few sub- jects. As a result of the high dimensionality and sparsity of examples, it can be very easy to find classifiers with near perfect training accuracies but which have poor generalisation capability. Additionally, depending on the condition and treatment involved, evaluation over a range of costs will often be desirable. An MOEA is used to identify genes for clas- sification by simultaneously maximising the area under the ROC curve whilst minimising model complexity. This method is illustrated on a number of well-studied datasets and ap- plied to a recent bioinformatics database resulting from the current InChianti population study. Many classifiers produce “hard”, non-probabilistic classifications and are trained to find a single set of parameters, whose values are inevitably uncertain due to limited available training data. In a Bayesian framework it is possible to ameliorate the effects of this parameter uncertainty by averaging over classifiers weighted by their posterior probabil- ity. Unfortunately, the required posterior probability is not readily computed for hard classifiers. In this thesis an Approximate Bayesian Computation Markov Chain Monte Carlo algorithm is used to sample model parameters for a hard classifier using the AUC as a measure of performance. The ability to produce ROC curves close to the Bayes op- timal ROC curve is demonstrated on a synthetic dataset. Due to the large numbers of sampled parametrisations, averaging over them when rapid classification is needed may be impractical and thus methods for producing sparse weightings are investigated. 519.6

Search results