1 |
Quantile-based generalized logistic distributionOmachar, Brenda V. January 2014 (has links)
This dissertation proposes the development of a new quantile-based generalized logistic distribution GLDQB, by using the quantile function of the generalized logistic distribution (GLO) as the basic building block. This four-parameter distribution is highly flexible with respect to distributional shape in that it explains extensive levels of skewness and kurtosis through the inclusion of two shape parameters. The parameter space as well as the distributional shape properties are discussed at length. The distribution is characterized through its -moments and an estimation algorithm is presented for estimating the distribution’s parameters with method of -moments estimation. This new distribution is then used to fit and approximate the probability of a data set. / Dissertation (MSc)--University of Pretoria, 2014. / Statistics / MSc / Unrestricted
|
2 |
Contribution to Statistical Techniques for Identifying Differentially Expressed Genes in Microarray DataHossain, Ahmed 30 August 2011 (has links)
With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes (features or genomic biomarkers) simultaneously in one single experiment. Robust and accurate gene selection methods are required to identify differentially expressed genes across different samples for disease diagnosis or prognosis. The problem of identifying significantly differentially expressed genes can be stated as follows: Given gene expression measurements from an experiment of two (or more)conditions, find a subset of all genes having significantly
different expression levels across these two (or more) conditions.
Analysis of genomic data is challenging due to high dimensionality of data and low sample size. Currently several mathematical and statistical methods exist to identify significantly differentially expressed genes. The methods typically focus on gene by gene analysis within a parametric hypothesis testing framework. In this study, we propose three flexible procedures for analyzing microarray data.
In the first method we propose a parametric method which is based on a flexible distribution, Generalized Logistic Distribution of Type II (GLDII), and an approximate likelihood ratio test (ALRT) is
developed. Though the method considers gene-by-gene analysis, the ALRT method with distributional assumption GLDII appears to provide a favourable fit to microarray data.
In the second method we propose a test statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is greater than 0.5 allowing different variances for each gene.
This proposed method is computationally less intensive and can identify genes that are reasonably stable with satisfactory
prediction performance. The third method is based on comparing two AUCs for a pair of genes that is designed for selecting highly
correlated genes in the microarray datasets. We propose a nonparametric procedure for selecting genes with expression levels
correlated with that of a ``seed" gene in microarray experiments.
The test proposed by DeLong et al. (1988) is the conventional nonparametric procedure for comparing correlated AUCs. It uses a
consistent variance estimator and relies on asymptotic normality of the AUC estimator. Our proposed method includes DeLong's variance estimation technique in comparing pair of genes and can identify genes with biologically sound implications.
In this thesis, we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. We assess the proposed
approaches by extensive simulation studies and demonstrate the methods on real datasets. The simulation study indicates that the parametric method performs favorably well at any settings of variance, sample size and treatment effects. Importantly, the method is found less sensitive to contaminated by noise. The proposed nonparametric methods do not involve complicated formulas and do not
require advanced programming skills. Again both methods can identify a large fraction of truly differentially expressed (DE) genes,
especially if the data consists of large sample sizes or the presence of outliers. We conclude that the proposed methods offer
good choices of analytical tools to identify DE genes for further biological and clinical analysis.
|
3 |
Contribution to Statistical Techniques for Identifying Differentially Expressed Genes in Microarray DataHossain, Ahmed 30 August 2011 (has links)
With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes (features or genomic biomarkers) simultaneously in one single experiment. Robust and accurate gene selection methods are required to identify differentially expressed genes across different samples for disease diagnosis or prognosis. The problem of identifying significantly differentially expressed genes can be stated as follows: Given gene expression measurements from an experiment of two (or more)conditions, find a subset of all genes having significantly
different expression levels across these two (or more) conditions.
Analysis of genomic data is challenging due to high dimensionality of data and low sample size. Currently several mathematical and statistical methods exist to identify significantly differentially expressed genes. The methods typically focus on gene by gene analysis within a parametric hypothesis testing framework. In this study, we propose three flexible procedures for analyzing microarray data.
In the first method we propose a parametric method which is based on a flexible distribution, Generalized Logistic Distribution of Type II (GLDII), and an approximate likelihood ratio test (ALRT) is
developed. Though the method considers gene-by-gene analysis, the ALRT method with distributional assumption GLDII appears to provide a favourable fit to microarray data.
In the second method we propose a test statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is greater than 0.5 allowing different variances for each gene.
This proposed method is computationally less intensive and can identify genes that are reasonably stable with satisfactory
prediction performance. The third method is based on comparing two AUCs for a pair of genes that is designed for selecting highly
correlated genes in the microarray datasets. We propose a nonparametric procedure for selecting genes with expression levels
correlated with that of a ``seed" gene in microarray experiments.
The test proposed by DeLong et al. (1988) is the conventional nonparametric procedure for comparing correlated AUCs. It uses a
consistent variance estimator and relies on asymptotic normality of the AUC estimator. Our proposed method includes DeLong's variance estimation technique in comparing pair of genes and can identify genes with biologically sound implications.
In this thesis, we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. We assess the proposed
approaches by extensive simulation studies and demonstrate the methods on real datasets. The simulation study indicates that the parametric method performs favorably well at any settings of variance, sample size and treatment effects. Importantly, the method is found less sensitive to contaminated by noise. The proposed nonparametric methods do not involve complicated formulas and do not
require advanced programming skills. Again both methods can identify a large fraction of truly differentially expressed (DE) genes,
especially if the data consists of large sample sizes or the presence of outliers. We conclude that the proposed methods offer
good choices of analytical tools to identify DE genes for further biological and clinical analysis.
|
4 |
Bayesian Learning Under NonnormalityYilmaz, Yildiz Elif 01 December 2004 (has links) (PDF)
Naive Bayes classifier and maximum likelihood hypotheses in Bayesian learning are considered when the errors have non-normal distribution. For location and scale parameters, efficient and robust estimators that are obtained by using the modified maximum likelihood estimation (MML) technique are used. In naive Bayes classifier, the error distributions from class to class and from feature to feature are assumed to be non-identical and Generalized Secant Hyperbolic (GSH) and Generalized Logistic (GL) distribution families have been used instead of normal distribution. It is shown that the non-normal naive Bayes classifier obtained in this way classifies the data more accurately than the one based on the normality assumption. Furthermore, the maximum likelihood (ML) hypotheses are obtained under the assumption of non-normality, which also produce better results compared to the conventional ML approach.
|
5 |
Goodness-of-fit Tests Based On Censored SamplesCigsar, Candemir 01 July 2005 (has links) (PDF)
In this study, the most prominent goodness-of-fit tests for censored samples are reviewed. Power properties of goodness-of-fit statistics of the null hypothesis that a sample which is censored from right, left and both right and left which comes from uniform, normal and exponential distributions are investigated. Then, by a similar argument extreme value, student t with 6 degrees of freedom and generalized logistic distributions are discussed in detail through a comprehensive simulation study. A variety of real life applications are given. Suitable test statistics for testing the above distributions for censored samples are also suggested in the conclusion.
|
6 |
Statistical Inference From Complete And Incomplete DataCan Mutan, Oya 01 January 2010 (has links) (PDF)
Let X and Y be two random variables such that Y depends on X=x. This is a very common situation in many real life applications. The problem is to estimate the location and scale parameters in the marginal distributions of X and Y and the conditional distribution of Y given X=x. We are also interested in estimating the regression coefficient and the correlation coefficient. We have a cost constraint for observing X=x, the larger x is the more expensive it becomes. The allowable sample size n is governed by a pre-determined total cost. This can lead to a situation where some of the largest X=x observations cannot be observed (Type II censoring). Two general methods of estimation are available, the method of least squares and the method of maximum likelihood. For most non-normal distributions, however, the latter is analytically and computationally problematic. Instead, we use the method of modified maximum likelihood estimation which is known to be essentially as efficient as the maximum likelihood estimation. The method has a distinct advantage: It yields estimators which are explicit functions of sample observations and are, therefore, analytically and computationally straightforward. In this thesis specifically, the problem is to evaluate the effect of the largest order statistics x(i) (i> / n-r) in a random sample of size n (i) on the mean E(X) and variance V(X) of X, (ii) on the cost of observing the x-observations, (iii) on the conditional mean E(Y|X=x) and variance V(Y|X=x) and (iv) on the regression coefficient. It is shown that unduly large x-observations have a detrimental effect on the allowable sample size and the estimators, both least squares and modified maximum likelihood. The advantage of not observing a few largest observations are evaluated. The distributions considered are Weibull, Generalized Logistic and the scaled Student&rsquo / s t.
|
7 |
Distribuição generalizada de chuvas máximas no Estado do Paraná. / Local and regional frequency analysis by lh-moments and generalized distributionsPansera, Wagner Alessandro 07 December 2013 (has links)
Made available in DSpace on 2017-05-12T14:46:53Z (GMT). No. of bitstreams: 1
Wagner.pdf: 5111902 bytes, checksum: b4edf3498cca6f9c7e2a9dbde6e62e18 (MD5)
Previous issue date: 2013-12-07 / The purpose of hydrologic frequency analysis is to relate magnitude of events with their occurrence frequency based on probability distribution. The generalized probability distributions can be used on the study concerning extreme hydrological events: extreme events, logistics and Pareto. There are several methodologies to estimate probability distributions parameters, however, L-moments are often used due to computational easiness. Reliability of quantiles with high return period can be increased by LH-moments or high orders L-moments. L-moments have been widely studied; however, there is little information about LH-moments on literature, thus, there is a great research requirement on such area. Therefore, in this study, LH-moments were studied under two approaches commonly used in hydrology: (i) local frequency analysis (LFA) and (ii) regional frequency analysis (RFA). Moreover, a database with 227 rainfall stations was set (daily maximum annual), in Paraná State, from 1976 to 2006. LFA was subdivided into two steps: (i) Monte Carlo simulations and (ii) application of results to database. The main result of Monte Carlo simulations was that LH-moments make 0.99 and 0.995 quantiles less biased. Besides, simulations helped on creating an algorithm to perform LFA by generalized distributions. The algorithm was applied to database and enabled an adjustment of 227 studied series. In RFA, the 227stations have been divided into 11 groups and regional growth curves were obtained; while local quantiles were obtained from the regional growth curves. The difference between local quantiles obtained by RFA was quantified with those obtained via LFA. The differences may be approximately 33 mm for return periods of 100 years. / O objetivo da análise de frequência das variáveis hidrológicas é relacionar a magnitude dos eventos com sua frequência de ocorrência por meio do uso de uma distribuição de probabilidade. No estudo de eventos hidrológicos extremos, podem ser usadas as distribuições de probabilidade generalizadas: de eventos extremos, logística e Pareto. Existem diversas metodologias para a estimativa dos parâmetros das distribuições de probabilidade, no entanto, devido às facilidades computacionais, utilizam-se frequentemente os momentos-L. A confiabilidade dos quantis com alto período de retorno pode ser aumentada utilizando os momentos-LH ou momentos-L de altas ordens. Os momentos-L foram amplamente estudados, todavia, os momentos-LH apresentam literatura reduzida, logo, mais pesquisas são necessárias. Portanto, neste estudo, os momentos-LH foram estudados sob duas abordagens comumente utilizadas na hidrologia: (i) Análise de frequência local (AFL) e (ii) Análise de frequência regional (AFR). Além disso, foi montado um banco de dados com 227 estações pluviométricas (máximas diárias anuais), localizadas no Estado do Paraná, no período de 1976 a 2006. A AFL subdividiu-se em duas etapas: (i) Simulações de Monte Carlo e (ii) Aplicação dos resultados ao banco de dados. O principal resultado das simulações de Monte Carlo foi que os momentos-LH tornam os quantis 0,99 e 0,995 menos enviesados. Além disso, as simulações viabilizaram a criação de um algoritmo para realizar a AFL utilizando as distribuições generalizadas. O algoritmo foi aplicado ao banco de dados e possibilitou ajuste das 227 séries estudadas. Na AFR, as 227 estações foram dividas em 11 grupos e foram obtidas as curvas de crescimento regional. Os quantis locais foram obtidos a partir das curvas de crescimento regional. Foi quantificada a diferença entre os quantis locais obtidos via AFL com aqueles obtidos via AFR. As diferenças podem ser de aproximadamente 33 mm para períodos de retorno de 100 anos.
|
8 |
Distribuição generalizada de chuvas máximas no Estado do Paraná. / Local and regional frequency analysis by lh-moments and generalized distributionsPansera, Wagner Alessandro 07 December 2013 (has links)
Made available in DSpace on 2017-07-10T19:23:40Z (GMT). No. of bitstreams: 1
Wagner.pdf: 5111902 bytes, checksum: b4edf3498cca6f9c7e2a9dbde6e62e18 (MD5)
Previous issue date: 2013-12-07 / The purpose of hydrologic frequency analysis is to relate magnitude of events with their occurrence frequency based on probability distribution. The generalized probability distributions can be used on the study concerning extreme hydrological events: extreme events, logistics and Pareto. There are several methodologies to estimate probability distributions parameters, however, L-moments are often used due to computational easiness. Reliability of quantiles with high return period can be increased by LH-moments or high orders L-moments. L-moments have been widely studied; however, there is little information about LH-moments on literature, thus, there is a great research requirement on such area. Therefore, in this study, LH-moments were studied under two approaches commonly used in hydrology: (i) local frequency analysis (LFA) and (ii) regional frequency analysis (RFA). Moreover, a database with 227 rainfall stations was set (daily maximum annual), in Paraná State, from 1976 to 2006. LFA was subdivided into two steps: (i) Monte Carlo simulations and (ii) application of results to database. The main result of Monte Carlo simulations was that LH-moments make 0.99 and 0.995 quantiles less biased. Besides, simulations helped on creating an algorithm to perform LFA by generalized distributions. The algorithm was applied to database and enabled an adjustment of 227 studied series. In RFA, the 227stations have been divided into 11 groups and regional growth curves were obtained; while local quantiles were obtained from the regional growth curves. The difference between local quantiles obtained by RFA was quantified with those obtained via LFA. The differences may be approximately 33 mm for return periods of 100 years. / O objetivo da análise de frequência das variáveis hidrológicas é relacionar a magnitude dos eventos com sua frequência de ocorrência por meio do uso de uma distribuição de probabilidade. No estudo de eventos hidrológicos extremos, podem ser usadas as distribuições de probabilidade generalizadas: de eventos extremos, logística e Pareto. Existem diversas metodologias para a estimativa dos parâmetros das distribuições de probabilidade, no entanto, devido às facilidades computacionais, utilizam-se frequentemente os momentos-L. A confiabilidade dos quantis com alto período de retorno pode ser aumentada utilizando os momentos-LH ou momentos-L de altas ordens. Os momentos-L foram amplamente estudados, todavia, os momentos-LH apresentam literatura reduzida, logo, mais pesquisas são necessárias. Portanto, neste estudo, os momentos-LH foram estudados sob duas abordagens comumente utilizadas na hidrologia: (i) Análise de frequência local (AFL) e (ii) Análise de frequência regional (AFR). Além disso, foi montado um banco de dados com 227 estações pluviométricas (máximas diárias anuais), localizadas no Estado do Paraná, no período de 1976 a 2006. A AFL subdividiu-se em duas etapas: (i) Simulações de Monte Carlo e (ii) Aplicação dos resultados ao banco de dados. O principal resultado das simulações de Monte Carlo foi que os momentos-LH tornam os quantis 0,99 e 0,995 menos enviesados. Além disso, as simulações viabilizaram a criação de um algoritmo para realizar a AFL utilizando as distribuições generalizadas. O algoritmo foi aplicado ao banco de dados e possibilitou ajuste das 227 séries estudadas. Na AFR, as 227 estações foram dividas em 11 grupos e foram obtidas as curvas de crescimento regional. Os quantis locais foram obtidos a partir das curvas de crescimento regional. Foi quantificada a diferença entre os quantis locais obtidos via AFL com aqueles obtidos via AFR. As diferenças podem ser de aproximadamente 33 mm para períodos de retorno de 100 anos.
|
Page generated in 0.1301 seconds