Spelling suggestions: "subject:"subset byelection"" "subject:"subset dielection""
1 |
New Methods for Eliminating Inferior Treatments in Clinical TrialsLin, Chen-ju 26 June 2007 (has links)
Multiple comparisons and selection procedures are commonly studied in research and employed in application. Clinical trial is one of popular fields to which the subject of multiple comparisons is extensively applied. Based on the Federal Food, Drug, and Cosmetic Act, drug manufacturers need to not only demonstrate safety of their drug products but also establish effectiveness by substantial evidence in order to obtain marketing approval. However, the problem of error inflation occurs when there are more than two groups to compare with at the same time. How to design a test procedure with high power while controlling type I error becomes an important issue.
The treatment with the largest population mean is considered to be the best one in the study. Potentially the best treatments can receive increased resources and further investigation by excluding clearly inferior treatments. Hence, a small number of possibly the best treatments is preferred. This thesis focuses on the problem of eliminating the less effective treatments among three in clinical trials. The goal is to increase the ability to identify any inferior treatment providing that the probability of excluding any best treatment is guaranteed to be less than or equal to alpha. A step-down procedure is applied to solve the problem.
The general step-down procedure with fixed thresholds is conservative in our problem. The test is not efficient in rejecting the less effective treatments. We propose two methods with sharper thresholds to improve current procedures and construct a subset containing strictly inferior treatments. The first method, the restricted parameter space approach, is designed for the scenario when prior information about range of treatment means is known. The second method, the step-down procedure with feedback, utilizes observations to modify the threshold and controls error rate for the whole parameter space. The new procedures have greater ability to detect more inferior treatments than the standard procedure. In addition, type I error is also controlled under mild violation of the assumptions demonstrated by simulation.
|
2 |
Fizzy: feature subset selection for metagenomicsDitzler, Gregory, Morrison, J. Calvin, Lan, Yemin, Rosen, Gail L. January 2015 (has links)
BACKGROUND: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α- & β-diversity. Feature subset selection - a sub-field of machine learning - can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome. RESULTS: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. CONCLUSIONS: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.
|
3 |
Subset selection based on likelihood from uniform and related populationsChotai, Jayanti January 1979 (has links)
Let π1, π2, ... π be k (>_2) populations. Let πi (i = 1, 2, ..., k) be characterized by the uniform distributionon (ai, bi), where exactly one of ai and bi is unknown. With unequal sample sizes, suppose that we wish to select arandom-size subset of the populations containing the one withthe smallest value of 0i = bi - ai. Rule Ri selects πi iff a likelihood-based k-dimensional confidence region for the unknown (01,..., 0k) contains at least one point having 0i as its smallest component. A second rule, R, is derived through a likelihood ratio and is equivalent to that of Barr and Rizvi (1966) when the sample sizes are equal. Numerical comparisons are made. The results apply to the larger class of densities g(z; 0i) = M(z)Q(0i) iff a(0i) < z < b(0i). Extensions to the cases when both ai and bi are unknown and when 0max is of interest are i i indicated. / digitalisering@umu
|
4 |
Subset selection based on likelihood ratios : the normal means caseChotai, Jayanti January 1979 (has links)
Let π1, ..., πk be k(>_2) populations such that πi, i = 1, 2, ..., k, is characterized by the normal distribution with unknown mean and ui variance aio2 , where ai is known and o2 may be unknown. Suppose that on the basis of independent samples of size ni from π (i=1,2,...,k), we are interested in selecting a random-size subset of the given populations which hopefully contains the population with the largest mean.Based on likelihood ratios, several new procedures for this problem are derived in this report. Some of these procedures are compared with the classical procedure of Gupta (1956,1965) and are shown to be better in certain respects. / <p>Ny rev. utg.</p><p>This is a slightly revised version of Statistical Research Report No. 1978-6.</p> / digitalisering@umu
|
5 |
"Abordagem genética para seleção de um conjunto reduzido de características para construção de ensembles de redes neurais: aplicação à língua eletrônica" / A genetic approach to feature subset selection for construction of neural network ensembles: an application to gustative sensorsFerreira, Ednaldo José 10 August 2005 (has links)
As características irrelevantes, presentes em bases de dados de diversos domínios, deterioram a acurácia de predição de classificadores induzidos por algoritmos de aprendizado de máquina. As bases de dados geradas por uma língua eletrônica são exemplos típicos onde a demasiada quantidade de características irrelevantes e redundantes prejudicam a acurácia dos classificadores induzidos. Para lidar com este problema, duas abordagens podem ser utilizadas. A primeira é a utilização de métodos para seleção de subconjuntos de características. A segunda abordagem é por meio de ensemble de classificadores. Um ensemble deve ser constituído por classificadores diversos e acurados. Uma forma efetiva para construção de ensembles de classificadores é por meio de seleção de características. A seleção de características para ensemble tem o objetivo adicional de encontrar subconjuntos de características que promovam acurácia e diversidade de predição nos classificadores do ensemble. Algoritmos genéticos são técnicas promissoras para seleção de características para ensemble. No entanto, a busca genética, assim como outras estratégias de busca, geralmente visam somente a construção do ensemble, permitindo que todas as características (relevantes, irrelevantes e redundantes) sejam utilizadas. Este trabalho apresenta uma abordagem baseada em algoritmos genéticos para construção de ensembles de redes neurais artificiais com um conjunto reduzido das características totais. Para melhorar a acurácia dos ensembles, duas abordagens diferenciadas para treinamento de redes neurais foram utilizadas. A primeira baseada na interrupção precoce do treinamento com o algoritmo back-propagation e a segunda baseada em otimização multi-objetivo. Os resultados obtidos comprovam a eficácia do algoritmo proposto para construção de ensembles de redes neurais acurados. Também foi constatada sua eficiência na redução das características totais, comprovando que o algoritmo proposto é capaz de construir um ensemble utilizando um conjunto reduzido de características. / The irrelevant features in databases of some domains spoil the accuracy of the classifiers induced by machine learning algorithms. Databases generated by an electronic tongue are examples where the huge quantity of irrelevant and redundant features spoils the accuracy of classifiers. There are basically two approaches to deal with this problem: feature subset selection and ensemble of classifiers. A good ensemble is composed by accurate and diverse classifiers. An effective way to construct ensembles of classifiers is to make it through feature selection. The ensemble feature selection has an additional objective: to find feature subsets to promote accuracy and diversity in the ensemble of classifiers. Genetic algorithms are promising techniques for ensemble feature selection. However, genetic search, as well as other search strategies, only aims the ensemble construction, allowing the selection of all features (relevant, irrelevant and redundant). This work proposes an approach based on genetic algorithm to construct ensembles of neural networks using a reduced feature subset of totality. Two approaches were used to train neural networks to improve the ensembles accuracy. The first is based on early stopping with back-propagation algorithm and the second is based on multi-objective optimization. The results show the effectiveness and accuracy of the proposed algorithm to construct ensembles of neural networks, and also, its efficiency in the reduction of total features was evidenced, proving its capacity for constructing an ensemble using a reduced feature subset.
|
6 |
Seleção de atributos relevantes para aprendizado de máquina utilizando a abordagem de Rough Sets. / Machine learning feature subset selection using Rough Sets approach.Pila, Adriano Donizete 25 May 2001 (has links)
No Aprendizado de Máquina Supervisionado---AM---o algoritmo de indução trabalha com um conjunto de exemplos de treinamento, no qual cada exemplo é constituído de um vetor com os valores dos atributos e as classes, e tem como tarefa induzir um classificador capaz de predizer a qual classe pertence um novo exemplo. Em geral, os algoritmos de indução baseiam-se nos exemplos de treinamento para a construção do classificador, sendo que uma representação inadequada desses exemplos, bem como inconsistências nos mesmos podem tornar a tarefa de aprendizado difícil. Um dos problemas centrais de AM é a Seleção de um Subconjunto de Atributos---SSA---cujo objetivo é diminuir o número de atributos utilizados na representação dos exemplos. São três as principais razões para a realização de SSA. A primeira razão é que a maioria dos algoritmos de AM, computacionalmente viáveis, não trabalham bem na presença de vários atributos. A segunda razão é que, com um número menor de atributos, o conceito induzido através do classificador pode ser melhor compreendido. E, a terceira razão é o alto custo para coletar e processar grande quantidade de informações. Basicamente, são três as abordagens para a SSA: embedded, filtro e wrapper. A Teoria de Rough Sets---RS---é uma abordagem matemática criada no início da década de 80, cuja principal funcionalidade são os redutos, e será tratada neste trabalho. Segundo essa abordagem, os redutos são subconjuntos mínimos de atributos que possuem a propriedade de preservar o poder de descrição do conceito relacionado ao conjunto de todos os atributos. Neste trabalho o enfoque esta na abordagem filtro para a realização da SSA utilizando como filtro os redutos calculados através de RS. São descritos vários experimentos sobre nove conjuntos de dados naturais utilizando redutos, bem como outros filtros para SSA. Feito isso, os atributos selecionados foram submetidos a dois algoritmos simbólicos de AM. Para cada conjunto de dados e indutor, foram realizadas várias medidas, tais como número de atributos selecionados, precisão e números de regras induzidas. Também, é descrito um estudo de caso sobre um conjunto de dados do mundo real proveniente da área médica. O objetivo desse estudo pode ser dividido em dois focos: comparar a precisão dos algoritmos de indução e avaliar o conhecimento extraído com a ajuda do especialista. Embora o conhecimento extraído não apresente surpresa, pôde-se confirmar algumas hipóteses feitas anteriormente pelo especialista utilizando outros métodos. Isso mostra que o Aprendizado de Máquina também pode ser visto como uma contribuição para outros campos científicos. / In Supervised Machine Learning---ML---an induction algorithm is typically presented with a set of training examples, where each example is described by a vector of feature values and a class label. The task of the induction algorithm is to induce a classifier that will be useful in classifying new cases. In general, the inductive-learning algorithms rely on existing provided data to build their classifiers. Inadequate representation of the examples through the description language as well as inconsistencies in the training examples can make the learning task hard. One of the main problems in ML is the Feature Subset Selection---FSS---problem, i.e. the learning algorithm is faced with the problem of selecting some subset of feature upon which to focus its attention, while ignoring the rest. There are three main reasons that justify doing FSS. The first reason is that most ML algorithms, that are computationally feasible, do not work well in the presence of many features. The second reason is that FSS may improve comprehensibility, when using less features to induce symbolic concepts. And, the third reason for doing FSS is the high cost in some domains for collecting data. Basically, there are three approaches in ML for FSS: embedded, filter and wrapper. The Rough Sets Theory---RS---is a mathematical approach developed in the early 1980\'s whose main functionality are the reducts, and will be treated in this work. According to this approach, the reducts are minimal subsets of features capable to preserve the same concept description related to the entire set of features. In this work we focus on the filter approach for FSS using as filter the reducts obtained through the RS approach. We describe a series of FSS experiments on nine natural datasets using RS reducts as well as other filters. Afterwards we submit the selected features to two symbolic ML algorithms. For each dataset, various measures are taken to compare inducers performance, such as number of selected features, accuracy and number of induced rules. We also present a case study on a real world dataset from the medical area. The aim of this case study is twofold: comparing the induction algorithms performance as well as evaluating the extracted knowledge with the aid of the specialist. Although the induced knowledge lacks surprising, it allows us to confirm some hypothesis already made by the specialist using other methods. This shows that Machine Learning can also be viewed as a contribution to other scientific fields.
|
7 |
Essays on semi-parametric Bayesian econometric methodsWu, Ruochen January 2019 (has links)
This dissertation consists of three chapters on semi-parametric Bayesian Econometric methods. Chapter 1 applies a semi-parametric method to demand systems, and compares the abilities to recover the true elasticities of different approaches to linearly estimating the widely used Almost Ideal demand model, by either iteration or approximation. Chapter 2 co-authored with Dr. Melvyn Weeks introduces a new semi-parametric Bayesian Generalized Least Square estimator, which employs the Dirichlet Process prior to cope with potential heterogeneity in the error distributions. Two methods are discussed as special cases of the GLS estimator, the Seemingly Unrelated Regression for equation systems, and the Random Effects Model for panel data, which can be applied to many fields such as the demand analysis in Chapter 1. Chapter 3 focuses on the subset selection for the efficiencies of firms, which addresses the influence of heterogeneity in the distributions of efficiencies on subset selections by applying the semi-parametric Bayesian Random Effects Model introduced in Chapter 2.
|
8 |
"Abordagem genética para seleção de um conjunto reduzido de características para construção de ensembles de redes neurais: aplicação à língua eletrônica" / A genetic approach to feature subset selection for construction of neural network ensembles: an application to gustative sensorsEdnaldo José Ferreira 10 August 2005 (has links)
As características irrelevantes, presentes em bases de dados de diversos domínios, deterioram a acurácia de predição de classificadores induzidos por algoritmos de aprendizado de máquina. As bases de dados geradas por uma língua eletrônica são exemplos típicos onde a demasiada quantidade de características irrelevantes e redundantes prejudicam a acurácia dos classificadores induzidos. Para lidar com este problema, duas abordagens podem ser utilizadas. A primeira é a utilização de métodos para seleção de subconjuntos de características. A segunda abordagem é por meio de ensemble de classificadores. Um ensemble deve ser constituído por classificadores diversos e acurados. Uma forma efetiva para construção de ensembles de classificadores é por meio de seleção de características. A seleção de características para ensemble tem o objetivo adicional de encontrar subconjuntos de características que promovam acurácia e diversidade de predição nos classificadores do ensemble. Algoritmos genéticos são técnicas promissoras para seleção de características para ensemble. No entanto, a busca genética, assim como outras estratégias de busca, geralmente visam somente a construção do ensemble, permitindo que todas as características (relevantes, irrelevantes e redundantes) sejam utilizadas. Este trabalho apresenta uma abordagem baseada em algoritmos genéticos para construção de ensembles de redes neurais artificiais com um conjunto reduzido das características totais. Para melhorar a acurácia dos ensembles, duas abordagens diferenciadas para treinamento de redes neurais foram utilizadas. A primeira baseada na interrupção precoce do treinamento com o algoritmo back-propagation e a segunda baseada em otimização multi-objetivo. Os resultados obtidos comprovam a eficácia do algoritmo proposto para construção de ensembles de redes neurais acurados. Também foi constatada sua eficiência na redução das características totais, comprovando que o algoritmo proposto é capaz de construir um ensemble utilizando um conjunto reduzido de características. / The irrelevant features in databases of some domains spoil the accuracy of the classifiers induced by machine learning algorithms. Databases generated by an electronic tongue are examples where the huge quantity of irrelevant and redundant features spoils the accuracy of classifiers. There are basically two approaches to deal with this problem: feature subset selection and ensemble of classifiers. A good ensemble is composed by accurate and diverse classifiers. An effective way to construct ensembles of classifiers is to make it through feature selection. The ensemble feature selection has an additional objective: to find feature subsets to promote accuracy and diversity in the ensemble of classifiers. Genetic algorithms are promising techniques for ensemble feature selection. However, genetic search, as well as other search strategies, only aims the ensemble construction, allowing the selection of all features (relevant, irrelevant and redundant). This work proposes an approach based on genetic algorithm to construct ensembles of neural networks using a reduced feature subset of totality. Two approaches were used to train neural networks to improve the ensembles accuracy. The first is based on early stopping with back-propagation algorithm and the second is based on multi-objective optimization. The results show the effectiveness and accuracy of the proposed algorithm to construct ensembles of neural networks, and also, its efficiency in the reduction of total features was evidenced, proving its capacity for constructing an ensemble using a reduced feature subset.
|
9 |
A New Generation of Mixture-Model Cluster Analysis with Information Complexity and the Genetic EM AlgorithmHowe, John Andrew 01 May 2009 (has links)
In this dissertation, we extend several relatively new developments in statistical model selection and data mining in order to improve one of the workhorse statistical tools - mixture modeling (Pearson, 1894). The traditional mixture model assumes data comes from several populations of Gaussian distributions. Thus, what remains is to determine how many distributions, their population parameters, and the mixing proportions. However, real data often do not fit the restrictions of normality very well. It is likely that data from a single population exhibiting either asymmetrical or nonnormal tail behavior could be erroneously modeled as two populations, resulting in suboptimal decisions. To avoid these pitfalls, we develop the mixture model under a broader distributional assumption by fitting a group of multivariate elliptically-contoured distributions (Anderson and Fang, 1990; Fang et al., 1990). Special cases include the multivariate Gaussian and power exponential distributions, as well as the multivariate generalization of the Student’s T. This gives us the flexibility to model nonnormal tail and peak behavior, though the symmetry restriction still exists. The literature has many examples of research generalizing the Gaussian mixture model to other distributions (Farrell and Mersereau, 2004; Hasselblad, 1966; John, 1970a), but our effort is more general. Further, we generalize the mixture model to be non-parametric, by developing two types of kernel mixture model. First, we generalize the mixture model to use the truly multivariate kernel density estimators (Wand and Jones, 1995). Additionally, we develop the power exponential product kernel mixture model, which allows the density to adjust to the shape of each dimension independently. Because kernel density estimators enforce no functional form, both of these methods can adapt to nonnormal asymmetric, kurtotic, and tail characteristics. Over the past two decades or so, evolutionary algorithms have grown in popularity, as they have provided encouraging results in a variety of optimization problems. Several authors have applied the genetic algorithm - a subset of evolutionary algorithms - to mixture modeling, including Bhuyan et al. (1991), Krishna and Murty (1999), and Wicker (2006). These procedures have the benefit that they bypass computational issues that plague the traditional methods. We extend these initialization and optimization methods by combining them with our updated mixture models. Additionally, we “borrow” results from robust estimation theory (Ledoit and Wolf, 2003; Shurygin, 1983; Thomaz, 2004) in order to data-adaptively regularize population covariance matrices. Numerical instability of the covariance matrix can be a significant problem for mixture modeling, since estimation is typically done on a relatively small subset of the observations. We likewise extend various information criteria (Akaike, 1973; Bozdogan, 1994b; Schwarz, 1978) to the elliptically-contoured and kernel mixture models. Information criteria guide model selection and estimation based on various approximations to the Kullback-Liebler divergence. Following Bozdogan (1994a), we use these tools to sequentially select the best mixture model, select the best subset of variables, and detect influential observations - all without making any subjective decisions. Over the course of this research, we developed a full-featured Matlab toolbox (M3) which implements all the new developments in mixture modeling presented in this dissertation. We show results on both simulated and real world datasets. Keywords: mixture modeling, nonparametric estimation, subset selection, influence detection, evidence-based medical diagnostics, unsupervised classification, robust estimation.
|
10 |
Obtaining the Best Model Predictions and Parameter Estimates Using Limited DataMcLean, Kevin 27 September 2011 (has links)
Engineers who develop fundamental models for chemical processes are often unable to estimate all of the model parameters due to problems with parameter identifiability and estimability. The literature concerning these two concepts is reviewed and techniques for assessing parameter identifiability and estimability in nonlinear dynamic models are summarized. Modellers often face estimability problems when the available data are limited or noisy. In this situation, modellers must decide whether to conduct new experiments, change the model structure, or to estimate only a subset of the parameters and leave others at fixed values. Estimating only a subset of important model parameters is a technique often used by modellers who face estimability problems and it may lead to better model predictions with lower mean squared error (MSE) than the full model with all parameters estimated. Different methods in the literature for parameter subset selection are discussed and compared.
An orthogonalization algorithm combined with a recent MSE-based criterion has been used successfully to rank parameters from most to least estimable and to determine the parameter subset that should be estimated to obtain the best predictions. In this work, this strategy is applied to a batch reactor model using additional data and results are compared with computationally-expensive leave-one-out cross-validation. A new simultaneous ranking and selection technique based on this MSE criterion is also described. Unfortunately, results from these parameter selection techniques are sensitive to the initial parameter values and the uncertainty factors used to calculate sensitivity coefficients. A robustness test is proposed and applied to assess the sensitivity of the selected parameter subset to the initial parameter guesses. The selected parameter subsets are compared with those selected using another MSE-based method proposed by Chu et al. (2009). The computational efforts of these methods are compared and recommendations are provided to modellers. / Thesis (Master, Chemical Engineering) -- Queen's University, 2011-09-27 10:52:31.588
|
Page generated in 0.1345 seconds