Global ETD Search

161	Wavelet methods and statistical applications: network security and bioinformatics Kwon, Deukwoo 01 November 2005 (has links) Wavelet methods possess versatile properties for statistical applications. We would like to explore the advantages of using wavelets in the analyses in two different research areas. First of all, we develop an integrated tool for online detection of network anomalies. We consider statistical change point detection algorithms, for both local changes in the variance and for jumps detection, and propose modified versions of these algorithms based on moving window techniques. We investigate performances on simulated data and on network traffic data with several superimposed attacks. All detection methods are based on wavelet packets transformations. We also propose a Bayesian model for the analysis of high-throughput data where the outcome of interest has a natural ordering. The method provides a unified approach for identifying relevant markers and predicting class memberships. This is accomplished by building a stochastic search variable selection method into an ordinal model. We apply the methodology to the analysis of proteomic studies in prostate cancer. We explore wavelet-based techniques to remove noise from the protein mass spectra. The goal is to identify protein markers associated with prostate-specific antigen (PSA) level, an ordinal diagnostic measure currently used to stratify patients into different risk groups. Bayesian ordinal probit model wavelet methods change point detection network security bioinformatics proteomics SELDI-TOF MS Bayesian variable selection biomarker
162	Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis Wang, Yanhong 17 December 2013 (has links) Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional data, dimension reduction is necessary before clustering or classification can be made. In the first part of this dissertation, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai, 2009), and propose to use cross-validation to select the tuning parameter. Then we develop a variation of ODC, sparse optimal discriminant clustering (SODC) for high dimensional data, by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis. In the second part, three existing sparse principal component analysis (SPCA) methods, Lasso-PCA (L-PCA), Alternative Lasso PCA (AL-PCA), and sparse principal component analysis by choice of norm (SPCABP) are applied to a real data set the International HapMap Project for AIM selection to genome-wide SNP data, the classification accuracy is compared for them and it is demonstrated that SPCABP outperforms the other two SPCA methods. Third, we propose a novel method called sparse factor analysis by projection (SFABP) based on SPCABP, and propose to use cross-validation method for the selection of the tuning parameter and the number of factors. Our simulation studies show that SFABP has better performance than the unpenalyzed factor analysis when they are applied to classification problems. Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis
163	Monte Carlo methods for sampling high-dimensional binary vectors Schäfer, Christian 14 November 2012 (has links) (PDF) This thesis is concerned with Monte Carlo methods for sampling high-dimensional binary vectors from complex distributions of interest. If the state space is too large for exhaustive enumeration, these methods provide a mean of estimating the expected value with respect to some function of interest. Standard approaches are mostly based on random walk type Markov chain Monte Carlo, where the equilibrium distribution of the chain is the distribution of interest and its ergodic mean converges to the expected value. We propose a novel sampling algorithm based on sequential Monte Carlo methodology which copes well with multi-modal problems by virtue of an annealing schedule. The performance of the proposed sequential Monte Carlo sampler depends on the ability to sample proposals from auxiliary distributions which are, in a certain sense, close to the current distribution of interest. The core work of this thesis discusses strategies to construct parametric families for sampling binary vectors with dependencies. The usefulness of this approach is demonstrated in the context of Bayesian variable selection and combinatorial optimization of pseudo-Boolean objective functions. Sequential Monte Carlo Bayesian variable selection Binary parametric families Binary optimization
164	A fault diagnosis technique for complex systems using Bayesian data analysis Lee, Young Ki 01 April 2008 (has links) This research develops a fault diagnosis method for complex systems in the presence of uncertainties and possibility of multiple solutions. Fault diagnosis is a challenging problem because data used in diagnosis contain random errors and often systematic errors as well. Furthermore, fault diagnosis is basically an inverse problem so that it inherits unfavorable characteristics of inverse problems: The existence and uniqueness of an inverse solution are not guaranteed and the solution may be unstable. The weighted least squares method and its variations are traditionally used for solving inverse problems. However, the existing algorithms often fail to identify multiple solutions if they are present. In addition, the existing algorithms are not capable of selecting variables systematically so that they generally use the full model in which may contain unnecessary variables as well as necessary variables. Ignoring this model uncertainty often gives rise to, so called, the smearing effect in solutions, because of which unnecessary variables are overestimated and necessary variables are underestimated. The proposed method solves the inverse problem using Bayesian inference. An engineering system can be parameterized using state variables. The probability of each state variable is inferred from observations made on the system. A bias in an observation is treated as a variable, and the probability of the bias variable is inferred as well. To take the uncertainty of model structure into account, multiple Bayesian models are created with various combinations of the state variables and the bias variables. The results from all models are averaged according to how likely each model is. Gibbs sampling is used for approximating updated probabilities. The method is demonstrated for two applications: the status matching of a turbojet engine and the fault diagnosis of an industrial gas turbine. In the status matching application only physical faults in the components of a turbojet engine are considered whereas in the fault diagnosis application sensor biases are considered as well as physical faults. The proposed method is tested in various faulty conditions using simulated measurements. Results show that the proposed method identifies physical faults and sensor biases simultaneously. It is also demonstrated that multiple solutions can be identified. Overall, there is a clear improvement in ability to identify correct solutions over the full model that contains all state and bias variables. Fault diagnosis Variable selection Bayesian model averaging Inverse problems Statistical inference Fault location (Engineering)
165	Using a weighted bootstrap approach to identify risk factors associated with the sexual activity of entering first-year students at UWC Brydon, Humphrey January 2013 (has links) Magister Scientiae - MSc / This thesis looks at the effect that the introduction of various techniques (weighting, bootstrapping and variable selection) has on the accuracy of the modelling process when using logistic regression. The data used in the modelling process is based on the sexual activity of entering first-year students at the University of the Western Cape, therefore, by constructing logistic regression models based on this data, certain predictor variables or factors associated with the sexual activity of these students are identified. The sample weighting technique utilized in this thesis assigned a weight to a student based on gender and racial representations within a sample when compared to the population of the entering first-year. The use of a sample weighting technique is shown to produce a more effective modelling process than a modelling process without weighting. The bootstrapping procedure is shown to produce logistic regression models that are more accurate. Utilizing more than 200 bootstrap samples did not necessarily produce logistic regression models that were more accurate than using a total of 200 bootstrap samples. It is, however, concluded that a weighted bootstrap modelling procedure will result in more accurate models compared to a procedure without this intervention. The forward, backward, stepwise, Newton-Raphson and Fisher variable selection methods are used. The Newton-Raphson and Fisher methods are found not to be effective when used in a logistic modelling process, whereas the forward, backward and stepwise methods are all shown to produce very similar results. Six predictor variables or factors are identified with respect to the sexual activity of the specified students: the age of the student; whether they consume alcohol or not; their racial grouping; whether an HIV test has been taken; the importance of religion in influencing their sexual behaviour; and whether they smoke or not.i i Conclusions are reached with respect to improvements that could be made to the HIV prevention programme at UWC with reference to the sexual activity of entering first-years. Sexual risk behaviour Logistic regression Variable selection Weighted bootstrap Sample weighting HIV prevention Alcohol Smoking Importance of religion Sexually active
166	Técnicas de análise multivariável aplicadas ao desenvolvimento de analisadores virtuais Facchin, Samuel January 2005 (has links) A construção de um analisador virtual é sustentada basicamente por três pilares: o modelo, as variáveis que integram o modelo e a estratégia de correção/atualização do modelo. Os modelos matemáticos são classificados quanto ao nível de conhecimento do processo contido nele, indo de modelos complexos baseados em relações fundamentais e leis físico-químicas, denominados white-box, até modelos obtidos através de técnicas de análise multivariável, como técnicas de regressão multiváriavel e redes neurais, referenciados como black box. O presente trabalho objetiva uma análise de dois dos pilares: os modelos, focando em modelos obtidos através das técnicas de redução de dimensionalidade do tipo PLS, e metodologias de seleção de variáveis para a construção dessa classe de modelos. Primeiramente é realizada uma revisão das principais variantes lineares e não lineares da metodologia PLS, compreendendo desde o seu desenvolvimento até a sua combinação com redes neurais. Posteriormente são apresentadas algumas das técnicas popularmente utilizadas para a seleção de variáveis em modelos do tipo black-box, técnicas de validação cruzada e técnicas de seleção de dados para calibração e validação de modelos. São propostas novas abordagens para os procedimentos de seleção de variáveis, originadas da combinação das técnicas de seleção de dados com duas metodologias de seleção de variáveis. Os resultados produzidos por essas novas abordagens são comparados com o método clássico através de casos lineares e não lineares. A viabilidade das técnicas analisadas e desenvolvidas é verificada através da aplicação das mesmas no desenvolvimento de um analisador virtual para uma coluna de destilação simulada através do simulador dinâmico Aspen Dynamics®. Por fim são apresentadas as etapas e desafios da implementação de um analisador virtual baseados em técnicas PLS em uma Torre Depropanizadora de uma central de matérias primas de um pólo petroquímico. / The construction of a virtual analyzer is sustained basically by three pillars: the model, the variables that integrate the model and the updating strategy of the model. The mathematical models are classified with relationship at the level of the process knowledge within it, going from complex models, based on fundamental relationships and physical-chemistries laws, called white-box, until models obtained through multivariable analysis techniques, as multiple linear regression and neural networks, also called as black box. The focus of the present work is the analysis of two of the pillars: the models, specially the ones obtained by dimension reduction techniques, like PLS, and methodologies used in the development of this class of models. Initially, a revision of the main linear and non linear variants of the PLS methodology is done, embracing since its development to its combination with neural networks. Later on, some popularly variables selection techniques for black-box models are explained, as well as some cross validation techniques and strategies for data selection for calibration and validation of models. New approaches for variables selection procedures are proposed, originated by the combination of data selection strategies and two variables selection techniques. The results produced by those new approaches are compared with the classic method through linear and non linear case studies. The viability of the analyzed and developed techniques is verified through the application of the same ones in the development of a virtual analyzer for a distillation column, simulated by the dynamic simulator Aspen Dynamics®. The steps and challenges faced in the implementation of a virtual analyzer based on PLS technical for a Depropanizer Unit are finally presented. Controle de processos químicos Colunas de destilação Análise multivariada Virtual Analyzer Empirical models PLS QPLS Variable selection 1,3 Butadiene Depropanizer
167	Abordagens multivariadas para a seleção de variáveis com vistas à caracterização de medicamentos / Multivariate approaches to variable selection in order to characterize medicines Yamashita, Gabrielli Harumi January 2015 (has links) A averiguação da autenticidade de medicamentos tem se apoiado na análise de perfil por espectroscopia de infravermelho (ATR-FTIR). Contudo, tal análise tipicamente gera dados caracterizados por elevado número de variáveis (comprimentos de onda) ruidosas e correlacionadas, necessitando assim da aplicação de técnicas para seleção das variáveis mais relevantes e informativas, tornando os modelos preditivos e exploratórios mais robustos. Esta dissertação testa sistemáticas para a seleção de variáveis com vistas à clusterização e classificação de medicamentos. Para tanto, inicialmente faz-se uso dos parâmetros oriundos da Análise de Componentes Principais (ACP) para a geração de três índices de importância de variáveis; tais índices guiam um processo iterativo de eliminação de variáveis com vistas a uma clusterização mais consistente, medida através do Silhouette Index. Na sequência, utiliza-se o Algoritmo Genético (AG) combinado com a ferramenta de classificação k nearest neighbor (kNN) para selecionar o subconjunto de variáveis que resultem na maior acurácia média com propósito de classificação das amostras em dois grupos, originais ou falsificados. Por fim, aplica-se a divisão dos dados ATR-FTIR em intervalos para selecionar as regiões espectroscópicas mais relevantes para a classificação das amostras via kNN; na sequência, aplica-se o AG para refinar os intervalos retidos anteriormente. A aplicação dos métodos de seleção de variáveis propostos permitiu realizar clusterizações e classificações mais precisas com base em um subconjunto reduzido de variáveis. / The investigation of the authenticity of drugs has relied on the profile analysis by infrared spectroscopy (ATR-FTIR). However, such analysis typically yields a large number of correlated and noisy variables (wavelengths), which require the application of techniques for selecting the most informative and relevant variables to improve model ability. This thesis test an approach to variable selection aimed at clustering and classifying drug samples. For that matter, it derives three variable importance indices based on Principal Component Analysis (PCA) components that guide an iterative process of variable elimination; clustering performance based on the reduced sets is assessed via Silhouette Index. Next, we combine the Genetic Algorithm (GA) with the k nearest neighbor classification technique (kNN) to select the subset of variables yielding the highest average accuracy for classifying samples into authentic or counterfeit categories. Finally, we split the ATR-FTIR data into intervals to select the most relevant spectroscopic regions for sample classification via kNN; we then apply GA to refine the ranges previously retained. The implementation of the proposed variable selection methods led to more accurate clustering and classification procedures based on a small subset of variables. Algoritmos geneticos Análise multivariada Controle de qualidade Variable selection Clustering Principal component analysis Genetic algorithm Classification Interval selection
168	Sistemática para seleção de variáveis e determinação da condição ótima de operação em processos contínuos multivariados em múltiplos estágios Loreto, Éverton Miguel da Silva January 2014 (has links) Esta tese apresenta uma sistemática para seleção de variáveis de processo e determinação da condição ótima de operação em processos contínuos multivariados e em múltiplos estágios. O método proposto é composto por seis etapas. Um pré-tratamento nos dados é realizado após a identificação das variáveis de processo e do estabelecimento dos estágios de produção, onde são descartadas observações com valores espúrios e dados remanescentes são padronizados. Em seguida, cada estágio é modelado através de uma regressão Partial Least Squares (PLS) que associa a variável dependente daquele estágio às variáveis independentes de todos os estágios anteriores. A posterior seleção de variáveis independentes apoia-se nos coeficientes da regressão PLS; a cada interação, a variável com menor coeficiente de regressão é removida e um novo modelo PLS é gerado. O erro de predição é então avaliado e uma nova eliminação é promovida até que o número de variáveis remanescentes seja igual ao número de variáveis latentes (condição limite para geração de novos modelos PLS). O conjunto com menor erro determina as variáveis de processo mais relevantes para cada modelo. O conjunto de modelos PLS constituído pelas variáveis selecionadas é então integrado a uma programação quadrática para definição das condições de operação que minimizem o desvio entre os valores preditos e nominais das variáveis de resposta. A sistemática proposta foi validada através de dois exemplos numéricos. O primeiro utilizou dados de uma empresa do setor avícola, enquanto que o segundo apoiou-se em dados simulados. / This dissertation proposes a novel approach for process variable selection and determination of the optimal operating condition in multiple stages, multivariate continuous processes. The proposed framework relies on six steps. First, a pre-treatment of the data is carried out followed by the definition of production stages and removal of outliers. Next, each stage is modeled by a Partial Least Squares regression (PLS) which associates the dependent variable of each stage to all independent variables from previous stages. Independent variables are then iteratively selected based on PLS regression coefficients as follows: the variable with the lowest regression coefficient is removed and a new PLS model is generated. The prediction error is then evaluated and a new elimination is promoted until the number of remaining variables is equal to the number of latent variables (boundary condition for the generation of new PLS models). The subset of independent variables yielding the lowest predictive in each PLS model error is chosen. The set of PLS models consisting of the selected variables is then integrated to a quadratic programming aimed at defining the optimal operating conditions that minimize the deviation between the predicted and nominal values of response variables. The proposed approach was validated through two numerical examples. The first was applied to data from a poultry company, while the second used simulated data. Simulação numérica Sistemas de produção Controle de processos Análise multivariada Controle de qualidade PLS Variable selection Multivariate process Multistage process
169	Seleção de variáveis no desenvolvimento, classificação e predição de produtos / Selection of variables for the development, classification, and prediction of products Rossini, Karina January 2011 (has links) O presente trabalho apresenta proposições para seleção de variáveis em avaliações sensoriais descritivas e de espectro infravermelho que contribuam com a indústria de alimentos e química através da utilização de métodos de análise multivariada. Desta forma, os objetivos desta tese são: (i) Estudar as principais técnicas de análise multivariada de dados, como são comumente organizadas e como podem contribuir no processo de seleção de variáveis; (ii) Identificar e estruturar técnicas de análise multivariada de dados de forma a construir um método que reduza o número de variáveis necessárias para fins de caracterização, classificação e predição dos produtos; (iii) Reduzir a lista de variáveis/atributos, selecionando aqueles relevantes e não redundantes, reduzindo o tempo de execução e a fadiga imposta aos membros de um painel em avaliações sensoriais; (iv) Validar o método proposto utilizando dados reais; e (v) Comparar diferentes abordagens de análise sensorial voltadas ao desenvolvimento de novos produtos. Os métodos desenvolvidos foram avaliados através da aplicação de estudos de caso, em exemplos com dados reais. Os métodos sugeridos variam com as características dos dados analisados, dados altamente multicolineares ou não e, com e sem variável dependente (variável de resposta). Os métodos apresentam bom desempenho, conduzindo a uma redução significativa no número de variáveis e apresentando índices de adequação de ajuste dos modelos ou acurácia satisfatórios quando comparados aos obtidos mediante retenção da totalidade das variáveis ou comparados a outros métodos dispostos na literatura. Conclui-se que os métodos propostos são adequados para a seleção de variáveis sensoriais e de espectro infravermelho. / This dissertation presents propositions for variable selection in data from descriptive sensory evaluations and near-infrared (NIR) spectrum analyses, based on multivariate analysis methods. There are five objectives here: (i) review the main multivariate analysis techniques, their relationships and potential use in variable selection procedures; (ii) propose a variable selection method based on the techniques in (i) that allows product prediction, classification, and description; (iii) reduce the list of variables/attributes to be analyzed in sensory panels identifying those relevant and non-redundant, such that the time to collect panel data and the fatigue imposed on panelists is minimized; (iv) validate methodological propositions using real life data; and (v) compare different sensory analysis approaches used in new product development. Proposed methods were evaluated through case studies, and vary according to characteristics in the datasets analyzed (data with different degrees of multicollinearity, presenting or not dependent variables). All methods presented good performance leading to significant reduction in the number of variables in the datasets, and leading to models with better adequacy of fit. We conclude that the methods are suitable for datasets from descriptive sensory evaluations and NIR analyses. Análise multivariada Análise sensorial Indústria de alimentos Variable selection Sensory evaluation Multivariate data analysis Near-infrared (NIR) spectrum analyses
170	Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes / A systemic approach to statistical analysis to transcriptomic data through co-expression network analysis Brunet, Anne-Claire 17 June 2016 (has links) Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<<p), car collectées sur des échantillons de tailles très limitées au regard du très grand nombre de variables (ici l'expression des gènes).La première partie de la thèse est consacrée à la présentation de méthodes d'apprentissage supervisé, telles que les forêts aléatoires de Breiman et les modèles de régressions pénalisées, utilisées dans le contexte de la grande dimension pour sélectionner les gènes (variables d'expression) qui sont les plus pertinents pour l'étude de la pathologie d'intérêt. Nous évoquons les limites de ces méthodes pour la sélection de gènes qui soient pertinents, non pas uniquement pour des considérations d'ordre statistique, mais qui le soient également sur le plan biologique, et notamment pour les sélections au sein des groupes de variables fortement corrélées, c'est à dire au sein des groupes de gènes co-exprimés. Les méthodes d'apprentissage classiques considèrent que chaque gène peut avoir une action isolée dans le modèle, ce qui est en pratique peu réaliste. Un caractère biologique observable est la résultante d'un ensemble de réactions au sein d'un système complexe faisant interagir les gènes les uns avec les autres, et les gènes impliqués dans une même fonction biologique ont tendance à être co-exprimés (expression corrélée). Ainsi, dans une deuxième partie, nous nous intéressons aux réseaux de co-expression de gènes sur lesquels deux gènes sont reliés si ils sont co-exprimés. Plus précisément, nous cherchons à mettre en évidence des communautés de gènes sur ces réseaux, c'est à dire des groupes de gènes co-exprimés, puis à sélectionner les communautés les plus pertinentes pour l'étude de la pathologie, ainsi que les "gènes clés" de ces communautés. Cela favorise les interprétations biologiques, car il est souvent possible d'associer une fonction biologique à une communauté de gènes. Nous proposons une approche originale et efficace permettant de traiter simultanément la problématique de la modélisation du réseau de co-expression de gènes et celle de la détection des communautés de gènes sur le réseau. Nous mettons en avant les performances de notre approche en la comparant à des méthodes existantes et populaires pour l'analyse des réseaux de co-expression de gènes (WGCNA et méthodes spectrales). Enfin, par l'analyse d'un jeu de données réelles, nous montrons dans la dernière partie de la thèse que l'approche que nous proposons permet d'obtenir des résultats convaincants sur le plan biologique, plus propices aux interprétations et plus robustes que ceux obtenus avec les méthodes d'apprentissage supervisé classiques. / Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<<p) collected from small samples with regard to the very large number of variables (gene expression variables). The first part of the thesis is devoted to a description of some supervised learning methods, such as random forest and penalized regression models. The following methods can be used for selecting the most relevant disease-related genes. However, the statistical relevance of the selections doesn't determine the biological relevance, and particularly when genes are selected within a group of highly correlated variables or co-expressed genes. Common supervised learning methods consider that every gene can have an isolated action in the model which is not so much realistic. An observable biological phenomenum is the result of a set of reactions inside a complex system which makes genes interact with each other, and genes that have a common biological function tend to be co-expressed (correlation between expression variables). Then, in a second part, we are interested in gene co-expression networks, where genes are linked if they are co-expressed. More precisely, we aim to identify communities of co-expressed genes, and then to select the most relevant disease-related communities as well as the "key-genes" of these communities. It leads to a variety of biological interpretations, because a community of co-expressed genes is often associated with a specific biological function. We propose an original and efficient approach that permits to treat simultaneously the problem of modeling the gene co-expression network and the problem of detecting the communities in network. We put forward the performances of our approach by comparing it to the existing methods that are popular for analysing gene co-expression networks (WGCNA and spectral approaches). The last part presents the results produced by applying our proposed approach on a real-world data set. We obtain convincing and robust results that help us make more diverse biological interpretations than with results produced by common supervised learning methods. Données transcriptomiques Réseaux de gènes Transcriptomic data Co-expression network Variable selection Dimensionality reduction Penalized regression Network clustering Machine learning

Search results