101 |
Seleção de variáveis para classificação de bateladas produtivasKahmann, Alessandro January 2013 (has links)
Bancos de dados oriundos de processos industriais são caracterizados por elevado número de variáveis correlacionadas, dados ruidosos e maior número de variáveis do que observações, tornando a seleção de variáveis um importante problema a ser analisado no monitoramento de tais processos. A presente dissertação propõe sistemáticas para seleção de variáveis com vistas à classificação de bateladas produtivas. Para tanto, sugerem-se novos métodos que utilizam Índices de Importância de Variáveis para eliminação sistemática de variáveis combinadas a ferramentas de classificação; objetiva-se selecionar as variáveis de processo com maior habilidade discriminante para categorizar as bateladas em classes. Os métodos possuem uma sistematização básica que consiste em: i) separar os dados históricos em porções de treino e teste; ii) na porção de treino, gerar um Índice de Importância de Variáveis (IIV) que ordenará as variáveis de acordo com sua capacidade discriminante; iii) a cada iteração, classificam-se as amostras da porção de treino e removem-se sistematicamente as variáveis; iv) avaliam-se então os subconjuntos através da distância Euclidiana dos resultados dos subconjuntos a um ponto hipotético ótimo, definindo assim o subconjunto de variáveis a serem selecionadas. Para o cumprimento das etapas acima, são testadas diferentes ferramentas de classificação e IIV. A aplicação dos métodos em bancos reais e simulados verifica a robustez das proposições em dados com distintos níveis de correlação e ruído. / Databases derived from industrial processes are characterized by a large number of correlated, noisy variables and more variables than observations, making of variable selection an important issue regarding process monitoring. This thesis proposes methods for variable selection aimed at classifying production batches. For that matter, we propose new methods that use Variable Importance Indices for variable elimination combined with classification tools; the objective is to select the process variables with the highest discriminating ability to categorize batch classes. The methods rely on a basic framework: i) split historical data into training and testing sets; ii) in the training set, generate a Variable Importance Index (VII) that will rank the variables according to their discriminating ability; iii) at each iteration, classify samples from the training set and remove the variable with the smallest VII; iv) candidate subsets are then evaluated through the Euclidean distance to a hypothetical optimum, selecting the recommended subset of variables. The aforementioned steps are tested using different classification tools and VII’s. The application of the proposed methods to real and simulated data corroborates the robustness of the propositions on data with different levels of correlation and noise.
|
102 |
Predicting risk of cyberbullying victimization using lasso regressionOlaya Bucaro, Orlando January 2017 (has links)
The increased online presence and use of technology by today’s adolescents has created new places where bullying can occur. The aim of this thesis is to specify a prediction model that can accurately predict the risk of cyberbullying victimization. The data used is from a survey conducted at five secondary schools in Pereira, Colombia. A logistic regression model with random effects is used to predict cyberbullying exposure. Predictors are selected by lasso, tuned by cross-validation. Covariates included in the study includes demographic variables, dietary habit variables, parental mediation variables, school performance variables, physical health variables, mental health variables and health risk variables such as alcohol and drug consumption. Included variables in the final model are demographic variables, mental health variables and parental mediation variables. Variables excluded in the final model includes dietary habit variables, school performance variables, physical health variables and health risk variables. The final model has an overall prediction accuracy of 88%.
|
103 |
Bayesovský výběr proměnných / Bayesian variable selectionJančařík, Joel January 2017 (has links)
The selection of variables problem is ussual problem of statistical analysis. Solving this problem via Bayesian statistic become popular in 1990s. We re- view classical methods for bayesian variable selection methods and set a common framework for them. Indicator model selection methods and adaptive shrinkage methods for normal linear model are covered. Main benefit of this work is incorporating Bayesian theory and Markov Chain Monte Carlo theory (MCMC). All derivations needed for MCMC algorithms is provided. Afterward the methods are apllied on simulated and real data. 1
|
104 |
Channel attribution modelling using clickstream data from an online storeNeville, Kevin January 2017 (has links)
In marketing, behaviour of users is analysed in order to discover which channels (for instance TV, Social media etc.) are important for increasing the user’s intention to buy a product. The search for better channel attribution models than the common last-click model is of major concern for the industry of marketing. In this thesis, a probabilistic model for channel attribution has been developed, and this model is demonstrated to be more data-driven than the conventional last- click model. The modelling includes an attempt to include the time aspect in the modelling which have not been done in previous research. Our model is based on studying different sequence length and computing conditional probabilities of conversion by using logistic regression models. A clickstream dataset from an online store was analysed using the proposed model. This thesis has revealed proof of that the last-click model is not optimal for conducting these kinds of analyses.
|
105 |
Modelling and design of the eco-system of causality for real-time systemsDanishvar, Morad January 2015 (has links)
The purpose of this research work is to propose an improved method for real-time sensitivity analysis (SA) applicable to large-scale complex systems. Borrowed from the EventTracker principle of the interrelation of causal events, it deploys the Rank Order Clustering (ROC) method to automatically group every relevant system input to parameters that represent the system state (i.e. output). The fundamental principle of event modelling is that the state of a given system is a function of every acquirable piece of knowledge or data (input) of events that occur within the system and its wider operational environment unless proven otherwise. It therefore strives to build the theoretical and practical foundation for the engineering of input data. The event modelling platform proposed attempts to filter unwanted data, and more importantly, include information that was thought to be irrelevant at the outset of the design process. The underpinning logic of the proposed Event Clustering technique (EventiC) is to build causal relationship between the events that trigger the inputs and outputs of the system. EventiC groups inputs with relevant corresponding outputs and measures the impact of each input variable on the output variables in short spans of time (relative real-time). It is believed that this grouping of relevant input-output event data by order of its importance in real-time is the key contribution to knowledge in this subject area. Our motivation is that components of current complex and organised systems are capable of generating and sharing information within their network of interrelated devices and systems. In addition to being an intelligent recorder of events, EventiC could also be a platform for preliminary data and knowledge construction. This improvement in the quality, and at times the quantity of input data, may lead to improved higher level mathematical formalism. It is hoped that better models will translate into superior controls and decision making. It is therefore believed that the projected outcome of this research work can be used to predict, stabilize (control), and optimize (operational research) the work of complex systems in the shortest possible time. For proof of concept, EventiC was designed using the MATLAB package and implemented using real-time data from the monitoring and control system of a typical cement manufacturing plant. The purpose for this deployment was to test and validate the concept, and to demonstrate whether the clusters of input data and their levels of importance against system performance indicators could be approved by industry experts. EventiC was used as an input variable selection tool for improving the existing fuzzy controller of the plant. Finally, EventiC was compared with its predecessor EventTracker using the same case study. The results revealed improvements in both computational efficiency and the quality of input variable selection.
|
106 |
Variable selection in joint modelling of mean and variance for multilevel dataCharalambous, Christiana January 2011 (has links)
We propose to extend the use of penalized likelihood based variable selection methods to hierarchical generalized linear models (HGLMs) for jointly modellingboth the mean and variance structures. We are interested in applying these newmethods on multilevel structured data, hence we assume a two-level hierarchical structure, with subjects nested within groups. We consider a generalized linearmixed model (GLMM) for the mean, with a structured dispersion in the formof a generalized linear model (GLM). In the first instance, we model the varianceof the random effects which are present in the mean model, or in otherwords the variation between groups (between-level variation). In the second scenario,we model the dispersion parameter associated with the conditional varianceof the response, which could also be thought of as the variation betweensubjects (within-level variation). To do variable selection, we use the smoothlyclipped absolute deviation (SCAD) penalty, a penalized likelihood variable selectionmethod, which shrinks the coefficients of redundant variables to 0 and at thesame time estimates the coefficients of the remaining important covariates. Ourmethods are likelihood based and so in order to estimate the fixed effects in ourmodels, we apply iterative procedures such as the Newton-Raphson method, inthe form of the LQA algorithm proposed by Fan and Li (2001). We carry out simulationstudies for both the joint models for the mean and variance of the randomeffects, as well as the joint models for the mean and dispersion of the response,to assess the performance of our new procedures against a similar process whichexcludes variable selection. The results show that our method increases both theaccuracy and efficiency of the resulting penalized MLEs and has 100% successrate in identifying the zero and non-zero components over 100 simulations. Forthe main real data analysis, we use the Health Survey for England (HSE) 2004dataset. We investigate how obesity is linked to several factors such as smoking,drinking, exercise, long-standing illness, to name a few. We also discover whetherthere is variation in obesity between individuals and between households of individuals,as well as test whether that variation depends on some of the factorsaffecting obesity itself.
|
107 |
Seleção de variáveis para classificação de bateladas produtivasKahmann, Alessandro January 2013 (has links)
Bancos de dados oriundos de processos industriais são caracterizados por elevado número de variáveis correlacionadas, dados ruidosos e maior número de variáveis do que observações, tornando a seleção de variáveis um importante problema a ser analisado no monitoramento de tais processos. A presente dissertação propõe sistemáticas para seleção de variáveis com vistas à classificação de bateladas produtivas. Para tanto, sugerem-se novos métodos que utilizam Índices de Importância de Variáveis para eliminação sistemática de variáveis combinadas a ferramentas de classificação; objetiva-se selecionar as variáveis de processo com maior habilidade discriminante para categorizar as bateladas em classes. Os métodos possuem uma sistematização básica que consiste em: i) separar os dados históricos em porções de treino e teste; ii) na porção de treino, gerar um Índice de Importância de Variáveis (IIV) que ordenará as variáveis de acordo com sua capacidade discriminante; iii) a cada iteração, classificam-se as amostras da porção de treino e removem-se sistematicamente as variáveis; iv) avaliam-se então os subconjuntos através da distância Euclidiana dos resultados dos subconjuntos a um ponto hipotético ótimo, definindo assim o subconjunto de variáveis a serem selecionadas. Para o cumprimento das etapas acima, são testadas diferentes ferramentas de classificação e IIV. A aplicação dos métodos em bancos reais e simulados verifica a robustez das proposições em dados com distintos níveis de correlação e ruído. / Databases derived from industrial processes are characterized by a large number of correlated, noisy variables and more variables than observations, making of variable selection an important issue regarding process monitoring. This thesis proposes methods for variable selection aimed at classifying production batches. For that matter, we propose new methods that use Variable Importance Indices for variable elimination combined with classification tools; the objective is to select the process variables with the highest discriminating ability to categorize batch classes. The methods rely on a basic framework: i) split historical data into training and testing sets; ii) in the training set, generate a Variable Importance Index (VII) that will rank the variables according to their discriminating ability; iii) at each iteration, classify samples from the training set and remove the variable with the smallest VII; iv) candidate subsets are then evaluated through the Euclidean distance to a hypothetical optimum, selecting the recommended subset of variables. The aforementioned steps are tested using different classification tools and VII’s. The application of the proposed methods to real and simulated data corroborates the robustness of the propositions on data with different levels of correlation and noise.
|
108 |
Agrupamento de trabalhadores com perfis semelhantes de aprendizado utilizando técnicas multivariadasAzevedo, Bárbara Brzezinski January 2013 (has links)
A manufatura de produtos customizados resulta em variedade de modelos, redução no tamanho de lotes e alternância frequente de tarefas executadas por trabalhadores. Neste contexto, tarefas manuais são especialmente afetadas por conta do processo de adaptação do trabalhador a novos modelos de produtos. Este processo de aprendizado pode ocorrer de maneira distinta dentro de um grupo de trabalhadores. Assim, busca-se o agrupamento dos trabalhadores com perfis similares de aprendizado, monitorando a formação de gargalos em linhas de produção constituídas por dissimilaridades de aprendizado em processos manuais. A presente dissertação apresenta abordagens para clusterização de trabalhadores baseadas nos parâmetros oriundos da modelagem de Curvas de Aprendizado. Tais parâmetros, os quais caracterizam o processo de adaptação de trabalhadores a tarefas, são transformados através da Análise de Componentes Principais e então utilizados como variáveis de clusterização. Na sequência, testam-se outras transformações nos parâmetros utilizando funções Kernel. Os trabalhadores são clusterizados através do método K-Means e Fuzzy C-Means e a qualidade dos agrupamentos formados é medida através do Silhouette Index. Por fim, sugere-se um índice de importância de variável baseado em parâmetros obtidos na Análise Componentes Principais com o objetivo de selecionar as variáveis mais relevantes para clusterização. As abordagens propostas são aplicadas em um processo da indústria calçadista, gerando resultados satisfatórios quando comparados a clusterizações realizadas sem a transformação prévia dos dados ou sem seleção das variáveis. / Manufacturing of customized products relies on a large menu choice, reduced batch sizes and frequent alternation of tasks performed by workers. In this context, manual tasks are especially affected by workers’ adaptation to new product models. This learning process takes place in different paces within a group of workers. This thesis aims at grouping workers with similar learning process tailored to avoid bottlenecks in production lines due to learning dissimilarities among workers. For that matter, we present a method for clustering workers based on parameters derived from Learning Curve (LC) modeling. Such parameters are processed through Principal Component Analysis (PCA), and the PCA scores are used as clustering variables. Next, Kernel transformations are also used to improve clustering quality. The data is clustered using K-Means and Fuzzy C-Means techniques, and the quality of resulting clusters is measured by the Silhouette Index. Finally, we suggest a variable importance index based on parameters derived from PCA to select the most relevant variables for clustering. The proposed approaches are applied in a footwear process, yielding satisfactory results when compared to clustering on original data or without variable selection.
|
109 |
Concave selection in generalized linear modelsJiang, Dingfeng 01 May 2012 (has links)
A family of concave penalties, including the smoothly clipped absolute deviation (SCAD) and minimax concave penalties (MCP), has been shown to have attractive properties in variable selection. The computation of concave penalized solutions, however, is a difficult task. We propose a majorization minimization by coordinate descent (MMCD) algorithm to compute the solutions of concave penalized generalized linear models (GLM). In contrast to the existing algorithms that uses local quadratic or local linear approximation of the penalty, the MMCD majorizes the negative log-likelihood by a quadratic loss, but does not use any approximation to the penalty. This strategy avoids the computation of scaling factors in iterative steps, hence improves the efficiency of coordinate descent. Under certain regularity conditions, we establish the theoretical convergence property of the MMCD algorithm. We implement this algorithm in a penalized logistic regression model using the SCAD and MCP penalties. Simulation studies and a data example demonstrate that the MMCD works sufficiently fast for the penalized logistic regression in high-dimensional settings where the number of covariates is much larger than the sample size. Grouping structure among predictors exists in many regression applications. We first propose an l2 grouped concave penalty to incorporate such group information in a regression model. The l2 grouped concave penalty performs group selection and includes group Lasso as a special case. An efficient algorithm is developed and its theoretical convergence property is established under certain regularity conditions. The group selection property of the l2 grouped concave penalty is desirable in some applications; while in other applications selection at both group and individual levels is needed. Hence, we propose an l1 grouped concave penalty for variable selection at both individual and group levels. An efficient algorithm is also developed for the l1 grouped concave penalty. Simulation studies are performed to evaluate the finite-sample performance of the two grouped concave selection methods. The new grouped penalties are also used in analyzing two motivation datasets. The results from both the simulation and real data analyses demonstrate certain benefits of using grouped penalties. Therefore, the proposed concave group penalties are valuable alternatives to the standard concave penalties.
|
110 |
Monitoring and diagnosis of process faults and sensor faults in manufacturing processesLi, Shan 01 January 2008 (has links)
The substantial growth in the use of automated in-process sensing technologies creates great opportunities for manufacturers to detect abnormal manufacturing processes and identify the root causes quickly. It is critical to locate and distinguish two types of faults - process faults and sensor faults. The procedures to monitor and diagnose process and sensor mean shift faults are presented with the assumption that the manufacturing processes can be modeled by a linear fault-quality model.
A W control chart is developed to monitor the manufacturing process and quickly detect the occurrence of the sensor faults. Since the W chart is insensitive to process faults, when it is combined with U chart, both process faults and sensor faults can be detected and distinguished. A unit-free index referred to as the sensitivity ratio (SR) is defined to measure the sensitivity of the W chart. It shows that the sensitivity of the W chart is affected by the potential influence of the sensor measurement.
A Bayesian variable selection based fault diagnosis approach is presented to locate the root causes of the abnormal processes. A Minimal Coupled Pattern (MCP) and its degree are defined to denote the coupled structure of a system. When less than half of the faults within an MCP occur, which is defined as sparse faults, the proposed fault diagnosis procedure can identify the correct root causes with high probability. Guidelines are provided for the hyperparameters selection in the Bayesian hierarchical model. An alternative CML method for hyperparameters selection is also discussed. With the large number of potential process faults and sensor faults, an MCMC method, e.g. Metropolis-Hastings algorithm can be applied to approximate the posterior probabilities of candidate models.
The monitor and diagnosis procedures are demonstrated and evaluate through an autobody assembly example.
|
Page generated in 0.1219 seconds