Global ETD Search

61	起伏變遷型長期追蹤資料的分析方法研究 / The Analysis of Categorical Panel Data in Discrete Time with All Categories Communicating 盧宏益 Unknown Date (has links) 許多社會科學及醫學上的長期追蹤研究上，常會根據研究之需要，而針對某一群人在一段時間內重覆地收集其有關變項（包括類別型反應變項及解釋變項）的資料。這種重覆觀察的資料在統計的文獻上稱為長期追蹤研究資料。在這些長期追蹤研究上，研究者常利用迴歸模型建構的技巧來探討反應變項及解釋變項之間的關係。一般常用的模型，著重於評估解釋變項對反應變項的當時及短期效應，當解釋變項比反應變項更頻繁地被觀測時，這些模型則不適用。當反應變項可在不同類別間變動時，我們通常有興趣去探討解釋變項如何去影響反應變項的演變或未來走向的趨勢，這種研究可稱之為類別型長期追蹤研究資料的未來趨勢分析。本論文提出了以馬可夫離散時間過程來建立類別型長期追蹤研究資料的模型。此模型不但可以捕捉到解釋變項對反應變項的未來趨勢效應；而且當解釋變項較反應變項更頻繁地被觀測時，本模型也可以利用解釋變項的完整訊息來做出更正確的統計推論。 / Many longitudinal studies in social science and medical science take repeated observations of an categorical outcome, along with several covariates, from follow-up subjects over a certain period of time. Such repeated observations are called longitudinal or panel data in the statistical literature. It is often of interest in these studies to investigate the relationship between the outcome and the covariates through regression modeling techniques. Commonly used models often focus on assessing the contemporary or short term effect of the covariate on the outcome, and can't incorporate time-varying covariates that are observed more or less frequently than the rate we observe the outcome. When the outcome fluctuates among different categories, it is often of interest to assess how covariates effect the evolution or trend of the underlying outcome process. Such assessment can be termed trend analysis of categorical panel data. In this thesis, we propose a Markov chain based regression model for analyzing nominal categorical panel data that are generated by a discrete time outcome process. The proposed model focuses on assessing the trend effect of the covariate on the categorical outcome, and is able to utilize the complete information of the covariates that are observed more or less frequently than the outcome. 長期追蹤研究資料類別型資料趨勢效應 longitudinal data panel data trend effect categorical data
62	Geometric Methods for Mining Large and Possibly Private Datasets Chen, Keke 07 July 2006 (has links) With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. The iVIBRATE framework uniquely addresses the challenges in handling irregularly shaped clusters, domain-specific cluster definition, and cluster-labeling of the data on disk. It consists of the VISTA visual cluster rendering subsystem, and the Adaptive ClusterMap Labeling subsystem. The second main contribution is the development of ``Best K Plot'(BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method uniquely addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method consists of the basic theory, the sample BKPlot theory for large datasets, and the testing method for identifying no-cluster datasets. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality, considering possible inference attacks. When geometric perturbation is applied to collaborative multiparty data classification, it is challenging to unify the different geometric perturbations used by different parties. We study three protocols under the data-mining-service oriented framework for unifying the perturbations: 1) the threshold-satisfied voting protocol, 2) the space adaptation protocol, and 3) the space adaptation protocol with a trusted party. The tradeoffs between the privacy guarantee, the model accuracy and the cost are studied for the protocols. Geometric methods Information visualization Data mining Privacy-preserving data mining Data clustering Data classification Distributed collaborative data mining Categorical data clustering
63	Some properties of measures of disagreement and disorder in paired ordinal data Högberg, Hans January 2010 (has links) The measures studied in this thesis were a measure of disorder, D, and a measure of the individual part of the disagreement, the measure of relative rank variance, RV, proposed by Svensson in 1993. The measure of disorder is a useful measure of order consistency in paired assessments of scales with a different number of possible values. The measure of relative rank variance is a useful measure in evaluating reliability and for evaluating change in qualitative outcome variables. In Paper I an overview of methods used in the analysis of dependent ordinal data and a comparison of the methods regarding the assumptions, specifications, applicability, and implications for use were made. In Paper II an application, and a comparison of the results of some standard models, tests, and measures to two different research problems were made. The sampling distribution of the measure of disorder was studied both analytically and by a simulation experiment in Paper III. The asymptotic normal distribution was shown by the theory of U-statistics and the simulation experiments for finite sample sizes and various amount of disorder showed that the sampling distribution was approximately normal for sample sizes of about 40 to 60 for moderate sizes of D and for smaller sample sizes for substantial sizes of D. The sampling distribution of the relative rank variance was studied in a simulation experiment in Paper IV. The simulation experiment showed that the sampling distribution was approximately normal for sample sizes of 60-100 for moderate size of RV, and for smaller sample sizes for substantial size of RV. In Paper V a procedure for inference regarding relative rank variances from two or more samples was proposed. Pair-wise comparison by jackknife technique for variance estimation and the use of normal distribution as approximation in inference for parameters in independent samples based on the results in Paper IV were demonstrated. Moreover, an application of Kruskal-Wallis test for independent samples and Friedman’s test for dependent samples were conducted. / Statistical methods for ordinal data agreement augmented ranks categorical data disorder jackknife paired ordinal data rating scales sample size sampling distribution simulation U-statistics variance SOCIAL SCIENCES SAMHÄLLSVETENSKAP Statistics Statistik
64	Multiple Imputation for Two-Level Hierarchical Models with Categorical Variables and Missing at Random Data January 2016 (has links) abstract: Accurate data analysis and interpretation of results may be influenced by many potential factors. The factors of interest in the current work are the chosen analysis model(s), the presence of missing data, and the type(s) of data collected. If analysis models are used which a) do not accurately capture the structure of relationships in the data such as clustered/hierarchical data, b) do not allow or control for missing values present in the data, or c) do not accurately compensate for different data types such as categorical data, then the assumptions associated with the model have not been met and the results of the analysis may be inaccurate. In the presence of clustered/nested data, hierarchical linear modeling or multilevel modeling (MLM; Raudenbush & Bryk, 2002) has the ability to predict outcomes for each level of analysis and across multiple levels (accounting for relationships between levels) providing a significant advantage over single-level analyses. When multilevel data contain missingness, multilevel multiple imputation (MLMI) techniques may be used to model both the missingness and the clustered nature of the data. With categorical multilevel data with missingness, categorical MLMI must be used. Two such routines for MLMI with continuous and categorical data were explored with missing at random (MAR) data: a formal Bayesian imputation and analysis routine in JAGS (R/JAGS) and a common MLM procedure of imputation via Bayesian estimation in BLImP with frequentist analysis of the multilevel model in Mplus (BLImP/Mplus). Manipulated variables included interclass correlations, number of clusters, and the rate of missingness. Results showed that with continuous data, R/JAGS returned more accurate parameter estimates than BLImP/Mplus for almost all parameters of interest across levels of the manipulated variables. Both R/JAGS and BLImP/Mplus encountered convergence issues and returned inaccurate parameter estimates when imputing and analyzing dichotomous data. Follow-up studies showed that JAGS and BLImP returned similar imputed datasets but the choice of analysis software for MLM impacted the recovery of accurate parameter estimates. Implications of these findings and recommendations for further research will be discussed. / Dissertation/Thesis / Doctoral Dissertation Educational Psychology 2016 Quantitative psychology Statistics Educational tests & measurements Bayesian Estimation Categorical Data Analysis Missing at Random Data Missing Data Theory Multilevel Modeling Multiple Imputation
65	Outlier Detection with Applications in Graph Data Mining Ranga Suri, N N R January 2013 (has links) (PDF) Outlier detection is an important data mining task due to its applicability in many contemporary applications such as fraud detection and anomaly detection in networks, etc. It assumes significance due to the general perception that outliers represent evolving novel patterns in data that are critical to many discovery tasks. Extensive use of various data mining techniques in different application domains gave rise to the rapid proliferation of research work on outlier detection problem. This has lead to the development of numerous methods for detecting outliers in various problem settings. However, most of these methods deal primarily with numeric data. Therefore, the problem of outlier detection in categorical data has been considered in this work for developing some novel methods addressing various research issues. Firstly, a ranking based algorithm for detecting a likely set of outliers in a given categorical data has been developed employing two independent ranking schemes. Subsequently, the issue of data dimensionality has been addressed by proposing a novel unsupervised feature selection algorithm on categorical data. Similarly, the uncertainty associated with the outlier detection task has also been suitably dealt with by developing a novel rough sets based categorical clustering algorithm. Due to the networked nature of the data pertaining to many real life applications such as computer communication networks, social networks of friends, the citation networks of documents, hyper-linked networks of web pages, etc., outlier detection(also known as anomaly detection) in graph representation of network data turns out to be an important pattern discovery activity. Accordingly, a novel graph mining method has been envisaged in this thesis based on the concept of community detection in graphs. In addition to finding anomalous nodes and anomalous edges, this method is capable of detecting various higher level anomalies that are arbitrary sub-graphs of the input graph. Subsequently, these ideas have been further extended in this thesis to characterize the time varying behavior of outliers(anomalies) in dynamic network data by defining various categories of temporal outliers (anomalies). Characterizing the behavior of such outliers during the evolution of the network over time is critical for discovering different anomalous connectivity patterns with potential adverse effects such as intrusions into a computer network, etc. In order to deal with temporal outlier detection in single instance network/graph data, the link prediction task has been leveraged in this thesis to produce multiple instances of the input graph. Thus, various outlier detection principles have been successfully applied for mining various categories of temporal outliers(anomalies) in the graph representation of network data. Data Mining Graph Data Mining Outlier Detection Categorical Data - Outlier Detection Network/Graph Data - Outlier Detection Graph Data Mining - Outlier Detection Outliers Rough Clustering Algorithm Computer Science
66	Modelos estatísticos para dados politômicos nominais em estudos longitudinais com uma aplicação à área agronômica / Statistical models for nominal polytomous data in longitudinal studies with an application to agronomy Vinicius Menarin 14 January 2016 (has links) Estudos em que a resposta de interesse é uma variável categorizada são bastante comuns nas mais diversas áreas da Ciência. Em muitas situações essa resposta é composta por mais de duas categorias não ordenadas, denominada então de uma variável politômica nominal, e em geral o objetivo do estudo é associar a probabilidade de ocorrência de cada categoria aos efeitos de variáveis explicativas. Ademais, existem tipos especiais de estudos em que os dados são coletados diversas vezes para uma mesma unidade amostral ao longo do tempo, os estudos longitudinais. Estudos assim requerem o uso de modelos estatísticos que considerem em sua formulação algum tipo de estrutura que suporte a dependência que tende a surgir entre observações feitas em uma mesma unidade amostral. Neste trabalho são abordadas duas extensões do modelo de logitos generalizados, usualmente empregado quando a resposta é politômica nominal com observações independentes entre si. A primeira consiste de uma modificação das equações de estimação generalizadas para dados nominais que se utiliza de razões de chances locais para descrever a dependência entre as observações da variável resposta politômica ao longo dos diversos tempos observados. Este tipo de modelo é denominado de modelo marginal. A segunda proposta abordada consiste no modelo de logitos generalizados com a inclusão de efeitos aleatórios no preditor linear, que também leva em conta uma dependência entre as observações. Esta abordagem caracteriza o modelo de logitos generalizados misto. Há diferenças importantes inerentes às interpretações dos modelos marginais e mistos, que são discutidas e que devem ser levadas em consideração na escolha da abordagem adequada. Ambas as propostas são aplicadas em um conjunto de dados proveniente de um experimento da área agronômica realizado em campo, conduzido sob um delineamento casualizado em blocos com esquema fatorial para os tratamentos. O experimento foi acompanhado ao longo de seis estações do ano, caracterizando assim uma estrutura longitudinal, sendo a variável resposta o tipo de vegetação observado no campo (touceiras, plantas invasoras ou espaços vazios). Os resultados encontrados são satisfatórios, embora a dependência presente nos dados não seja tão caracterizada; por meio de testes como da razão de verossimilhanças e de Wald diversas diferenças significativas entre os tratamentos foram encontradas. Ainda, devido às diferenças metodológicas das duas abordagens, o modelo marginal baseado nas equações de estimação generalizadas mostra-se mais adequado para esses dados. / Studies where the response is a categorical variable are quite common in many fields of Sciences. In many situations this response is composed by more than two unordered categories characterizing a nominal polytomous outcome and, in general, the aim of the study is to associate the probability of occurrence of each category to the effects of variables. Furthermore, there are special types of study where many measurements are taken over the time for the same sampling unit, called longitudinal studies. Such studies require special statistical models that consider some kind of structure that support the dependence that tends to arise from the repeated measurements for the same sampling unit. This work focuses on two extensions of the baseline-category logit model usually employed in cases when there is a nominal polytomous response with independent observations. The first one consists in a modification of the well-known generalized estimating equations for longitudinal data based on local odds ratios to describe the dependence between the levels of the response over the repeated measurements. This type of model is also known as a marginal model. The second approach adds random effects to the linear predictor of the baseline-category logit model, which also considers a dependence between the observations. This characterizes a baseline-category mixed model. There are substantial differences inherent to interpretations when marginal and mixed models are compared, what should be considered in the choice of the most appropriated approach for each situation. Both methodologies are applied to the data of an agronomic experiment installed under a complete randomized block design with a factorial arrangement for the treatments. It was carried out over six seasons, characterizing the longitudinal structure, and the response is the type of vegetation observed in field (tussocks, weeds or regions with bare ground). The results are satisfactory, even if the dependence found in data is not so strong, and likelihood-ratio and Wald tests point to several differences between treatments. Moreover, due to methodological differences between the two approaches, the marginal model based on generalized estimating equations seems to be more appropriate for this data. Dados categorizados nominais Equações de estimação generalizadas Medidas repetidas no tempo Modelos lineares generalizados mistos generalized estimating equations generalized linear mixed models nominal categorical data repeated measurements over time
67	Determining the number of classes in latent class regression models / A Monte Carlo simulation study on class enumeration Luo, Sherry January 2021 (has links) A Monte Carlo simulation study on class enumeration with latent class regression models. / Latent class regression (LCR) is a statistical method used to identify qualitatively different groups or latent classes within a heterogeneous population and commonly used in the behavioural, health, and social sciences. Despite the vast applications, an agreed fit index to correctly determine the number of latent classes is hotly debated. To add, there are also conflicting views on whether covariates should or should not be included into the class enumeration process. We conduct a simulation study to determine the impact of covariates on the class enumeration accuracy as well as study the performance of several commonly used fit indices under different population models and modelling conditions. Our results indicate that of the eight fit indices considered, the aBIC and BLRT proved to be the best performing fit indices for class enumeration. Furthermore, we found that covariates should not be included into the enumeration procedure. Our results illustrate that an unconditional LCA model can enumerate equivalently as well as a conditional LCA model with its true covariate specification. Even with the presence of large covariate effects in the population, the unconditional model is capable of enumerating with high accuracy. As noted by Nylund and Gibson (2016), a misspecified covariate specification can easily lead to an overestimation of latent classes. Therefore, we recommend to perform class enumeration without covariates and determine a set of candidate latent class models with the aBIC. Once that is determined, the BLRT can be utilized on the set of candidate models and confirm whether results obtained by the BLRT match the results of the aBIC. By separating the enumeration procedure of the BLRT, it still allows one to use the BLRT but reduce the heavy computational burden that is associated with this fit index. Subsequent analysis can then be pursued accordingly after the number of latent classes is determined. / Thesis / Master of Science (MSc) latent class analysis class enumeration latent variable models mplus simulations classifcation mixture models categorical data model selection latent class regression latent classes covariates measurement non-invariance direct effects
68	The development of authentic virtual reality scenarios to measure individuals’ level of systems thinking skills and learning abilities Dayarathna, Vidanelage L. 10 December 2021 (has links) (PDF) This dissertation develops virtual reality modules to capture individuals’ learning abilities and systems thinking skills in dynamic environments. In the first chapter, an immersive queuing theory teaching module is developed using virtual reality technology. The objective of the study is to present systems engineering concepts in a more sophisticated environment and measure students learning abilities. Furthermore, the study explores the performance gaps between male and female students in manufacturing systems concepts. To investigate the gender biases toward the performance of developed VR module, three efficacy measures (simulation sickness questionnaire, systems usability scale, and presence questionnaire) and two effectiveness measures (NASA TLX assessment and post-motivation questionnaire) were used. The second and third chapter aims to assess individuals’ systems thinking skills when they engage in complex multidimensional problems. A modern complex system comprises many interrelated subsystems and various dynamic attributes. Understanding and handling large complex problems requires holistic critical thinkers in modern workplaces. Systems Thinking (ST) is an interdisciplinary domain that offers different ways to better understand the behavior and structure of a complex system. The developed scenario-based instrument measures students’ cognitive tendency for complexity, change, and interaction when making decisions in a turbulent environment. The proposed complex systems scenarios are developed based on an established systems thinking instrument that can measure important aspects of systems thinking skills. The systems scenarios are built in a virtual environment that facilitate students to react to real-world situations and make decisions. The construct validity of the VR scenarios is assessed by comparing the high systematic scores between ST instrument and developed VR scenarios. Furthermore, the efficacy of the VR scenarios is investigated using the simulation sickness questionnaire, systems usability scale, presence questionnaire, and NASA TLX assessment. Engineering education Learning abilities Scenario based assessment Systems thinking Systems thinking skills Virtual reality Applied Statistics Categorical Data Analysis Ergonomics Industrial Engineering Systems Engineering
69	An Analysis of Financial Planning for Employees of East Tennessee State University. Campbell, Steven Roy 06 May 2006 (has links) (PDF) The purpose of this study was to determine if East Tennessee State University provides its employees appropriate financial planning services. In particular, it is unknown to what degree employees of East Tennessee State University have actively engaged in financial planning. The research was conducted during June and July, 2005. Data were gathered by surveying faculty, staff, and retirees of the university. Ten percent of the population responded to the study. The survey instrument covered the areas of retirement, other financial planning services, and attitudes toward financial planning. The results of the data analysis gave insight into what degree employees of East Tennessee State University have actively engaged in financial planning. For example, over 20% of the respondents encouraged employees to start early in order to achieve the benefit of time value of money. Fifteen percent of the respondents suggested financial planning workshops be offered on a more frequent basis. Approximately 10% of the respondents preferred an instructor to be independent, instead of a financial salesperson. The study provided an increase in the body of knowledge on financial planning for the ETSU employee and established a historical database for the various programs offered within the ETSU system. Financial Planning Retirement Planning Employee Benefits Deferred Compensation Supplemental Retirement Annuities Categorical Data Analysis Educational Sociology Physical Sciences and Mathematics Social and Behavioral Sciences Sociology Statistics and Probability Work, Economy and Organizations
70	Data Mining Methods For Malware Detection Siddiqui, Muazzam 01 January 2008 (has links) This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares. Data Mining Malware Detection Machine Learning Classification Instruction Sequences Signature Extraction Predictive Modeling Supervised Learning Unsupervised Learning Feature Selection Feature Reduction Categorical Data Analysis

Search results