81 |
Análise de dados categorizados com omissão / Analysis of categorical data with missingnessFrederico Zanqueta Poleto 30 August 2006 (has links)
Neste trabalho aborda-se aspectos teóricos, computacionais e aplicados de análises clássicas de dados categorizados com omissão. Uma revisão da literatura é apresentada enquanto se introduz os mecanismos de omissão, mostrando suas características e implicações nas inferências de interesse por meio de um exemplo considerando duas variáveis respostas dicotômicas e estudos de simulação. Amplia-se a modelagem descrita em Paulino (1991, Brazilian Journal of Probability and Statistics 5, 1-42) da distribuição multinomial para a produto de multinomiais para possibilitar a inclusão de variáveis explicativas na análise. Os resultados são desenvolvidos em formulação matricial adequada para a implementação computacional, que é realizada com a construção de uma biblioteca para o ambiente estatístico R, a qual é disponibilizada para facilitar o traçado das inferências descritas nesta dissertação. A aplicação da teoria é ilustrada por meio de cinco exemplos de características diversas, uma vez que se ajusta modelos estruturais lineares (homogeneidade marginal), log-lineares (independência, razão de chances adjacentes comum) e funcionais lineares (kappa, kappa ponderado, sensibilidade/especificidade, valor preditivo positivo/negativo) para as probabilidades de categorização. Os padrões de omissão também são variados, com omissões em uma ou duas variáveis, confundimento de células vizinhas, sem ou com subpopulações. / We consider theoretical, computational and applied aspects of classical categorical data analyses with missingness. We present a literature review while introducing the missingness mechanisms, highlighting their characteristics and implications in the inferences of interest by means of an example involving two binary responses and simulation studies. We extend the multinomial modeling scenario described in Paulino (1991, Brazilian Journal of Probability and Statistics 5, 1-42) to the product-multinomial setup to allow for the inclusion of explanatory variables. We develop the results in matrix formulation and implement the computational procedures via subroutines written under R statistical environment. We illustrate the application of the theory by means of five examples with different characteristics, fitting structural linear (marginal homogeneity), log-linear (independence, constant adjacent odds ratio) and functional linear models (kappa, weighted kappa, sensitivity/specificity, positive/negative predictive value) for the marginal probabilities. The missingness patterns includes missingness in one or two variables, neighbor cells confounded, with or without explanatory variables.
|
82 |
"Konstrukcija i analiza klaster algoritma sa primenom u definisanju bihejvioralnih faktora rizika u populaciji odraslog stanovništva Srbije" / "Construction and analysis of cluster algorithmwith application in defining behavioural riskfactors in Serbian adult population"Dragnić Nataša 23 June 2016 (has links)
<p>Klaster analiza ima dugu istoriju i mada se<br />primenjuje u mnogim oblastima i dalje ostaju<br />značajni izazovi. U disertaciji je prikazan uvod<br />u neglatki optimizacioni pristup u<br />klasterovanju, sa osvrtom na problem<br />klasterovanja velikih skupova podataka.<br />Međutim, ovi optimizacioni algoritmi bolje<br />funkcionišu u radu sa neprekidnim podacima.<br />Jedan od glavnih izazova u klaster analizi je<br />rad sa velikim skupovima podataka sa<br />kategorijalnim i kombinovanim (numerički i<br />kategorijalni) tipovima promenljivih. Rad sa<br />velikim brojem instanci (objekata) i velikim<br />brojem dimenzija (promenljivih), može<br />predstavljati problem u klaster analizi, zbog<br />vremenske složenosti. Jedan od načina<br />rešavanja ovog problema je redukovanje broja<br />instanci, bez gubitka informacija.<br />Prvi cilj disertacije je bio upoređivanje<br />rezultata klasterovanja na celom skupu i<br />prostim slučajnim uzorcima sa kategorijalnim i<br />kombinovanim podacima, za različite veličine<br />uzorka i različit broj klastera. Nije utvrđena<br />značajna razlika (p>0.05) u rezultatima<br />klasterovanja na uzorcima obima<br />0.03m,0.05m,0.1m,0.3m (gde je m obim<br />posmatranog skupa) i celom skupu.<br />Drugi cilj disertacije je bio konstrukcija<br />efikasnog postupka klasterovanja velikih<br />skupova podataka sa kategorijalnim i<br />kombinovanim tipovima promenljivih.<br />Predloženi postupak se sastoji iz sledećih<br />koraka: 1. klasterovanje na prostim slučajnim<br />uzorcima određene kardinalnosti; 2.<br />određivanje najboljeg klasterskog rešenja na<br />uzorku, primenom odgovarajućeg kriterijuma<br />validnosti; 3. dobijeni centri klastera iz ovog<br />uzorka služe za klasterovanje ostatka skupa.<br />Treći cilj disertacije predstavlja primenu<br />klaster analize u definisanju klastera<br />bihejvioralnih faktora rizika u populaciji<br />odraslog stanovništva Srbije, kao i analizu<br />sociodemografskih karakteristika dobijenih<br />klastera. Klaster analiza je primenjena na<br />velikom reprezentativnom uzorku odraslog<br />stanovništva Srbije, starosti 20 i više godina.<br />Izdvojeno je pet jasno odvojenih klastera sa<br />karakterističnim kombinacijama bihejvioralnih<br />faktora rizika: Bez rizičnih faktora, Štetna<br />upotreba alkohola i druge rizične navike,<br />Nepravilna ishrana i druge rizične navike,<br />Nedovoljna fizička aktivnost, Pušenje. Rezultati<br />multinomnog logističkog regresionog modela<br />ukazuju da ispitanici koji nisu u braku, lošijeg<br />su materijalnog stanja, nižeg obrazovanja i žive<br />u Vojvodini imaju veću šansu za prisustvo<br />višestrukih bihejvioralnih faktora rizika.</p> / <p>The cluster analysis has a long history and a<br />large number of clustering techniques have<br />been developed in many areas, however,<br />significant challenges still remain. In this<br />thesis we have provided a introduction to<br />nonsmooth optimization approach to clustering<br />with reference to clustering large datasets.<br />Nevertheless, these optimization clustering<br />algorithms work much better when a dataset<br />contains only vectors with continuous features.<br />One of the main challenges is clustering of large<br />datasets with categorical and mixed (numerical<br />and categorical) data. Clustering deals with a<br />large number of instances (objects) and a large<br />number of dimensions (variables) can be<br />problematic because of time complexity. One of<br />the ways to solve this problem is by reducing<br />the number of instances, without the loss of<br />information.<br />The first aim of this thesis was to compare<br />the results of cluster algorithms on the whole<br />dataset and on simple random samples with<br />categorical and mixed data, in terms of validity,<br />for different number of clusters and for<br />different sample sizes. There were no<br />significant differences (p>0.05) between the<br />obtained results on the samples of the size of<br />0.03m,0.05m,0.1m,0.3m (where m is the size of<br />the dataset) and the whole dataset.<br />The second aim of this thesis was to<br />develop an efficient clustering procedure for<br />large datasets with categorical and mixed<br />(numeric and categorical) values. The proposed<br />procedure consists of the following steps: 1.<br />clustering on simple random samples of a given<br />cardinality; 2. finding the best cluster solution<br />on a sample (by appropriate validity measure);<br />3. using cluster centers from this sample for<br />clustering of the remaining data.<br />The third aim of this thesis was to<br />examine clustering of four lifestyle risk factors<br />and to examine the variation across different<br />socio-demographic groups in a Serbian adult<br />population. Cluster analysis was carried out on<br />a large representative sample of Serbian adults<br />aged 20 and over. We identified five<br />homogenous health behaviour clusters with<br />specific combination of risk factors: 'No Risk<br />Behaviours', 'Drinkers with Risk Behaviours',<br />'Unhealthy diet with Risk Behaviours',<br />'Smoking'. Results of multinomial logistic<br />regression indicated that single adults, less<br />educated, with low socio-economic status and<br />living in the region of Vojvodina are most likely<br />to be a part of the clusters with a high-risk<br />profile.</p>
|
83 |
Understanding Brigham Young University's Technology Teacher Education Program's Sucess in Attracting and Retaining Female StudentsCox, Katrina M. 12 July 2006 (has links) (PDF)
The purpose of the study was to attempt to understand why Brigham Young University Technology Teacher Education program has attracted and retained a high number of females. This was done through a self-created survey composed of four forced responses, distributed among the Winter 2006 semester students. Likert-scale questions were outlined according to the five theoretical influences on women in technology, as established by Welty and Puck (2001) and two of the three relationships of academia, as established by Haynie III (1999), as well as three free response questions regarding retention and attraction within the major. Findings suggested strong positive polarity in four of the five influences and in both relationships, with particular emphasis on subject content, positive teacher/student relationships, as well as an overall positive environment as major contributors to attraction and retention at this university. "Role Models, Mentors, and Peers" was the only influence that scored in the negative range. Though the effect size showed differences between males and females on individual questions as well as the two relationships and "Messages from Counselors", no practical difference was found between the male and female perceptions under the five remaining general categories. In all three categories where a medium to large effect size was shown, females were favored in having more positive responses and perceptions than males.
|
84 |
State Level Earned Income Tax Credit’s Effects on Race and Age: An Effective Poverty Reduction PolicyBarone, Anthony J 01 January 2013 (has links)
In this paper, I analyze the effectiveness of state level Earned Income Tax Credit programs on improving of poverty levels. I conducted this analysis for the years 1991 through 2011 using a panel data model with fixed effects. The main independent variables of interest were the state and federal EITC rates, minimum wage, gross state product, population, and unemployment all by state. I determined increases to the state EITC rates provided only a slight decrease to both the overall white below-poverty population and the corresponding white childhood population under 18, while both the overall and the under-18 black population for this category realized moderate decreases in their poverty rates for the same time period. I also provide a comparison of the effectiveness of the state level EITCs and minimum wage at the state level over the same time period on these select demographic groups.
|
85 |
潛在移轉分析法與中位數法在長期追蹤資料分組的差異比較 / On classification of longitudinal data ─ comparison between Latent Transition Analysis and the method using Median as a cutpoint李坤瑋, Lee, Kun Wei Unknown Date (has links)
當資料屬於類別型的長期追蹤資料(Longitudinal categorical data)時,除了可以透過廣義估計方程式(General estimate equation, GEE)來求解模型參數估計值外,潛在移轉分析(Latent transition analysis, LTA)法也是一種可行的資料分析方法。若資料的期數不多,也可以選擇將資料適度分群後使用羅吉斯迴歸分析(Logistic regression)法。當探討的反應變數為二元(Binary)型態,且觀察對象於每一期提供多個測量變數值的情況之下,廣義估計方程式與羅吉斯迴歸分析法的使用,文獻上常見先將所有的測量變數值加總後,以「中位數」作為分類的切割點。不同於以上兩種方法,潛在移轉分析法則是直接使用原始資料來取得觀察對象的潛在狀態相關訊息,因此與前二者的作法不同,可能導致後續的各項分析結果有所差異存在。
為了能夠了解造成中位數分類法與移轉分析法差異的可能因素,我們架構在潛在移轉分析法的模型下,以不同的參數設定來進行電腦模擬,比較各參數條件下的兩分類方法差異。結果發現各潛在狀態下的測量變數反應機率形式、第一期潛在狀態的組成比例等皆會對兩分類方法是否具有相同分類有所影響。另外,透過分析「青少年媒體使用與健康生活調查」的實際資料得知,潛在移轉分析會將大部分的觀察對象歸屬於「網路成癮」,而中位數分類法則是將大部分的觀察對象歸屬於「無網路成癮」。此外,可以注意到「沮喪」、「線上情色每星期平均使用天數」、及「父母相處狀況」這幾個控制變數與各分組結果的關聯性,於上述三種資料分析方法中有所不同。 / Several methods can be used to analyze longitudinal categorical data, as among them Latent Transition Analysis (LTA), and Generalized Linear Models estimated by Generalized Estimating Equations (GEE) probably the most popular. In addition, if the number of periods is two, then with certain grouping of data, the Logistic Regression can also be applied to perform the analyses.
When there are more than one manifest response variable for each study subject, LTA is able to classify the subjects in terms of the original manifest response variables and proceeds with necessary analyses. On the other hand, GEE method and Logistic Regression lack the flexibility, and require certain transformation to transform the manifest response variables into a categorical response variable first. One common way to form a binary response is to sum all manifest variables, and then taking median as a cut-point.
In this study, we explore the differences of the classification resulted from LTA directly and using median as a cut-point through simulations. An empirical study is also provided to illustrate the classification differences, and the differences on the subsequent analyses using LTA, GEE method, and Logistic Regression approach.
|
86 |
Spatiotemporal Analyses of Recycled Water ProductionArcher, Jana E. 01 May 2017 (has links)
Increased demands on water supplies caused by population expansion, saltwater intrusion, and drought have led to water shortages which may be addressed by use of recycled water as recycled water products. Study I investigated recycled water production in Florida and California during 2009 to detect gaps in distribution and identify areas for expansion. Gaps were detected along the panhandle and Miami, Florida, as well as the northern and southwestern regions in California. Study II examined gaps in distribution, identified temporal change, and located areas for expansion for Florida in 2009 and 2015. Production increased in the northern and southern regions of Florida but decreased in Southwest Florida. Recycled water is an essential component water management a broader adoption of recycled water will increase water conservation in water-stressed coastal communities by allocating recycled water for purposes that once used potable freshwater.
|
87 |
Das nichtparametrische Behrens-Fisher-Problem: ein studentisierter Permutationstest und robuste Konfidenzintervalle für den Shift-Effekt / The non-parametric Behrens-Fisher Problem: A Studentized Permutation Test and Robust Confidence Intervals for the Shift EffectNeubert, Karin 07 July 2006 (has links)
No description available.
|
88 |
具有額外或不足變異的群集類別資料之研究 / A Study of Modelling Categorical Data with Overdispersion or Underdispersion蘇聖珠, Su, Sheng-Chu Unknown Date (has links)
進行調查時,最後的抽樣單位常是從不同的群集取得的,而同一群集內的樣本對象,因背景類似而對於某些問題常會傾向相同或類似的反應,研究者若忽略這種群內相關性,仍以獨立性樣本進行分析時,因其共變異數矩陣通常會與多項模式的共變異數矩陣相差懸殊,而造成所謂的額外變異或不足變異的現象。本文在不同的情況下,提出了Dirichlet-Multinomial模式(簡稱DM模式)、擴展的DM模式、以及兩種平均數-共變異數矩陣模式,以適當的彙整所有的群集資料。並討論DM與EDM模式中相關之參數及格機率之最大概似估計法,且分別對此兩種平均數-共變異數矩陣模式,提出求導廣義最小平方估計的程序。此外,也針對幾種特殊的二維表及三維表結構,探討對應的參數及格機率之估計方法。並提出計算簡易的Score統計檢定量以判斷群內相關(intra-cluster correlation)之存在性,及判斷資料集具有額外或不足變異,而對於不同母體的群內相關同質性檢定亦提出討論。 / This paper presents a modelling method of analyzing categorical data with overdispersion or underdispersion. In many studies, data are collected from differ clusters, and members within the same cluster behave similary. Thus, the responses of members within the same cluster are not independent and the multinomial distribution is not the correct distribution for the observed counts. Therefore, the covariance matrix of the sample proportion vector tends to be much different from that of the multinomial model. We discuss four different models to fit counts data with overdispersion or underdispersion feature, witch include Dirichlet-Multinomial model (DM model), extended DM model (EDM model), and two mean-covariance models. Method of maximum-likelihood estimation is discussed for DM and EDM models. Procedures to derive generalized least squares estimates are proposed for the two mean-covariance models respectively. As to the cell probabilities, we also discuss how to estimate them under several special structures of two-way and three-way tables. More easily evaluated Score test statistics are derived for the DM and EDM models to test the existence of the intra-cluster correlation. And the test of homogeneity of intra-cluster correlation among several populations is also derived.
|
89 |
Item Response Theory in the Neurodegenerative Disease Data Analysis / Théorie de la réponse d'item dans l'analyse des données sur les maladies neurodégénérativesWang, Wenjia 21 June 2017 (has links)
Les maladies neurodégénératives, telles que la maladie d'Alzheimer (AD) et Charcot Marie Tooth (CMT), sont des maladies complexes. Leurs mécanismes pathologiques ne sont toujours pas bien compris et les progrès dans la recherche et le développement de nouvelles thérapies potentielles modifiant la maladie sont lents. Les données catégorielles, comme les échelles de notation et les données sur les études d'association génomique (GWAS), sont largement utilisées dans les maladies neurodégénératives dans le diagnostic, la prédiction et le suivi de la progression. Il est important de comprendre et d'interpréter ces données correctement si nous voulons améliorer la recherche sur les maladies neurodégénératives. Le but de cette thèse est d'utiliser la théorie psychométrique moderne: théorie de la réponse d’item pour analyser ces données catégoriques afin de mieux comprendre les maladies neurodégénératives et de faciliter la recherche de médicaments correspondante. Tout d'abord, nous avons appliqué l'analyse de Rasch afin d'évaluer la validité du score de neuropathie Charcot-Marie-Tooth (CMTNS), un critère important d'évaluation principal pour les essais cliniques de la maladie de CMT. Nous avons ensuite adapté le modèle Rasch à l'analyse des associations génétiques pour identifier les gènes associés à la maladie d'Alzheimer. Cette méthode résume les génotypes catégoriques de plusieurs marqueurs génétiques tels que les polymorphisme nucléotidique (SNPs) en un seul score génétique. Enfin, nous avons calculé l'information mutuelle basée sur la théorie de réponse d’item pour sélectionner les items sensibles dans ADAS-cog, une mesure de fonctionnement cognitif la plus utilisées dans les études de la maladie d'Alzheimer, afin de mieux évaluer le progrès de la maladie. / Neurodegenerative diseases, such as Alzheimer’s disease (AD) and Charcot Marie Tooth (CMT), are complex diseases. Their pathological mechanisms are still not well understood, and the progress in the research and development of new potential disease-modifying therapies is slow. Categorical data like rating scales and Genome-Wide Association Studies (GWAS) data are widely utilized in the neurodegenerative diseases in the diagnosis, prediction and progression monitor. It is important to understand and interpret these data correctly if we want to improve the disease research. The purpose of this thesis is to use the modern psychometric Item Response Theory to analyze these categorical data for better understanding the neurodegenerative diseases and facilitating the corresponding drug research. First, we applied the Rasch analysis in order to assess the validity of the Charcot-Marie-Tooth Neuropathy Score (CMTNS), a main endpoint for the CMT disease clinical trials. We then adapted the Rasch model to the analysis of genetic associations and used to identify genes associated with Alzheimer’s disease by summarizing the categorical genotypes of several genetic markers such as Single Nucleotide Polymorphisms (SNPs) into one genetic score. Finally, to select sensitive items in the most used psychometrical tests for Alzheimer’s disease, we calculated the mutual information based on the item response model to evaluate the sensitivity of each item on the ADAS-cog scale.
|
90 |
Imputation multiple par analyse factorielle : Une nouvelle méthodologie pour traiter les données manquantes / Multiple imputation using principal component methods : A new methodology to deal with missing valuesAudigier, Vincent 25 November 2015 (has links)
Cette thèse est centrée sur le développement de nouvelles méthodes d'imputation multiples, basées sur des techniques d'analyse factorielle. L'étude des méthodes factorielles, ici en tant que méthodes d'imputation, offre de grandes perspectives en termes de diversité du type de données imputées d'une part, et en termes de dimensions de jeux de données imputés d'autre part. Leur propriété de réduction de la dimension limite en effet le nombre de paramètres estimés.Dans un premier temps, une méthode d'imputation simple par analyse factorielle de données mixtes est détaillée. Ses propriétés sont étudiées, en particulier sa capacité à gérer la diversité des liaisons mises en jeu et à prendre en compte les modalités rares. Sa qualité de prédiction est éprouvée en la comparant à l'imputation par forêts aléatoires.Ensuite, une méthode d'imputation multiple pour des données quantitatives basée sur une approche Bayésienne du modèle d'analyse en composantes principales est proposée. Elle permet d'inférer en présence de données manquantes y compris quand le nombre d'individus est petit devant le nombre de variables, ou quand les corrélations entre variables sont fortes.Enfin, une méthode d'imputation multiple pour des données qualitatives par analyse des correspondances multiples (ACM) est proposée. La variabilité de prédiction des données manquantes est reflétée via un bootstrap non-paramétrique. L'imputation multiple par ACM offre une réponse au problème de l'explosion combinatoire limitant les méthodes concurrentes dès lors que le nombre de variables ou de modalités est élev / This thesis proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction.First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests.Next, a multiple imputation method for continuous data using principal component analysis (PCA) is presented. This is based on a Bayesian treatment of the PCA model. Unlike standard methods based on Gaussian models, it can still be used when the number of variables is larger than the number of individuals and when correlations between variables are strong.Finally, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods.
|
Page generated in 0.0358 seconds