11 |
UNSUPERVISED DATA MINING BY RECURSIVE PARTITIONINGHE, AIJING 16 September 2002 (has links)
No description available.
|
12 |
Recursive Partitioning of Models of a Generalized Linear Model TypeRusch, Thomas 10 June 2012 (has links) (PDF)
This thesis is concerned with recursive partitioning of models of a generalized linear model type (GLM-type), i.e., maximum likelihood models with a linear predictor for the linked mean, a topic that has received constant interest over the last twenty years. The resulting tree (a ''model tree'') can be seen as an extension of classic trees, to allow for a GLM-type model in the partitions. In this work, the focus lies on applied and computational aspects of model trees with GLM-type node models to work out different areas where application of the combination of parametric models and trees will be beneficial and to build a computational scaffold for future application of model trees. In the first part, model trees are defined and some algorithms for fitting model trees with GLM-type node model are reviewed and compared in terms of their properties of tree induction and node model fitting. Additionally, the design of a particularly versatile algorithm, the MOB algorithm (Zeileis et al. 2008) in R is described and an in-depth discussion of how the functionality offered can be extended to various GLM-type models is provided. This is highlighted by an example of using partitioned negative binomial models for investigating the effect of health care incentives. Part 2 consists of three research articles where model trees are applied to different problems that frequently occur in the social sciences. The first uses trees with GLM-type node models and applies it to a data set of voters, who show a non-monotone relationship between the frequency of attending past elections and the turnout in 2004. Three different type of model tree algorithms are used to investigate this phenomenon and for two the resulting trees can explain the counter-intuitive finding. Here model tress are used to learn a nonlinear relationship between a target model and a big number of candidate variables to provide more insight into a data set. A second application area is also discussed, namely using model trees to detect ill-fitting subsets in the data. The second article uses model trees to model the number of fatalities in Afghanistan war, based on the WikiLeaks Afghanistan war diary. Data pre-processing with a topic model generates predictors that are used as explanatory variables in a model tree for overdispersed count data. Here the combination of model trees and topic models allows to flexibly analyse database data, frequently encountered in data journalism, and provides a coherent description of fatalities in the Afghanistan war. The third paper uses a new framework built around model trees to approach the classic problem of segmentation, frequently encountered in marketing and management science. Here, the framework is used for segmentation of a sample of the US electorate for identifying likely and unlikely voters. It is shown that the framework's model trees enable accurate identification which in turn allows efficient targeted mobilisation of eligible voters. (author's abstract)
|
13 |
Modeling Mortality Rates In The WikiLeaks Afghanistan War LogsRusch, Thomas, Hofmarcher, Paul, Hatzinger, Reinhold, Hornik, Kurt 09 1900 (has links) (PDF)
The WikiLeaks Afghanistan war logs contain more than 76 000 reports about fatalities and their circumstances in the US led Afghanistan war, covering the period from January 2004 to December 2009. In this paper we use those reports to build statistical models to help us understand the mortality rates associated with specific circumstances. We choose an approach that combines Latent Dirichlet Allocation (LDA) with negative binomial based recursive partitioning. LDA is used to process the natural language information contained in each report summary. We estimate latent topics and assign each report to one of them. These topics - in addition to other variables in the data set - subsequently serve as explanatory variables for modeling the number of fatalities of the civilian population, ISAF Forces, Anti-Coalition Forces and the Afghan National Police or military as well as the combined number of fatalities. Modeling is carried out with manifest mixtures of negative binomial distributions estimated with model-based recursive partitioning. For each group of fatalities, we identify segments with different mortality rates that correspond to a small number of topics and other explanatory variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This provides an unprecedented description of the war in Afghanistan covered by the war logs. Additionally, our approach can serve as an example as to how modern statistical methods may lead to extra insight if applied to problems of data journalism. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
|
14 |
Gaining Insight with Recursive Partitioning of Generalized Linear ModelsRusch, Thomas, Zeileis, Achim January 2013 (has links) (PDF)
Recursive partitioning algorithms separate a feature space into a set of disjoint rectangles.
Then, usually, a constant in every partition is fitted. While this is a simple and intuitive approach, it may still lack interpretability as to how a specific relationship between dependent and independent variables may look. Or it may be that a certain model is assumed or of interest and there is a number of candidate variables that may non-linearly give rise to different model parameter values. We present an approach that combines generalized linear models with recursive partitioning that offers enhanced interpretability of classical trees as well as providing an explorative way to assess a candidate variable's in uence on a parametric model. This method conducts recursive partitioning of a generalized linear model by (1) fitting the model to the data set, (2) testing for parameter
instability over a set of partitioning variables, (3) splitting the data set with respect to the variable associated with the highest instability. The outcome is a tree where each terminal node is associated with a generalized linear model. We will show the method's
versatility and suitability to gain additional insight into the relationship of dependent and independent variables by two examples, modelling voting behaviour and a failure model
for debt amortization, and compare it to alternative approaches.
|
15 |
Development and validation of clinical prediction models to diagnose acute respiratory infections in children and adults from Canadian Hutterite communities.Vuichard Gysin, Danielle January 2016 (has links)
Acute respiratory infections (ARI) caused by influenza and other respiratory viruses affect millions of people annually. Although usually self-limiting a more complicated or severe course may occur in previously healthy people but are more likely in individuals with underlying illnesses. The most common viral agent is rhinovirus whereas influenza is less frequent but is well known to cause winter epidemics. In primary care, rapid diagnosis of influenza virus infections is essential in order to provide treatment. Clinical presentations vary among the different pathogens but may overlap and may also depend on host factors. Predictive models have been developed for influenza but study results may be biased because only individuals presenting with fever were included. Most of these models have not been adequately validated and their predictive power, therefore, is likely overestimated. The main objective of this thesis was to compare different mathematical models for the
derivation of clinical prediction rules in individuals presenting with symptoms of ARI to better distinguish between influenza, influenza A subtypes and entero-/rhinovirus-related illness in children and adults and to evaluate model performance by using data-splitting for internal validation.
Data from a completed prospective cluster-randomized trial for the indirect effect of influenza vaccination in children of Hutterite communities served as a basis of my thesis. There were a total of 3288 first episodes per season of ARI in 2202 individuals and 321 (9.8%) influenza positive events over three influenza seasons (2008-2011). The data set was divided into children under 18 years and adults. Both data sets were randomly split by subjects into a derivation (2/3 of the dataset) and a validation population (1/3 of the dataset). All predictive models were developed in the derivation sets. Demographic factors and the classical symptoms of ARI were evaluated with logistic regression and Cox proportional hazard models using forward stepwise selection applying robust estimators to account for non-independent data and by means of recursive partitioning. The beta coefficients of the independent predictors were used to develop different point scores. These scores were then tested in the validation groups and performance between validation and derivation set was compared using receiver operating characteristics (ROC) curves. We determined sensitivities and specificities, positive and negative predictive values, and likelihood ratios at different cut-points which could reflect test and treatment thresholds. Fever, chills, and cough were the most important predictors in children whereas chills and cough but not fever were most predictive of influenza virus infection in adults. Performance of the individual models was moderate with areas under the receiver operating characteristic curves between 0.75 and 0.80 for the main outcome influenza A or B virus infection. There was no statistically significant difference in performance between the derivation and validation sets for the main outcome. The results have shown, that various mathematical models have similar discriminative ability to
distinguish influenza from other respiratory viruses. The scores could assist clinicians in their decision-making. However, performance of the models was slightly overestimated due to potential clustering of data and the results would first needed to be validated in a different population before application in clinical practice. / Thesis / Master of Science (MSc) / Every year, millions of people are attacked by "the flu" or the common cold. Certain signs and symptoms apparently are more discriminative between the common cold and the flu. However, the decision between starting a simple symptom orientated treatment, treating empirically for influenza or ordering a rapid diagnostic test that has only moderate sensitivity and specificity can be challenging.
This thesis, therefore, aims to help physicians in their decision-making process by developing simple scores and decision trees for the diagnosis of influenza versus non-influenza respiratory infections.
Data from a completed trial for the indirect effect of influenza vaccination in children of Hutterite communities served as a basis of my thesis. There were a total of 3288 first seasonal episodes of ARI in 2202 individuals and 321 (9.8%) influenza positive events over three influenza seasons (2008-2011). The data set was divided into children under 18 years and adults. Both data sets were split into a derivation and a validation set (=holdout group). Different mathematical models were applied to the derivation set and demographic factors as well as the classical symptoms of ARI were evaluated. The scores generated from the most important factors that remained in the model were then tested in the validation group and performance between validation and derivation set was compared. Accuracy was determined at different cut-points which could reflect test and treatment thresholds. Fever, chills, and cough were the most important predictors in children whereas chills and cough but not fever were most predictive of influenza virus infection in adults. Performance of the individual models was moderate for the main outcome influenza A or B virus infection. There was no statistically significant difference in performance between the derivation and validation sets for the main outcome. The results have shown, that various mathematical models have similar discriminative ability to distinguish influenza from other respiratory viruses. The scores could assist clinicians in their decision-making. However, the results would first needed to be validated in a different population before application in clinical practice.
|
16 |
Influencing Elections with Statistics: Targeting Voters with Logistic Regression TreesRusch, Thomas, Lee, Ilro, Hornik, Kurt, Jank, Wolfgang, Zeileis, Achim 03 1900 (has links) (PDF)
Political campaigning has become a multi-million dollar business. A substantial proportion of a campaign's budget is spent on voter mobilization, i.e., on identifying and
influencing as many people as possible to vote. Based on data, campaigns use statistical
tools to provide a basis for deciding who to target. While the data available is usually rich,
campaigns have traditionally relied on a rather limited selection of information, often including only previous voting behavior and one or two demographical variables. Statistical
procedures that are currently in use include logistic regression or standard classification
tree methods like CHAID, but there is a growing interest in employing modern data mining approaches. Along the lines of this development, we propose a modern framework
for voter targeting called LORET (for logistic regression trees) that employs trees (with
possibly just a single root node) containing logistic regressions (with possibly just an intercept) in every leaf. Thus, they contain logistic regression and classification trees as special
cases and allow for a synthesis of both techniques under one umbrella. We explore various
flavors of LORET models that (a) compare the effect of using the full set of available
variables against using only limited information and (b) investigate their varying effects
either as regressors in the logistic model components or as partitioning variables in the
tree components. To assess model performance and illustrate targeting, we apply LORET
to a data set of 19,634 eligible voters from the 2004 US presidential election. We find that
augmenting the standard set of variables (such as age and voting history) together with
additional predictor variables (such as the household composition in terms of party affiliation and each individual's rank in the household) clearly improves predictive accuracy.
We also find that LORET models based on tree induction outbeat the unpartitioned competitors. Additionally, LORET models using both partitioning variables and regressors
in the resulting nodes can improve the efficiency of allocating campaign resources while
still providing intelligible models. / Series: Research Report Series / Department of Statistics and Mathematics
|
17 |
Model trees with topic model preprocessing: an approach for data journalism illustrated with the WikiLeaks Afghanistan war logsRusch, Thomas, Hofmarcher, Paul, Hatzinger, Reinhold, Hornik, Kurt 06 1900 (has links) (PDF)
The WikiLeaks Afghanistan war logs contain nearly 77,000 reports of
incidents in the US-led Afghanistan war, covering the period from January
2004 to December 2009. The recent growth of data on complex social systems
and the potential to derive stories from them has shifted the focus of
journalistic and scientific attention increasingly toward data-driven journalism
and computational social science. In this paper we advocate the usage
of modern statistical methods for problems of data journalism and beyond,
which may help journalistic and scientific work and lead to additional insight.
Using the WikiLeaks Afghanistan war logs for illustration, we present an approach
that builds intelligible statistical models for interpretable segments in
the data, in this case to explore the fatality rates associated with different circumstances
in the Afghanistan war. Our approach combines preprocessing by
Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process
the natural language information contained in each report summary by estimating
latent topics and assigning each report to one of them. Together with
other variables these topic assignments serve as splitting variables for finding
segments in the data to which local statistical models for the reported number
of fatalities are fitted. Segmentation and fitting is carried out with recursive
partitioning of negative binomial distributions. We identify segments with
different fatality rates that correspond to a small number of topics and other
variables as well as their interactions. Furthermore, we carve out the similarities
between segments and connect them to stories that have been covered in
the media. This gives an unprecedented description of the war in Afghanistan
and serves as an example of how data journalism, computational social science
and other areas with interest in database data can benefit from modern
statistical techniques. (authors' abstract)
|
18 |
Risk Factors for Suicidal Behaviour Among Canadian Civilians and Military Personnel: A Recursive Partitioning ApproachRusu, Corneliu 05 April 2018 (has links)
Background: Suicidal behaviour is a major public health problem that has not abated over the past decade. Adopting machine learning algorithms that allow for combining risk factors that may increase the predictive accuracy of models of suicide behaviour is one promising avenue toward effective prevention and treatment.
Methods: We used Canadian Community Health Survey – Mental Health and Canadian Forces Mental Health Survey to build conditional inference random forests models of suicidal behaviour in Canadian general population and Canadian Armed Forces. We generated risk algorithms for suicidal behaviour in each sample. We performed within- and between-sample validation and reported the corresponding performance metrics.
Results: Only a handful of variables were important in predicting suicidal behaviour in Canadian general population and Canadian Armed Forces. Each model’s performance on within-sample validation was satisfactory, with moderate to high sensitivity and high specificity, while the performance on between-sample validation was conditional on the size and heterogeneity of the training sample.
Conclusion: Using conditional inference random forest methodology on large nationally representative mental health surveys has the potential of generating models of suicidal behaviour that not only reflect its complex nature, but indicate that the true positive cases are likely to be captured by this approach.
|
19 |
Score-Based Approaches to Heterogeneity in Psychological ModelsArnold, Manuel 30 May 2022 (has links)
Statistische Modelle menschlicher Kognition und Verhaltens stützen sich häufig auf aggregierte Daten und vernachlässigen dadurch oft Heterogenität in Form von Unterschieden zwischen Personen oder Gruppen. Die Nichtberücksichtigung vorliegender Heterogenität kann zu verzerrten Parameterschätzungen und zu falsch positiven oder falsch negativen Tests führen. Häufig kann Heterogenität mithilfe von Kovariaten erkannt und vorhergesagt werden. Allerdings erweist sich die Identifizierung von Prädiktoren von Heterogenität oft als schwierige Aufgabe. Zur Lösung dieses Problems schlage ich zwei neue Ansätze vor, um individuelle und gruppenspezifische Unterschiede mithilfe von Kovariaten vorherzusagen.
Die vorliegende kumulative Dissertation setzt sich aus drei Projekten zusammen. Projekt 1 widmet sich dem Verfahren IPC-Regression (Individual Parameter Contribution), welches die Exploration von Parameterheterogenität in Strukturgleichungsmodellen (SEM) mittels Kovariaten erlaubt. Unter anderem evaluiere ich IPC-Regression für dynamische Panel-Modelle, schlage eine alternative Schätzmethode vor und leite IPCs für allgemeine Maximum-Likelihood-Schätzer her. Projekt 2 veranschaulicht, wie IPC-Regression in der Praxis eingesetzt werden kann. Dazu führe ich schrittweise in die Implementierung von IPC-Regression im ipcr-Paket für die statistische Programmiersprache R ein. Schließlich werden in Projekt 3 SEM-Trees weiterentwickelt. SEM-Trees sind eine modellbasierte rekursive Partitionierungsmethode zur Identifizierung von Kovariaten, die Gruppenunterschiede in SEM-Parametern vorhersagen. Die bisher verwendeten SEM-Trees sind sehr rechenaufwendig. In Projekt 3 kombiniere ich SEM-Trees mit unterschiedlichen Score-basierten Tests. Die daraus resultierenden Score-Guided-SEM-Tees lassen sich deutlich schneller als herkömmlichen SEM-Trees berechnen und zeigen bessere statistische Eigenschaften. / Statistical models of human cognition and behavior often rely on aggregated data and may fail to consider heterogeneity, that is, differences across individuals or groups. If overlooked, heterogeneity can bias parameter estimates and may lead to false-positive or false-negative findings. Often, heterogeneity can be detected and predicted with the help of covariates. However, identifying predictors of heterogeneity can be a challenging task. To solve this issue, I propose two novel approaches for detecting and predicting individual and group differences with covariates.
This cumulative dissertation is composed of three projects. Project 1 advances the individual parameter contribution (IPC) regression framework, which allows studying heterogeneity in structural equation model (SEM) parameters by means of covariates. I evaluate the use of IPC regression for dynamic panel models, propose an alternative estimation technique, and derive IPCs for general maximum likelihood estimators. Project 2 illustrates how IPC regression can be used in practice. To this end, I provide a step-by-step introduction to the IPC regression implementation in the ipcr package for the R system for statistical computing. Finally, Project 3 progresses the SEM tree framework. SEM trees are a model-based recursive partitioning method for finding covariates that predict group differences in SEM parameters. Unfortunately, the original SEM tree implementation is computationally demanding. As a solution to this problem, I combine SEM trees with a family of score-based tests. The resulting score-guided SEM trees compute quickly, solving the runtime issues of the original SEM trees, and show favorable statistical properties.
|
20 |
Developing a methodology to account for commercial motor vehicles using microscopic traffic simulation modelsSchultz, Grant George 30 September 2004 (has links)
The collection and interpretation of data is a critical component of traffic and transportation engineering used to establish baseline performance measures and to forecast future conditions. One important source of traffic data is commercial motor vehicle (CMV) weight and classification data used as input to critical tasks in transportation design, operations, and planning. The evolution of Intelligent Transportation System (ITS) technologies has been providing transportation engineers and planners with an increased availability of CMV data. The primary sources of these data are automatic vehicle classification (AVC) and weigh-in-motion (WIM). Microscopic traffic simulation models have been used extensively to model the dynamic and stochastic nature of transportation systems including vehicle composition. One aspect of effective microscopic traffic simulation models that has received increased attention in recent years is the calibration of these models, which has traditionally been concerned with identifying the "best" parameter set from a range of acceptable values. Recent research has begun the process of automating the calibration process in an effort to accurately reflect the components of the transportation system being analyzed. The objective of this research is to develop a methodology in which the effects of CMVs can be included in the calibration of microscopic traffic simulation models. The research examines the ITS data available on weight and operating characteristics of CMVs and incorporates this data in the calibration of microscopic traffic simulation models. The research develops a methodology to model CMVs using microscopic traffic simulation models and then utilizes the output of these models to generate the data necessary to quantify the impacts of CMVs on infrastructure, travel time, and emissions. The research uses advanced statistical tools including principal component analysis (PCA) and recursive partitioning to identify relationships between data collection sites (i.e., WIM, AVC) such that the data collected at WIM sites can be utilized to estimate weight and length distributions at AVC sites. The research also examines methodologies to include the distribution or measures of central tendency and dispersion (i.e., mean, variance) into the calibration process. The approach is applied using the CORSIM model and calibrated utilizing an automated genetic algorithm methodology.
|
Page generated in 0.1335 seconds