Locally weighted regression is a powerful tool that allows the estimation of different sets of coefficients for
each location in the underlying data, challenging the assumption of stationary regression coefficients across
a study region. The accuracy of LWR largely depends on how a researcher establishes the relationship across
locations, which is often constructed using a weight matrix or function. This paper explores the different
kernel functions used to assign weights to observations, including Gaussian, bi-square, and tri-cubic, and
how the choice of weight variables and window size affects the accuracy of the estimates. We guide this
choice through the cross-validation approach and show that the bi-square function outperforms the choice of
other kernel functions. Our findings demonstrate that an optimal window size for LWR models depends on
the cross-validation (CV) approach employed. In our empirical application, the full-sample CV guides the
choice of a higher window-size case, and CV by proxy guides the choice of a lower window size. Since the CV
by Proxy approach focuses on the predictive ability of the model in the vicinity of one specific point (usually
a policy point/site), we note that guiding a model choice through this approach makes more intuitive sense
when the aim of the researcher is to predict the outcome in one specific site (policy or target point). To
identify the optimal weight variables, while we suggest exploring various combinations of weight variables,
we argue that an efficient alternative is to merge all continuous variables in the dataset into a single weight
variable. / M.A. / Locally weighted regression (LWR) is a statistical technique that establishes a relationship between dependent
and explanatory variables, focusing primarily on data points in proximity to a specific point of
interest/target point. This technique assigns varying degrees of importance to the observations that are in
proximity to the target point, thereby allowing for the modeling of relationships that may exhibit spatial
variability within the dataset.
The accuracy of LWR largely depends on how researchers define relationships across different locations/studies,
which is often done using a “weight setting”. We define weight setting as a combination of weight
functions (determines how the observations around a point of interest are weighted before they enter the
model), weight variables (determines proximity between the point of interest and all other observations), and
window sizes (determines the number of observations that can be allowed in the local regression). To find
which weight setting is an optimal one or which combination of weight functions, weight variables, and window
sizes generates the lowest predictive error, researchers often employ a cross-validation (CV) approach.
Cross-validation is a statistical method used to assess and validate the performance of a predictive model. It
entails removing a host observation (a point of interest), predicting that point, and evaluating the accuracy
of such predicted point by comparing it with its actual value.
In our study, we employ two CV approaches. The first one is a full-sample CV approach, where we remove
a host observation, and predict it using the full set of observations used in the given local regression. The
second one is the CV by proxy approach, which uses a similar mechanism as full-sample CV to check the
accuracy of the prediction, however, by focusing only on the vicinity points that share similar characteristics
as a target point.
We find that the bi-square function consistently outperforms the choice of Gaussian and tri-cubic weight
functions, regardless of the CV approaches. However, the choice of an optimal window size in LWR models
depends on the CV approach that we employ. While the full-sample CV method guides us toward the
selection of a larger window size, the CV by proxy directs us toward a smaller window size. In the context of
identifying the optimal weight variables, we recommend exploring various combinations of weight variables.
However, we also propose an efficient alternative, which involves using all continuous variables within the
dataset into a single-weight variable instead of striving to identify the best of thousands of different weight
variable settings.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/116292 |
Date | January 2023 |
Creators | Puri, Roshan |
Contributors | Department of Statistics, Van Mullekom, Jennifer H., Moeltner, Klaus, Driscoll, Anne |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | en_US |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf, application/pdf |
Rights | CC0 1.0 Universal, http://creativecommons.org/publicdomain/zero/1.0/ |
Page generated in 0.0015 seconds