1 |
Linear regression techniques for identifying influential data and applications in commercial data analysisJacobs, Michael Kalman 27 September 2023 (has links) (PDF)
Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residual as a test for outliers, and Cook's distance as a measure of influence. An outlier test which is distributionally neater than the standardized residual is proposed. In practical applications, ordinary least squares regression is often inappropriate, and the use of biased estimators may be preferable. In this thesis, the existing theory is extended to several alternative regression techniques. Ridge regression and generalized inverse regression are suitable techniques when the cross-product matrix is ill-conditioned. Restricted least squares regression, with exact or stochastic prior information, · is used in many econometric application~. Models with selected · variables-are used to eliminate design faults or to reduce computational effort. New statistics are developed for all these techniques, the distributional results are proved, and computational formulae are developed. Computational problems may arise in the actual use of the various techniques, and these are investigated. Computer programs written in BASIC and suitable for microcomputer use are presented, making the techniques accessible to virtually any commercial environment. The performance of the various techniques is examined, using a controlled simulation study and a number of practical data sets drawn from several areas of South African commerce. This is, as far as can be ascertained, the first extensive practical South African study on the effects of influential data. It is shown that the presence of outliers or influential data can bias the results of any study significantly. It is recommended that no data analysis should be attempted without a preliminary scan of outliers and influential observation. The techniques presented can be used advantageously even in data sets where the ultimate analysis does not involve linear regression. It is shown that influential data are not merely of nuisance value in the analysis but may contain valuable - information in their own right._
|
Page generated in 0.0987 seconds