1 |
Minimizing Recommended Error Costs Under Noisy Inputs in Rule-Based Expert SystemsThola, Forest D. 01 January 2012 (has links)
This dissertation develops methods to minimize recommendation error costs when inputs to a rule-based expert system are prone to errors. The problem often arises in web-based applications where data are inherently noisy or provided by users who perceive some benefit from falsifying inputs. Prior studies proposed methods that attempted to minimize the probability of recommendation error, but did not take into account the relative costs of different types of errors. In situations where these differences are significant, an approach that minimizes the expected misclassification error costs has advantages over extant methods that ignore these costs.
Building on the existing literature, two new techniques - Cost-Based Input Modification (CBIM) and Cost-Based Knowledge-Base Modification (CBKM) were developed and evaluated. Each method takes as inputs (1) the joint probability distribution of a set of rules, (2) the distortion matrix for input noise as characterized by the probability distribution of the observed input vectors conditioned on their true values, and (3) the misclassification cost for each type of recommendation error. Under CBIM, for any observed input vector v, the recommendation is based on a modified input vector v' such that the expected error costs are minimized. Under CBKM the rule base itself is modified to minimize the expected cost of error.
The proposed methods were investigated as follows: as a control, in the special case where the costs associated with different types of errors are identical, the recommendations under these methods were compared for consistency with those obtained under extant methods. Next, the relative advantages of CBIM and CBKM were compared as (1) the noise level changed, and (2) the structure of the cost matrix varied.
As expected, CBKM and CBIM outperformed the extant Knowledge Base Modification (KM) and Input Modification (IM) methods over a wide range of input distortion and cost matrices, with some restrictions. Under the control, with constant misclassification costs, the new methods performed equally with the extant methods. As misclassification costs increased, CBKM outperformed KM and CBIM outperformed IM. Using different cost matrices to increase misclassification cost asymmetry and order, CBKM and CBIM performance increased. At very low distortion levels, CBKM and CBIM underperformed as error probability became more significant in each method's estimation. Additionally, CBKM outperformed CBIM over a wide range of input distortion as its technique of modifying an original knowledge base outperformed the technique of modifying inputs to an unmodified decision tree.
|
2 |
Prediction Performance of Survival ModelsYuan, Yan January 2008 (has links)
Statistical models are often used for the prediction of
future random variables. There are two types of prediction, point
prediction and probabilistic prediction. The prediction accuracy is
quantified by performance measures, which are typically based on
loss functions. We study the estimators of these performance
measures, the prediction error and performance scores, for point and
probabilistic predictors, respectively. The focus of this thesis is
to assess the prediction performance of survival models that analyze
censored survival times. To accommodate censoring, we extend the
inverse probability censoring weighting (IPCW) method, thus
arbitrary loss functions can be handled. We also develop confidence
interval procedures for these performance measures.
We compare model-based, apparent loss based and cross-validation
estimators of prediction error under model misspecification and
variable selection, for absolute relative error loss (in chapter 3)
and misclassification error loss (in chapter 4). Simulation results
indicate that cross-validation procedures typically produce reliable
point estimates and confidence intervals, whereas model-based
estimates are often sensitive to model misspecification. The methods
are illustrated for two medical contexts in chapter 5. The apparent
loss based and cross-validation estimators of performance scores for
probabilistic predictor are discussed and illustrated with an
example in chapter 6. We also make connections for performance.
|
3 |
Prediction Performance of Survival ModelsYuan, Yan January 2008 (has links)
Statistical models are often used for the prediction of
future random variables. There are two types of prediction, point
prediction and probabilistic prediction. The prediction accuracy is
quantified by performance measures, which are typically based on
loss functions. We study the estimators of these performance
measures, the prediction error and performance scores, for point and
probabilistic predictors, respectively. The focus of this thesis is
to assess the prediction performance of survival models that analyze
censored survival times. To accommodate censoring, we extend the
inverse probability censoring weighting (IPCW) method, thus
arbitrary loss functions can be handled. We also develop confidence
interval procedures for these performance measures.
We compare model-based, apparent loss based and cross-validation
estimators of prediction error under model misspecification and
variable selection, for absolute relative error loss (in chapter 3)
and misclassification error loss (in chapter 4). Simulation results
indicate that cross-validation procedures typically produce reliable
point estimates and confidence intervals, whereas model-based
estimates are often sensitive to model misspecification. The methods
are illustrated for two medical contexts in chapter 5. The apparent
loss based and cross-validation estimators of performance scores for
probabilistic predictor are discussed and illustrated with an
example in chapter 6. We also make connections for performance.
|
4 |
Some Advances in Classifying and Modeling Complex DataZhang, Angang 16 December 2015 (has links)
In statistical methodology of analyzing data, two of the most commonly used techniques are classification and regression modeling. As scientific technology progresses rapidly, complex data often occurs and requires novel classification and regression modeling methodologies according to the data structure. In this dissertation, I mainly focus on developing a few approaches for analyzing the data with complex structures.
Classification problems commonly occur in many areas such as biomedical, marketing, sociology and image recognition. Among various classification methods, linear classifiers have been widely used because of computational advantages, ease of implementation and interpretation compared with non-linear classifiers. Specifically, linear discriminant analysis (LDA) is one of the most important methods in the family of linear classifiers.
For high dimensional data with number of variables p larger than the number of observations n occurs more frequently, it calls for advanced classification techniques.
In Chapter 2, I proposed a novel sparse LDA method which generalizes LDA through a regularized approach for the two-class classification problem.
The proposed method can obtain an accurate classification accuracy with attractive computation, which is suitable for high dimensional data with p>n.
In Chapter 3, I deal with the classification when the data complexity lies in the non-random missing responses in the training data set. Appropriate classification method needs to be developed accordingly. Specifically, I considered the "reject inference problem'' for the application of fraud detection for online business. For online business, to prevent fraud transactions, suspicious transactions are rejected with unknown fraud status, yielding a training data with selective missing response. A two-stage modeling approach using logistic regression is proposed to enhance the efficiency and accuracy of fraud detection.
Besides the classification problem, data from designed experiments in scientific areas often have complex structures. Many experiments are conducted with multiple variance sources. To increase the accuracy of the statistical modeling, the model need to be able to accommodate more than one error terms. In Chapter 4, I propose a variance component mixed model for a nano material experiment data to address the between group, within group and within subject variance components into a single model. To adjust possible systematic error introduced during the experiment, adjustment terms can be added. Specifically a group adaptive forward and backward selection (GFoBa) procedure is designed to select the significant adjustment terms. / Ph. D.
|
Page generated in 0.1044 seconds