Spelling suggestions: "subject:"probability "" "subject:"aprobability ""
431 |
On Prediction and Filtering of Stock Index Returns :HALLGREN, FREDRIK January 2011 (has links)
No description available.
|
432 |
Markov chain Monte Carlo for rare-event simulation in heavy-tailed settingsGudmundsson, Thorbjörn January 2013 (has links)
No description available.
|
433 |
BAGGED PREDICTION ACCURACY IN LINEAR REGRESSIONKimby, Daniel January 2022 (has links)
Bootstrap aggregation, or bagging, is a prominent method used in statistical inquiry suggested to improve predictive performance. It is useful to confirm the efficacy of such improvements and to expand upon them. This thesis investigates whether the results of Leo Breiman's (1996) paper \emph{Bagging predictors} can be replicated, where bagging is shown to lower prediction error. Additionally, predictive performance of weighted bagging is investigated, where we weight using a function of the residual variance. The data used is simulated, consisting of a numerical outcome variable as well as 30 independent variables. Linear regression is run with forward step selection, selecting models with the lowest SSE. Predictions are saved for all 30 models. Separately, we run forward step selection, selecting significant p-values of the added coefficient, saving only one final model. Prediction error is measured in mean squared error. The results suggest that both bagged methods improve upon prediction error, selecting models with the lowest SSE, with unweighted bagging performing the best. The results are congruent with Breiman's (1996) results, with minor differences. P-value selection shows weighted bagging performing the best. Further research should be conducted on real data to verify these results, in particular with reference to weighted bagging.
|
434 |
Evaluating Risk Factors with Regression : A Review and Simulation Study of Current PracticesReinhammar, Ragna January 2022 (has links)
The term ”risk factor” is used synonymously with both predictor and causal factor, and causal aims of explanatory analyses are rarely stated explicitly. Consequently, the concepts of explaining and predicting are conflated in risk factor research. This thesis reviews the current practice of evaluating risk factors with regression in three medical journals and identifies three common covariate selection strategies: adjusting for a pre-specified set, univariable pre-filtering, and stepwise selection. The implication of ”risk factor” varies in the reviewed articles and many authors make implicit causal definitions of the term. In the articles, logistic regression is the most frequently used model, and effect estimates are often reported as conditional odds ratios. The thesis compares current practices to estimating a marginal odds ratio in a simulation study mimicking data from Louapre et al. (2020). The marginal odds ratio is estimated with a regression imputation estimator and an Augmented Inverse Probability Weighting estimator. The simulation study illustrates the difference between conditional and marginal odds ratios and examines the performance of estimators under correctly specified and misspecified models. From the simulation, it is concluded that the estimators of the marginal odds ratio are consistent and robust against certain model misspecifications.
|
435 |
Measures of statistical dependence for feature selection : Computational study / Mått på statistiskt beroende för funktionsval : BeräkningsstudieAlshalabi, Mohamad January 2022 (has links)
The importance of feature selection for statistical and machine learning models derives from their explainability and the ability to explore new relationships, leading to new discoveries. Straightforward feature selection methods measure the dependencies between the potential features and the response variable. This thesis tries to study the selection of features according to a maximal statistical dependency criterion based ongeneralized Pearson’s correlation coefficients, e.g., Wijayatunga’s coefficient. I present a framework for feature selection based on these coefficients for high dimensional feature variables. The results are compared to the ones obtained by applying an elastic net regression (for high-dimensional data). The generalized Pearson’s correlation coefficient is a metric-based measure where the metric is Hellinger distance. The metric is considered as the distance between probability distributions. The Wijayatunga’s coefficient is originally proposed for the discrete case; here, we generalize it for continuous variables by discretization and kernelization. It is interesting to see how discretization work as we discretize the bins finer. The study employs both synthetic and real-world data to illustrate the validity and power of this feature selection process. Moreover, a new method of normalization for mutual information is included. The results show that both measures have considerable potential in detecting associations. The feature selection experiment shows that elastic net regression is superior to our proposed method; nevertheless, more investigation could be done regarding this subject.
|
436 |
Uncertainty quantification for neural network predictions / Kvantifiering av osäkerhet för prediktioner av neurala nätverkBorgström, Jonas January 2022 (has links)
Since their inception, machine learning methods have proven useful, and their usability continues to grow as new methods are introduced. However, as these methods are used for decision-making in most fields, such as weather forecasting,medicine, and stock market prediction, their reliability must be appropriately evaluated before the models are deployed. Uncertainty in machine learning and neural networks usually stems from two primary sources, the data used or the model itself. Uncertainty would not be a problem for most statistical and machine learning methods, but for neural networks that lack inherent uncertainty quantification methods, this can be more problematic. Furthermore, as the neural network architecture dimension grows in size, so does the number of parameters to be estimated. So, modeling the prediction uncertainty through parameter uncertainty can become an impossible task. There are, however, methods that can quantify uncertainty in neural networks using Bayesian approximation. One such method is Monte Carlo Dropout, where the same input data is used with different network structures. The results using these methods are assumed to follow a normal distribution, from which the uncertainty can be quantified. The second method tests a new approach where the neural network is first considered a dimension reduction tool. In doing this, input feature space that is often large is mapped to the state space of the neurons in the last hidden layer that can be selected to be smaller. Then by using the information from this reduced feature space, a reduced parameter set for the neural network prediction can be defined. With this, an assumption of, for example, a multinomial-Dirichlet probability model for discrete classification can be made. Importantly, this reduced feature space can generate predictions for hypothetical inputs, which quantifies prediction uncertainty for the network predictions. This thesis aims to see if the uncertainty of neural network predictions can be quantified statistically by evaluating this new method. Then, the results of these two methods will be compared to see any differences between the predictive uncertainty quantified using these methods. The results show that using the new method, predictive uncertainty could be quantified by first gathering the output range for each ReLU activation function. Then, new data could be uniformly simulated and inserted into the softmax layer for classification by using these ranges. By using these results, the multinomial-Dirichlet distribution could be used to quantify the uncertainty. The two methods offer comparable results when used to quantify predictive uncertainty.
|
437 |
Modification of the RusBoost algorithm : A comparison of classifiers on imbalanced data / Modifikation av RusBoost algoritmen : En jämförelse av klassificeringsmetoder på obalanserad dataForslund, Isak January 2022 (has links)
In many situations data is imbalanced, meaning the proportion of one class is larger than the other(s). Standard classifiers often produce undesirable results when the data is imbalanced and different methods have been developed in the attempt to improve classification under such conditions. Examples of this are the algorithms AdaBoost, RusBoost, and SmoteBoost which modifies the cost for misclassified observations, and the latter two also reduce the class imbalances when training the classifier. This thesis presents a new method, Modified RusBoost, where the RusBoost algorithm is modified in a way such that observations that are harder to classify correctly are assigned a lower probability of being removed in the under-sampling process. Comparisons were made between the performance of this method, AdaBoost, RusBoost, and SmoteBoost on imbalanced data. Also, how imbalances affect the different classifiers were investigated. The performance of these methods were compared on 20 real data sets. Overall, Modified RusBoost performed better or comparable to the other methods. Indicating that this algorithm can be a good alternative when classifying imbalanced data. Also, results showed that an increase of ρ, a ratio of majority over minority observations in a data set, has a negative impact on performance of the algorithms. However, this negative impact of ρ affects the performance of all methods similarly.
|
438 |
Sleep apnea prediction in a Swedish cohort : Can the STOP-Bang questionnaire be improved? / Sömnapnéprediktion i en svensk kohort : Kan STOP-Bang enkäten förbättras?Gladh, Miriam January 2022 (has links)
Obstructive sleep apnea (OSA) is defined as more than five breathing pauses per hour of sleep, an apnea-hypopnea index (AHI) > 5. STOP-Bang is a questionnaire that predicts the risk of sleep apnea based on risk factors, like snoring, hypertension, and throat circumference greater than 40 cm. Many individuals with OSA are undiagnosed and patients with sleep apnea have an increased risk of complications after surgery. Therefore, it is important to identify these patients. This thesis aims to create prediction models that predict the degree of sleep apnea, defined as no sleep apnea to mild sleep apnea (AHI < 15) or moderate to severe sleep apnea (AHI ≥ 15), by using different methods. The methods are Random Forests, logistic regression, and linear discriminant analysis (LDA). Beyond these three methods, the STOP-Bang questionnaire, a weighted STOP-Bang, and a modified STOP-Bang are used to predict the degree of sleep apnea. In the modified STOP-Bang, the same feature variables are used as in STOP-Bang. But the categorical feature variables are divided in a different way, and the modified STOP-Bang gives more weight to some of the feature variables. STOP-Bang models where some other feature variables are used were made to see if the prediction accuracy would be improved, SCAPIS STOP-Bang. The prediction precision is also compared for all models depending on gender. Accuracy, specificity, and sensitivity were compared for the models. For the models using the STOP-Bang feature variables, the models with the highest area under the curve (AUC), with confidence interval in parenthesis, were the LDA and the logistic regression models with an AUC of 0.81 (0.78, 0.84). The confidence intervals for the AUC, sensitivity, and accuracy were overlapping for all the models. The SCAPIS STOP-Bang model did not achieve a better prediction accuracy. For all the models, the accuracy was higher for females than for males. But also here, all the confidence intervals were overlapping.
|
439 |
Correlation coefficient based feature screening : With applications to microarray data / Korrelationsbaserad dimensionsreduktion med tillämpning på data från mikromatriserHolma, Agnes January 2022 (has links)
Measuring dependency between variables is of great importance when performing statistical analysis and can for instance be used for feature screening. Therefore, it is interesting to find measures that can quantify the dependencies, even if the dependencies are complex. Recently, the correlation coefficient ξn was proposed [1], that is fast to compute and works particularly well when dependencies show an oscillatory or wiggly pattern. In this thesis, the coefficient ξn was applied as a feature screening tool, and it was investigated how well the coefficient could find the dependencies between predictor variables and a response variable in a comprehensive simulation study. The result showed that the correlation coefficient ξn was better, compared to two other quite new and popular correlation coefficients, Hilbert-Schmidt Independence Criterion and Distance Correlation (DC), at detecting the dependencies when variables were connected through sinus-or cosinus-functions and worse when variables were connected through some other functions, such as exponential functions. As a feature screening tool, the correlation coefficient ξn and DC was also applied to real microarray data to investigate if it could give better results than when using t-test for feature screening. The result showed that using t-test was more efficient than using DC or ξn for this particular data set.
|
440 |
Classification Models for Activity Recognition using Smart Phone Accelerometers / Klassificeringsmodeller för aktivitetsigenkänning använder sig av accelerometrar för smarta telefonerKumar, Biswas January 2022 (has links)
The huge amount of data generated by accelerometers in smartphones creates new opportunities for useful data mining applications. Machine Learning algorithms can be effectively used for tasks such as the classification and clustering of physical activity patterns. This paper builds and evaluates a system that uses real-world smartphone-based tri-axial accelerometers labeled data to perform activity recognition tasks. Over a million data recorded at the frequency 20Hz, was filtered and pre-processed to extract relevant features for the classification task. The features were selected to obtain higher classification accuracy. These supervised classification models, namely, random forest, support vector machines, decision tree, naïve Bayes classifier, and multinomial logistic regression are evaluated and finally compared with a few unsupervised classification models such as k-means and self-organizing map (SOM) technique built on an unlabelled dataset. Statistical model evaluation metrics such as accuracy-precision-recall are used to compare the classification performances of the models. It was interesting to see that all supervised learning methods achieved very high accuracy (over 95%) on labeled datasets as against 65% by unsupervised SOM. Moreover, they registered very low similarity (23%) among themselves on unlabelled datasets with the same selected features.
|
Page generated in 0.2583 seconds