Global ETD Search

31	Predicting social unrest events in South Africa using LSTM neural networks Zambezi, Samantha 21 September 2021 (has links) This thesis demonstrates an approach to predict the count of social unrest events in South Africa. A comparison is made between traditional forecasting approaches and neural networks; the traditional forecast method selected being the Autoregressive Integrated Moving Average (ARIMA model). The type of neural network implemented was the Long Short-Term Memory (LSTM) neural network. The basic theoretical concepts of ARIMA and LSTM neural networks are explained and subsequently, the patterns of the social unrest time series were analysed using time series exploratory techniques. The social unrest time series contained a significant number of irregular fluctuations with a non-linear trend. The structure of the social unrest time series suggested that traditional linear approaches would fail to model the non-linear behaviour of the time series. This thesis confirms this finding. Twelve experiments were conducted, and in these experiments, features, scaling procedures and model configurations are varied (i.e. univariate and multivariate models). Multivariate LSTM achieved the lowest forecast errors and performance improved as more explanatory features were introduced. The ARIMA model's performance deteriorated with added complexity and the univariate ARIMA produced lower forecast errors compared to the multivariate ARIMA. In conclusion, it can be claimed that multivariate LSTM neural networks are useful for predicting social unrest events. Statistical Sciences
32	Natural Language Financial Forecasting: The South African Context Katende, Simon 24 August 2021 (has links) The stock market plays a fundamental role in any country's economy as it efficiently directs the flow of savings and investments of an economy in ways that advances the accumulation of capital and the production of goods and services. Factors that affect the price movement of stocks include company news and performance, macroeconomic factors, market sentiment as well as unforeseeable events. The conventional prediction approach is based on historical numerical data such as price trends and trading volumes to name a few. This thesis reviews the literature of Natural Language Financial Forecasting (NLFF) and proposes novel implementation techniques with the use of Stock Exchange News Service (SENS) announcements to predict stock price trends with machine learning methods. Deep Learning has recently sparked interest in the data science communities, but the literature on the application of deep learning in stock prediction, especially in emerging markets like South Africa, is still limited. In this thesis, the process of labelling announcements, the use of a more statistically relevent technique called the event study was used. Classical textual preprocessing and representation techniques were replaced with state-of-the-art sentence embeddings. Deep learning models (Deep Neural Network (DNN)) were then compared to Classical Models (Logistic Regression (LR)). These models were trained, optimized and deployed using the Tensorflow Machine Learning (ML) framework on Google Cloud AI Platform. The comparison between the performance results of the models shows that both DNN and LR have potential operational capabilites to use information dissemination as a means to assist market participants with their trading decisions. Statistical Sciences
33	Sequential nonparametric estimation via Hermite series estimators Stephanou, Michael Jared 25 February 2021 (has links) Algorithms for estimating the statistical properties of streams of data in real time, as well as for the efficient analysis of massive data sets, are becoming particularly pertinent given the increasing ubiquity of such data. In this thesis we introduce novel approaches to sequential (online) estimation in both stationary and non-stationary settings based on Hermite series density estimators. In the univariate context we apply Hermite series based distribution function estimators to sequential cumulative distribution function estimation. These distribution function estimators are particularly useful because they allow the sequential estimation of the full cumulative distribution function. This is in contrast to the empirical distribution function estimator and smooth kernel distribution function estimator which only allow sequential cumulative probability estimation at predefined values on the support of the associated density function. We explore the asymptotic consistency and robustness properties of the Hermite series based cumulative distribution function estimator thereby redressing a gap in the literature. Given the sequential Hermite series based distribution function estimator, we obtain sequential quantile estimates numerically. Our algorithms go beyond existing sequential quantile estimation algorithms in that they allow arbitrary quantiles (as opposed to pre-specified quantiles) to be estimated at any point in time, in both the static and dynamic quantile estimation settings. In the bivariate context we introduce a Hermite series based sequential estimator for the Spearman's rank correlation coefficient and provide algorithms applicable in both the stationary and non-stationary settings. To treat the the non-stationary setting, we introduce a novel, exponentially weighted estimator for the Spearman's rank correlation, which allows the local nonparametric correlation of a bivariate data stream to be tracked. To the best of our knowledge this is the first algorithm to be proposed for estimating a time-varying Spearman's rank correlation that does not rely on a moving window approach. We explore the practical effectiveness of the Hermite series based estimators through real data and simulation studies, demonstrating competitive performance compared to leading existing algorithms. The potential applications of this work are manifold. Our sequential distribution function and quantile estimation algorithms can be applied to real time anomaly and outlier detection, real time provisioning for future demand as well as real time risk estimation for example. The Hermite series based Spearman's rank correlation estimator can be applied to fast and robust online calculation of correlation which may vary over time. Possible machine learning applications include fast feature selection and hierarchical clustering on massive data sets amongst others. statistical sciences
34	Changes in rainfall seasonality in the Western Cape, South Africa: an exploration of methods for determining the start and end of the rainfall season Ivey, Peter 29 January 2021 (has links) The aim of this thesis is to detect and analyse changes in seasonality in rainfall for various groups of weather stations in the Western Cape area. Weather stations with similar seasonal patterns are firstly grouped together using certain clustering algorithms. The start and end of the rainfall season dates for the different groups of weather stations are estimated and then compared over time to determine whether there have been any changes. Once these start and end of season dates have been estimated, the length of the rainfall season is estimated and compared over time. Studies have been performed globally and over southern Africa attempting to analyse rainfall patterns and changes. However, rainfall is the most unstable climate variable in terms of time and space and thus, it is really difficult to predict (Yaman, 2018). Most studies have pointed toward an increase of extreme events on both sides of the scale i.e. more intense flooding and more severe drought being experienced. Some places are also starting to experience more rainfall than before whilst other places are starting to experience more drought. The impacts of these rainfall changes are already being experienced with many areas being forced to adapt to the new conditions. Many better decisions can be made with a better understanding of how rainfall seasons are changing. In the agricultural industry, better informed decisions about when the rainfall season is likely to start and end can result in more optimal yield from crops. Changes in rainfall can also affect the type of crops that should be planted. Farmers will also be able to better prepare for drought seasons if they are better informed as to when these drought periods will likely occur. In terms of disaster risk management, the more that is known about rainfall patterns, the better prepared regions can be for the inevitable increase in extreme events. Cities can put in better systems now in order to deal with potential future crises. Cape Town is an example of a city that could have possibly been better prepared for the current drought crisis if there was a better understanding of rainfall trends. Hopefully in the future, with more accurate information about rainfall, it can rather be an active process than a reactionary process to the current climate conditions. Statistical Sciences
35	Building a question answering system for the introduction to statistics course using supervised learning techniques Leonhardt, Waldo 04 February 2021 (has links) Question Answering (QA) is the task of automatically generating an answer to a question asked by a human in natural language. Open-domain QA is still a difficult problem to solve even after 60 years of research in this field, as trying to answer questions which cover a wide range of subjects is a complex matter. Closed-domain QA is, on the other hand, more achievable as the context for asking questions is restricted and allows for more accurate interpretation. This dissertation explores how a QA system could be built for the Introduction to Statistics course taught online at the University of Cape Town (UCT), for the purpose of answering administrative queries. This course runs twice a year and students tend to ask similar administrative questions each time that the course is run. If a QA system can successfully answer these questions automatically, it would save lecturers the time in having to do so manually, as well as enabling students to receive the answers immediately. For a machine to be able to interpret natural language questions, methods are needed to transform text into numbers while still preserving the meaning of the text. The field of Natural Language Processing (NLP) offers the building blocks for such methods that have been used in this study. After predicting the category of a new question using Multinomial Logistic Regression (MLR), the past question that is most similar to the new question is retrieved and its answer is used for the new question. The following five classifiers, Naive Bayes, Logistic Regression, Support Vector Machines, Stochastic Gradient Descent and Random Forests were compared to see which one provides the best results for the categorisation of a new question. The cosine similarity method was used to find the most similar past question. The Round-Trip Translation (RTT) technique was explored as an augmentation method for text, in an attempt to increase the dataset size. Methods were compared using the initial base dataset of 744 questions, compared to the extended dataset of 6 614 questions, which was generated as a result of the RTT technique. In addition to these two datasets, features for Bag-of-Words (BoW), Term Frequency times Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDiA), pre-trained Global Vector (GloVe) word embeddings and customengineered features were also compared. This study found that a model using an MLR classifier with TF-IDF unigram and bigram features (built on the smaller 744 questions dataset) performed the best, with a test F1-measure of 84.8%. Models using a Stochastic Gradient Descent classifier also performed very well with a variety of features, indicating that Stochastic Gradient Descent is the most versatile classifier to use. No significant improvements were found using the extended RTT dataset of 6 614 questions, but this dataset was used by the model that ranked eighth in position. A simulator was also built to illustrate and test how a bot (an autonomous program on a network that is able to interact with users) can be used to facilitate the auto-answering of student questions. This simulator proved very useful and helped to identify the fact that questions relating to the Course Information Pack had been excluded from the data that had been initially sourced, as students had been asking such questions through other platforms. Building a QA system using a small dataset proved to be very challenging. Restricting the domain of questions and focusing only on administrative queries was helpful. Lots of data cleaning was needed and all past answers needed to be rewritten and standardised, as the raw answers were too specific and did not generalise well. The features that performed the best for cosine similarity and for extracting the most similar past question were LSA topics built from TF-IDF unigram features. Using LSA topics as the input for cosine similarity, instead of the raw TF-IDF features,resolved the “curse of dimensionality”. Issues with cosine similarity were observed in cases where it favoured short documents, which often led to the selection of the wrong past question. As an alternative, the use of more advanced language-modelling-based similarity measures are suggested for future study. Either, pre-trained word embeddings such as GloVe could be used as a language model, or a custom language model could be trained. A generic UCT language model could be valuable and it would be preferable to build such a language model using the entire digital content of Vula across all faculties where students converse, ask questions or post comments. Building a QA system using this UCT language model is foreseen to offer better results, as terms like “Vula”, “DP”, “SciLab” and “jdlt1” would be endowed with more meaning. Statistical Sciences
36	A variance shift model for outlier detection and estimation in linear and linear mixed models Gumedze, Freedom Nkhululeko January 2009 (has links) Outliers are data observations that fall outside the usual conditional ranges of the response data.They are common in experimental research data, for example, due to transcription errors or faulty experimental equipment. Often outliers are quickly identified and addressed, that is, corrected, removed from the data, or retained for subsequent analysis. However, in many cases they are completely anomalous and it is unclear how to treat them. Case deletion techniques are established methods in detecting outliers in linear fixed effects analysis. The extension of these methods to detecting outliers in linear mixed models has not been entirely successful, in the literature. This thesis focuses on a variance shift outlier model as an approach to detecting and assessing outliers in both linear fixed effects and linear mixed effects analysis. A variance shift outlier model assumes a variance shift parameter, !i, for the ith observation, where !i is unknown and estimated from the data. Estimated values of !i indicate observations with possibly inflated variances relative to the remainder of the observations in the data set and hence outliers. When outliers lurk within anomalous elements in the data set, a variance shift outlier model offers an opportunity to include anomalies in the analysis, but down-weighted using the variance shift estimate Ë!i. This down-weighting might be considered preferable to omitting data points (as in case-deletion methods). For very large values of !i a variance shift outlier model is approximately equivalent to the case deletion approach. We commence with a detailed review of parameter estimation and inferential procedures for the linear mixed model. The review is necessary for the development of the variance shift outlier model as a method for detecting outliers in linear fixed and linear mixed models. This review is followed by a discussion of the status of current research into linear mixed model diagnostics. Different types of residuals in the linear mixed model are defined. A decomposition of the leverage matrix for the linear mixed model leads to interpretable leverage measures. ii A detailed review of a variance shift outlier model in linear fixed effects analysis is given. The purpose of this review is firstly, to gain insight into the general case (the linear mixed model) and secondly, to develop the model further in linear fixed effects analysis. A variance shift outlier model can be formulated as a linear mixed model so that the calculations required to estimate the parameters of the model are those associated with fitting a linear mixed model, and hence the model can be fitted using standard software packages. Likelihood ratio and score test statistics are developed as objective measures for the variance shift estimates. The proposed test statistics initially assume balanced longitudinal data with a Gaussian distributed response variable. The dependence of the proposed test statistics on the second derivatives of the log-likelihood function is also examined. For the single-case outlier in linear fixed effects analysis, analytical expressions for the proposed test statistics are obtained. A resampling algorithm is proposed for assessing the significance of the proposed test statistics and for handling the problem of multiple testing. A variance shift outlier model is then adapted to detect a group of outliers in a fixed effects model. Properties and performance of the likelihood ratio and score test statistics are also investigated. A variance shift outlier model for detecting single-case outliers is also extended to linear mixed effects analysis under Gaussian assumptions for the random effects and the random errors. The variance parameters are estimated using the residual maximum likelihood method. Likelihood ratio and score tests are also constructed for this extended model. Two distinct computing algorithms which constrain the variance parameter estimates to be positive, are given. Properties of the resulting variance parameter estimates from each computing algorithm are also investigated. A variance shift outlier model for detecting single-case outliers in linear mixed effects analysis is extended to detect groups of outliers or subjects having outlying profiles with random intercepts and random slopes that are inconsistent with the corresponding model elements for the remaining subjects in the data set. The issue of influence on the fixed effects under a variance shift outlier model is also discussed. Statistical Sciences
37	Contributions to Linear Regression diagnostics using the singular value decompostion: Measures to Indentify Outlying Observations, Influential Observations and Collinearity in Multivariate Data Ramaboa, Kutlwano January 2010 (has links) No description available. Statistical Sciences
38	The use of ringing data in the study of climatic influences on common passerines Jansen, Dorine Yvette Manon January 2016 (has links) To understand the potential impact of forecasted increases in climatic variability we need to determine the impact of climatic stochasticity on demographic rates. This thesis used available long-term ringing data collected by volunteers, augmented by data from research projects, to investigate the influence of climatic variation on survival of 10 common passerines in southern Africa. Through sheer numbers common species are fundamental to ecosystem functioning. Migratory species are subject to climatic stochasticity in breeding and wintering grounds, and during migration. In a population of African Reed Warblers Acrocephalus baeticatus (an azonal wetland specialist) a capture-mark-recapture model correlated higher temperature in the breeding grounds with higher adult survival (1998-2010), but - contrary to expectations - not wetter winters. A spatial analysis using a multi-state model in a Bayesian framework did not link survival in populations across southern Africa to environmental seasonality. However, as hypothesised, migratory populations appeared to survive better than sedentary populations. Increased climatic variation could synchronize survival of species assemblages and colonies in meta-populations. I investigated a 3-species assemblage in climatically stable fynbos (2000-2007) and a 4-species assemblage in more seasonal wetland (1999-2013) with a hierarchical model, run in WinBUGS, with a temporal, synchronous (common) and asynchronous (species-specific) component. Comparison of models with and without climatic covariates quantified the impact of climatic stochasticity as a synchronizing and desynchronizing agent. As expected, the wetland assemblage exhibited more synchronous and asynchronous variation in survival than the fynbos assemblage, but the analysis did not find evidence of climatic forcing. Demographic rates of a population of 25 colonies of a Sociable Weaver Philetairus socius meta-population in savanna near Kimberley did not correlate with climatic indices during 1993-2014. Age-specific survival and fecundity of the largest colony were influenced by climatic variation reinforcing earlier inference that colonies respond differently to environmental stochasticity. The integrated population model using count, ringing, and productivity data enabled the first estimation of annual fecundity, juvenile survival and recruitment. The volunteer data yielded the first estimates of adult survival of two African endemics and estimates of a second population for three other species. A review of volunteer ringing resulted in recommendations to improve its use from a demographic perspective. Statistical Sciences
39	The construction of a partial least squares biplot Oyedele, Opeoluwa Funmilayo January 2014 (has links) Includes bibliographical references. / In multivariate analysis, data matrices are often very large, which sometimes makes it difficult to describe their structure and to make a visual inspection of the relationship between their respective rows (samples) and columns (variables). For this reason, biplots, the joint graphical display of the rows and columns of a data matrix, can be useful tools for analysis. Since they were first introduced, biplots have been employed in a number of multivariate methods, such as Correspondence Analysis (CA), Principal Component Analysis (PCA), Canonical Variate Analysis (CVA) and Discriminant Analysis (DA), as a form of graphical display of data. Another possible employment is in Partial Least Squares (PLS). First introduced as a regression method, PLS is more flexible than multivariate regression, but better suited than Principal Component Regression (PCR) for the prediction of a set of response variables from a large set of predictor variables. Employing the biplot in PLS gave rise to the PLS biplot, a new addition to the biplot family. In the current study, this biplot was successfully applied to the sensory data to investigate the relationships between the sensory panel characteristics and the chemical quality measurements of sixteen olive oils. It was also applied to a large set of mineral sorting production data to investigate the relationships between the output variables and the process factors used to produce a final product. Furthermore, the PLS biplot was applied to a Binomialdistributed data concerning the diabetes testing of Indian women and to a Poisson-distributed data showing the diversity of arboreal marsupials (possum) in the Montane ash forest. After these applications, it is proposed that the PLS biplot is a useful graphical tool for displaying results from the (univariate) Partial Least Squares-Generalized Linear Model (PLS-GLM) analysis of a data set. With Partial Least Squares Regression (PLSR) being a valuable method for modelling high-dimensional data, especially in chemometrics, the PLS biplot was successfully applied to a cereal evaluation containing one hundred and forty five infrared spectra and six chemical properties, and a gene expression data with two thousand genes. Statistical Sciences
40	Contributions to spatial uncertainty modelling in GIS : small sample data Guo, Danni January 2007 (has links) Includes bibliographical references. / Environmental data is very costly and difficult to collect and are often vague (subjective) or imprecise in nature (e.g. hazard level of pollutants are classified as "harmful for human beings"). These realities in practise (fuzziness and small datasets) leads to uncertainty, which is addressed by my research objective: "To model spatial environmental data with .fuzzy uncertainty, and to explore the use of small sample data in spatial modelling predictions, within Geographic Information System (GIS)." The methodologies underlying the theoretical foundations for spatial modelling are examined, such as geostatistics, fuzzy mathematics Grey System Theory, and (V,·) Credibility Measure Theory. Fifteen papers including three journal papers were written in contribution to the developments of spatial fuzzy and grey uncertainty modelling, in which I have a contributed portion of 50 to 65%. The methods and theories have been merged together in these papers, and they are applied to two datasets, PM10 air pollution data and soil dioxin data. The papers can be classified into two broad categories: fuzzy spatial GIS modelling and grey spatial GIS modelling. In fuzzy spatial GIS modelling, the fuzzy uncertainty (Zadeh, 1965) in environmental data is addressed. The thesis developed a fuzzy membership grades kriging approach by converting fuzzy subsets spatial modelling into membership grade spatial modelling. As this method develops, the fuzzy membership grades kriging is put into the foundation of the credibility measure theory, and approached a full data-assimilated membership function in terms of maximum fuzzy entropy principle. The variable modelling method in dealing with fuzzy data is a unique contribution to the fuzzy spatial GIS modelling literature. In grey spatial GIS modelling, spatial predictions using small sample data is addressed. The thesis developed a Grey GIS modelling approach, and two-dimensional order-less spatially observations are converted into two one-dimensional ordered data sequences. The thesis papers also explored foundational problems within the grey differential equation models (Deng, 1985). It is discovered the coupling feature of grey differential equations together with the help of e-similarity measure, generalise the classical GM( 1,1) model into more classes of extended GM( 1,1) models, in order to fully assimilate with sample data information. The development of grey spatial GIS modelling is a creative contribution to handling small sample data. Statistical Science

Search results