Spelling suggestions: "subject:"estatistics"" "subject:"cstatistics""
571 |
A GLMM analysis of data from the Sinovuyo Caring Families Program (SCFP)Nhapi, Raymond T 13 February 2019 (has links)
We present an analysis of the data from a longitudinal randomized control trial that assesses the impact of an intervention program aimed at improving the quality of childcare within families. The SCFP was a group-based program implemented over two separate waves conducted in Khayelitsha and Nyanga. The data were collected at baseline, post-test and at one-year follow-up via questionnaires (self-assessment) and observational video coding. Multiple imputation (using chained equations) procedures were used to impute missing information. Generalized linear Mixed Effect Models (GLMMs) were used to assess the impact of the intervention program on the responses, adjusted for possible confounding variables. These summed scores were often right skewed with zero-inflation. All the effects (fixed and random) were estimated through the method of maximum likelihood. Primarily, an intention-to-treat analysis was done after which a per-protocol analysis was also implemented with participants who attended a specified number of the group sessions. All these GLMMs were implemented in the imputation framework.
|
572 |
Anomaly detection in a mobile data networkSalzwedel, Jason Paul 14 February 2020 (has links)
The dissertation investigated the creation of an anomaly detection approach to
identify anomalies in the SGW elements of a LTE network. Unsupervised techniques
were compared and used to identify and remove anomalies in the training data set.
This “cleaned” data set was then used to train an autoencoder in an semi-supervised
approach. The resultant autoencoder was able to indentify normal observations. A
subsequent data set was then analysed by the autoencoder. The resultant
reconstruction errors were then compared to the ground truth events to investigate
the effectiveness of the autoencoder’s anomaly detection capability.
|
573 |
Low volatility alternative equity indicesOladele, Oluwatosin Seun January 2015 (has links)
In recent years, there has been an increasing interest in constructing low volatility portfolios. These portfolios have shown significant outperformance when compared with the market capitalization-weighted portfolios. This study analyses the low volatility portfolios in South Africa using sectors instead of individual stocks as building blocks for portfolio construction. The empirical results from back-testing these portfolios show significant outperformance when compared with their market capitalization weighted equity benchmark counterpart (ALSI). In addition, a further analysis of this study delves into the construction of the low volatility portfolios using the Top 40 and Top 100 stocks. The results also show significant outperformance over the market-capitalization portfolio (ALSI), with the portfolios constructed using the Top 100 stocks having a better performance than portfolio constructed using the Top 40 stocks. Finally, the low volatility portfolios are also blended with typical portfolios (ALSI and the SWIX indices) in order to establish their usefulness as effective portfolio strategies. The results show that the Low volatility Single Index Model (SIM) and the Equally Weight low-beta portfolio (Lowbeta) were the superior performers based on their Sharpe ratios.
|
574 |
Machine learning methods for individual acoustic recognition in a species of field cricketDlamini, Gciniwe 18 February 2019 (has links)
Crickets, like other insects, play a vital role in maintaining a balance in the ecosystem. Therefore, the ability to identify individual crickets is crucial as it enables ecologists to estimate important population metrics such as population densities, which in turn are used to investigate ecological questions pertaining to these insects. In this research, classification models were developed to recognise individual field crickets of the species Plebeiogryllus guttiventris based solely on the audio recordings of their calls. Recent advances in technology have made data collection easier, and consequently, large volumes of data, including acoustic data, have become available to ecologists. The task of acoustic animal identifications thus requires the utilisation of models that are well suited for training large datasets. It is for this very reason that convolutional neural networks (CNN) and recurrent neural networks (RNN) were utilised in this research. The results of these models were compared to results of a baseline random forest (RF) model as RFs can also be used to make acoustic classifications. Mel-frequency cepstral coefficients (MFCC), raw acoustic samples as well as two temporal features were extracted from each chirp in the cricket recordings and used as inputs to train the machine learning models. The raw acoustic samples were only used in the deep neural network (DNN) models (CNNs and RNNs) as these models have been successful in training other raw forms of data such as images (for example, Krizhevsky et al. (2012)). Training on the MFCC features was conducted in two ways: the DNN models were trained on MFCC matrices that each spanned a chirp, whereas the RF models were trained on the MFCC frame vectors. This is because RF are only able to train on vector representations of observations, not matrices. The frame-level MFCC predictions obtained from the RF model were then aggregated into chirp-level predictions to facilitate the comparison with the other classification models. The best classification performance was achieved by the RF model trained on the MFCC features with a score of 99.67%. The worst performance was observed from the RF model trained upon the temporal features, which scored 67%. The DNN models attained on average 98.6% classification accuracies when trained on both MFCC features and the raw acoustic samples. These results show that individual recognition of the crickets using acoustics can be achieved with great success through the use of machine learning. Moreover, the performance of the deep learning models when trained upon the raw acoustic samples indicate that the feature (MFCC) extraction step can be bypassed; the deep learning machine algorithms can be trained directly on the raw acoustic data and still achieve great results.
|
575 |
Unravelling black box machine learning methods using biplotsRowan, Adriaan 14 February 2020 (has links)
Following the development of new mathematical techniques, the improvement of computer processing power and the increased availability of possible explanatory variables, the financial services industry is moving toward the use of new machine learning methods, such as neural networks, and away from older methods such as generalised linear models. However, their use is currently limited because they are seen as “black box” models, which gives predictions without justifications and which are therefore not understood and cannot be trusted. The goal of this dissertation is to expand on the theory and use of biplots to visualise the impact of the various input factors on the output of the machine learning black box. Biplots are used because they give an optimal two-dimensional representation of the data set on which the machine learning model is based.The biplot allows every point on the biplot plane to be converted back to the original ��-dimensions – in the same format as is used by the machine learning model. This allows the output of the model to be represented by colour coding each point on the biplot plane according to the output of an independently calibrated machine learning model. The interaction of the changing prediction probabilities – represented by the coloured output – in relation to the data points and the variable axes and category level points represented on the biplot, allows the machine learning model to be globally and locally interpreted. By visualing the models and their predictions, this dissertation aims to remove the stigma of calling non-linear models “black box” models and encourage their wider application in the financial services industry.
|
576 |
Classification and visualisation of text documents using networksPhaweni, Thembani 14 February 2019 (has links)
In both the areas of text classification and text visualisation graph/network theoretic methods can be applied effectively. For text classification we assessed the effectiveness of graph/network summary statistics to develop weighting schemes and features to improve test accuracy. For text visualisation we developed a framework using established visual cues from the graph visualisation literature to communicate information intuitively. The final output of the visualisation component of the dissertation was a tool that would allow members of the public to produce a visualisation from a text document. We represented a text document as a graph/network. The words were nodes and the edges were created when a pair of words appeared within a pre-specified distance (window) of words from each other. The text document model is a matrix representation of a document collection such that it can be integrated into a machine or statistical learning algorithm. The entries of this matrix can be weighting according to various schemes. We used the graph/network representation of a text document to create features and weighting schemes that could be applied to the text document model. This approach was not well developed for text classification therefore we applied different edge weighting methods, window sizes, weighting schemes and features. We also applied three machine learning algorithms, naïve Bayes, neural networks and support vector machines. We compared our various graph/network approaches to the traditional document model with term frequency inverse-document-frequency. We were interested in establishing whether or not the use of graph weighting schemes and graph features could increase test accuracy for text classification tasks. As far as we can tell from the literature, this is the first attempt to use graph features to weight bag-of-words features for text classification. These methods had been applied to information retrieval (Blanco & Lioma, 2012). It seemed they could also be applied to text classification. The text visualisation field seemed divorced from the text summarisation and information retrieval fields, in that text co-occurrence relationships were not treated with equal importance. Developments in the graph/network visualisation literature could be taken advantage of for the purposes of text visualisation. We created a framework for text visualisation using the graph/network representation of a text document. We used force directed algorithms to visualise the document. We used established visual cues like, colour, size and proximity in space to convey information through the visualisation. We also applied clustering and part-of-speech tagging to allow for filtering and isolating of specific information within the visualised document. We demonstrated this framework with four example texts. We found that total degree, a graph weighting scheme, outperformed term frequency on average. The effect of graph features depended heavily on the machine learning method used: for the problems we considered graph features increased accuracy for SVM classifiers, had little effect for neural networks and decreased accuracy for naïve Bayes classifiers Therefore the impact on test accuracy of adding graph features to the document model is dependent on the machine learning algorithm used. The visualisation of text graphs is able to convey meaningful information regarding the text at a glance through established visual cues. Related words are close together in visual space and often connected by thick edges. Large nodes often represent important words. Modularity clustering is able to extract thematically consistent clusters from text graphs. This allows for the clusters to be isolated and investigated individually to understand specific themes within a document. The use of part-of-speech tagging is effective in both reducing the number of words being displayed but also increasing the relevance of words being displayed. This was made clear through the use of part-of-speech tags applied to the Internal Resistance of Apartheid Wikipedia webpage. The webpage was reduced to its proper nouns which contained much of the important information in the text. Training accuracy is important in text classification which is a task that can often be performed on vast amounts of documents. Much of the research in text classification is aimed at increasing classification accuracy either through feature engineering, or optimising machine learning methods. The finding that total degree outperformed term frequency on average provides an alternative avenue for achieving higher test accuracy. The finding that the addition of graph features can increase test accuracy when matched with the right machine learning algorithm suggests some new research should be conducted regarding the role that graph features can have in text classification. Text visualisation is used as an exploratory tool and as a means of quickly and easily conveying text information. The framework we developed is able to create automated text visualisations that intuitively convey information for short and long text documents. This can greatly reduce the amount of time it takes to assess the content of a document which can increase general access to information.
|
577 |
Statistical Modeling and Testing of Shapes of Planar ObjectsUnknown Date (has links)
This dissertation studies statistical shape analysis of planar objects. The focus is on two different representations. The first one considers only the boundary of planar shapes, a comprehensive analysis framework including quantification, registration, statistical summary and modeling are illustrated in (A. Srivastava and E. Klassen, 2016). Here, we study the hypothesis testing problem with this boundary representation. The second representation considers both the boundary and the interior of the objects. The goal is to construct a shape analysis framework as in (A. Srivastava and E. Klassen, 2016). First, we apply the DISCO test, a nonparametric k-sample test proposed for Euclidean space, to the shape space of planar closed curves. There are two common frameworks for shape analysis of planar closed curves -- Kendall's shape analysis and elastic shape analysis. The former treats a planar closed curve as discrete points while the latter treats it as a continuous parametric function. The test statistic for the DISCO test is based on pairwise distances between curves. Thus, we deploy five shape metrics from the two shape spaces while considering the scales of curves. The specific problem that we are interested in is whether acute exercise (Run) will affect mitochondrial morphology as observed in the images of skeletal muscles of mice. In data collection, some other factors are also involved, including cell, animal, type of mitochondria (SS/IMF) and exercise (Sedentary/Run). These factors formulate a hierarchical data structure which makes it hard to test the factor exercise. We propose a compression method to rule out the effect of the inner factor and then test on the outer factor. After confirming the significance of factor cell and ignoring the significance of factor animal, the SS mitochondria and the IMF mitochondria are found to be significant for all the five shape metrics. However, when testing on factor exercise, the only significant case happens with scaled elastic metric for SS mitochondria. Second, we construct elastic shape analysis framework for planar shapes including the boundary as well as the interior. Recent developments in elastic shape analysis (ESA) are motivated by the fact that it provides comprehensive frameworks for simultaneous registration, deformation, and comparison of shapes, such as functions, curves and surfaces. These methods achieve computational efficiency using certain square-root representations that transform invariant elastic metrics into Euclidean metrics, allowing for applications of standard algorithms and statistical tools. For analyzing shapes of embeddings of unit square/disk in ℝ², we introduced a tensor field of symmetric positive definite matrix as the mathematical representation. In addition to the desirable invariant properties, the main advantage of this representation is that it simplifies the calculation of the elastic metric, geodesics and registration. The computation of inversion from the tensor field to the embedding is complicated and we use a gradient descent method. The optimal reparametrization between two embeddings is found through the multiresolution algorithm (Hamid Laga, Qian Xie, Ian H Jermyn, and Anuj Srivastava, 2017). We demonstrate the proposed theory using a statistical analysis of fly wings and leaves. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / 2019 / September 27, 2019. / k-sample test, mitochondria, planar shapes, shape analysis / Includes bibliographical references. / Anuj Srivastava, Professor Directing Dissertation; Eric Klassen, University Representative; Wei Wu, Committee Member; Fred Huffer, Committee Member.
|
578 |
Shape Constrained Single Index Models for Biomedical StudiesUnknown Date (has links)
For many biomedical, environmental and economic studies with an unknown non-linear relationship between the response and its multiple predictors, a single index model provides practical dimension reduction and good physical interpretation. However widespread uses of existing Bayesian analysis for such models are lacking in biostatistics due to some major impediments including slow mixing of the Markov Chain Monte Carlo (MCMC), inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence. We present a new Bayesian single index model with associated MCMC algorithm that incorporates an efficient Metropolis Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good biological interpretation and prediction, implementable Bayesian inference, fast convergence of the MCMC, and a first time extension to accommodate missing covariates. We also obtain for the first time, the set of sufficient conditions for obtaining the optimal rate of convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via re-analysis of an environmental study. I have proposed a frequentist and a Bayesian methods for a monotone single-index models using the Bernstein polynomial basis to represent the link function. The monotonicity of the unknown link function creates a clinically interpretable index, along with the relative importance of the covariates on the index. We develop a computationally-simple, iterative, profile likelihood-based method for the frequentist analysis. To ease the computational complexity of the Bayesian analysis, we also develop a novel and efficient Metropolis-Hastings step to sample from the conditional posterior distribution of the index parameters. These methodologies and their advantages over existing methods are illustrated via simulation studies. These methods are also used to analyze depression based measures among adolescent girls. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2018. / July 13, 2018. / Adolescent Depression, Gaussian Processes, Markov Chain Monte Carlo, Mode Aligned Proposal Density, Monotone Single Index Models, Single Index Models / Includes bibliographical references. / Debajyoti Sinha, Professor Co-Directing Dissertation; Debdeep Pati, Professor Co-Directing Dissertation; Greg Hajcak, University Representative; Elizabeth Slate, Committee Member; Eric Chicken, Committee Member.
|
579 |
Scalable Nonconvex Optimization Algorithms: Theory and ApplicationsUnknown Date (has links)
Modern statistical problems often involve minimizing objective functions that are not necessarily convex or smooth. In this study, we devote to developing scalable algorithms for nonconvex optimization with statistical guarantees. We first investigate a broad surrogate framework defined by generalized Bregman divergence functions for developing scalable algorithms. Local linear approximation, mirror descent, iterative thresholding, and DC programming can all be viewed as particular instances. The Bregman re-characterization enables us to choose suitable measures of computational error to establish global convergence rate results even for nonconvex problems in high-dimensional settings. Moreover, under some regularity conditions, the sequence of iterates in Bregman surrogate optimization can be shown to approach the statistical truth within the desired accuracy geometrically fast. The algorithms can be accelerated with a careful control of relaxation and stepsize parameters. Simulation studies are performed to support the theoretical results. An important applications of nonconvex optimization is robust estimation. Outliers widely occur in big-data and high-dimensional applications and may severely affect statistical estimation and inference. A framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming but explicitly includes outlyingness parameters for all samples, which greatly facilitates computation, theory, and parameter tuning. To address the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, we introduce a new means to alleviate the requirement on the initial value such that on regular datasets the number of data resamplings can be substantially reduced. Moreover, based on combined statistical and computational treatments, we are able to develop new tools for nonasymptotic robust analysis regarding a general loss. The obtained estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology in robust parameter estimation and variable selection. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2018. / July 6, 2018. / Includes bibliographical references. / Yiyuan She, Professor Directing Dissertation; Paul Beaumont, University Representative; Xufeng Niu, Committee Member; Jinfeng Zhang, Committee Member.
|
580 |
Bayesian Hierarchical Models That Incorporate New Sources of Dependence for Boundary Detection and Spatial PredictionUnknown Date (has links)
Spatial boundary analysis has attained considerable attention in several disciplines including engineering, spatial statistics, and computer science. The inferential question of interest is to identify rapid surface changes of an unobserved latent process. We extend Curvilinear Wombling, the current state-of-art method for point referenced data that is curve measure based and limited to a single spatial scale, to multiscale settings. Specifically, we propose a multiscale representation of the directional derivative of Karhunen-Lo'eve (DDKL) expansion to perform multiscale direction-based boundary detection. By aggregating curvilinear wombling measure, we extend curvilinear boundary detection from curve to zone. Furthermore, we propose a direction-based multiscale curvilinear boundary error criterion to quantify curvilinear boundary fallacy (CBF), which is an analogue to the ecological fallacy in the spatial change of support literature. We refer to this metric as the criterion for boundary aggregation error (BAGE). Several theoretical results are derived to motivate BAGE. Particularly, we show that no boundary aggregation error exists when directional derivatives of eigenfunctions of a Karhunen-Lo'eve expansion are constant across spatial scales. We illustrate the use of our model through a simulated example and an application of Mediterranean wind measurements. The American Community Survey (ACS) is an ongoing survey administered by the U.S. Census Bureau, which publishes estimates of important demographic statistics over pre-specified administrative areas periodically. Spatially referenced binomial count data are widely present in ACS. Since the sample size of a binomial count is often a realization of a random process, the probability of an outcome and the probability of a trial number are possibly correlated. We consider a joint model for both binomial outcome and the trial number. Latent Gaussian process (LGP) models are widely used to analyze for non-Gaussian ACS count data. However, there are computational problems; for example, LGPs may involve subjective tuning of parameters using Metroplis-Hastings. To improve computational performance of Gibbs sampling, we include the latent multivariate logit-beta distribution (MLB) in our joint model for binomial and negative binomial count data. The closed form full-conditional distributions are straightforward to simulate from without subjective tuning steps within a Gibbs sampler. We illustrate the proposed model through simulations and an application of ACS poverty estimates at the county level. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / 2019 / October 24, 2019. / Includes bibliographical references. / Jonathan R. Bradley, Professor Co-Directing Dissertation; Xufeng Niu, Professor Co-Directing Dissertation; Kevin Speer, University Representative; Fred Huffer, Committee Member.
|
Page generated in 0.0783 seconds