Global ETD Search

1	A statistical approach to automated detection of multi-component radio sources Smith, Jeremy Stewart 24 February 2021 (has links) Advances in radio astronomy are allowing for deeper and wider areas of the sky to be observed than ever before. Source counts of future radio surveys are expected to number in the tens of millions. Source finding techniques are used to identify sources in a radio image, however, these techniques identify single distinct sources and are challenged to identify multi-component sources, that is to say, where two or more distinct sources belong to the same underlying physical phenomenon, such as a radio galaxy. Identification of such phenomena is an important step in generating catalogues from surveys on which much of the radio astronomy science is based. Historically, identifying multi-component sources was conducted by visual inspection, however, the size of future surveys makes manual identification prohibitive. An algorithm to automate this process using statistical techniques is proposed. The algorithm is demonstrated on two radio images. The output of the algorithm is a catalogue where nearest neighbour source pairs are assigned a probability score of being a component of the same physical object. By applying several selection criteria, pairs of sources which are likely to be multi-component sources can be determined. Radio image cutouts are then generated from this selection and may be used as input into radio source classification techniques. Successful identification of multi-component sources using this method is demonstrated. data science
2	A temporal prognostic model based on dynamic Bayesian networks: mining medical insurance data Mbaka, Sarah Kerubo 10 September 2021 (has links) A prognostic model is a formal combination of multiple predictors from which risk probability of a specific diagnosis can be modelled for patients. Prognostic models have become essential instruments in medicine. The models are used for prediction purposes of guiding doctors to make a smart diagnosis, patient-specific decisions or help in planning the utilization of resources for patient groups who have similar prognostic paths. Dynamic Bayesian networks theoretically provide a very expressive and flexible model to solve temporal problems in medicine. However, this involves various challenges due both to the nature of the clinical domain, and the nature of the DBN modelling and inference process itself. The challenges from the clinical domain include insufficient knowledge of temporal interactions of processes in the medical literature, the sparse nature and variability of medical data collection, and the difficulty in preparing and abstracting clinical data in a suitable format without losing valuable information in the process. Challenges about the DBN methodology and implementation include the lack of tools that allow easy modelling of temporal processes. Overcoming this challenge will help to solve various clinical temporal reasoning problems. In this thesis, we addressed these challenges while building a temporal network with explanations of the effects of predisposing factors, such as age and gender, and the progression information of all diagnoses using claims data from an insurance company in Kenya. We showed that our network could differentiate the possible probability exposure to a diagnosis given the age and gender and possible paths given a patient's history. We also presented evidence that the more patient history is provided, the better the prediction of future diagnosis. Data Science
3	Designing an event display for the Transition Radiation Detector in ALICE Perumal, Sameshan 15 September 2021 (has links) We document here a successful design study for an event display focused on the Transition Radiation Detector (TRD) within A Large Ion Collider Experiment (ALICE) at the European Organisation for Nuclear Research (CERN). Reviews of the fields of particle physics and visualisation are presented to motivate formally designing this display for two different audiences. We formulate a methodology, based on successful design studies in similar fields, that involves experimental physicists in the design process as domain experts. An iterative approach incorporating in-person interviews is used to define a series of visual components applying best practices from literature. Interactive event display prototypes are evaluated with potential users, and refined using elicited feedback. The primary artefact is a portable, functional, effective, validated event display – a series of case studies evaluate its use by both scientists and the general public. We further document use cases for, and hindrances preventing, the adoption of event displays, and propose novel data visualisations of experimental particle physics data. We also define a flexible intermediate JSON data format suitable for web-based displays, and a generic task to convert historical data to this format. This collection of artefacts can guide the design of future event displays. Our work makes the case for a greater use of high quality data visualisation in particle physics, across a broad spectrum of possible users, and provides a framework for the ongoing development of web-based event displays of TRD data. Data Science
4	An exploration of media repertoires in South Africa: 2002-2014 Bakker, Hans-Peter 11 March 2020 (has links) This dissertation explores trends in media engagement in South Africa over a period from 2002 until 2014. It utilises data from the South African Audience Research Foundation’s All Media and Products Surveys. Using factor analysis, six media repertoires are identified and, utilising structural equation modelling, marginal means for various demographic categories by year are estimated. Measurement error is determined with the aid of bootstrapping. These estimates are plotted to provide visual aids in interpreting model parameters. The findings show general declines in engagement with traditional media and growth in internet engagement, but these trends can vary markedly for different demographic groups. The findings also show that for many South Africans traditional media such as television remain dominant. Data Science
5	Unsupervised Machine Learning Application for the Identification of Kimberlite Ore Facie using Convolutional Neural Networks and Deep Embedded Clustering Langton, Sean 25 February 2022 (has links) Mining is a key economic contributor to many regions globally - especially those in developing nations. The design and operation of the processing plants associated with each of these mines is highly dependant on the composition of the feed material. The aim of this research is to demonstrate the viability of implementing a computer vision solution to provide online information of the composition of material entering the plant, thus allowing the plant operators to adjust equipment settings and process parameters accordingly. Data is collected in the form of high resolution images captured every couple of seconds of material on the main feed conveyor belt into the Kao Diamond Mine processing plant. The modelling phase of the research is implemented in two stages. The first stage involves the implementation of a Mask Region-based Convolutional Neural Network (Mask R-CNN) model with a ResNet 101 CNN backbone for instance segmentation of individual rocks from each image. These individual rock images are extracted and used for the second phase of the modelling pipeline - utilizing an unsupervised clustering method known as Convolutional Deep Embedded Clustering with Data Augmentation (ConvDEC-DA). The clustering phase of this research provides a method to group feed material rocks into their respective types or facie using features developed from the auto-encoder portion of the ConvDEC-DA modelling. While this research focuses on the clustering of Kimberlite rocks according to their respective facie, similar implementations are possible for a wide range of mining and rock types. Data Science
6	A Machine Learning Approach to Predicting the Employability of a Graduate Modibane, Masego 12 February 2020 (has links) For many credit-offering institutions, such as banks and retailers, credit scores play an important role in the decision-making process of credit applications. It becomes difficult to source the traditional information required to calculate these scores for applicants that do not have a credit history, such as recently graduated students. Thus, alternative credit scoring models are sought after to generate a score for these applicants. The aim for the dissertation is to build a machine learning classification model that can predict a students likelihood to become employed, based on their student data (for example, their GPA, degree/s held etc). The resulting model should be a feature that these institutions should use in their decision to approve a credit application from a recently graduated student. Data Science
7	Collaborative Genre Tagging Leslie, James 19 November 2020 (has links) Recommender systems (RS) are used extensively in online retail and on media streaming platforms to help users filter the plethora of options at their disposal. Their goal is to provide users with suggestions of products or artworks that they might like. Content-based RS's make use of user and/or item metadata to predict user preferences, while collaborative-filtering (CF) has proven to be an effective approach in tasks such as predicting movie or music preferences of users in the absence of any metadata. Latent factor models have been used to achieve state-of-the-art accuracy in many CF settings, playing an especially large role in beating the benchmark set in the Netflix Prize in 2008. These models learn latent features for users and items to predict the preferences of users. The first latent factor models made use of matrix factorisation to learn latent factors, but more recent approaches have made use of neural architectures with embedding layers. This master's dissertation outlines collaborative genre tagging (CGT), a transfer learning application of CF that makes use of latent factors to predict genres of movies, using only explicit user ratings as model inputs. Data Science
8	Forecasting and modelling the VIX using Neural Networks Netshivhambe, Nomonde 12 April 2023 (has links) (PDF) This study investigates the volatility forecasting ability of neural network models. In particular, we focus on the performance of Multi-layer Perceptron (MLP) and the Long Short Term (LSTM) Neural Networks in predicting the CBOE Volatility Index (VIX). The inputs into these models includes the VIX, GARCH(1,1) fitted values and various financial and macroeconomic explanatory variables, such as the S&P 500 returns and oil price. In addition, this study segments data into two sub-periods, namely a Calm and Crisis Period in the financial market. The segmentation of the periods caters for the changes in the predictive power of the aforementioned models, given the dierent market conditions. When forecasting the VIX, we show that the best performing model is found in the Calm Period. In addition, we show that the MLP has more predictive power than the LSTM. Data Science
9	Log mining to develop a diagnostic and prognostic framework for the MeerLICHT telescope Roelf, Timothy Brian 20 April 2023 (has links) (PDF) In this work we present the approach taken to address the problems anomalous fault detection and system delays experienced by the MeerLICHT telescope. We make use of the abundantly available console logs, that record all aspects of the telescope's function, to obtain information. The MeerLICHT operational team must devote time to manually inspecting the logs during system downtime to discover faults. This task is laborious, time inefficient given the large size of the logs, and does not suit the time-sensitive nature of many of the surveys the telescope partakes in. We used the novel approach of the Hidden Markov model, to address the problems of fault detection and system delays experienced by the MeerLICHT. We were able to train the model in three separate ways, showing some success at fault detection and none at the addressing the system delays. Data Science
10	Natural Language Processing on Data Warehouses Maree, Stiaan 27 October 2022 (has links) (PDF) The main problem addressed in this research was to use natural language to query data in a data warehouse. To this effect, two natural language processing models were developed and compared on a classic star-schema sales data warehouse with sales facts and date, location and item dimensions. Utterances are queries that people make with natural language, for example, What is the sales value for mountain bikes in Georgia for 1 July 2005?" The first model, the heuristics model, implemented an algorithm that steps through the sequence of utterance words and matches the longest number of consecutive words at the highest grain of the hierarchy. In contrast, the embedding model implemented the word2vec algorithm to create different kinds of vectors from the data warehouse. These vectors are aggregated and then the cosine similarity between vectors was used to identify concepts in the utterances that can be converted to a programming language. To understand question style, a survey was set up which then helped shape random utterances created to use for the evaluation of both methods. The first key insight and main premise for the embedding model to work is a three-step process of creating three types of vectors. The first step is to train vectors (word vectors) for each individual word in the data warehouse; this is called word embeddings. For instance, the word `bike' will have a vector. The next step is when the word vectors are averaged for each unique column value (column vectors) in the data warehouse, thus leaving an entry like `mountain bike' with one vector which is the average of the vectors for `mountain' and `bike'. Lastly, the utterance by the user is averaged (utterance vectors) by using the word vectors created in step one, and then, by using cosine similarity, the utterance vector is matched to the closest column vectors in order to identify data warehouse concepts in the utterance. The second key insight was to train word vectors firstly for location, then separately for item - in other words, per dimension (one set for location, and one set for item). Removing stop words was the third key insight, and the last key insight was to use Global Vectors to instantiate the training of the word vectors. The results of the evaluation of the models indicated that the embedding model was ten times faster than the heuristics model. In terms of accuracy, the embedding algorithm (95.6% accurate) also outperformed the heuristics model (70.1% accurate). The practical application of the research is that these models can be used as a component in a chatbot on data warehouses. Combined with a Structured Query Language query generation component, and building Application Programming Interfaces on top of it, this facilitates the quick and easy distribution of data; no knowledge of a programming language such as Structured Query Language is needed to query the data. Data Science

Search results