Modern retailers have been collecting more and more data over the past decades. The increased sizes of collected data have led to higher demand for data analytics expertise tools, which the Umeå-founded company Infobaleen provides. A recurring challenge when developing such tools is the data itself. Difficulties in finding relevant open data sets have led to a rise in the popularity of using synthetic data. By using artificially generated data, developers gain more control over the input when testing and presenting their work. However, most methods that exist today either depend on real-world data as input or produce results that look synthetic and are difficult to extend. In this thesis, I introduce a method specifically designed to generate synthetic transactional data stochastically. I first examined real-world data provided by Infobaleen to determine suitable statistical distributions to use in my algorithm empirically. I then modelled individual decision-making using points in an embedding space, where the distance between the points serves as a basis for individually unique probability weights. This solution creates data distributed similarly to real-world data and enables retroactive data enrichment using the same embeddings. The result is a data set that looks genuine to the human eye but is entirely synthetic. Infobaleen already generates data with this model when presenting its product to new potential customers or partners.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-204096 |
Date | January 2023 |
Creators | Lundgren, Veronica |
Publisher | Umeå universitet, Institutionen för fysik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0024 seconds