Global ETD Search

1	Pattern analysis of the user behaviour in a mobile application using unsupervised machine learning / Mönsteranalys av användarbeteenden i en mobilapp med hjälp av oövervakad maskininlärning Hrstic, Dusan Viktor January 2019 (has links) Continuously increasing amount of logged data increases the possibilities of finding new discoveries about the user interaction with the application for which the data is logged. Traces from the data may reveal some specific user behavioural patterns which can discover how to improve the development of the application by showing the ways in which the application is utilized. In this thesis, unsupervised machine learning techniques are used in order to group the users depending on their utilization of SEB Privat Android mobile application. The user interactions in the applications are first extracted, then various data preprocessing techniques are implemented to prepare the data for clustering and finally two clustering algorithms, namely, HDBSCAN and KMedoids are performed to cluster the data. Three types of user behaviour have been found from both K-medoids and HDBSCAN algorithm. There are users that tend to interact more with the application and navigate through its deeper layers, then the ones that consider only a quick check of their account balance or transaction, and finally regular users. Among the resulting features chosen with the help of feature selection methods, 73 % of them are related to user behaviour. The findings can be used by the developers to improve the user interface and overall functionalities of application. The user flow can thus be optimized according to the patterns in which the users tend to navigate through the application. / En ständigt växande datamängd ökar möjligheterna att hitta nya upptäckter om användningen av en mobil applikation för vilken data är loggad. Spår som visas i data kan avslöja vissa specifika användarbeteenden som kan förbättra applikationens utveckling genom att antyda hur applikationen används. I detta examensarbete används oövervakade maskininlärningstekniker för att gruppera användarna beroende på deras bruk av SEB Privat Android mobilapplikation. Användarinteraktionerna i applikationen extraheras ut först, sedan används olika databearbetningstekniker för att förbereda data för klustringen och slutligen utförs två klustringsalgoritmer, nämligen HDBSCAN och Kmedoids för att gruppera data. Tre distinkta typer av användarbeteende har hittats från både K-medoids och HDBSCAN-algoritmen. Det finns användare som har en tendens att interagera mer med applikationen och navigera genom sitt djupare lager, sedan finns det de som endast snabbt kollar på deras kontosaldo eller transaktioner och till slut finns det vanliga användare. Bland de resulterande attributen som hade valts med hjälp av teknikerna för val av attribut, är 73% av dem relaterade till användarbeteendet. Det som upptäcktes i denna avhandling kan användas för att utvecklarna ska kunna förbättra användargränssnittet och övergripande funktioner i applikationen. Användarflödet kan därmed optimeras med hänsyn till de sätt enligt vilka användarna har en speciell tendens att navigera genom applikationen. Clustering HDBSCAN K-medoids data preprocessing user behaviour mobile application Klustring HDBSCAN K-medoids databearbetning användarbeteende mobila applikationer Computer and Information Sciences Data- och informationsvetenskap
2	Evaluation of the correlation between test cases dependency and their semantic text similarity Andersson, Filip January 2020 (has links) An important step in developing software is to test the system thoroughly. Testing software requires a generation of test cases that can reach large numbers and is important to be performed in the correct order. Certain information is critical to know to schedule the test cases incorrectly order and isn’t always available. This leads to a lot of required manual work and valuable resources to get correct. By instead analyzing their test speciﬁcation it could be possible to detect the functional dependencies between test cases. This study presents a natural language processing (NLP) based approach and performs cluster analysis on a set of test cases to evaluate the correlation between test case dependencies and their semantic similarities. After an initial feature selection, the test cases’ similarities are calculated through the Cosine distance function. The result of the similarity calculation is then clustered using the HDBSCAN clustering algorithm. The clusters would represent test cases’ relations where test cases with close similarities are put in the same cluster as they were expected to share dependencies. The clusters are then validated with a Ground Truth containing the correct dependencies. The result is an F-Score of 0.7741. The approach in this study is used on an industrial testing project at Bombardier Transportation in Sweden. Software Testing Test optimization NLP Dependency Semantic Similarity Clustering Cosine Similarity HDBSCAN Computer Sciences Datavetenskap (datalogi)
3	From Clusters to Graphs – Toward a Scalable Viewing of News Videos Ruth, Nicolas, Liebl, Bernhard, Burghardt, Manuel 04 July 2024 (has links) In this paper, we present a novel approach that combines density-based clustering and graph modeling to create a scalable viewing application for the exploration of similarity patterns in news videos. Unlike most existing video analysis tools that focus on individual videos, our approach allows for an overview of a larger collection of videos, which can be further examined based on their connections or communities. By utilizing scalable reading, specific subgraphs can be selected from the overview and their respective clusters can be explored in more detail on the video frame level info:eu-repo/classification/ddc/006 ddc:006
4	Automated error matching system using machine learning and data clustering : Evaluating unsupervised learning methods for categorizing error types, capturing bugs, and detecting outliers. Bjurenfalk, Jonatan, Johnson, August January 2021 (has links) For large and complex software systems, it is a time-consuming process to manually inspect error logs produced from the test suites of such systems. Whether it is for identifyingabnormal faults, or finding bugs; it is a process that limits development progress, and requires experience. An automated solution for such processes could potentially lead to efficient fault identification and bug reporting, while also enabling developers to spend more time on improving system functionality. Three unsupervised clustering algorithms are evaluated for the task, HDBSCAN, DBSCAN, and X-Means. In addition, HDBSCAN, DBSCAN and an LSTM-based autoencoder are evaluated for outlier detection. The dataset consists of error logs produced from a robotic test system. These logs are cleaned and pre-processed using stopword removal, stemming, term frequency-inverse document frequency (tf-idf) and singular value decomposition (SVD). Two domain experts are tasked with evaluating the results produced from clustering and outlier detection. Results indicate that X-Means outperform the other clustering algorithms when tasked with automatically categorizing error types, and capturing bugs. Furthermore, none of the outlier detection methods yielded sufficient results. However, it was found that X-Means’s clusters with a size of one data point yielded an accurate representation of outliers occurring in the error log dataset. Conclusively, the domain experts deemed X-means to be a helpful tool for categorizing error types, capturing bugs, and detecting outliers. Unsupervised learning machine learning clustering DBSCAN HDBSCAN X-Means outlier detection error log clustering Computer Sciences Datavetenskap (datalogi)
5	Identifying Machine States and Sensor Properties for a Digital Machine Template : Automatically recognize states in a machine using multivariate time series cluster analysis Viking, Jakob January 2021 (has links) Digital twins have become a large part of new cyber-physical systems as they allow for the simulation of a physical object in the digital world. In addition to the new approaches of digital twins, machines have become more intelligent, allowing them to produce more data than ever before. Within the area of digital twins, there is a need for a less complex approach than a fully optimised digital twin. This approach is more like a digital shadow of the physical object. Therefore, the focus of this thesis is to study machine states and statistical distributions for all sensors in a machine. Where as majority of studies in the literature focuses on generating data from a digital twin, this study focuses on what characteristics a digital twin have. The solution is by defining a term named digital machine template that contains the states and statistical properties of each sensor in a given machine. The primary approach is to create a proof of work application that uses traditional data mining technologies and clustering to analyze how many states there are in a machine and how the sensor data is structured. It all results in a digital machine template with all of the information mentioned above. The results contain all the states a machine might have and the possible statistical distributions of each senor in each state. The digital machine template opens the possibility of using it as a basis for creating a digital twins. It allows the time of development to be shorter than that of a regular digital twin. More research still needs to be done as the less complex approach may lead to missing information or information not being interpreted correctly. It still shows promises as a less complex way of looking at digital twins since it may become necessary due to digital twins becoming even more complex by the day. Digital twin Digital shadow Unsupervised learning Cluster analysis HDBSCAN Chi-squared goodness of fit test Computer Engineering Datorteknik
6	Endogenous tagging of Unc-13 reveals nanoscale reorganization at active zones during presynaptic homeostatic potentiation Dannhäuser, Sven, Mrestani, Achmed, Gundelach, Florian, Pauli, Martin, Komma, Fabian, Kollmannsberger, Philip, Sauer, Markus, Heckmann, Manfred, Paul, Mila M. 06 February 2025 (has links) Introduction: Neurotransmitter release at presynaptic active zones (AZs) requires concerted protein interactions within a dense 3D nano-hemisphere. Among the complex protein meshwork the (M)unc-13 family member Unc-13 of Drosophila melanogaster is essential for docking of synaptic vesicles and transmitter release. Methods: We employ minos-mediated integration cassette (MiMIC)-based gene editing using GFSTF (EGFP-FlAsH-StrepII-TEV-3xFlag) to endogenously tag all annotated Drosophila Unc-13 isoforms enabling visualization of endogenous Unc-13 expression within the central and peripheral nervous system. Results and discussion: Electrophysiological characterization using two-electrode voltage clamp (TEVC) reveals that evoked and spontaneous synaptic transmission remain unaffected in unc-13GFSTF 3rd instar larvae and acute presynaptic homeostatic potentiation (PHP) can be induced at control levels. Furthermore, multi-color structured-illumination shows precise co-localization of Unc-13GFSTF, Bruchpilot, and GluRIIA-receptor subunits within the synaptic mesoscale. Localization microscopy in combination with HDBSCAN algorithms detect Unc-13GFSTF subclusters that move toward the AZ center during PHP with unaltered Unc-13GFSTF protein levels. info:eu-repo/classification/ddc/610 ddc:610
7	Text mining Twitter social media for Covid-19 : Comparing latent semantic analysis and latent Dirichlet allocation Sheikha, Hassan January 2020 (has links) In this thesis, the Twitter social media is data mined for information about the covid-19 outbreak during the month of March, starting from the 3’rd and ending on the 31’st. 100,000 tweets were collected from Harvard’s opensource data and recreated using Hydrate. This data is analyzed further using different Natural Language Processing (NLP) methodologies, such as termfrequency inverse document frequency (TF-IDF), lemmatizing, tokenizing, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Furthermore, the results of the LSA and LDA algorithms is reduced dimensional data that will be clustered using clustering algorithms HDBSCAN and K-Means for later comparison. Different methodologies are used to determine the optimal parameters for the algorithms. This is all done in the python programing language, as there are libraries for supporting this research, the most important being scikit-learn. The frequent words of each cluster will then be displayed and compared with factual data regarding the outbreak to discover if there are any correlations. The factual data is collected by World Health Organization (WHO) and is then visualized in graphs in ourworldindata.org. Correlations with the results are also looked for in news articles to find any significant moments to see if that affected the top words in the clustered data. The news articles with good timelines used for correlating incidents are that of NBC News and New York Times. The results show no direct correlations with the data reported by WHO, however looking into the timelines reported by news sources some correlation can be seen with the clustered data. Also, the combination of LDA and HDBSCAN yielded the most desireable results in comparison to the other combinations of the dimnension reductions and clustering. This was much due to the use of GridSearchCV on LDA to determine the ideal parameters for the LDA models on each dataset as well as how well HDBSCAN clusters its data in comparison to K-Means. Data mining Text mining artificial intelligence Natural language processing Latent Semantic Analysis Latent Dirichlet Allocation KMeans HDBSCAN Dimension reduction Engineering and Technology Teknik och teknologier
8	Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-Dialoger Hidén, Oskar, Björelind, David January 2021 (has links) The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature. Machine Learning NLP Text Representations Clustering Extractive summarization TFIDF S-BERT FastText K-means DBSCAN HDBSCAN LSA TextRank Word Mover's Distance (WMD) Computer Engineering Datorteknik
9	Predictive maintenance using NLP and clustering support messages Yilmaz, Ugur January 2022 (has links) Communication with customers is a major part of customer experience as well as a great source of data mining. More businesses are engaging with consumers via text messages. Before 2020, 39% of businesses already use some form of text messaging to communicate with their consumers. Many more were expected to adopt the technology after 2020[1]. Email response rates are merely 8%, compared to a response rate of 45% for text messaging[2]. A significant portion of this communication involves customer enquiries or support messages sent in both directions. According to estimates, more than 80% of today’s data is stored in an unstructured format (suchas text, image, audio, or video) [3], with a significant portion of it being stated in ambiguous natural language. When analyzing such data, qualitative data analysis techniques are usually employed. In order to facilitate the automated examination of huge corpora of textual material, researchers have turned to natural language processing techniques[4]. Under the light of shared statistics above, Billogram[5] has decided that support messages between creditors and recipients can be mined for predictive maintenance purposes, such as early identification of an outlier like a bug, defect, or wrongly built feature. As one sentence goal definition, Billogram is looking for an answer to ”why are people reaching out to begin with?” This thesis project discusses implementing unsupervised clustering of support messages by benefiting from natural language processing methods as well as performance metrics of results to answer Billogram’s question. The research also contains intent recognition of clustered messages in two different ways, one automatic and one semi-manual, the results have been discussed and compared. LDA and manual intent assignment approach of the first research has 100 topics and a 0.293 coherence score. On the other hand, the second approach produced 158 clusters with UMAP and HDBSCAN while intent recognition was automatic. Creating clusters will help identifying issues which can be subjects of increased focus, automation, or even down-prioritizing. Therefore, this research lands in the predictive maintenance[9] area. This study, which will get better over time with more iterations in the company, also contains the preliminary work for ”labeling” or ”describing”clusters and their intents. Predictive maintenance support messages NLP unsupervised clustering intent recognition LDA UMAP HDBSCAN BERT Swedish BERT(KB-BERT) Billogram
10	Clustering on groups for human tracking with 3D LiDAR Utterström, Simon January 2023 (has links) 3D LiDAR people detection and tracking applications rely on extracting individual people from the point cloud for reliable tracking. A recurring problem for these applications is under-segmentation caused by people standing close or interacting with each other, which in turn causes the system to lose tracking. To address this challenge, we propose Kernel Density Estimation Clustering with Grid (KDEG) based on Kernel Density Estimation Clustering. KDEG leverages a grid to save density estimates computed in parallel, finding cluster centers by selecting local density maxima in the grid. KDEG reaches a remarkable accuracy of 98.4%, compared to HDBSCAN and Scan Line Run (SLR) with 80.1% and 62.0% accuracy respectively. Furthermore, KDEG is measured to be highly efficient, with a running time similar to state-of-the-art methods SLR and Curved Voxel Clustering. To show the potential of KDEG, an experiment with a real tracking application on two people walking shoulder to shoulder was performed. This experiment saw a significant increase in the number of accurately tracked frames from 5% to 78% by utilizing KDEG, displaying great potential for real-world applications. In parallel, we also explored HDBSCAN as an alternative to DBSCAN. We propose a number of modifications to HDBSCAN, including the projection of points to the groundplane, for improved clustering on human groups. HDBSCAN with the proposed modifications demonstrates a commendable accuracy of 80.1%, surpassing DBSCAN while maintaining a similar running time. Running time is however found to be lacking for both HDBSCAN and DBSCAN compared to more efficient methods like KDEG and SLR. / <p>Arbetet är gjort på plats i Tokyo på Chuo Universitet utan samverkan från Umeå Universitet såsom utbytesprogram eller liknande.</p><p>Arbetet är delvis finansierat av Scandinavia-Japan Sasakawa Foundation.</p><p>Arbetet gick inte under vanlig termin, utan började 2023/05/01 och slutade 2023/08</p> Computer Vision Computer Science AI Machine Learning clustering Kernel Density Clustering tracking LiDAR 3D LiDAR tracking human pedestrian real time Datavetenskap Dataseende clustering SLR CVC KDEG KDE Kernel Density Clustering HDBSCAN DBSCAN LiDAR point cloud tracking human pedestrian Computer Sciences Datavetenskap (datalogi)

Search results