11 |
Datautvinning av klickdata : Kombination av klustring och klassifikation / Data mining of click data : Combination of clustering and classificationZhang, Xianjie, Bogic, Sebastian January 2018 (has links)
Ägare av webbplatser och applikationer tjänar ofta på att användare klickar på deras länkar. Länkarna kan bland annat vara reklam eller varor som säljs. Det finns många studier inom dataanalys angående om en sådan länk kommer att bli klickad, men få studier fokuserar på hur länkarna kan justeras för att bli klickade. Problemet som företaget Flygresor.se har är att de saknar ett verktyg för deras kunder, resebyråer, att analysera deras biljetter och därefter justera attributen för resorna. Den efterfrågade lösningen var en applikation som gav förslag på hur biljetterna skulle förändras för att bli mer klickade och på såsätt kunna sälja fler resor. I detta arbete byggdes en prototyp som använder sig av två olika datautvinningsmetoder, klustring med algoritmen DBSCAN och klassifikation med algoritmen k-NN. Algoritmerna användes tillsammans med en utvärderingsprocess, kallad DNNA, som analyserade resultatet från dessa två algoritmer och gav förslag på förändringar av artikelns attribut. Kombinationen av algoritmerna tillsammans med DNNA testades och utvärderades som lösning till problemet. Programmet lyckades förutse vilka attribut av biljetter som behövde justeras för att biljetterna skulle bli mer klickade. Rekommendationerna av justeringar var rimliga men eftersom andra liknande verktyg inte hade publicerats kunde detta arbetes resultat inte jämföras. / Owners of websites and applications usually profits through users that clicks on their links. These can be advertisements or items for sale amongst others. There are many studies about data analysis where they tell you if a link will be clicked, but only a few that focus on what needs to be adjusted to get the link clicked. The problem that Flygresor.se have is that they are missing a tool for their customers, travel agencies, that analyses their tickets and after that adjusts the attributes of those trips. The requested solution was an application which gave suggestions about how to change the tickets in a way that would make it more clicked and in that way, make more sales. A prototype was constructed which make use of two different data mining methods, clustering with the algorithm DBSCAN and classification with the algorithm knearest neighbor. These algorithms were used together with an evaluation process, called DNNA, which analyzes the result from the algorithms and gave suggestions about changes that could be done to the attributes of the links. The combination of the algorithms and DNNA was tested and evaluated as the solution to the problem. The program was able to predict what attributes of the tickets needed to be adjusted to get the tickets more clicks. ‘The recommendations of adjustments were reasonable but this result could not be compared to similar tools since they had not been published.
|
12 |
Deinterleaving of radar pulses with batch processing to utilize parallelism / Gruppering av radar pulser med batch-bearbetning för att utnyttja parallelismLind, Emma, Stahre, Mattias January 2020 (has links)
The threat level (specifically in this thesis, for aircraft) in an environment can be determined by analyzing radar signals. This task is critical and has to be solved fast and with high accuracy. The received electromagnetic pulses have to be identified in order to classify a radar emitter. Usually, there are several emitters transmitting radar pulses at the same time in an environment. These pulses need to be sorted into groups, where each group contains pulses from the same emitter. This thesis aims to find a fast and accurate solution to sort the pulses in parallel. The selected approach analyzes batches of pulses in parallel to exploit the advantages of a multi-threaded Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). Firstly, a suitable clustering algorithm had to be selected. Secondly, an optimal batch size had to be determined to achieve high clustering performance and to rapidly process the batches of pulses in parallel. A quantitative method based on experiments was used to measure clustering performance, execution time, system response, and parallelism as a function of batch sizes when using the selected clustering algorithm. The algorithm selected for clustering the data was Density-based Spatial Clustering of Applications with Noise (DBSCAN) because of its advantages, such as not having to specify the number of clusters in advance, its ability to find arbitrary shapes of a cluster in a data set, and its low time complexity. The evaluation showed that implementing parallel batch processing is possible while still achieving high clustering performance, compared to a sequential implementation that used the maximum likelihood method.An optimal batch size in terms of data points and cutoff time is hard to determine since the batch size is very dependent on the input data. Therefore, one batch size might not be optimal in terms of clustering performance and system response for all streams of data. A solution could be to determine optimal batch sizes in advance for different streams of data, then adapt a batch size depending on the stream of data. However, with a high level of parallelism, an additional delay is introduced that depends on the difference between the time it takes to collect data points into a batch and the time it takes to process the batch, thus the system will be slower to output its result for a given batch compared to a sequential system. For a time-critical system, a high level of parallelism might be unsuitable since it leads to slower response times. / Genom analysering av radarsignaler i en miljö kan hotnivån bestämmas. Detta är en kritisk uppgift som måste lösas snabbt och med bra noggrannhet. För att kunna klassificera en specifik radar måste de elektromagnetiska pulserna identifieras. Vanligtvis sänder flera emittrar ut radarpulser samtidigt i en miljö. Dessa pulser måste sorteras i grupper, där varje grupp innehåller pulser från en och samma emitter. Målet med denna avhandling är att ta fram ett sätt att snabbt och korrekt sortera dessa pulser parallellt. Den valda metoden använder grupper av data som analyserades parallellt för att nyttja fördelar med en multitrådad Central Processing Unit (CPU) eller en Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). Först behövde en klustringsalgoritm väljas och därefter en optimal gruppstorlek för den valda algoritmen. Gruppstorleken baserades på att grupperna kunde behandlas parallellt och snabbt, samt uppnå tillförlitlig klustring. En kvantitativ metod användes som baserades på experiment genom att mäta klustringens tillförlitlighet, exekveringstid, systemets svarstid och parallellitet som en funktion av gruppstorlek med avseende på den valda klustringsalgoritmen. Density-based Spatial Clustering of Applications with Noise (DBSCAN) valdes som algoritm på grund av dess förmåga att hitta kluster av olika former och storlekar utan att på förhand ange antalet kluster för en mängd datapunkter, samt dess låga tidskomplexitet. Resultaten från utvärderingen visade att det är möjligt att implementera ett system med grupper av pulser och uppnå bra och tillförlitlig klustring i jämförelse med en sekventiell implementation av maximum likelihood-metoden. En optimal gruppstorlek i antal datapunkter och cutoff tid är svårt att definiera då storleken är väldigt beroende på indata. Det vill säga, en gruppstorlek måste inte nödvändigtvis vara optimal för alla typer av indataströmmar i form av tillförlitlig klustring och svarstid för systemet. En lösning skulle vara att definiera optimala gruppstorlekar i förväg för olika indataströmmar, för att sedan kunna anpassa gruppstorleken efter indataströmmen. Det uppstår en fördröjning i systemet som är beroende av differensen mellan tiden det tar att skapa en grupp och exekveringstiden för att bearbeta en grupp. Denna fördröjning innebär att en parallell grupp-implementation aldrig kommer kunna vara lika snabb på att producera sin utdata som en sekventiell implementation. Detta betyder att det i ett tidskritiskt system förmodligen inte är optimalt att parallellisera mycket eftersom det leder till långsammare svarstid för systemet.
|
13 |
Unsupervised Machine Learning Based Anomaly Detection in Stockholm Road Traffic / Oövervakad Maskininlärning baserad Anomali Detektion i Stockholms TrafikdataHellström, Vilma January 2023 (has links)
This thesis is a study of anomaly detection in vehicle traffic data in central Stockholm. Anomaly detection is an important tool in the analysis of traffic data for improved urban planing. Two unsupervised machine learning models are used, the DBSCAN clustering model and the LSTM deep learning neural network. A modified version of the models is also employed, incorporating adaptations that exploit diurnal traffic variations to improve the quality of the results. Subsequently, the model performance is analysed and compared. For evaluating the models, we employed two types of synthetic anomalies: a straightforward one and a more complex variant. The results indicate that all models show some ability to detect both anomalies. The models show better performance on the simpler anomaly, with both LSTM and DBSCAN giving comparable results. In contrast, LSTM outperforms DBSCAN on the more complex anomaly. Notably, the modified versions of both models consistently show enhanced performance. This suggest that LSTM outperforms DBSCAN as anomalies become more complex, presumably owing to LSTM’s proficiency in identifying intricate patterns. However, this relationship warrants further investigation in future research. / Denna examensuppsats behandlar anomalidetektering i fordonstrafikdata i centrala Stockholm. Anomalidetektering är ett viktigt verktyg vid analys av trafikdata för förbättrad stadsplanering. Två oövervakade maskininlärningsmodeller används, klustringsmodellen DBSCAN och djupinlärande neurala nätverket LSTM. En modifierad version av modellerna appliceras även, denna modifikation innebär anpassningar som utnyttjar dagliga traffikvariationer för att förbättra kvaliteten på resultatet. Modellerna analyseras och dess prestanda jämförs. För att utvärdera modellerna användes två typer av syntetiska anomalier: en enkel och en mer komplex anomali. Resultaten visar på en förmåga hos modellerna att upptäcka båda anomalierna. Modellerna uppvisar en bättre prestanda på den enklare anomalin, där LSTM och DBSCAN ger jämförbara resultat. För den mer komplexa anomalin så ger LSTM bättre resultat än DBSCAN. De modifierade versionerna av båda modellerna genererade konsekvent bättre resultat än den mer konventionella tillämpningen. Resultatet tyder på att LSTM överträffar DBSCAN när anomalierna blir mer komplexa, detta på grund av LSTMs skicklighet i att identifiera icke triviala mönster. Detta kräver dock ytterligare undersökningar i framtida forskning.
|
14 |
6G RF Waveform with AI for Human Presence Detection in Indoor EnvironmentsStratigi, Eirini January 2022 (has links)
Wireless communication equipment is widely available and the number of transmitters and receivers keeps increasing. In addition to communications, wireless nodes can be used for sensing. This project is focuses on human presence detection in indoor environments using measurements such as CSI that can be extracted from radio receivers and labeled using a camera and AI computer vision techniques (YoLo framework). Our goal is to understand if a room is empty or has one or two people by utilizing machine learning algorithms. We have selected SVM (Support Vector Machines) and CNN (Convolutional Neural Networks). These methods will be evaluated in different scenarios such as different locations, bandwidths of 20, 40 and 120MHz, carrier frequencies of 2.4 and 5 GHz, high/low SNR values as well as different antenna configurations (MIMO, SIMO, SISO). Both methods perform very well for classification and specifically in case of CNN it performs better in low SNR compared to SVM. We found that some of the measurements seemed to be outliers and the clustering algorithm DBScan was used in order to identify them. Last but not least, we explore whether the radio can complement computer vision in presence detection since radio waves may propagate through walls and opaque obstacles. / Trådlös kommunikationsutrustning är allmänt tillgänglig och antalet sändare och mottagare fortsätter att öka. Förutom kommunikation kan trådlösa noder användas för avkänning. Detta projekt fokuserar på mänsklig närvarodetektering i inomhusmiljöer med hjälp av mätningar som CSI som kan extraheras från radiomottagare och märkas med hjälp av en kamera och AI datorseende tekniker (YoLo-ramverket). Vårt mål är att förstå om ett rum är tomt eller har en eller två personer genom att använda maskininlärningsalgoritmer. Vi har valt SVM och CNN. Dessa metoder kommer att utvärderas i olika scenarier såsom olika platser, bandbredder på 20, 40 och 120MHz, bärvågsfrekvenser på 2,4 och 5 GHz, höga/låga SNR-värden samt olika antennkonfigurationer (MIMO, SIMO, SISO). Båda metoderna fungerar mycket bra för klassificering och specifikt i fall av CNN presterar den bättre i låg SNR jämfört med SVM. Vi fann att några av mätningarna verkade vara extremvärden och klustringsalgoritmen DBScan användes för att identifiera dem. Sist men inte minst undersöker vi om radion kan komplettera datorseende vid närvarodetektering eftersom radiovågor kan fortplanta sig genom väggar och ogenomskinliga hinder.
|
15 |
Bi-filtration and stability of TDA mapper for point cloud dataBungula, Wako Tasisa 01 August 2019 (has links)
TDA mapper is an algorithm used to visualize and analyze big data. TDA mapper is applied to a dataset, X, equipped with a filter function f from X to R. The output of the algorithm is an abstract graph (or simplicial complex). The abstract graph captures topological and geometric information of the underlying space of X.
One of the interests in TDA mapper is to study whether or not a mapper graph is stable. That is, if a dataset X is perturbed by a small value, and denote the perturbed dataset by X∂, we would like to compare the TDA mapper graph of X to the TDA mapper graph of X∂. Given a topological space X, if the cover of the image of f satisfies certain conditions, Tamal Dey, Facundo Memoli, and Yusu Wang proved that the TDA mapper is stable. That is, the mapper graph of X differs from the mapper graph of X∂ by a small value measured via homology.
The goal of this thesis is three-fold. The first is to introduce a modified TDA mapper algorithm. The fundamental difference between TDA mapper and the modified version is the modified version avoids the use of filter function. In comparing the mapper graph outputs, the proposed modified mapper is shown to capture more geometric and topological features. We discuss the advantages and disadvantages of the modified mapper.
Tamal Dey, Facundo Memoli, and Yusu Wang showed that a filtration of covers induce a filtration of simplicial complexes, which in turn induces a filtration of homology groups. While Tamal Dey, Facundo Memoli, and Yusu Wang focused on TDA mapper's application to topological space, the second goal of this thesis is to show DBSCAN clustering gives a filtration of covers when TDA mapper is applied to a point cloud. Hence, DBSCAN gives a filtration of mapper graphs (simplicial complexes) and homology groups. More importantly, DBSCAN gives a filtration of covers, mapper graphs, and homology groups in three parameter directions: bin size, epsilon, and Minpts. Hence, there is a multi-dimensional filtration of covers, mapper graphs, and homology groups. We also note that single-linkage clustering is a special case of DBSCAN clustering, so the results proved to be true when DBSCAN is used are also true when single-linkage is used. However, complete-linkage does not give a filtration of covers in the direction of bin, hence no filtration of simpicial complexes and homology groups exist when complete-linkage is applied to cluster a dataset. In general, the results hold for any clustering algorithm that gives a filtration of covers.
The third (and last) goal of this thesis is to prove that two multi-dimensional persistence modules (one: with respect to the original dataset, X; two: with respect to the ∂-perturbation of X) are 2∂-interleaved. In other words, the mapper graphs of X and X∂ differ by a small value as measured by homology.
|
16 |
Detecting Metagame Shifts in League of Legends Using Unsupervised LearningPeabody, Dustin P 18 May 2018 (has links)
Over the many years since their inception, the complexity of video games has risen considerably. With this increase in complexity comes an increase in the number of possible choices for players and increased difficultly for developers who try to balance the effectiveness of these choices. In this thesis we demonstrate that unsupervised learning can give game developers extra insight into their own games, providing them with a tool that can potentially alert them to problems faster than they would otherwise be able to find. Specifically, we use DBSCAN to look at League of Legends and the metagame players have formed with their choices and attempt to detect when the metagame shifts possibly giving the developer insight into what changes they should affect to achieve a more balanced, fun game.
|
17 |
Implementation and Evaluation of Image Retrieval Method Utilizing Geographic Location MetadataLundstedt, Magnus January 2009 (has links)
Multimedia retrieval systems are very important today with millions of content creators all over the world generating huge multimedia archives. Recent developments allows for content based image and video retrieval. These methods are often quite slow, especially if applied on a library of millions of media items. In this research a novel image retrieval method is proposed, which utilizes spatial metadata on images. By finding clusters of images based on their geographic location, the spatial metadata, and combining this information with existing content- based image retrieval algorithms, the proposed method enables efficient presentation of high quality image retrieval results to system users. Clustering methods considered include Vector Quantization, Vector Quantization LBG and DBSCAN. Clustering was performed on three different similarity measures; spatial metadata, histogram similarity or texture similarity. For histogram similarity there are many different distance metrics to use when comparing histograms. Euclidean, Quadratic Form and Earth Mover’s Distance was studied. As well as three different color spaces; RGB, HSV and CIE Lab.
|
18 |
Study of Protein Interfaces with ClusteringBergqvist, Jonathan January 2018 (has links)
Protein-protein interactions occur in nature and have different functions. The interacting surface between two interacting proteins contains the respective protein's interface residues. In this thesis, a series of Python scripts are presented which can perform interface-interface comparisons with the method InterComp, to obtain a distance matrix of different protein interfaces. The distance matrix can be studied with the use of clustering algorithms such as DBSCAN. The result from clustering using DBSCAN shows that for the 77,017 protein interfaces studied, a majority of the protein interfaces are part of a single cluster while most of the remaining interfaces are noise for the tested parameters Eps and MinPts. The conclusion of this thesis is the effect on the number of clusters for the tested parameters Eps and MinPts when performing DBSCAN.
|
19 |
Anomaly Detection in District Heating using a Clustering based approachNguyen, Minh-Tung, Baduni, Metjan January 2021 (has links)
The global demand for energy has increased in recent years. In Northern Europe and North America, centralized production and distribution of heat energy is commonly regarded as District Heating (DH). Efficient delivery of heat in the DH system is crucial not only for the building dwellers but even for companies that supply such energy. DH efficiency has to overcome several challenges as a result of faults that negatively impact its performance. Data collected from substations can be analyzed to identify potential faults and reduce the associated economic costs. The aim of this study is to use unsupervised machine learning in order to identify potential clusters of buildings in a time series dataset collected from buildings in a medium size Swedish town. We propose to find the anomalies in two ways; firstly, by identifying possible clusters of buildings and finding buildings which do not belong to a cluster, that can constitute potential anomalies. Secondly, by studying how the cluster membership transitions can help us to identify abnormal behavior over different time windows. A data mining experiment has been conducted by analyzing the energy profiles of 90 buildings in a period of 8 weeks for 2017 using the DBSCAN algorithm. Results suggest that winter period is more appropriate for the formation of possible clusters compared to summer period due to less noise encountered in winter. Clustering for every week can tell us more about the anomalies. Last, the periodic transitions between the clusters and the ranking of the clusters based on scaled distance can help us improve the anomaly detection by signalizing us for further inspection.
|
20 |
Detection of Deviations in Beehives Based on Sound Analysis and Machine LearningHodzic, Amer, Hoang, Danny January 2021 (has links)
Honeybees are an essential part of our ecosystem as they take care of most of the pollination in the world. They also produce honey, which is the main reason beekeeping was introduced in the first place. As the production of honey is affected by the living conditions of the honeybees, the beekeepers aim to maintain the health of the honeybee societies. TietoEVRY, together with HSB Living Lab, introduced connected beehives in a project named BeeLab. The goal of BeeLab is to provide a service to monitor and gain knowledge about honeybees using the data collected with different sensors. Today they measure weight, temperature, air pressure, and humidity. It is known that honeybees produce different sounds when different events are occurring in the beehive. Therefore BeeLab wants to introduce sound monitoring to their service. This project aims to investigate the possibility of detecting deviations in beehives based on sound analysis and machine learning. This includes recording sound from beehives followed by preprocessing of sound data, feature extraction, and applying a machine learning algorithm on the sound data. An experiment is done using Mel-Frequency Cepstral Coefficients (MFCC) to extract sound features and applying the DBSCAN machine learning algorithm to investigate the possibilities of detecting deviations in the sound data. The experiment showed promising results as deviating sounds used in the experiment were grouped into different clusters.
|
Page generated in 0.0322 seconds