Spelling suggestions: "subject:"unsupervised machine learning"" "subject:"nsupervised machine learning""
1 |
Improving the Analysis of MINI-LINK Test Data Using Unsupervised Machine LearningNerella, Dinesh Kumar January 2023 (has links)
No description available.
|
2 |
Towards a fully automated extraction and interpretation of tabular data using machine learningHedbrant, Per January 2019 (has links)
Motivation A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. This handling includes import of data into the same format, regardless of the output of the various instruments used. There are commercial solutions available for this process, but to our knowledge, all these require prior generation of templates to which data must conform.A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. Significant amount of time is spent on manual pre- processing, converting from one format to another. There are currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for automated recognition and loading of data stored in arbitrary formats. The aim of this study is three-folded: A) Investigate if unsupervised machine learning methods can be used to label different types of cells in spreadsheets. B) Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and technologies for the SaaS solution. Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done. A hypothesis-driven algorithm is built and adapted to two of the data formats CBCS uses most frequently. Discussions are held on choices of architecture and technologies for the SaaS solution, including system design patterns, web development framework and database. Result The reading and pre-processing framework is in itself a valuable result, due to its general applicability. No satisfying results are found when using mini-batch K means clustering method. When only reading data from one format, the dimensionality can be reduced from 542 to around 40 dimensions. The hypothesis-driven algorithm can consistently interpret the format it is designed for. More work is needed to make it more general. Implication The study contribute to the desired solution in short-term by the hypothesis-generating algorithm, and in a more generalisable way by the unsupervised learning approach. The study also contributes by initiating a conversation around the system design choices.
|
3 |
Unsupervised induction of semantic rolesLang, Joel January 2012 (has links)
In recent years, a considerable amount of work has been devoted to the task of automatic frame-semantic analysis. Given the relative maturity of syntactic parsing technology, which is an important prerequisite, frame-semantic analysis represents a realistic next step towards broad-coverage natural language understanding and has been shown to benefit a range of natural language processing applications such as information extraction and question answering. Due to the complexity which arises from variations in syntactic realization, data-driven models based on supervised learning have become the method of choice for this task. However, the reliance on large amounts of semantically labeled data which is costly to produce for every language, genre and domain, presents a major barrier to the widespread application of the supervised approach. This thesis therefore develops unsupervised machine learning methods, which automatically induce frame-semantic representations without making use of semantically labeled data. If successful, unsupervised methods would render manual data annotation unnecessary and therefore greatly benefit the applicability of automatic framesemantic analysis. We focus on the problem of semantic role induction, in which all the argument instances occurring together with a specific predicate in a corpus are grouped into clusters according to their semantic role. Our hypothesis is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences, by leveraging the syntactic relations conveyed through parse trees with lexical-semantic information. We argue that semantic role induction can be guided by three linguistic principles. The first is the well-known constraint that semantic roles are unique within a particular frame. The second is that the arguments occurring in a specific syntactic position within a specific linking all bear the same semantic role. The third principle is that the (asymptotic) distribution over argument heads is the same for two clusters which represent the same semantic role. We consider two approaches to semantic role induction based on two fundamentally different perspectives on the problem. Firstly, we develop feature-based probabilistic latent structure models which capture the statistical relationships that hold between the semantic role and other features of an argument instance. Secondly, we conceptualize role induction as the problem of partitioning a graph whose vertices represent argument instances and whose edges express similarities between these instances. The graph thus represents all the argument instances for a particular predicate occurring in the corpus. The similarities with respect to different features are represented on different edge layers and accordingly we develop algorithms for partitioning such multi-layer graphs. We empirically validate our models and the principles they are based on and show that our graph partitioning models have several advantages over the feature-based models. In a series of experiments on both English and German the graph partitioning models outperform the feature-based models and yield significantly better scores over a strong baseline which directly identifies semantic roles with syntactic positions. In sum, we demonstrate that relatively high-quality shallow semantic representations can be induced without human supervision and foreground a promising direction of future research aimed at overcoming the problem of acquiring large amounts of lexicalsemantic knowledge.
|
4 |
Anomaly Detection in District Heating using a Clustering based approachNguyen, Minh-Tung, Baduni, Metjan January 2021 (has links)
The global demand for energy has increased in recent years. In Northern Europe and North America, centralized production and distribution of heat energy is commonly regarded as District Heating (DH). Efficient delivery of heat in the DH system is crucial not only for the building dwellers but even for companies that supply such energy. DH efficiency has to overcome several challenges as a result of faults that negatively impact its performance. Data collected from substations can be analyzed to identify potential faults and reduce the associated economic costs. The aim of this study is to use unsupervised machine learning in order to identify potential clusters of buildings in a time series dataset collected from buildings in a medium size Swedish town. We propose to find the anomalies in two ways; firstly, by identifying possible clusters of buildings and finding buildings which do not belong to a cluster, that can constitute potential anomalies. Secondly, by studying how the cluster membership transitions can help us to identify abnormal behavior over different time windows. A data mining experiment has been conducted by analyzing the energy profiles of 90 buildings in a period of 8 weeks for 2017 using the DBSCAN algorithm. Results suggest that winter period is more appropriate for the formation of possible clusters compared to summer period due to less noise encountered in winter. Clustering for every week can tell us more about the anomalies. Last, the periodic transitions between the clusters and the ranking of the clusters based on scaled distance can help us improve the anomaly detection by signalizing us for further inspection.
|
5 |
Anomaly Detection in Riding Behaviours : Using Unsupervised Machine Learning Methods on Time Series Data from Micromobility ServicesHansson, Indra, Congreve Lifh, Julia January 2022 (has links)
The global micromobility market is a fast growing market valued at USD 40.19 Billion in 2020. As the market grows, it is of great importance for companies to gain market shares in order to stay competitive and be the first choice within micromobility services. This can be achieved by, e.g., offering a safe micromobility service, for both riders and other road users. With state-of-the-art technology, accident prevention and preventing misuse of scooters and cities’ infrastructure is achievable. This study is conducted in collaboration with Voi Technology, a Swedish micromobility company that is committed to eliminate all serious injuries and fatalities in their value chain by 2030. Given such an ambition, the aim of the thesis is to evaluate the possibility of using unsupervised machine learning for anomaly detection with sensor data, to distinguish abnormal and normal riding behaviours. The study evaluates two machine learning algorithms; isolation forest and artificial neural networks, namely autoencoders. Beyond assessing the models ability to detect abnormal riding behaviours in general, they are evaluated based on their ability to find certain behaviours. By simulating different abnormal riding behaviours, model evaluation can be performed. The data preparation performed for the models include transforming the time series data into non-overlapping windows of a specific size containing descriptive statistics. The result obtained shows that finding a one-size-fits all type of anomaly detection model did not work as desired for either the isolation forest or the autoencoder. Further, the result indicate that one of the abnormal riding behaviours appears to be easier to distinguish, which motivates evaluating models created with the aim of distinguishing that specific behaviour. Hence, a simple moving average is also implemented to explore the performance of a very basic forecasting method. For this method, a similar data transformation as previously described is not performed as it utilises a sliding window of specific size, which is run on a single feature corresponding to an entire scooter ride. The result show that it is possible to isolate one type of abnormal riding behaviour using the autoencoder model. Additionally, the simple moving average model can also be utilised to detect the behaviour in question. Out of the two models, it is recommended to deploy a simple moving average due to its simplicity. / Den globala mikromobilitetsmarknaden är en snabbt växande marknad som år 2020 värderades till 40,19 miljarder USD. I takt med att marknaden växer så ökar också kraven bland företag att erbjuda produkter och tjänster av hög kvalitet, för att erhålla en stark position på marknaden, vara konkurrenskraftiga och förbli ett förstahandsval hos sina kunder. Detta kan uppnås genom att bland annat erbjuda mikromobilitetstjänster som är säkra, för både föraren och andra trafikanter. Med hjälp av den senaste tekniken kan olyckor förebyggas och skadligt bruk av skotrar och städers infrastruktur förhindras. Följande studie utförs i samarbete med Voi Technology, ett svenskt mikromobilitetsföretag som har åtagit sig ansvaret att eliminera samtliga allvarliga skador och dödsfall i deras värdekedja till och med år 2030. I linje med en sådan ambition, är syftet med avhandlingen att utvärdera möjligheten att använda oövervakad maskininlärning för anomalidetektering bland sensordata, för att särskilja onormala och normala körbeteenden. Studien utvärderar två maskininlärningsalgoritmer; isolation forest och artificiella neurala nätverk, mer specifikt autoencoders. Utöver att bedöma modellernas förmåga att upptäcka onormala körbeteenden i allmänhet, utvärderas modellerna utifrån deras förmåga att hitta särskilda körbeteenden. Genom att simulera olika onormala körbeteenden kan modellerna evalueras. Dataförberedelsen som utförs för modellerna inkluderar omvandling av den råa tidsseriedatan till icke överlappande fönster av specifik storlek, bestående av beskrivande statistik. Det erhållna resultatet visar att varken isolation forest eller autoencodern presterar som förväntat samt att önskan om att hitta en generell modell som klarar av att detektera anomalier av olika karaktär inte verkar uppfyllas. Vidare indikerar resultatet på att ett visst onormalt körbeteende verkar enklare att särskilja än resterande, vilket motiverar att utvärdera modeller skapade i syfte att detektera det specifika beteendet. Följaktligen implementeras därför ett glidande medelvärde för att utforska prestandan hos en mycket grundläggande prediktionsmetod. För denna metod utförs inte den tidigare nämnda datatransformationen eftersom metoden använder ett glidande medelvärde som appliceras på en variabel tillhörande en fullständig åktur. Följande analys visar att autoencoder modellen klarar av att urskilja denna typ av onormalt körbeteende. Resultatet visar även att ett glidande medelvärde klarar av att detektera körbeteendet i fråga. Av de två modellerna rekommenderas en implementering av ett glidande medelvärdet på grund av dess enkelhet.
|
6 |
Unsupervised Topic Modeling to Improve Stormwater InvestigationsArvidsson, David January 2022 (has links)
Stormwater investigations are an important part of the detail plan that is necessary for companies and industries to write. The detail plan is used to show that an area is well suited for among other things, construction. Writing these detail plans is a costly and time consuming process and it is not uncommon they get rejected. This is because it is difficult to find information about the criteria you need to meet and what you need to address within the investigation. This thesis aims to make this problem less ambiguous by applying the topic modeling algorithm LDA (latent Dirichlet allocation) in order to identify the structure of stormwater investigations. Moreover, sentences that contain words from the topic modeling will be extracted to give each word a perspective of how it can be used in the context of writing a stormwater investigation. Finally a knowledge graph will be created with the extracted topics and sentences. The result of this study indicates that topic modeling and NLP (natural language processing) can be used to identify the structure of stormwater investigations. Furthermore it can also be used to extract useful information that can be used as a guidance when learning and writing stormwater investigations.
|
7 |
Detection of Deviations in Beehives Based on Sound Analysis and Machine LearningHodzic, Amer, Hoang, Danny January 2021 (has links)
Honeybees are an essential part of our ecosystem as they take care of most of the pollination in the world. They also produce honey, which is the main reason beekeeping was introduced in the first place. As the production of honey is affected by the living conditions of the honeybees, the beekeepers aim to maintain the health of the honeybee societies. TietoEVRY, together with HSB Living Lab, introduced connected beehives in a project named BeeLab. The goal of BeeLab is to provide a service to monitor and gain knowledge about honeybees using the data collected with different sensors. Today they measure weight, temperature, air pressure, and humidity. It is known that honeybees produce different sounds when different events are occurring in the beehive. Therefore BeeLab wants to introduce sound monitoring to their service. This project aims to investigate the possibility of detecting deviations in beehives based on sound analysis and machine learning. This includes recording sound from beehives followed by preprocessing of sound data, feature extraction, and applying a machine learning algorithm on the sound data. An experiment is done using Mel-Frequency Cepstral Coefficients (MFCC) to extract sound features and applying the DBSCAN machine learning algorithm to investigate the possibilities of detecting deviations in the sound data. The experiment showed promising results as deviating sounds used in the experiment were grouped into different clusters.
|
8 |
Visual Analysis of Industrial Multivariate Time-Series Data : Effective Solution to Maximise Insights from Blow Moulding Machine Sensory DataMusleh, Maath January 2021 (has links)
Developments in the field of data analytics provides a boost for small-sized factories. These factories are eager to take full advantage of the potential insights in the remotely collected data to minimise cost and maximise quality and profit. This project aims to process, cluster and visualise sensory data of a blow moulding machine in a plastic production factory. In collaboration with Lean Automation, we aim to develop a data visualisation solution to enable decision-makers in a plastic factory to improve their production process. We will investigate three different aspects of the solution: methods for processing multivariate time-series data, clustering approaches for the sensory-data cultivated, and visualisation techniques that maximises production process insights. We use a formative evaluation method to develop a solution that meets partners' requirements and best practices within the field. Through building the MTSI dashboard tool, we hope to answer questions on optimal techniques to represent, cluster and visualise multivariate time series data.
|
9 |
Clustering SQL-queries using unsupervised machine learningSchmidt, Thomas January 2022 (has links)
Decerno has created a business system that utilizes Microsoft's Entity Framework (EF) which is an object-database mapper. It can automatically generate SQL queries from code written in C#. Some of these queries has started to display significant increase in query response time which require further examination. The generated queries can vary in length between 3 to around 2500 tokens in length which makes it difficult to get an overview of what types of queries that are consistently slow. This thesis examines the possibility of using neural networks based on the transformer model in conjunction with the autoencoder in order to create feature rich embeddings from the SQL queries. The networks presented in this thesis are tasked with capturing the semantics of the SQL queries such that semantically similar queries will be mapped close to one another in the latent feature space. In order to investigate the impact of embedding dimension, several transformer based networks are constructed that calculate embeddings with varying embedding dimension. The dimensionality reduction algorithm UMAP is applied to the higher dimensional embeddings in order to enable the clustering algorithm DBSCAN to successfully be applied. The results show that unsupervised machine learning can be used in order to create feature-rich embeddings from SQL-queries but that higher dimensional embeddings are required as the models that encoded the SQL queries to embeddings with 5 dimensions and lower not yielded satisfactory results. Thus some sort of dimensionality reduction algorithm is required when assuming the method proposed in this thesis. Furthermore, the results did not indicate any correlation between semantic similarity and average response times.
|
10 |
Oövervakad maskininlärning för att upptäcka bottar i online-tävlingarSaari, Lukas, Mårtensson, Emil January 2016 (has links)
Digital marknadsföring är i dagsläget en snabbt växande bransch och aktörer söker ständigt efter nya sätt att bedriva marknadsföring. I denna rapport studeras en av dessa aktörer, Adoveo, vars värdeerbjudande är att inkludera ett tävlingsmoment i reklamkampanjerna som ger deltagare möjlighet att vinna priser. Problematiskt är dock att priserna riskeras att inte delas ut till mänskliga deltagare, utan istället delas ut till bottar som deltar i tävlingarna både omänskligt många gånger och med omänskligt bra resultat. Syftet med rapporten är att med hjälp av data från denna aktör försöka skilja mänskliga deltagare från bottar. För detta tillämpades två oövervakade maskininlärningsalgoritmer för att klustra datapunkterna, Gaussian Mixture Model och K-medelvärde. Resultatet var en otydlig klusterstruktur där det inte gick att pålitligt identifiera något kluster som mänskligt respektive botliknande. Orsakerna bakom denna osäkerhet var främst designen av reklamtävlingarna samt att attributen i den studerade datan var otillräckliga. Rekommendationer gavs till hur dessa problem skulle kunna åtgärdas. Slutligen genomfördes en analys avseende affärsnyttan med botsäkra tävlingar och vilket mervärde det skapar för företaget. Analysen visade att affärsnyttan från att botsäkra tävlingarna skulle vara stor, då det skulle ge fördelar gentemot konsumenter såväl som annonsörer och konkurrenter. / Digital marketing is a fast-growing market and its actors are constantlylooking for innovative and new ways of marketing. In this paper, an actoron this market called Adoveo will be studied. Their specialization and valueproposition is to include a competition part in their advertisement campaigns,giving its participators the possibility to win a prize. What could turn out to beproblematic is that the prizes are not rewarded to human contestants, insteadgoing to a bot that can participate in the competition with unreasonably goodresults. The purpose of this paper is to try to separate bots from human contestantswith the data provided from Adoveo. To that end, two unsupervised machinelearning algorithms were implemented to cluster the data points, GaussianMixture Model and K-Means. The result was an uninterpretable cluster structurefrom which there was no reliable identification of bot-like and human-likebehaviour to be made. The reason behind this was twofold, the design of thecompetition and a lack of decisive attributes in the data. Recommendationswere provided to how both of these issues could be rectified.Finally, an analysis was provided on the business value of bot-securingcompetitions and the value it gives to the company. The analysis showed thatthe business value of bot-securing competitions would be beneficial, becauseit would give a competitive advantage against competitors and also improvebusiness with advertisers and consumers.
|
Page generated in 0.1632 seconds