251 |
Advanced Algorithms for Classification and Anomaly Detection on Log File Data : Comparative study of different Machine Learning ApproachesWessman, Filip January 2021 (has links)
Background: A problematic area in today’s large scale distributed systems is the exponential amount of growing log data. Finding anomalies by observing and monitoring this data with manual human inspection methods becomes progressively more challenging, complex and time consuming. This is vital for making these systems available around-the-clock. Aim: The main objective of this study is to determine which are the most suitable Machine Learning (ML) algorithms and if they can live up to needs and requirements regarding optimization and efficiency in the log data monitoring area. Including what specific steps of the overall problem can be improved by using these algorithms for anomaly detection and classification on different real provided data logs. Approach: Initial pre-study is conducted, logs are collected and then preprocessed with log parsing tool Drain and regular expressions. The approach consisted of a combination of K-Means + XGBoost and respectively Principal Component Analysis (PCA) + K-Means + XGBoost. These was trained, tested and with different metrics individually evaluated against two datasets, one being a Server data log and on a HTTP Access log. Results: The results showed that both approaches performed very well on both datasets. Able to with high accuracy, precision and low calculation time classify, detect and make predictions on log data events. It was further shown that when applied without dimensionality reduction, PCA, results of the prediction model is slightly better, by a few percent. As for the prediction time, there was marginally small to no difference for when comparing the prediction time with and without PCA. Conclusions: Overall there are very small differences when comparing the results for with and without PCA. But in essence, it is better to do not use PCA and instead apply the original data on the ML models. The models performance is generally very dependent on the data being applied, it the initial preprocessing steps, size and it is structure, especially affecting the calculation time the most.
|
252 |
Modul víceúrovňových asociačních pravidel systému pro dolování z dat / Multi-Level Association Rules Module of a Data Mining SystemPospíšil, Jan January 2010 (has links)
This thesis focuses on the problematics of implementing a multilevel association rules mining module, for existing data mining project. There are two main algorithms explained, Apriori and MLT2L1. The thesis continues with the datamining module implementation, as well as the DMSL elements design. In the final chapters deal with an example dataminig task and its result comparison as well as the whole thesis achievement description.
|
253 |
Återskapa mänskligt beteende med artificiell intelligens i 2D top-down wave shooter spel / Recreate human behaviour with artificial intelligence in 2D top-down wave shooter gameBjärehall, Johannes, Hallberg, Johan January 2020 (has links)
Arbetet undersöker mänskligt beteende hos beteendeträd och LSTM nätverk. Ett spel skapades som testades av personer i en undersökning där deltagarna fick spela tillsammans med vardera agent i slumpmässig ordning för att bedöma agenternas beteende. Resultatet från undersökningen visade att beteendeträdet var den mänskliga varianten enligt deltagarna oavsett ordning som testpersonerna spelade med vardera agent. Problemet med resultatet beror antagligen till störst del på att det inte fanns tillräckligt med tid och bristande CPU kraft för att utveckla LSTM agenten ytterligare. För att förbättra och arbeta vidare med arbetet kan mer tid läggas på att träna LSTM nätverket och finjustera beteendeträdet. För att förbättra testet borde riktig multiplayer funktionalitet implementeras som gör att det går att testa agenterna jämfört med riktiga mänskliga spelare.
|
254 |
Object Classification using Language ModelsFrom, Gustav January 2022 (has links)
In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder. För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning.
|
255 |
Imbalanced Learning and Feature Extraction in Fraud Detection with Applications / Obalanserade Metoder och Attribut Aggregering för Upptäcka Bedrägeri, med AppliceringarJacobson, Martin January 2021 (has links)
This thesis deals with fraud detection in a real-world environment with datasets coming from Svenska Handelsbanken. The goal was to investigate how well machine learning can classify fraudulent transactions and how new additional features affected classification. The models used were EFSVM, RUTSVM, CS-SVM, ELM, MLP, Decision Tree, Extra Trees, and Random Forests. To determine the best results the Mathew Correlation Coefficient was used as performance metric, which has been shown to have a medium bias for imbalanced datasets. Each model could deal with high imbalanced datasets which is common for fraud detection. Best results were achieved with Random Forest and Extra Trees. The best scores were around 0.4 for the real-world datasets, though the score itself says nothing as it is more a testimony to the dataset’s separability. These scores were obtained when using aggregated features and not the standard raw dataset. The performance measure recall’s scores were around 0.88-0.93 with an increase in precision by 34.4%-67%, resulting in a large decrease of False Positives. Evaluation results showed a great difference compared to test-runs, either substantial increase or decrease. Two theories as to why are discussed, a great distribution change in the evaluation set, and the sample size increase (100%) for evaluation could have lead to the tests not being well representing of the performance. Feature aggregation were a central topic of this thesis, with the main focus on behaviour features which can describe patterns and habits of customers. For these there were five categories: Sender’s fraud history, Sender’s transaction history, Sender’s time transaction history, Sender’shistory to receiver, and receiver’s history. Out of these, the best performance increase was from the first which gave the top score, the other datasets did not show as much potential, with mostn ot increasing the results. Further studies need to be done before discarding these features, to be certain they don’t improve performance. Together with the data aggregation, a tool (t-SNE) to visualize high dimension data was usedto great success. With it an early understanding of what to expect from newly added features would bring to classification. For the best dataset it could be seen that a new sub-cluster of transactions had been created, leading to the belief that classification scores could improve, whichthey did. Feature selection and PCA-reduction techniques were also studied and PCA showedgood results and increased performance. Feature selection had not conclusive improvements. Over- and under-sampling were used and neither improved the scores, though undersampling could maintain the results which is interesting when increasing the dataset. / Denna avhandling handlar om upptäcka bedrägerier i en real-world miljö med data från Svenska Handelsbanken. Målet var att undersöka hur bra maskininlärning är på att klassificera bedrägliga transaktioner, och hur nya attributer hjälper klassificeringen. Metoderna som användes var EFSVM, RUTSVM, CS-SVM, ELM, MLP, Decision Tree, Extra Trees och Random Forests. För evaluering av resultat används Mathew Correlation Coefficient, vilket har visat sig ha småttt beroende med hänsyn till obalanserade datamängder. Varje modell har inbygda värden för attklara av att bearbeta med obalanserade datamängder, vilket är viktigt för att upptäcka bedrägerier. Resultatmässigt visade det sig att Random Forest och Extra Trees var bäst, utan att göra p-test:s, detta på grund att dataseten var relativt sätt små, vilket gör att små skillnader i resultat ej är säkra. De högsta resultaten var cirka 0.4, det absoluta värdet säger ingenting mer än som en indikation om graden av separation mellan klasserna. De bäst resultaten ficks när nya aggregerade attributer användes och inte standard datasetet. Dessa resultat hade recall värden av 0,88-0,93 och för dessa kunde det synas precision ökade med 34,4% - 67%, vilket ger en stor minskning av False Positives. Evluation-resultaten hade stor skillnad mot test-resultaten, denna skillnad hade antingen en betydande ökning eller minskning. Två anledningar om varför diskuterades, förändring av evaluation-datan mot test-datan eller att storleksökning (100%) för evaluation har lett till att testerna inte var representativa. Attribute-aggregering var ett centralt ämne, med fokus på beteende-mönster för att beskriva kunders vanor. För dessa fanns det fem kategorier: Avsändarens bedrägerihistorik, Avsändarens transaktionshistorik, Avsändarens historik av tid för transaktion, Avsändarens historik till mottagaren och mottagarens historik. Av dessa var den största prestationsökningen från bedrägerihistorik, de andra attributerna hade inte lika positiva resultat, de flesta ökade inte resultaten.Ytterligare mer omfattande studier måste göras innan dessa attributer kan sägas vara givande eller ogivande. Tillsammans med data-aggregering användes t-SNE för att visualisera högdimensionsdata med framgång. Med t-SNE kan en tidig förståelse för vad man kan förvänta sig av tillagda attributer, inom klassificering. För det bästa dataset kan man se att ett nytt kluster som hade skapats, vilket kan tolkas som datan var mer beskrivande. Där förväntades också resultaten förbättras, vilket de gjorde. Val av attributer och PCA-dimensions reducering studerades och PCA-visadeförbättring av resultaten. Over- och under-sampling testades och kunde ej förbättrade resultaten, även om undersampling kunde bibehålla resultated vilket är intressant om datamängden ökar.
|
256 |
Machine Learning Evaluation of Natural Language to Computational Thinking : On the possibilities of coding without syntaxBjörkman, Desireé January 2020 (has links)
Voice commands are used in today's society to offer services like putting events into a calendar, tell you about the weather and to control the lights at home. This project tries to extend the possibilities of voice commands by improving an earlier proof of concept system that interprets intention given in natural language to program code. This improvement was made by mixing linguistic methods and neural networks to increase accuracy and flexibility of the interpretation of input. A user testing phase was made to conclude if the improvement would attract users to the interface. The results showed possibilities of educational purposes for computational thinking and the issues to overcome to become a general programming tool.
|
257 |
AI methods for identifying process defects in advanced manufacturing with rare labeled dataSenanayaka Mudiyanselage, Ayantha Umesh 08 August 2023 (has links) (PDF)
This dissertation aims to provide efficient process defect identification methods for advanced manufacturing environments using AI tools/algorithms with limited labeled data availability. Asset and equipment quality become highly sensitive in sustaining virtuous performance and safety in various manufacturing domains. Internally generated process imperfections degrade finished products' optimum performance and mechanical attributes. The evolution of big data and intelligent sensing systems leverage data-driven defect identification in advanced manufacturing environments. Widely adopted data-driven process anomaly detection methods assume that the training (source) and testing (target) data follow the same distribution and that labeled data are available in both source and target domains. However, the source and target sometimes follow different distributions in real-world manufacturing environments as the diversity of industrialization processes leads to heterogeneous data collection under different production conditions. Such a case significantly limits the performance of AI algorithms when distribution discrepancy exists.
Moreover, labeling data is typically costly and time-consuming, signifying that identifying process defects is limited by rare labeled data. Also, more realistic industrial applications incorporate fewer defect data than ordinal data and unforeseen target defects, leading to complications in understanding the process behaviors in various aspects. Therefore, we introduced methodological principles, including unsupervised grouping, transfer learning, data augmentation, and ensemble learning to address these limitations in advanced operations. First, rapid porosity prediction methodology for additive manufacturing (AM) processes under varying process conditions is developed by leveraging knowledge transfer from existing process conditions. Second, designing an effective classification method concerning time series signals to advance predictive maintenance (PdM) for machine state prediction is discussed. Finally, a data augmentation-based stacking classifier approach is developed to enhance the precision of predicting porosity, even when limited porosity data is available.
|
258 |
Convolutional Neural Networks for Indexing Transmission Electron Microscopy Patterns: a Proof of ConceptTomczak, Nathaniel 26 May 2023 (has links)
No description available.
|
259 |
Machine Learning on Terrain Data and Logged Vehicle Data to Gain Insights into Operating Conditions for an Articulated Hauler : Machine Learning on Terrain Data and Logged Vehicle Data to Gain Insights into Operating Conditions for an Articulated HaulerSun, Tianren, Wang, Yen Chieh January 2022 (has links)
Manufacturers can develop next-generation production and service for their customers by the data gathered and analyzed from customers’ usage conditions. In this research, the operating condition of articular haulers is collected and analyzed through machine learning algorithms to predict the type of operational topographies and road surface. To achieve that, elevation data and satellite images, which were gathered from Microsoft Azure Maps, are used as data sources to identify the topography and road surface on which machines operated. In the end, two machine learning models are trained with machines’ inclination records and road roughness records, respectively, to classify the topography and road surface. For the topography classifier, the topography is categorized into four terrain labels, including "Low Hills", "Mountains", "Plains", and "Tablelands & High Hills". The road surface is classified into "Paved" and "Unpaved". A Convolutional Neural Network (CNN) image classification model is built for labeling satellite images instead of labeling manually. The results indicate that the prediction for topography labels "Plains" and "Tablelands & High Hills" has superior performance, which accounts for the majority of the raw dataset; on the contrary, the road surface classifier still needs further improvement in the future. In addition, an analysis and discussion regarding the imbalanced dataset are included, and it shows the limited effect on an extremely imbalanced dataset. Finally, the conclusion and future work are given.
|
260 |
A Machine Learning Approach that Integrates Clinical Data and PTM Proteomics Identifies a Mechanism of ACK1 Activation and Stabilization in CancerLoku Balasooriyage, Eranga Roshan Balasooriya 08 August 2022 (has links)
Identification of novel cancer driver mutations is crucial for targeted cancer therapy, yet a difficult task especially for low frequency drivers. To identify cancer driver mutations, we developed a machine learning (ML) model to predict cancer hotspots. Here, we applied the ML program to 32 non-receptor tyrosine kinases (NRTKs) and identified 36 potential cancer driver mutations, with high probability mutations in 10 genes, including ABL1, ABL2, JAK1, JAK3, and ACK1. ACK1 is a member of the poorly understood ACK family of NRTKs that also includes TNK1. Although ACK1 is an established oncogene and high-interest therapeutic target, the exact mechanism of ACK1 regulation is largely unknown and there is still no ACK1 inhibitor in clinical use. The ACK kinase family has a unique domain arrangement with most notably, a predicted ubiquitin association (UBA) domain at its C-terminus. While the presence of a functional UBA domain on a kinase is unique to the ACK family, but the role of the UBA domain on ACK1 is unknown. Interestingly, the ML program identified the ACK1 Mig6 homology region (MHR) and UBA domains truncating mutation p633fs* as a cancer driver mutation. Our data suggest that the ACK1 UBA domain helps activate full-length ACK1 through induced proximity. It also acts as a mechanism of negative feedback by tethering ACK1 to ubiquitinated cargo that is ultimately degraded. Indeed, our preliminary data suggest that truncation of the ACK1 UBA stabilizes ACK1 protein levels, which results in spontaneous ACK1 oligomerization and activation. Furthermore, our data suggests removal of the MHR domain hyper activates ACK1. Thus, our data provide a model to explain how human mutations in ACK1 convert the kinase into an oncogenic driver. In conclusion, our data reveal a mechanism of ACK1 activation and potential strategies to target the kinase in cancer.
|
Page generated in 0.0372 seconds