Global ETD Search

71	W2R: an ensemble Anomaly detection model inspired by language models for web application firewalls security Wang, Zelong, AnilKumar, Athira January 2023 (has links) Nowadays, web application attacks have increased tremendously due to the large number of users and applications. Thus, industries are paying more attention to using Web application Firewalls and improving their security which acts as a shield between the app and the internet by filtering and monitoring the HTTP traffic. Most works focus on either traditional feature extraction or deep methods that require no feature extraction method. We noticed that a combination of an unsupervised language model and a classic dimension reduction method is less explored for this problem. Inspired by this gap, we propose a new unsupervised anomaly detection model with better results than the existing state-of-the-art model for anomaly detection in WAF security. This paper focuses on this structure to explore WAF security: 1) feature extraction from HTTP traffic packets by using NLP (natural language processing) methods such as word2vec and Bert, and 2) Dimension reduction by PCA and Autoencoder, 3) Using different types of anomaly detection techniques including OCSVM, isolation forest, LOF and combination of these algorithms to explore how these methods affect results.  We used the datasets CSIC 2010 and ECML/PKDD 2007 in this paper, and the model has better results. web application firewall anomaly detection word2vec BERT dimension reduction ensemble model Computer Sciences Datavetenskap (datalogi)
72	An End-to-End Native Language Identification Model without the Need for Manual Annotation / En modersmålsidentifiering modell utan behov av manuell annotering Buzaitė, Viktorija January 2022 (has links) Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section. native language identification NLI BERT syntax errors
73	Analysing the possibilities of a needs-based house configurator Ermolaev, Roman January 2023 (has links) A needs-based configurator is a system or tool that assists users in customizing products based on their specific needs. This thesis investigates the challenges of obtaining data for a needs-based machine learning house configurator and identifies suitable models for its implementation. The study consists of two parts: first, an analysis of how to obtain data, and second, an evaluation of three models for implementing the needs-based solution. The analysis shows that collecting house review data for a needs-based configurator is challenging due to several factors, including how the housing market operates compared to other markets, privacy concerns, and the complexity of the buying process. To address this, future studies could consider alternative data sources, adding contextual data, and creating surveys or questionnaires. The evaluation of three models: DistilBERT, BERT fine-tuned for Swedish, and a CNN with a Swedish word embedding layer, shows that both the BERT models perform well on the generated dataset, while the CNN model underperformed. The Swedish BERT model performed the best, achieving high recall and precision metrics for k between 2 and 5. This thesis suggests that further research on needs-based configurators should focus on alternative data sources and more extensive datasets to improve performance. Needs-based Configurator House configurator CNN BERT DistilBERT Swedish Interaction Technologies Interaktionsteknik Computer Systems Datorsystem
74	NOVA: Automated Detection of Violent Threats in Swedish Online Environments Lindén, Kevin, Moshfegh, Arvin January 2023 (has links) Social media and online environments have become an integral part of society, allowing for self-expression, information sharing, and discussions online. However, these platforms are also used to express hate and threats of violence. Violent threats online lead to negative consequences, such as an unsafe online environment, self-censorship, and endangering democracy. Manually detecting and moderating threats online is challenging due to the vast amounts of data uploaded daily. Scholars have called for efficient tools based on machine learning to tackle this problem. Another challenge is that few threat-focused datasets and models exist, especially for low-resource languages such as Swedish, making identifying and detecting threats challenging. Therefore, this study aims to develop a practical and effective tool to automatically detect and identify online threats in Swedish. A tailored Swedish threat dataset will be generated to fine-tune KBLab’s Swedish BERT model. The research question that guides this project is “How effective is a fine-tuned BERT model in classifying texts as threatening or non-threatening in Swedish online environments?”. To the authors’ knowledge, no existing model can detect threats in Swedish. This study uses design science research to develop the artifact and evaluates the artifact’s performance using experiments. The dataset will be generated during the design and development by manually annotating translated English, synthetic, and authentic Swedish data. The BERT model will be fine-tuned using hyperparameters from previous research. The generated dataset comprised 6,040 posts split into 39% threats and 61% non-threats. The model, NOVA, achieved good performance on the test set and in the wild - successfully differentiating threats from non-threats. NOVA achieved almost perfect recall but a lower precision - indicating room for improvement. NOVA might be too lenient when classifying threats, which could be attributed to the complexity and ambiguity of threats and the relatively small dataset. Nevertheless, NOVA can be used as a filter to identify threatening posts online among vast amounts of data. Violent Threats BERT Artificial Intelligence Machine Learning Automated Text Classification Computer Sciences Datavetenskap (datalogi)
75	Low-resource Language Question Answering Systemwith BERT Jansson, Herman January 2021 (has links) The complexity for being at the forefront regarding information retrieval systems are constantly increasing. Recent technology of natural language processing called BERT has reached superhuman performance in high resource languages for reading comprehension tasks. However, several researchers has stated that multilingual model’s are not enough for low-resource languages, since they are lacking a thorough understanding of those languages. Recently, a Swedish pre-trained BERT model has been introduced which is trained on significantly more Swedish data than the multilingual models currently available. This study compares both multilingual and Swedish monolingual inherited BERT model’s for question answering utilizing both a English and a Swedish machine translated SQuADv2 data set during its fine-tuning process. The models are evaluated with SQuADv2 benchmark and within a implemented question answering system built upon the classical retriever-reader methodology. This study introduces a naive and more robust prediction method for the proposed question answering system as well finding a sweet spot for each individual model approach integrated into the system. The question answering system is evaluated and compared against another question answering library at the leading edge within the area, applying a custom crafted Swedish evaluation data set. The results show that the fine-tuned model based on the Swedish pre-trained model and the Swedish SQuADv2 data set were superior in all evaluation metrics except speed. The comparison between the different systems resulted in a higher evaluation score but a slower prediction time for this study’s system. BERT Question Answering system Reading Comprehension Low resource language SQuADv2 Computer Systems Datorsystem
76	Designing and evaluating an algorithm to pick out minority comments online Liu, Elin January 2022 (has links) Social media and online discussion forums have allowed people to hide behind a veil of anonymity, which has made the platforms feel unsafe for people with a different opinion than the majority. Recent research on robots and bots have found that they are a good option when it comes to inducing cooperation or acting as a conversation partner to encourage critical thinking. These robots and bots are based on an algorithm that is able to identify and classify comments, usually into positive and negative comments, left by users. The problem attended to in this thesis is to explore the possibility of creating an algorithm that can classify and pick out a minority opinion with an accuracy of at least 90%. The purpose is to create one of the vital algorithms for a larger project. The goal of this thesis is to provide a functioning algorithm with an accuracy of at least 90% for future implementations. In this thesis, the research approach is quantitative. The results show that it is possible to create an algorithm with the ability to classify and identify comments that also can pick out a minority opinion. Furthermore, the algorithm also achieved an accuracy of at least 90% when it comes to classification of comments, which makes the search for a minority opinion much easier. / Sociala medier och diskussionsforum online har tillåtit människor att gömma sig bakom sin datorskärm och vara anonym. Detta har gjort sociala medier till en osäker plats för människor som inte delar samma åsikt som majoriteten om olika diskussionsämnen. Ny forskning om robotar och sociala botar har funnit att dem är effektiva med att få människor att samarbeta samt att dem är en bra konversationspartner som framkallar mer kritiskt tänkande. Dessa robotar och sociala botar är baserade på en algoritm som kan identifiera och klassificera kommentarer, oftast till positiva eller negativa kommentarer som användare av sociala medier har lämnat. Problemet som avhandlingen försöker lösa är om det är möjligt att skapa en algoritm som kan identifiera och klassificera kommentarer, men även hitta och ta fram en åsikt som inte är en del av majoriteten med en träffsäkerhet på minst 90%. Ändamålet är att skapa en viktig byggsten för ett större forskningsprojekt. Målet med avhandlingen är att skapa en funktionerande algoritm för framtida undersökning som förhoppningsvis kan motarbeta partiskhet i sociala medier. Avhandlingens ståndpunkt är kvantitativ. Resultaten från avhandlingen visar att det är möjligt att skapa en algoritm som kan klassificera samt hitta en åsikt som inte är en del av majoriteten. Dessutom har algoritmen hög noggrannhet när det gäller klassificeringen vilket underlättar sökandet av en åsikt. social media bias mitigating bias opinions sentiment analysis BERT accuracy Computer and Information Sciences Data- och informationsvetenskap
77	Investigations of Free Text Indexing Using NLP : Comparisons of Search Algorithms and Models in Apache Solr / Undersöka hur fritextindexering kan förbättras genom NLP Sundstedt, Alfred January 2023 (has links) As Natural Language Processing progresses societal and applications like OpenAI obtain more considerable popularity in society, businesses encourage the integration of NLP into their systems. Both to improve the user experience and provide users with their requested information. For case management systems, a complicated task is to provide the user with relevant documents, since customers often have large databases containing similar information. This presumes that the user needs to match the requested topic perfectly. Imagine if there was a solution to search for context, instead of formulating the perfect prompt, via established NLP models like BERT. Imagine if the system understood its content. This thesis aims to investigate how a free text index can be improved using NLP from a user perspective and implement it. Using AI to help a free text index, in this case, Apache Solr, can make it easier for users to find the specific content the users are looking for. It is interesting to see how the search can be improved with the help of NLP models and present a more relevant result for the user. NLP can improve user prompts, known as queries, and assist in indexing the information. The task is to conduct a practical investigation by configuring the free text database Apache Solr, with and without NLP support. This is investigated by learning the search models' content, letting the search models provide their relevant search results, for some user queries, and evaluating the results. The investigated search models were a string-based model, an OpenNLP model, and BERT models segmented on paragraph level and sentence level. A hybrid search model of OpenNLP and BERT, on paragraph level, was the best solution overall. NLP Apache Solr Document Retrieval Context OpenNLP BERT Computer Sciences Datavetenskap (datalogi)
78	Lights, Camera, BERT! : Autonomizing the Process of Reading andInterpreting Swedish Film Scripts Henzel, Leon January 2023 (has links) In this thesis, the autonomization of extracting information from PDFs of Swedish film scriptsthrough various machine learning techniques and named entity recognition (NER) is explored.Furthermore, it is explored if labeled data needed for the NER tasks can be reduced to some degreewith the goal of saving time. The autonomization process is split into two subsystems, one forextracting larger chunks of text and one for extracting relevant information through named entitiesfrom some of the larger text-chunks using NER. The methods explored for accelerating the labelingtime for NER are active learning and self learning. For active learning, three methods are explored:Logprob and Word Entropy as uncertainty based active learning methods, and active learning byprocessing surprisal (ALPS) as a diversity based method. For self learning, Logprob and WordEntropy are used as they are uncertainty based sampling methods. The results find that ALPS isthe highest performing active learning method when it comes to saving time on labeling data forNER. For Self learning Word Entropy proved a successful method, whereas Logprob could notsufficiently be used for self learning. The entire script reading system is evaluated by competingagainst a human extracting information from a film script, where the human and system competeson time and accuracy. Accuracy is defined a custom F1-score based on the F1-score for NER.Overall the system performs magnitudes faster than human level, while still retaining fairly highaccuracy. The system for extracting named entities had quite low accuracy, which is hypothesisedto mainly be due to high data imbalance and too little diversity in the training data.Teknisk-naturvetenskapliga fakultetenUppsala universitet, Utgivningsort UppsalaHandledare: Björn Mosten Ämnesgranskare: Maria Andrína Fransisco Rodriguez NLP Active Learning Transformers BERT
79	Compressing Deep Learning models for Natural Language Understanding Ait Lahmouch, Nadir January 2022 (has links) Uppgifter för behandling av naturliga språk (NLP) har under de senaste åren visat sig vara särskilt effektiva när man använder förtränade språkmodeller som BERT. Det enorma kravet på datorresurser som krävs för att träna sådana modeller gör det dock svårt att använda dem i verkligheten. För att lösa detta problem har komprimeringsmetoder utvecklats. I det här projektet studeras, genomförs och testas några av dessa metoder för komprimering av neurala nätverk för textbearbetning. I vårt fall var den mest effektiva metoden Knowledge Distillation, som består i att överföra kunskap från ett stort neuralt nätverk, som kallas läraren, till ett litet neuralt nätverk, som kallas eleven. Det finns flera varianter av detta tillvägagångssätt, som skiljer sig åt i komplexitet. Vi kommer att titta på två av dem i det här projektet. Den första gör det möjligt att överföra kunskap mellan ett neuralt nätverk och en mindre dubbelriktad LSTM, genom att endast använda resultatet från den större modellen. Och en andra, mer komplex metod som uppmuntrar elevmodellen att också lära sig av lärarmodellens mellanliggande lager för att utvinna kunskap. Det slutliga målet med detta projekt är att ge företagets datavetare färdiga komprimeringsmetoder för framtida projekt som kräver användning av djupa neurala nätverk för NLP. / Natural language processing (NLP) tasks have proven to be particularly effective when using pre-trained language models such as BERT. However, the enormous demand on computational resources required to train such models makes their use in the real world difficult. To overcome this problem, compression methods have emerged in recent years. In this project, some of these neural network compression approaches for text processing are studied, implemented and tested. In our case, the most efficient method was Knowledge Distillation, which consists in transmitting knowledge from a large neural network, called teacher, to a small neural network, called student. There are several variants of this approach, which differ in their complexity. We will see two of them in this project, the first one which allows a knowledge transfer between any neural network and another smaller bidirectional LSTM, using only the output of the larger model. And a second, more complex approach that encourages the student model to also learn from the intermediate layers of the teacher model for incremental knowledge extraction. The ultimate goal of this project is to provide the company’s data scientists with ready-to-use compression methods for their future projects requiring the use of deep neural networks for NLP. Natural Language Processing Deep learning BERT Knowledge Distillation Pruning. Computer Sciences Datavetenskap (datalogi)
80	Contextualising government reports using Named Entity Recognition Aljic, Almir, Kraft, Theodor January 2020 (has links) The science of making a computer understand text and process it, natural language processing, is a topic of great interest among researchers. This study aims to further that research by comparing the BERT algorithm and classic logistic regression when identifying names of public organizations. The results show that BERT outperforms its competitor in the task from the data which consisted of public state inquiries and reports. Furthermore a literature study was conducted as a way of exploring how a system for NER can be implemented into the management of an organization. The study found that there are many ways of doing such an implementation but mainly suggested three main areas that should be focused to ensure success - recognising the right entities, trusting the system and presentation of data. / Vetenskapen kring hur datorer ska förstå och arbeta med fria texter, språkteknologi, är ett område som blivit populärt bland forskare. Den här uppsatsen vill utvidga det området genom att jämföra BERT med logistisk regression för att undersöka nämnandet av svenska myndigheter genom NER. BERT visar bättre resultat i att identifiera namnen på myndigheter från texter i statliga utredningar och rapporter än modellen med logistisk regression. Det genomfördes även en litteraturstudie för att undersöka hur ett system för NER kan implementeras i en organisation. Studien visade att det finns flera sätt att genomföra detta men föreslår framförallt tre områden som bör fokuseras på för en lyckad implementation - användande av rätt entiteter, trovärdighet i system och presentation av data. Machine learning Natural Language processing Logistic Regression BERT Computer and Information Sciences Data- och informationsvetenskap

Search results