Global ETD Search

11	Using WordNet Synonyms and Hypernyms in Automatic Topic Detection Wargärde, Nicko January 2020 (has links) Detecting topics by extracting keywords from written text using TF-IDF has been studied and successfully used in many applications. Adding a semantic layer to TF-IDF-based topic detection using WordNet synonyms and hypernyms has been explored in document clustering by assigning concepts that describe texts or by adding all synonyms and hypernyms that occurring words have to a list of keywords. A new method where TF-IDF scores are calculated and WordNet synset members’ TF-IDFscores are added together to all occurring synonyms and/or hypernyms is explored in this paper. Here, such an approach is evaluated by comparing extracted keywords using TF-IDF and the new proposed method, SynPlusTF-IDF, against manually assigned keywords in a database of scientific abstracts. As topic detection is widely used in many contexts and applications, improving current methods is of great value as the methods can become more accurate at extracting correct and relevant keywords from written text. An experiment was conducted comparing the two methods and their accuracy measured using precision and recall and by calculating F1-scores.The F1-scores ranged from 0.11131 to 0.14264 for different variables and the results show that SynPlusTF-IDF is not better at topic detection compared to TF-IDF and both methods performed poorly at topic detection with the chosen dataset. topic detection TF-IDF SynPlusTF-IDF keyword extraction WordNet synsets synonyms hypernyms Computer Sciences Datavetenskap (datalogi)
12	Towards Cyberbullying-free social media in smart cities: a unified multi-modal approach Kumari, K., Singh, J.P., Dwivedi, Y.K., Rana, Nripendra P. 27 September 2020 (has links) Yes / Smart cities are shifting the presence of people from physical world to cyber world (cyberspace). Along with the facilities for societies, the troubles of physical world, such as bullying, aggression and hate speech, are also taking their presence emphatically in cyberspace. This paper aims to dig the posts of social media to identify the bullying comments containing text as well as image. In this paper, we have proposed a unified representation of text and image together to eliminate the need for separate learning modules for image and text. A single-layer Convolutional Neural Network model is used with a unified representation. The major findings of this research are that the text represented as image is a better model to encode the information. We also found that single-layer Convolutional Neural Network is giving better results with two-dimensional representation. In the current scenario, we have used three layers of text and three layers of a colour image to represent the input that gives a recall of 74% of the bullying class with one layer of Convolutional Neural Network. / Ministry of Electronics and Information Technology (MeitY), Government of India Convolutional Neural network Cyberbullying Deep learning Online social network TF-IDF
13	Automatisk extraktion av nyckelord ur ett kundforum / Automatic keyword extraction from a customer forum Ekman, Sara January 2018 (has links) Konversationerna i ett kundforum rör sig över olika ämnen och språket är inkonsekvent. Texterna uppfyller inte de krav som brukar ställas på material inför automatisk nyckelordsextraktion. Uppsatsens undersöker hur nyckelord automatiskt kan extraheras ur ett kundforum trots dessa svårigheter. Fokus i undersökningen ligger på tre aspekter av nyckelordsextraktion. Den första faktorn rör hur den etablerade nyckelordsextraktionsmetoden TFIDF presterar jämfört med fyra metoder som skapas med hänsyn till materialets ovanliga struktur. Nästa faktor som testas är om olika sätt att räkna ordfrekvens påverkar resultatet. Den tredje faktorn är hur metoderna presterar om de endast använder inläggen, rubrikerna eller båda texttyperna i sina extraktioner. Icke-parametriska test användes för utvärdering av extraktionerna. Ett antal Friedmans test visar att metoderna i några fall skiljer sig åt gällande förmåga att identifiera relevanta nyckelord. I post-hoc-test mellan de högst presterande metoderna ses en av de nya metoderna i ett fall prestera signifikant bättre än de andra nya metoderna men inte bättre än TFIDF. Ingen skillnad hittades mellan användning av olika texttyper eller sätt att räkna ordfrekvens. För framtida forskning rekommenderas reliabilitetstest av manuellt annoterade nyckelord. Ett större stickprov bör användas än det i aktuell studie och olika förslag ges för att förbättra rättning av extraherade nyckelord. / Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TFIDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TFIDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions. Automatic keyword extraction Information extraction Noisy text TFIDF User generated text Användargenererad text Automatisk nyckelordsextraktion Brusig text Informationsextraktion TFIDF General Language Studies and Linguistics
14	Categorization of Swedish e-mails using Supervised Machine Learning / Kategorisering av svenska e-postmeddelanden med användning av övervakad maskininlärning Mann, Anna, Höft, Olivia January 2021 (has links) Society today is becoming more digitalized, and a common way of communication is to send e-mails. Currently, the company Auranest has a filtering method for categorizing e-mails, but the method is a few years old. The filter provides a classification of valuable e-mails for jobseekers, where employers can make contact. The company wants to know if the categorization can be performed with a different method and improved. The degree project aims to investigate whether the categorization can be proceeded with higher accuracy using machine learning. Three supervised machine learning algorithms, Naïve Bayes, Support Vector Machine (SVM), and Decision Tree, have been examined, and the algorithm with the highest results has been compared with Auranest's existing filter. Accuracy, Precision, Recall, and F1 score have been used to determine which machine learning algorithm received the highest results and in comparison, with Auranest's filter. The results showed that the supervised machine learning algorithm SVM achieved the best results in all metrics. The comparison between Auranest's existing filter and SVM showed that SVM performed better in all calculated metrics, where the accuracy showed 99.5% for SVM and 93.03% for Auranest’s filter. The comparative results showed that accuracy was the only factor that received similar results. For the other metrics, there was a noticeable difference. / Dagens samhälle blir alltmer digitaliserat och ett vanligt kommunikationssätt är att skicka e-postmeddelanden. I dagsläget har företaget Auranest ett filter för att kategorisera e-postmeddelanden men filtret är några år gammalt. Användningsområdet för filtret är att sortera ut värdefulla e-postmeddelanden för arbetssökande, där kontakt kan ske från arbetsgivare. Företaget vill veta ifall kategoriseringen kan göras med en annan metod samt förbättras. Målet med examensarbetet är att undersöka ifall filtreringen kan göras med högre träffsäkerhet med hjälp av maskininlärning. Tre övervakade maskininlärningsalgoritmer, Naïve Bayes, Support Vector Machine (SVM) och Decision Tree, har granskats och algoritmen med de högsta resultaten har jämförts med Auranests befintliga filter. Träffsäkerhet, precision, känslighet och F1-poäng har använts för att avgöra vilken maskininlärningsalgoritm som gav högst resultat sinsemellan samt i jämförelse med Auranests filter. Resultatet påvisade att den övervakade maskininlärningsmetoden SVM åstadkom de främsta resultaten i samtliga mätvärden. Jämförelsen mellan Auranests befintliga filter och SVM visade att SVM presterade bättre i alla kalkylerade mätvärden, där träffsäkerheten visade 99,5% för SVM och 93,03% för Auranests filter. De jämförande resultaten visade att träffsäkerheten var den enda faktorn som gav liknande resultat. För de övriga mätvärdena var det en märkbar skillnad. Classification categorization e-mails preprocessing TF-IDF machine learning supervised learning Naïve Bayes Support Vector Machine Decision Tree Klassificering kategorisering e-postmeddelanden förbehandling av data TF-IDF maskininlärning övervakad inlärning Naïve Bayes Support Vector Machine Decision Tree Computer Sciences Datavetenskap (datalogi)
15	@TheRealDonaldTrump’s tweets correlation with stock market volatility / @TheRealDonaldTrump's tweets korrelation med volatiliteten på aktiemarkanden Olofsson, Isak January 2020 (has links) The purpose of this study is to analyze if there is any tweet specific data posted by Donald Trump that has a correlation with the volatility of the stock market. If any details about the president Trump's tweets show correlation with the volatility, the goal is to find a subset of regressors with as high as possible predictability. The content of tweets is used as the base for regressors. The method which has been used is a multiple linear regression with tweet and volatility data ranging from 2010 until 2020. As a measure of volatility, the Cboe VIX has been used, and the regressors in the model have focused on the content of tweets posted by Trump using TF-IDF to evaluate the content of tweets. The results from the study imply that the chosen regressors display a small significant correlation of with an adjusted R2 = 0.4501 between Trump´s tweets and the market volatility. The findings Include 78 words with correlation to stock market volatility when part of President Trump's tweets. The stock market is a large and complex system of many unknowns, which aggravate the process of simplifying and quantifying data of only one source into a regression model with high predictability. / Syftet med denna studie är att analysera om det finns några specifika egenskaper i de tweets publicerade av Donald Trump som har en korrelation med volatiliteten på aktiemarknaden. Om egenskaper kring president Trumps tweets visar ett samband med volatiliteten är målet att hitta en delmängd av regressorer med för att beskriva sambandet med så hög signifikans som möjligt. Innehållet i tweets har varit i fokus använts som regressorer. Metoden som har använts är en multipel linjär regression med tweet och volatilitetsdata som sträcker sig från 2010 till 2020. Som ett mått på volatilitet har Cboe VIX använts, och regressorerna i modellen har fokuserat på innehållet i tweets där TF-IDF har använts för att transformera ord till numeriska värden. Resultaten från studien visar att de valda regressorerna uppvisar en liten men signifikant korrelation med en justerad R2 = 0,4501 mellan Trumps tweets och marknadens volatilitet. Resultaten inkluderar 78 ord som de när en är en del av president Trumps tweets visar en signifikant korrelation till volatiliteten på börsen. Börsen är ett stort och komplext system av många okända, som försvårar processen att förenkla och kvantifiera data från endast en källa till en regressionsmodell med hög förutsägbarhet. Donald Trump Volatility Cboe VIX Twitter Stock Market TF-IDF Regression analys Statistic Applied mathematics Financial mathematicsis Donald Trump Volatilitet Cboe VIX Twitter Aktiemarknaden TF-IDF Regressionsanalys Statistik Tillämpad matematik Finansiell matematik Probability Theory and Statistics Sannolikhetsteori och statistik
16	A comparison of different methods in their ability to compare semantic similarity between articles and press releases / En jämförelse av olika metoder i deras förmåga att jämföra semantisk likhet mellan artiklar och pressmeddelanden Andersson, Julius January 2022 (has links) The goal of a press release is to have the information spread as widely as possible. A suitable approach to distribute the information is to target journalists who are likely to distribute the information further. Deciding which journalists to target has traditionally been performed manually without intelligent digital assistance and therefore has been a time consuming task. Machine learning can be used to assist the user by predicting a ranking of journalists based on their most semantically similar written article to the press release. The purpose of this thesis was to compare different methods in their ability to compare semantic similarity between articles and press releases when used for the task of ranking journalists. Three methods were chosen for comparison: (1.) TF-IDF together with cosine similarity, (2.) TF-IDF together with soft-cosine similarity and (3.) sentence mover’s distance (SMD) together with SBERT. Based on the proposed heuristic success metric, both TF-IDF methods outperformed the SMD method. The best performing method was TF-IDF with soft-cosine similarity. / Målet med ett pressmeddelande är att få informationen att spriddas till så många som möjligt. Ett lämpligt tillvägagångssätt för att sprida informationen är att rikta in sig på journalister som sannolikt kommer att sprida informationen vidare. Beslutet om vilka journalister man ska rikta sig till har traditionellt utförts manuellt utan intelligent digital assistans och har därför varit en tidskrävande uppgift. Maskininlärning kan användas för att hjälpa användaren genom att förutsäga en rankning av journalister baserat på deras mest semantiskt liknande skrivna artikel till pressmeddelandet. Syftet med denna uppsats var att jämföra olika metoder i deras förmåga att jämföra semantisk likhet mellan artiklar och pressmeddelanden när de används för att rangordna journalister. Tre metoder valdes för jämförelse: (1.) TF-IDF tillsammans med cosinus likhet, (2.) TF-IDF tillsammans med mjuk-cosinus likhet och (3.) sentence mover’s distance (SMD) tillsammans med SBERT. Baserat på det föreslagna heuristiska framgångsmåttet överträffade båda TF-IDF-metoderna SMD-metoden. Den bäst presterande metoden var TF-IDF med mjuk-cosinus likhet. Semantic similarity TF-IDF SBERT Cosine similarity Soft-cosine similarity Sentence mover’s distance Semantisk likhet TF-IDF SBERT Cosinus likhet Mjuk-cosinus likhet Sentence mover’s distance Computer and Information Sciences Data- och informationsvetenskap
17	Maskininlärning för dokumentklassificering av finansielladokument med fokus på fakturor / Machine Learning for Document Classification of FinancialDocuments with Focus on Invoices Khalid Saeed, Nawar January 2022 (has links) Automatiserad dokumentklassificering är en process eller metod som syftar till att bearbeta ochhantera dokument i digitala former. Många företag strävar efter en textklassificeringsmetodiksom kan lösa olika problem. Ett av dessa problem är att klassificera och organisera ett stort antaldokument baserat på en uppsättning av fördefinierade kategorier.Detta examensarbete syftar till att hjälpa Medius, vilket är ett företag som arbetar med fakturaarbetsflöde, att klassificera dokumenten som behandlas i deras fakturaarbetsflöde till fakturoroch icke-fakturor. Detta har åstadkommits genom att implementera och utvärdera olika klassificeringsmetoder för maskininlärning med avseende på deras noggrannhet och effektivitet för attklassificera finansiella dokument, där endast fakturor är av intresse.I denna avhandling har två dokumentrepresentationsmetoder "Term Frequency Inverse DocumentFrequency (TF-IDF) och Doc2Vec" använts för att representera dokumenten som vektorer. Representationen syftar till att minska komplexiteten i dokumenten och göra de lättare att hantera.Dessutom har tre klassificeringsmetoder använts för att automatisera dokumentklassificeringsprocessen för fakturor. Dessa metoder var Logistic Regression, Multinomial Naïve Bayes och SupportVector Machine.Resultaten från denna avhandling visade att alla klassificeringsmetoder som använde TF-IDF, föratt representera dokumenten som vektorer, gav goda resultat i from av prestanda och noggranhet.Noggrannheten för alla tre klassificeringsmetoderna var över 90%, vilket var kravet för att dennastudie skulle anses vara lyckad. Dessutom verkade Logistic Regression att ha det lättare att klassificera dokumenten jämfört med andra metoder. Ett test på riktiga data "dokument" som flödarin i Medius fakturaarbetsflöde visade att Logistic Regression lyckades att korrekt klassificeranästan 96% av dokumenten.Avslutningsvis, fastställdes Logistic Regression tillsammans med TF-IDF som de övergripandeoch mest lämpliga metoderna att klara av problmet om dokumentklassficering. Dessvärre, kundeDoc2Vec inte ge ett bra resultat p.g.a. datamängden inte var anpassad och tillräcklig för attmetoden skulle fungera bra. / Automated document classification is an essential technique that aims to process and managedocuments in digital forms. Many companies strive for a text classification methodology thatcan solve a plethora of problems. One of these problems is classifying and organizing a massiveamount of documents based on a set of predefined categories.This thesis aims to help Medius, a company that works with invoice workflow, to classify theirdocuments into invoices and non-invoices. This has been accomplished by implementing andevaluating various machine learning classification methods in terms of their accuracy and efficiencyfor the task of financial document classification, where only invoices are of interest. Furthermore,the necessary pre-processing steps for achieving good performance are considered when evaluatingthe mentioned classification methods.In this study, two document representation methods "Term Frequency Inverse Document Frequency (TF-IDF) and Doc2Vec" were used to represent the documents as fixed-length vectors.The representation aims to reduce the complexity of the documents and make them easier tohandle. In addition, three classification methods have been used to automate the document classification process for invoices. These methods were Logistic Regression, Multinomial Naïve Bayesand Support Vector Machine.The results from this thesis indicate that all classification methods used TF-IDF, to represent thedocuments as vectors, give high performance and accuracy. The accuracy of all three classificationmethods is over 90%, which is the prerequisite for the success of this study. Moreover, LogisticRegression appears to cope with this task very easily, since it classifies the documents moreefficiently compared to the other methods. A test of real data flowing into Medius’ invoiceworkflow shows that Logistic Regression is able to correctly classify up to 96% of the data.In conclusion, the Logistic Regression together with TF-IDF is determined to be the overall mostappropriate method out of the other tested methods. In addition, Doc2Vec suffers to providea good result because the data set is not customized and sufficient for the method to workwell. Document classification Text classification Invoices NLP TF-IDF Doc2vec Machine Learning Logistic Regression Multinomial Naïve Bayes Support Vector Machine. Dokumentklassificering Textklassificering Fakturor NLP TF-IDF Doc2vec Maskininlärning Logistic Regression Multinomial Naïve Bayes Support Vector Machine. Computer Sciences Datavetenskap (datalogi)
18	A Method for Recommending Computer-Security Training for Software Developers Nadeem, Muhammad 12 August 2016 (has links) Vulnerable code may cause security breaches in software systems resulting in financial and reputation losses for the organizations in addition to loss of their customers’ confidential data. Delivering proper software security training to software developers is key to prevent such breaches. Conventional training methods do not take the code written by the developers over time into account, which makes these training sessions less effective. We propose a method for recommending computer–security training to help identify focused and narrow areas in which developers need training. The proposed method leverages the power of static analysis techniques, by using the flagged vulnerabilities in the source code as basis, to suggest the most appropriate training topics to different software developers. Moreover, it utilizes public vulnerability repositories as its knowledgebase to suggest community accepted solutions to different security problems. Such mitigation strategies are platform independent, giving further strength to the utility of the system. This research discussed the proposed architecture of the recommender system, case studies to validate the system architecture, tailored algorithms to improve the performance of the system, and human subject evaluation conducted to determine the usefulness of the system. Our evaluation suggests that the proposed system successfully retrieves relevant training articles from the public vulnerability repository. The human subjects found these articles to be suitable for training. The human subjects also found the proposed recommender system as effective as a commercial tool. FindBugs Static code analysis Jaccard index tf–idf training NVD CWE software vulnerabilities software security Recommender system
19	Recommending Answers to Math Questions Using KL-Divergence and the Approximate XML Tree Matching Approach Gao, Siqi 30 May 2023 (has links) (PDF) Mathematics is the science and study of quality, structure, space, and change. It seeks out patterns, formulates new conjectures, and establishes the truth by rigorous deduction from appropriately chosen axioms and definitions. The study of mathematics makes a person better at solving problems. It gives someone skills that (s)he can use across other subjects and apply in many different job roles. In the modern world, builders use mathematics every day to do their work, since construction workers add, subtract, divide, multiply, and work with fractions. It is obvious that mathematics is a major contributor to many areas of study. For this reason, retrieving, ranking, and recommending Math answers, which is an application of Math information retrieval (IR), deserves attention and recognition, since a reliable recommender system helps users find the relevant answers to Math questions and benefits all Math learners whenever they need help solve a Math problem, regardless of the time and place. Such a recommender system can enhance the learning experience and enrich the knowledge in Math of its users. We have developed MaRec, a recommender system that retrieves and ranks Math answers based on their textual content and embedded formulas in answering a Math question. MaRec (i) applies KL-divergence to rank the textual content of a potential answer A with respect to the textual content of a Math question Q, and (ii) together with the representation of the Math formulas in Q and A as XML trees determines their subtree matching scores in ranking A as an answer to Q. The design of MaRec is simple, since it does not require the training and test process mandated by machine learning-based Math IR systems, which is tedious to set up and time consuming to train the models. Conducted empirical studies show that MaRec significantly outperforms (i) three existing state-of-the-art MathIR systems based on an offline evaluation, and (ii) a top-of-the-line machine learning system based on an online performance analysis. Math IR system formula search tree matching KL-divergence recommender systems TF-IDF diversity Physical Sciences and Mathematics
20	Sumarizace českých textů z více zdrojů / Multi-source Text Summarization for Czech Brus, Tomáš January 2012 (has links) This work focuses on the summarization task for a set of articles on the same topic. It discusses several possible ways of summarizations and ways to assess their final quality. The implementation of the described algorithms and their application to selected texts constitutes a part of this work. The input texts come from several Czech news servers and they are represented as deep syntactic trees (the so called tectogrammatical layer).

Search results