Global ETD Search

1	Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets Fearn, Wilson Murray 05 December 2019 (has links) Text analysis is a significant branch of natural language processing, and includes manydifferent sub-fields such as topic modeling, document classification, and sentiment analysis.Unsurprisingly, those who do text analysis are concerned with the runtime of their algorithmsSome of these algorithms have runtimes that depend jointly on the size of the corpus beinganalyzed, as well as the size of that corpus's vocabulary. Trivially, a user may reduce theamount of data they feed into their model to speed it up, but we assume that users will behesitant to do this as more data tends to lead to better model quality. On the other hand,when the runtime also depends on the vocabulary of the corpus, a user may instead modifythe vocabulary to attain a faster runtime. Because elements of the vocabulary also add tomodel quality, this puts users into the position of needing to modify the corpus vocabulary inorder to reduce the runtime of their algorithm while maintaining model quality. To this end,we look at the relationship between model quality and runtime for text analysis by looking atthe effect that current techniques in vocabulary reduction have on algorithmic runtime andcomparing that with their effect on model quality. Despite the fact that this is an importantrelationship to investigate, it appears little work has been done in this area. We find thatmost preprocessing methods do not have much of an effect on more modern algorithms, butproper rare word filtering gives the best results in the form of significant runtime reductionstogether with slight improvements in accuracy and a vocabulary size that scales efficiently aswe increase the size of the data. document classification text preprocessing vocabulary reduction nlp Physical Sciences and Mathematics
2	A Rule-Based Normalization System for Greek Noisy User-Generated Text Toska, Marsida January 2020 (has links) The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization. nlp noisy text preprocessing rule-based levenshtein twitter normalization Greek
3	Τεχνικές και μηχανισμοί συσταδοποίησης χρηστών και κειμένων για την προσωποποιημένη πρόσβαση περιεχομένου στον Παγκόσμιο Ιστό Τσόγκας, Βασίλειος 16 April 2015 (has links) Με την πραγματικότητα των υπέρογκων και ολοένα αυξανόμενων πηγών κειμένου στο διαδίκτυο, καθίστανται αναγκαία η ύπαρξη μηχανισμών οι οποίοι βοηθούν τους χρήστες ώστε να λάβουν γρήγορες απαντήσεις στα ερωτήματά τους. Η δημιουργία περιεχομένου, προσωποποιημένου στις ανάγκες των χρηστών, κρίνεται απαραίτητη σύμφωνα με τις επιταγές της συνδυαστικής έκρηξης της πληροφορίας που είναι ορατή σε κάθε ``γωνία'' του διαδικτύου. Ζητούνται άμεσες και αποτελεσματικές λύσεις ώστε να ``τιθασευτεί'' αυτό το χάος πληροφορίας που υπάρχει στον παγκόσμιο ιστό, λύσεις που είναι εφικτές μόνο μέσα από ανάλυση των προβλημάτων και εφαρμογή σύγχρονων μαθηματικών και υπολογιστικών μεθόδων για την αντιμετώπισή τους. Η παρούσα διδακτορική διατριβή αποσκοπεί στο σχεδιασμό, στην ανάπτυξη και τελικά στην αξιολόγηση μηχανισμών και καινοτόμων αλγορίθμων από τις περιοχές της ανάκτησης πληροφορίας, της επεξεργασίας φυσικής γλώσσας καθώς και της μηχανικής εκμάθησης, οι οποίοι θα παρέχουν ένα υψηλό επίπεδο φιλτραρίσματος της πληροφορίας του διαδικτύου στον τελικό χρήστη. Πιο συγκεκριμένα, στα διάφορα στάδια επεξεργασίας της πληροφορίας αναπτύσσονται τεχνικές και μηχανισμοί που συλλέγουν, δεικτοδοτούν, φιλτράρουν και επιστρέφουν κατάλληλα στους χρήστες κειμενικό περιεχόμενο που πηγάζει από τον παγκόσμιο ιστό. Τεχνικές και μηχανισμοί που σκοπό έχουν την παροχή υπηρεσιών πληροφόρησης πέρα από τα καθιερωμένα πρότυπα της υφιστάμενης κατάστασης του διαδικτύου. Πυρήνας της διδακτορικής διατριβής είναι η ανάπτυξη ενός μηχανισμού συσταδοποίησης (clustering) τόσο κειμένων, όσο και των χρηστών του διαδικτύου. Στο πλαίσιο αυτό μελετήθηκαν κλασικοί αλγόριθμοι συσταδοποίησης οι οποίοι και αξιολογήθηκαν για την περίπτωση των άρθρων νέων προκειμένου να εκτιμηθεί αν και πόσο αποτελεσματικός είναι ο εκάστοτε αλγόριθμος. Σε δεύτερη φάση υλοποιήθηκε αλγόριθμος συσταδοποίησης άρθρων νέων που αξιοποιεί μια εξωτερική βάση γνώσης, το WordNet, και είναι προσαρμοσμένος στις απαιτήσεις των άρθρων νέων που πηγάζουν από το διαδίκτυο. Ένας ακόμη βασικός στόχος της παρούσας εργασίας είναι η μοντελοποίηση των κινήσεων που ακολουθούν κοινοί χρήστες καθώς και η αυτοματοποιημένη αξιολόγηση των συμπεριφορών, με ορατό θετικό αποτέλεσμα την πρόβλεψη των προτιμήσεων που θα εκφράσουν στο μέλλον οι χρήστες. Η μοντελοποίηση των χρηστών έχει άμεση εφαρμογή στις δυνατότητες προσωποποίησης της πληροφορίας με την πρόβλεψη των προτιμήσεων των χρηστών. Ως εκ' τούτου, υλοποιήθηκε αλγόριθμος προσωποποίησης ο οποίος λαμβάνει υπ' όψιν του πληθώρα παραμέτρων που αποκαλύπτουν έμμεσα τις προτιμήσεις των χρηστών. / With the reality of the ever increasing information sources from the internet, both in sizes and indexed content, it becomes necessary to have methodologies that will assist the users in order to get the information they need, exactly the moment they need it. The delivery of content, personalized to the user needs is deemed as a necessity nowadays due to the combinatoric explosion of information visible to every corner of the world wide web. Solutions effective and swift are desperately needed in order to deal with this information overload. These solutions are achievable only via the analysis of the refereed problems, as well as the application of modern mathematics and computational methodologies. This Ph.d. dissertation aims to the design, development and finally to the evaluation of mechanisms, as well as, novel algorithms from the areas of information retrieval, natural language processing and machine learning. These mechanisms shall provide a high level of filtering capabilities regarding information originating from internet sources and targeted to end users. More precisely, through the various stages of information processing, various techniques are proposed and developed. Techniques that will gather, index, filter and return textual content well suited to the user tastes. These techniques and mechanisms aim to go above and beyond the usual information delivery norms of today, dealing via novel means with several issues that are discussed. The kernel of this Ph.d. dissertation is the development of a clustering mechanism that will operate both on news articles, as well as, users of the web. Within this context several classical clustering algorithms were studied and evaluated for the case of news articles, allowing as to estimate the level of efficiency of each one within this domain of interest. This left as with a clear choice as to which algorithm should be extended for our work. As a second phase, we formulated a clustering algorithm that operates on news articles and user profiles making use of the external knowledge base of WordNet. This algorithm is adapted to the requirements of diversity and quick churn of news articles originating from the web. Another central goal of this Ph.d. dissertation is the modeling of the browsing behavior of system users within the context of our recommendation system, as well as, the automatic evaluation of these behaviors with the obvious desired outcome or predicting the future preferences of users. The user modeling process has direct application upon the personalization capabilities that we can over on information as far as user preferences predictions are concerned. As a result, a personalization algorithm we formulated which takes into consideration a plethora or parameters that indirectly reveal the user preferences. Συσταδοποίηση Προσωποποίηση Άρθρα νέων 004.35 W-kmeans K-means N-grams Clustering News articles Text preprocessing News clustering Articles clustering
4	Statistické metody ve stylometrii / Statistical methods in stylometry Dupal, Pavel January 2017 (has links) The aim of this thesis is to provide an overview of some of the commonly used methods in the area of authorship attribution (stylometry). The text begins with a recap of history from the end of the 19th century to present time and the required terminology from the field of text mining is presented and explained. What follows is a list of selected methods from the field of multidimensional statistics (principal components analysis, cluster analysis) and machine learning (Support Vector Machines, Naive Bayes) and their application as pertains to stylometrical problems, including several methods created specifically for use in this field (bootstrap consensus tree, contrast analysis). Finally these same methods are applied to a practical problem of authorship verification based on a corpus bulit from the works of four internet writers.
5	Exploring NMF and LDA Topic Models of Swedish News Articles Svensson, Karin, Blad, Johan January 2020 (has links) The ability to automatically analyze and segment news articles by their content is a growing research field. This thesis explores the unsupervised machine learning method topic modeling applied on Swedish news articles for generating topics to describe and segment articles. Specifically, the algorithms non-negative matrix factorization (NMF) and the latent Dirichlet allocation (LDA) are implemented and evaluated. Their usefulness in the news media industry is assessed by its ability to serve as a uniform categorization framework for news articles. This thesis fills a research gap by studying the application of topic modeling on Swedish news articles and contributes by showing that this can yield meaningful results. It is shown that Swedish text data requires extensive data preparation for successful topic models and that nouns exclusively and especially common nouns are the most suitable words to use. Furthermore, the results show that both NMF and LDA are valuable as content analysis tools and categorization frameworks, but they have different characteristics, hence optimal for different use cases. Lastly, the conclusion is that topic models have issues since they can generate unreliable topics that could be misleading for news consumers, but that they nonetheless can be powerful methods for analyzing and segmenting articles efficiently on a grand scale by organizations internally. The thesis project is a collaboration with one of Sweden’s largest media groups and its results have led to a topic modeling implementation for large-scale content analysis to gain insight into readers’ interests. Topic Modeling NMF LDA Swedish News Articles Text Preprocessing Computer and Information Sciences Data- och informationsvetenskap
6	Metody stemmingu používané při dolování textu / Stemming Methods Used in Text Mining Adámek, Tomáš January 2010 (has links) The main theme of this master's thesis is a description of text mining. This document is specialized to English texts and their automatic data preprocessing. The main part of this thesis analyses various stemming algorithms (Lovins, Porter and Paice/Husk). Stemming is a procedure for automatic conflating semantically related terms together via the use of rule sets. Next part of this thesis describes design of an application for various types of stemming algorithms. Application is based on the Java platform with using of graphic library Swing and MVC architecture. Next chapter contains description of implementation of the application and stemming algorithms. In the last part of this master's thesis experiments with stemming algorithms and comparing the algorithm from viewpoint to the results of classification the text are described.
7	Zpracování uživatelských recenzí / Processing of User Reviews Cihlářová, Dita January 2019 (has links) Very often, people buy goods on the Internet that they can not see and try. They therefore rely on reviews of other customers. However, there may be too many reviews for a human to handle them quickly and comfortably. The aim of this work is to offer an application that can recognize in Czech reviews what features of a product are most commented and whether the commentary is positive or negative. The results can save a lot of time for e-shop customers and provide interesting feedback to the manufacturers of the products.
8	Evaluating the robustness of DistilBERT to data shift in toxicity detection / Evaluering av DistilBERTs robusthet till dataskifte i en kontext av identifiering av kränkande språk Larsen, Caroline January 2022 (has links) With the rise of social media, cyberbullying and online spread of hate have become serious problems with devastating consequences. Mentimeter is an interactive presentation tool enabling the presentation audience to participate by typing their own answers to questions asked by the presenter. As the Mentimeter product is commonly used in schools, there is a need to have a strong toxicity detection program that filters out offensive and profane language. This thesis focuses on the topics of text pre-processing and robustness to datashift within the problem domain of toxicity detection for English text. Initially, it is investigated whether lemmatization, spelling correction, and removal of stop words are suitable strategies for pre-processing within toxicity detection. The pre-trained DistilBERT model was fine-tuned using an English twitter dataset that had been pre-processed using a number of different techniques. The results indicate that none of the above-mentioned strategies have a positive impact on the model performance. Lastly, modern methods are applied to train a toxicity detection model adjusted to anonymous Mentimeter user text data. For this purpose, a balanced Mentimeter dataset with 3654 datapoints was created and annotated by the thesis author. The best-performing model of the pre-processing experiment was iteratively fine-tuned and evaluated with an increasing amount of Mentimeter data. Based on the results, it is concluded that state-of-the-art performance can be achieved even when using relatively few datapoints for fine-tuning. Namely, when using around 500 − 2500 training datapoints, F1-scores between 0.90 and 0.94 were obtained on a Mentimeter test set. These results show that it is possible to create a customized toxicity detection program, with high performance, using just a small dataset. / I och med sociala mediers stora framtåg har allvarliga problem såsom nätmobbning och spridning av hat online blivit allt mer vanliga. Mentimeter är ett interaktivt presentationsverktyg som gör det möjligt för presentations-publiken att svara på frågor genom att formulera egna fritextsvar. Eftersom Mentimeter ofta används i skolor så finns det ett behov av ett välfungerande program som identifierar och filtrerar ut kränkande text och svordomar. Den här uppsatsen fokuserar på ämnena textbehandling och robusthet gentemot dataskifte i en kontext av identifiering av kränkande språk för engelsk text. Först undersöks det huruvida lemmatisering, stavningskorrigering, samt avlägsnande av stoppord är lämpliga textbehandlingstekniker i kontexten av identifiering av kränkande språk. Den förtränade DistilBERT-modellen används genom att finjustera dess parameterar med ett engelskt Twitter-dataset som har textbehandlats med ett antal olika tekniker. Resultaten indikerar att ingen av de nämnda strategierna har en positiv inverkan på modellens prestanda. Därefter användes moderna metoder för att träna en modell som kan identifiera kränkande text anpassad efter anonym data från Mentimeter. Ett balancerat Mentimeter-dataset med 3654 datapunkter skapades och annoterades av uppsatsförfattaren. Därefter finjusterades och evaluerades den bäst presterande modellen från textbehandlingsexperimentet iterativt med en ökande mängd Mentimeter-data. Baserat på resultaten drogs slutsatsen att toppmodern prestanda kan åstadkommas genom att använda relativt få datapunkter för träning. Nämligen, när ungefär 500 − 2500 träningsdatapunkter används, så uppnåddes F1-värden mellan 0.90 och 0.94 på ett test-set av Mentimeter-datasetet. Resultaten visar att det är möjligt att skapa en högpresterande modell som identifierar kränkande text, genom att använda ett litet dataset. Machine learning Natural Language Processing DistilBERT Toxicity Detection Profanity Detection Hate Speech Identification Text preprocessing Maskininlärning naturligtspråkbehandling DistilBERT identifiering av kränkande språk identifiering av svordomar textbehandling Computer and Information Sciences Data- och informationsvetenskap

Search results