Spelling suggestions: "subject:"automated text classification"" "subject:"utomated text classification""
1 |
NOVA: Automated Detection of Violent Threats in Swedish Online EnvironmentsLindén, Kevin, Moshfegh, Arvin January 2023 (has links)
Social media and online environments have become an integral part of society, allowing for self-expression, information sharing, and discussions online. However, these platforms are also used to express hate and threats of violence. Violent threats online lead to negative consequences, such as an unsafe online environment, self-censorship, and endangering democracy. Manually detecting and moderating threats online is challenging due to the vast amounts of data uploaded daily. Scholars have called for efficient tools based on machine learning to tackle this problem. Another challenge is that few threat-focused datasets and models exist, especially for low-resource languages such as Swedish, making identifying and detecting threats challenging. Therefore, this study aims to develop a practical and effective tool to automatically detect and identify online threats in Swedish. A tailored Swedish threat dataset will be generated to fine-tune KBLab’s Swedish BERT model. The research question that guides this project is “How effective is a fine-tuned BERT model in classifying texts as threatening or non-threatening in Swedish online environments?”. To the authors’ knowledge, no existing model can detect threats in Swedish. This study uses design science research to develop the artifact and evaluates the artifact’s performance using experiments. The dataset will be generated during the design and development by manually annotating translated English, synthetic, and authentic Swedish data. The BERT model will be fine-tuned using hyperparameters from previous research. The generated dataset comprised 6,040 posts split into 39% threats and 61% non-threats. The model, NOVA, achieved good performance on the test set and in the wild - successfully differentiating threats from non-threats. NOVA achieved almost perfect recall but a lower precision - indicating room for improvement. NOVA might be too lenient when classifying threats, which could be attributed to the complexity and ambiguity of threats and the relatively small dataset. Nevertheless, NOVA can be used as a filter to identify threatening posts online among vast amounts of data.
|
2 |
Classification automatique de textes pour les revues de littérature mixtes en santéLanglois, Alexis 12 1900 (has links)
Les revues de littérature sont couramment employées en sciences de la santé pour justifier et interpréter les résultats d’un ensemble d’études. Elles permettent également aux chercheurs, praticiens et décideurs de demeurer à jour sur les connaissances. Les revues dites systématiques mixtes produisent un bilan des meilleures études portant sur un même sujet tout en considérant l’ensemble des méthodes de recherche quantitatives et qualitatives. Leur production est ralentie par la prolifération des publications dans les bases de données bibliographiques et la présence accentuée de travaux non scientifiques comme les éditoriaux et les textes d’opinion. Notamment, l’étape d’identification des études pertinentes pour l’élaboration de telles revues s’avère laborieuse et requiert un temps considérable. Traditionnellement, le triage s’effectue en utilisant un ensemble de règles établies manuellement. Dans cette étude, nous explorons la possibilité d’utiliser la classification automatique pour exécuter cette tâche.
La famille d’algorithmes ayant été considérée dans le comparatif de ce travail regroupe les arbres de décision, la classification naïve bayésienne, la méthode des k plus proches voisins, les machines à vecteurs de support ainsi que les approches par votes. Différentes méthodes de combinaison de caractéristiques exploitant les termes numériques, les symboles ainsi que les synonymes ont été comparés. La pertinence des concepts issus d’un méta-thésaurus a également été mesurée.
En exploitant les résumés et les titres d’approximativement 10 000 références, les forêts d’arbres de décision admettent le plus haut taux de succès (88.76%), suivies par les machines à vecteurs de support (86.94%). L’efficacité de ces approches devance la performance des filtres booléens conçus pour les bases de données bibliographiques. Toutefois, une sélection judicieuse des entrées de la collection d’entraînement est cruciale pour pallier l’instabilité du modèle final et la disparité des méthodologies quantitatives et qualitatives des études scientifiques existantes. / The interest of health researchers and policy-makers in literature reviews has continued to increase over the years. Mixed studies reviews are highly valued since they combine results from the best available studies on various topics while considering quantitative, qualitative and mixed research methods. These reviews can be used for several purposes such as justifying, designing and interpreting results of primary studies. Due to the proliferation of published papers and the growing number of nonempirical works such as editorials and opinion letters, screening records for mixed studies reviews is time consuming. Traditionally, reviewers are required to manually identify potential relevant studies. In order to facilitate this process, a comparison of different automated text classification methods was conducted in order to determine the most effective and robust approach to facilitate systematic mixed studies reviews.
The group of algorithms considered in this study combined decision trees, naive Bayes classifiers, k-nearest neighbours, support vector machines and voting approaches. Statistical techniques were applied to assess the relevancy of multiple features according to a predefined dataset. The benefits of feature combination for numerical terms, synonyms and mathematical symbols were also measured. Furthermore, concepts extracted from a metathesaurus were used as additional features in order to improve the training process.
Using the titles and abstracts of approximately 10,000 entries, decision trees perform the best with an accuracy of 88.76%, followed by support vector machine (86.94%). The final model based on decision trees relies on linear interpolation and a group of concepts extracted from a metathesaurus. This approach outperforms the mixed filters commonly used with bibliographic databases like MEDLINE. However, references chosen for training must be selected judiciously in order to address the model instability and the disparity of quantitative and qualitative study designs.
|
Page generated in 0.1564 seconds