Global ETD Search

101	Cyberbullying detection in Urdu language using machine learning Khan, Sara, Qureshi, Amna 11 January 2023 (has links) Yes / Cyberbullying has become a significant problem with the surge in the use of social media. The most basic way to prevent cyberbullying on these social media platforms is to identify and remove offensive comments. However, it is hard for humans to read and remove all the comments manually. Current research work focuses on using machine learning to detect and eliminate cyberbullying. Although most of the work has been conducted on English texts to detect cyberbullying, limited to no work can be found in Urdu. This paper aims to detect cyberbullying from the users' comments posted in Urdu on Twitter using machine learning and Natural Language Processing (NLP) techniques. To the best of our knowledge, cyberbullying detection on Urdu text comments has not been performed due to the lack of a publicly available standard Urdu dataset. In this paper, we created a dataset of offensive user-generated Urdu comments from Twitter. The comments in the dataset are classified into five categories. n-gram techniques are used to extract features at character and word levels. Various supervised machine-learning techniques are applied to the dataset to detect cyberbullying. Evaluation metrics such as precision, recall, accuracy and F1 scores are used to analyse the performance of machine learning techniques. Cyberbullying Machine learning Natural language processing Twitter
102	Distributed representations for compositional semantics Hermann, Karl Moritz January 2014 (has links) The mathematical representation of semantics is a key issue for Natural Language Processing (NLP). A lot of research has been devoted to finding ways of representing the semantics of individual words in vector spaces. Distributional approaches—meaning distributed representations that exploit co-occurrence statistics of large corpora—have proved popular and successful across a number of tasks. However, natural language usually comes in structures beyond the word level, with meaning arising not only from the individual words but also the structure they are contained in at the phrasal or sentential level. Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is an equally fundamental task of NLP. This dissertation explores methods for learning distributed semantic representations and models for composing these into representations for larger linguistic units. Our underlying hypothesis is that neural models are a suitable vehicle for learning semantically rich representations and that such representations in turn are suitable vehicles for solving important tasks in natural language processing. The contribution of this thesis is a thorough evaluation of our hypothesis, as part of which we introduce several new approaches to representation learning and compositional semantics, as well as multiple state-of-the-art models which apply distributed semantic representations to various tasks in NLP. Part I focuses on distributed representations and their application. In particular, in Chapter 3 we explore the semantic usefulness of distributed representations by evaluating their use in the task of semantic frame identification. Part II describes the transition from semantic representations for words to compositional semantics. Chapter 4 covers the relevant literature in this field. Following this, Chapter 5 investigates the role of syntax in semantic composition. For this, we discuss a series of neural network-based models and learning mechanisms, and demonstrate how syntactic information can be incorporated into semantic composition. This study allows us to establish the effectiveness of syntactic information as a guiding parameter for semantic composition, and answer questions about the link between syntax and semantics. Following these discoveries regarding the role of syntax, Chapter 6 investigates whether it is possible to further reduce the impact of monolingual surface forms and syntax when attempting to capture semantics. Asking how machines can best approximate human signals of semantics, we propose multilingual information as one method for grounding semantics, and develop an extension to the distributional hypothesis for multilingual representations. Finally, Part III summarizes our findings and discusses future work. 006.3
103	Lexical vagueness handling using fuzzy logic in human robot interaction Guo, Xiao January 2011 (has links) Lexical vagueness is a ubiquitous phenomenon in natural language. Most of previous works in natural language processing (NLP) consider lexical ambiguity as the main problem in natural language understanding rather than lexical vagueness. Lexical vagueness is usually considered as a solution rather than a problem in natural language understanding since precise information is usually failed to be provided in conversations. However, lexical vagueness is obviously an obstacle in human robot interaction (HRI) since the robots are expected to precisely understand their users' utterances in order to provide reliable services to their users. This research aims to develop novel lexical vagueness handling techniques to enable service robots to precisely understand their users' utterance so that they can provide the reliable services to their users. A novel integrated system to handle lexical vagueness is proposed in this research based on an in-depth understanding of lexical ambiguity and lexical vagueness including why they exist, how they are presented, what differences are in between them, and the mainstream techniques to handle lexical ambiguity and lexical vagueness. The integrated system consists of two blocks: the block of lexical ambiguity handling and the block of lexical vagueness handling. The block of lexical ambiguity handling first removes syntactic ambiguity and lexical ambiguity. The block of lexical vagueness handling is then used to model and remove lexical vagueness. Experimental results show that the robots endowed with the developed integrated system are able to understand their users' utterances. The reliable services to their users, therefore, can be provided by the robots. 006.3
104	DEFENDING BERT AGAINST MISSPELLINGS Nivedita Nighojkar (8063438) 06 April 2021 (has links) Defending models against Natural Language Processing adversarial attacks is a challenge because of the discrete nature of the text dataset. However, given the variety of Natural Language Processing applications, it is important to make text processing models more robust and secure. This paper aims to develop techniques that will help text processing models such as BERT to combat adversarial samples that contain misspellings. These developed models are more robust than off the shelf spelling checkers. Applied Computer Science Natural Language Processing Natural language processing Movie Reviews LSTM networks
105	A Study of Media Polarization with Authorship Attribution Methods Yifei Hu (9193709) 14 December 2020 (has links) <div>Media polarization is a serious issue that can affect someone's views, ranging from a scientific fact to the perceived results of a presidential election. The media outlets in the United States are aligned along political spectrum representing different stances on various issues. Without providing any false information (but usually by omitting some facts), media outlets can report events by deliberately using the words and styles that favor particular political positions. <br></div>This research investigated the U.S. media polarization with authorship attribution approaches, analyzing stylistic differences between the left-leaning and right-leaning media and discovering specific linguistic patterns that made the news articles display biased political attitudes. Several models of authorship attribution were tested while controlling for topic, stance, and style, and were applied to media companies and their identity within a political spectrum. Style features that were compared included semantic and/or sentiment-related information, such as stance taking, with features that seemingly do not capture it, such as part of speech tags. The results demonstrate that a successful classification of articles as left-leaning or right-learning is possible regardless of their stance. Finally, we provide an analysis of the patterns that we found. Natural Language Processing Natural language processing media polarization Authorship attribution Stylometry
106	Weighted Aspects for Sentiment Analysis Byungkyu Yoo (14216267) 05 December 2022 (has links) <p>When people write a review about a business, they write and rate it based on their personal experience of the business. Sentiment analysis is a natural language processing technique that determines the sentiment of text, including reviews. However, unlike computers, the personal experience of humans emphasizes their preferences and observations that they deem important while ignoring other components that may not be as important to them personally. Traditional sentiment analysis does not consider such preferences. To utilize these human preferences in sentiment analysis, this paper explores various methods of weighting aspects in an attempt to improve sentiment analysis accuracy. Two types of methods are considered. The first method applies human preference by assigning weights to aspects in calculating overall sentiment analysis. The second method uses the results of the first method to improve the accuracy of traditional supervised sentiment analysis. The results show that the methods have high accuracy when people have strong opinions, but the weights of the aspects do not significantly improve the accuracy.</p> Natural language processing Natural Language Processing (NLP) Sentiment Analysis Aspect Based Sentiment Analysis
107	Large Language Models for Unsupervised Keyphrase Extraction and Biomedical Data Analytics Haoran Ding (18825838) 03 September 2024 (has links) <p dir="ltr">Natural Language Processing (NLP), a vital branch of artificial intelligence, is designed to equip computers with the ability to comprehend and manipulate human language, facilitating the extraction and utilization of textual data. NLP plays a crucial role in harnessing the vast quantities of textual data generated daily, facilitating meaningful information extraction. Among the various techniques, keyphrase extraction stands out due to its ability to distill concise information from extensive texts, making it invaluable for summarizing and navigating content efficiently. The process of keyphrase extraction usually begins by generating candidates first and then ranking them to identify the most relevant phrases. Keyphrase extraction can be categorized into supervised and unsupervised approaches. Supervised methods typically achieve higher accuracy as they are trained on labeled data, which allows them to effectively capture and utilize patterns recognized during training. However, the dependency on extensive, well-annotated datasets limits their applicability in scenarios where such data is scarce or costly to obtain. On the other hand, unsupervised methods, while free from the constraints of labeled data, face challenges in capturing deep semantic relationships within text, which can impact their effectiveness. Despite these challenges, unsupervised keyphrase extraction holds significant promise due to its scalability and lower barriers to entry, as it does not require labeled datasets. This approach is increasingly favored for its potential to aid in building extensive knowledge bases from unstructured data, which can be particularly useful in domains where acquiring labeled data is impractical. As a result, unsupervised keyphrase extraction is not only a valuable tool for information retrieval but also a pivotal technology for the ongoing expansion of knowledge-driven applications in NLP.</p><p dir="ltr">In this dissertation, we introduce three innovative unsupervised keyphrase extraction methods: AttentionRank, AGRank, and LLMRank. Additionally, we present a method for constructing knowledge graphs from unsupervised keyphrase extraction, leveraging the self-attention mechanism. The first study discusses the AttentionRank model, which utilizes a pre-trained language model to derive underlying importance rankings of candidate phrases through self-attention. This model employs a cross-attention mechanism to assess the semantic relevance between each candidate phrase and the document, enhancing the phrase ranking process. AGRank, detailed in the second study, is a sophisticated graph-based framework that merges deep learning techniques with graph theory. It constructs a candidate phrase graph using mutual attentions from a pre-trained language model. Both global document information and local phrase details are incorporated as enhanced nodes within the graph, and a graph algorithm is applied to rank the candidate phrases. The third study, LLMRank, leverages the strengths of large language models (LLMs) and graph algorithms. It employs LLMs to generate keyphrase candidates and then integrates global information through the text's graphical structures. This process reranks the candidates, significantly improving keyphrase extraction performance. The fourth study explores how self-attention mechanisms can be used to extract keyphrases from medical literature and generate query-related phrase graphs, improving text retrieval visualization. The mutual attentions of medical entities, extracted using a pre-trained model, form the basis of the knowledge graph. This, coupled with a specialized retrieval algorithm, allows for the visualization of long-range connections between medical entities while simultaneously displaying the supporting literature. In summary, our exploration of unsupervised keyphrase extraction and biomedical data analysis introduces novel methods and insights in NLP, particularly in information extraction. These contributions are crucial for the efficient processing of large text datasets and suggest avenues for future research and applications.</p> Natural language processing Natural Language Processing Unsupervised Keyphrase Extraction Large Language Models (LLMs) Knowledge Graph
108	A Framework to Identify Online Communities for Social Media Analysis Nikhil Mehta (9750842) 16 October 2024 (has links) <p dir="ltr">Easy access, variety of content, and fast widespread interactions are some of the reasons that have made social media increasingly popular in our society. This has lead to many people use social media everyday for a variety of reasons, such as interacting with friends or consuming news content. Thus, understanding content on social media is more important than ever.</p><p dir="ltr">An increased understanding on social media can lead to improvements on a large number of important tasks. In this work, we particularly focus on fake news detection and political bias detection. Fake news, text published by news sources with an intent to spread misinformation and sway beliefs, is ever prevalent in today's society. Detecting it is an important and challenging problem to prevent large scale misinformation and maintain a healthy society. In a similar way, detecting the political bias of news content can provide insights about the different perspectives on social media.</p><p dir="ltr">In this work, we view the problem of understanding social media as reasoning over the relationships between sources, the articles they publish, and the engaging users. We start by analyzing these relationships in a graph-based framework, and then use Large Language Models to do the same. We hypothesize that the key to understanding social media is understanding these relationships, such as identifying which users have similar perspectives, or which articles are likely to be shared by similar users.</p><p dir="ltr">Throughout this thesis, we propose several frameworks to capture the relationships on social media better. We initially tackle this problem using supervised learning systems, improving them to achieve strong performance. However, we find that automatedly modeling the complexities of the social media landscape is challenging. On the contrary, having humans analyze and interact with all news content to find relationships, is not scalable. Thus, we then propose to approach enhance our supervised approaches by approaching the social media understanding problem \textit{interactively}, where humans can interact to help an automated system learn a better social media representation quality.</p><p dir="ltr">On real world events, our experiments show performance improvements in detecting the factuality and political bias of news sources, both when trained with and without minimal human interactions. We particularly focus on one of the most challenging setups of this task, where test data is unseen and focuses on new topics when compared with the training data. This realistic setting shows the real world impact of our work in improving social media understanding.</p> Natural language processing social media analysis Large Language Models (LLMs) Natural Language Processing Model
109	Methods for measuring semantic similarity of texts Gaona, Miguel Angel Rios January 2014 (has links) Measuring semantic similarity is a task needed in many Natural Language Processing (NLP) applications. For example, in Machine Translation evaluation, semantic similarity is used to assess the quality of the machine translation output by measuring the degree of equivalence between a reference translation and the machine translation output. The problem of semantic similarity (Corley and Mihalcea, 2005) is de ned as measuring and recognising semantic relations between two texts. Semantic similarity covers di erent types of semantic relations, mainly bidirectional and directional. This thesis proposes new methods to address the limitations of existing work on both types of semantic relations. Recognising Textual Entailment (RTE) is a directional relation where a text T entails the hypothesis H (entailment pair) if the meaning of H can be inferred from the meaning of T (Dagan and Glickman, 2005; Dagan et al., 2013). Most of the RTE methods rely on machine learning algorithms. de Marne e et al. (2006) propose a multi-stage architecture where a rst stage determines an alignment between the T-H pairs to be followed by an entailment decision stage. A limitation of such approaches is that instead of recognising a non-entailment, an alignment that ts an optimisation criterion will be returned, but the alignment by itself is a poor predictor for iii non-entailment. We propose an RTE method following a multi-stage architecture, where both stages are based on semantic representations. Furthermore, instead of using simple similarity metrics to predict the entailment decision, we use a Markov Logic Network (MLN). The MLN is based on rich relational features extracted from the output of the predicate-argument alignment structures between T-H pairs. This MLN learns to reward pairs with similar predicates and similar arguments, and penalise pairs otherwise. The proposed methods show promising results. A source of errors was found to be the alignment step, which has low coverage. However, we show that when an alignment is found, the relational features improve the nal entailment decision. The task of Semantic Textual Similarity (STS) (Agirre et al., 2012) is de- ned as measuring the degree of bidirectional semantic equivalence between a pair of texts. The STS evaluation campaigns use datasets that consist of pairs of texts from NLP tasks such as Paraphrasing and Machine Translation evaluation. Methods for STS are commonly based on computing similarity metrics between the pair of sentences, where the similarity scores are used as features to train regression algorithms. Existing methods for STS achieve high performances over certain tasks, but poor results over others, particularly on unknown (surprise) tasks. Our solution to alleviate this unbalanced performances is to model STS in the context of Multi-task Learning using Gaussian Processes (MTL-GP) ( Alvarez et al., 2012) and state-of-the-art iv STS features ( Sari c et al., 2012). We show that the MTL-GP outperforms previous work on the same datasets. 006.3
110	Generation of referring expressions for an unknown audience Kutlák, Roman January 2014 (has links) When computers generate text, they have to consider how to describe the entities mentioned in the text. This situation becomes more difficult when the audience is unknown, as it is not clear what information is available to the addressees. This thesis investigates generation of descriptions in situations when an algorithm does not have a precise model of addressee's knowledge. This thesis starts with the collection and analysis of a corpus of descriptions of famous people. The analysis of the corpus revealed a number of useful patterns, which informed the remainder of this thesis. One of the difficult questions is how to choose information that helps addressees identify the described person. This thesis introduces a corpus-based method for determining which properties are more likely to be known by the addressees, and a probability-based method to identify properties that are distinguishing. One of the patterns observed in the collected corpus is the inclusion of multiple properties each of which uniquely identifies the referent. This thesis introduces a novel corpus-based method for determining how many properties to include in a description. Finally, a number of algorithms that leverage the findings of the corpus analysis and their computational implementation are proposed and tested in an evaluation involving human participants. The proposed algorithms outperformed the Incremental Algorithm in terms of numbers of correctly identified referents and in terms of providing a better mental image of the referent. The main contributions of this thesis are: (1) a corpus-based analysis of descriptions produced for an unknown audience; (2) a computational heuristic for estimating what information is likely to be known to addressees; and (3) algorithms that can generate referring expressions that benefit addressees without having an explicit model of addressee's knowledge. 004

Search results