Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
641 |
Towards Generation of Creative Software RequirementsDo, Quoc Anh, Jr 07 August 2020 (has links)
Increasingly competitive software industry, where multiple systems serve the same application domain and compete for customers, favors software with creative features. To promote software creativity, research has proposed multi-day workshops with experienced facilitators, and semi-automated tools to provide a limited support for creative thinking. Such approach is either time consuming and demands substantial involvement from analysts with creative abilities, or useful only for existing large-scale software with a rich issue tracking system. In this dissertation, we present different approaches leveraging advanced natural language processing and machine learning techniques to provide automated support for generating creative software requirements with minimal human intervention. A controlled experiment is conducted to assess the effectiveness of our automated framework compared to the traditional brainstorming technique. The results demonstrate our frame-work’s ability to generate creative features for a wide range of stakeholders and provoke innovative thinking among developers with various experience levels.
|
642 |
An Investigation Into ALM as a Knowledge Representation Library LanguageLloyd, Benjamin Tyler 15 December 2022 (has links)
No description available.
|
643 |
Comparing Text Similarity Functions For Outlier Detection : In a Dataset with Small Collections of TitlesRabo, Vide, Winbladh, Erik January 2022 (has links)
Detecting when a title is put in an incorrect data category can be of interest for commercial digital services, such as streaming platforms, since they group movies by genre. Another example of a beneficiary is price comparison services, which categorises offers by their respective product. In order to find data points that are significantly different from the majority (outliers), outlier detection can be applied. A title in the wrong category is an example of an outlier. Outlier detection algorithms may require a metric that quantify nonsimilarity between two points. Text similarity functions can provide such a metric when comparing text data. The question therefore arises, "Which text similarity function is best suited for detecting incorrect titles in practical environments such as commercial digital services?" In this thesis, different text similarity functions are evaluated when set to detect outlying (incorrect) product titles, with both efficiency and effectiveness taken into consideration. Results show that the variance in performance between functions generally is small, with a few exceptions. The overall top performer is Sørensen-Dice, a function that divides the number of common words with the total amount of words found in both strings. While the function is efficient in the sense that it identifies most outliers in a practical time-frame, it is not likely to find all of them and is therefore deemed to not be effective enough to by applied in practical use. Therefore it might be better applied as part of a larger system, or in combination with manual analysis. / Att identifiera när en titel placeras i en felaktig datakategori kan vara av intresse för kommersiella digitala tjänster, såsom plattformar för filmströmning, eftersom filmer delas upp i genrer. Också prisjämförelsetjänster, som kategoriserar erbjudanden efter produkt skulle dra nytta. Outlier detection kan appliceras för att finna datapunkter som skiljer sig signifikant från de övriga (outliers). En titel i en felaktig kategori är ett exempel på en sådan outlier. Outlier detection algoritmer kan kräva ett mått som kvantifierar hur olika två datapunkter är. Text similarity functions kvantifierar skillnaden mellan textsträngar och kan därför integreras i dessa algoritmer. Med detta uppkommer en följdfråga: "Vilken text similarity function är bäst lämpad för att hitta avvikande titlar i praktiska miljöer såsom kommersiella digitala tjänster?”. I detta examensarbete kommer därför olika text similarity functions att jämföras när de används för att finna felaktiga produkttitlar. Jämförelsen tar hänsyn till både tidseffektivitet och korrekthet. Resultat visar att variationen i prestation mellan funktioner generellt är liten, med ett fåtal undantag. Den totalt sett högst presterande funktionen är Sørensen-Dice, vilken dividerar antalet gemensamma ord med det totala antalet ord i båda texttitlarna. Funktionen är effektiv då den identiferar de flesta outliers inom en praktisk tidsram, men kommer sannolikt inte hitta alla. Istället för att användas som en fullständig lösning, skulle det därför vara fördelaktigt att kombinera den med manuell analys eller en mer övergripande lösning.
|
644 |
Creating eye-catching headlines using BART / Skapa intressanta rubriker med hjälp av BARTDespinoy, Eva January 2022 (has links)
Social media is a significant factor in information distribution today, and this information landscape contains a lot of different posts that compete for the user’s attention. Different factors can help catch the interest of the user, and one of them is the headline of the message. The headline can be more or less eye-catching, which can make the reader more or less interested in interacting with the post. The theme of this study is the automatized creation of eye-catching headlines that stay truthful to the content of the articles using Automatic Text Summarization. The exact method used consisted of fine-tuning the BART model, which is an existing model for Text Summarization. Other papers have been written using different models to solve this problem with more or less success, however, none have used this method. It was deemed an interesting method as it is less time- and energy-consuming than creating and training a new model entirely from scratch and therefore could be easily replicated if the results were positive. The BartForConditionalGeneration model implemented by the HuggingFace library was fine-tuned, using the Popular News Articles by Web.io. This method showed positive results. The resulting headlines were deemed faithful to the original ones, with a ROUGE-2 recall score of 0.541. They were comparably eye-catching to the human-written headlines, with the human respondents ranking them almost the same, with an average rank of 1.692 for the human-written headlines, and 1.821 for fine-tuned BART, and also getting an average score of 3.31 on a 1 to 5 attractiveness score scale. They were also deemed very comprehensible, with an average score of 0.95 on a scale from 0 to 1. / Sociala medier är idag en viktig faktor i distributionen av information. Detta nya landskap innehåller många olika inlägg som tävlar om användarens uppmärksamhet. Olika faktorer kan hjälpa till att fånga användarens blick till specifika inlägg eller artiklar, och en av dessa faktorer är rubriken. Rubriken kan vara mer eller mindre fängslande, och göra läsaren mer eller mindre intresserad av att interagera med inlägget. Temat för denna studie är att automatiskt skapa iögonfallande och intressanta rubriker, som beskriver innehå llet i artiklarna på ett korrekt sätt. Den valda metoden är automatisk textsamman fattning, och mer specifikt finjusterades BART-modellen, som är en existerande modell för textsammanfattning. Andra metoder har använts tidigare för att lösa denna problematik med mer eller mindre framgång, men ingen studie hade använt den här. Den ansågs vara intressant eftersom den är mindre tids- och energikrävande än vad det skulle vara att skapa en ny modell från grunden, och därför skulle den lätt kunna replikeras om resultatet var positivt. BartForConditionalGeneration-modellen implementerad av HuggingFace-bib lioteket finjusterades därför med hjälp av artiklar och rubriker från datasetet ’Popular News Articles’ av Web.io. Metoden visade positiva resultat. De resulterande rubrikerna ansågs trogna de ursprungliga, med en ROUGE-2 recall score på 0,541. De var jämförbart iögonfallande gentemot de mänskligt skrivna rubrikerna, då respondenterna rankade dem nästan likadant, med en genomsnittlig rankning på 1,692 för de mänskligt skrivna rubrikerna och 1,821 för rubrikerna som finjusterade BART genererade. De fick också ett genomsnittligt betyg av 3,31 på en poängskala från 1 till 5. De ansågs dessutom vara mycket lättbegripliga, med ett medelpoäng på 0,95 på en skala från 0 till 1.
|
645 |
Suggesting Missing Information in Text DocumentsHodgson, Grant Michael 01 January 2018 (has links)
A key part of contract drafting involves thinking of issues that have not been addressedand adding language that will address the missing issues. To assist attorneys with this task, we present a pipeline approach for identifying missing information within a contract section. The pipeline takes a contract section as input and includes 1) identifying sections that are similar to the input section from a corpus of contract sections; and 2) identifying and suggesting information from the similar sections that are missing from the input section. By taking advantage of sentence embedding and principal component analysis, this approach suggests sentences that are helpful for finishing a contract. We show that sentence suggestions are more useful than the state of the art topic suggestion algorithm by synthetic experiments and a user study.
|
646 |
Toward Annotation Efficiency in Biased Learning Settings for Natural Language ProcessingEffland, Thomas January 2023 (has links)
The goal of this thesis is to improve the feasibility of building applied NLP systems for more diverse and niche real-world use-cases of extracting structured information from text. A core factor in determining this feasibility is the cost of manually annotating enough unbiased labeled data to achieve a desired level of system accuracy, and our goal is to reduce this cost. We focus on reducing this cost by making contributions in two directions: (1) easing the annotation burden by leveraging high-level expert knowledge in addition to labeled examples, thus making approaches more annotation-efficient; and (2) mitigating known biases in cheaper, imperfectly labeled real-world datasets so that we may use them to our advantage. A central theme of this thesis is that high-level expert knowledge about the data and task can allow for biased labeling processes that focus experts on only manually labeling aspects of the data that cannot be easily labeled through cheaper means. This combination allows for more accurate models with less human effort. We conduct our research on this general topic through three diverse problems with immediate applications to real-world settings.
First, we study an applied problem in biased text classification. We encounter a rare-event text classification system that has been deployed for several years. We are tasked with improving this system's performance using only the severely biased incidental feedback provided by the experts over years of system use. We develop a method that combines importance weighting and an unlabeled data imputation scheme that exploits the selection-bias of the feedback to train an unbiased classifier without requiring additional labeled data. We experimentally demonstrate that this method considerably improves the system performance.
Second, we tackle an applied problem in named entity recognition (NER) concerning learning tagging models from data that have very low recall for annotated entities. To solve this issue we propose a novel loss, the Expected Entity Ratio (EER), that uses an uncertain estimate of the proportion of entities in the data to counteract the false-negative bias in the data, encouraging the model to have the correct ratio of entities in expectation. We justify the principles of our approach by providing theory that shows it recovers the true tagging distribution under mild conditions. Additionally we provide extensive empirical results that show it to be practically useful. Empirically, we find that it meets or exceeds performance of state-of-the-art baselines across a variety of languages, annotation scenarios, and amounts of labeled data. We also show that, when combined with our approach, a novel sparse annotation scheme can outperform exhaustive annotation for modest annotation budgets.
Third, we study the challenging problem of syntactic parsing in low-resource languages. We approach the problem from a cross-lingual perspective, building on a state-of-the-art transfer-learning approach that underperforms on ``distant'' languages that have little to no representation in the training corpus. Motivated by the field of syntactic typology, we introduce a general method called Expected Statistic Regularization (ESR) to regularize the parser on distant languages according to their expected typological syntax statistics. We also contribute general approaches for estimating the loss supervision parameters from the task formalism or small amounts of labeled data. We present seven broad classes of descriptive statistic families and provide extensive experimental evidence showing that using these statistics for regularization is complementary to deep learning approaches in low-resource transfer settings.
In conclusion, this thesis contributes approaches for reducing the annotation cost of building applied NLP systems through the use of high-level expert knowledge to impart additional learning signal on models and cope with cheaper biased data. We publish implementations of our methods and results, so that they may facilitate future research and applications. It is our hope that the frameworks proposed in this thesis will help to democratize access to NLP for producing structured information from text in wider-reaching applications by making them faster and cheaper to build.
|
647 |
Deep Learning Methods to Investigate Online Hate Speech and Counterhate Replies to Mitigate Hateful ContentAlbanyan, Abdullah Abdulaziz 05 1900 (has links)
Hateful content and offensive language are commonplace on social media platforms. Many surveys prove that high percentages of social media users experience online harassment. Previous efforts have been made to detect and remove online hate content automatically. However, removing users' content restricts free speech. A complementary strategy to address hateful content that does not interfere with free speech is to counter the hate with new content to divert the discourse away from the hate. In this dissertation, we complement the lack of previous work on counterhate arguments by analyzing and detecting them. Firstly, we study the relationships between hateful tweets and replies. Specifically, we analyze their fine-grained relationships by indicating whether the reply counters the hate, provides a justification, attacks the author of the tweet, or adds additional hate. The most obvious finding is that most replies generally agree with the hateful tweets; only 20% of them counter the hate. Secondly, we focus on the hate directed toward individuals and detect authentic counterhate arguments from online articles. We propose a methodology that assures the authenticity of the argument and its specificity to the individual of interest. We show that finding arguments in online articles is an efficient alternative compared to counterhate generation approaches that may hallucinate unsupported arguments. Thirdly, we investigate the replies to counterhate tweets beyond whether the reply agrees or disagrees with the counterhate tweet. We analyze the language of the counterhate tweet that leads to certain types of replies and predict which counterhate tweets may elicit more hate instead of stopping it. We find that counterhate tweets with profanity content elicit replies that agree with the counterhate tweet. This dissertation presents several corpora, detailed corpus analyses, and deep learning-based approaches for the three tasks mentioned above.
|
648 |
Countering Hate Speech: Modeling User-Generated Web Content Using Natural Language ProcessingYu, Xinchen 07 1900 (has links)
Social media is considered a particularly conducive arena for hate speech. Counter speech, which is a "direct response that counters hate speech" is a remedy to address hate speech. Unlike content moderation, counter speech does not interfere with the principle of free and open public spaces for debate. This dissertation focuses on the (a) automatic detection and (b) analyses of the effectiveness of counter speech and its fine-grained strategies in user-generated web content. The first goal is to identify counter speech. We create a corpus with 6,846 instances through crowdsourcing. We specifically investigate the role of conversational context in the annotation and detection of counter speech. The second goal is to assess and predict conversational outcomes of counter speech. We propose a new metric to measure conversation incivility based on the number of uncivil and civil comments as well as the unique authors involved in the discourse. We then use the metric to evaluate the outcomes of replies to hate speech. The third goal is to establish a fine-grained taxonomy of counter speech. We present a theoretically grounded taxonomy that differentiates counter speech addressing the author of hate speech from addressing the content. We further compare the conversational outcomes of different types of counter speech and build models to identify each type. We conclude by discussing our contributions and future research directions on using user-generated counter speech to combat online hatred.
|
649 |
Numerical Reasoning in NLP: Challenges, Innovations, and Strategies for Handling Mathematical Equivalency / 自然言語処理における数値推論:数学的同等性の課題、革新、および対処戦略Liu, Qianying 25 September 2023 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第24929号 / 情博第840号 / 新制||情||140(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)特定教授 黒橋 禎夫, 教授 河原 達也, 教授 西野 恒 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
650 |
Interpretable natural language processing models with deep hierarchical structures and effective statistical trainingZhaoxin Luo (17328937) 03 November 2023 (has links)
<p dir="ltr">The research focuses on improving natural language processing (NLP) models by integrating the hierarchical structure of language, which is essential for understanding and generating human language. The main contributions of the study are:</p><ol><li><b>Hierarchical RNN Model:</b> Development of a deep Recurrent Neural Network model that captures both explicit and implicit hierarchical structures in language.</li><li><b>Hierarchical Attention Mechanism:</b> Use of a multi-level attention mechanism to help the model prioritize relevant information at different levels of the hierarchy.</li><li><b>Latent Indicators and Efficient Training:</b> Integration of latent indicators using the Expectation-Maximization algorithm and reduction of computational complexity with Bootstrap sampling and layered training strategies.</li><li><b>Sequence-to-Sequence Model for Translation:</b> Extension of the model to translation tasks, including a novel pre-training technique and a hierarchical decoding strategy to stabilize latent indicators during generation.</li></ol><p dir="ltr">The study claims enhanced performance in various NLP tasks with results comparable to larger models, with the added benefit of increased interpretability.</p>
|
Page generated in 0.7645 seconds