Global ETD Search

31	Cross-Lingual and Genre-Supervised Parsing and Tagging for Low-Resource Spoken Data Fosteri, Iliana January 2023 (has links) Dealing with low-resource languages is a challenging task, because of the absence of sufficient data to train machine-learning models to make predictions on these languages. One way to deal with this problem is to use data from higher-resource languages, which enables the transfer of learning from these languages to the low-resource target ones. The present study focuses on dependency parsing and part-of-speech tagging of low-resource languages belonging to the spoken genre, i.e., languages whose treebank data is transcribed speech. These are the following: Beja, Chukchi, Komi-Zyrian, Frisian-Dutch, and Cantonese. Our approach involves investigating different types of transfer languages, employing MACHAMP, a state-of-the-art parser and tagger that uses contextualized word embeddings, mBERT, and XLM-R in particular. The main idea is to explore how the genre, the language similarity, none of the two, or the combination of those affect the model performance in the aforementioned downstream tasks for our selected target treebanks. Our findings suggest that in order to capture speech-specific dependency relations, we need to incorporate at least a few genre-matching source data, while language similarity-matching source data are a better candidate when the task at hand is part-of-speech tagging. We also explore the impact of multi-task learning in one of our proposed methods, but we observe minor differences in the model performance. dependency parsing part-of-speech tagging low-resource languages transcribed speech large language models cross-lingual learning transfer learning multi-task learning Universal Dependencies
32	A Comparative Analysis of Text Usage and Composition in Goscinny's <em>Le petit Nicolas</em>, Goscinny's <em>Astérix</em>, and Albert Uderzo's <em>Astérix</em> Meyer, Dennis Scott 05 March 2012 (has links) (PDF) The goal of this thesis is to analyze the textual composition of René Goscinny’s Astérix and Le petit Nicolas, demonstrating how they differ and why. Taking a statistical look at the comparative qualities of each series of works, the structural differences and similarities in language use in these two series and their respective media are highlighted and compared. Though one might expect more complicated language use in traditional text by virtue of its format, analysis of average word length, average sentence length, lexical diversity, the prevalence of specific forms (the passé composé, possessive pronouns, etc.), and preferred collocations (ils sont fous, ces romains !) shows interesting results. Though Le petit Nicolas has longer sentences and more relative pronouns (and hence more clauses per sentence on average), Astérix has longer words and more lexical diversity. A similar comparison of the albums of Astérix written by Goscinny to those of Uderzo, paying additional attention to the structural elements of each album (usage of narration and sound effects, for example) shows that Goscinny's love of reusing phrases is far greater than Uderzo's, and that the two have very different ideas of timing as expressed in narration boxes. René Goscinny Albert Uderzo Astérix Le petit Nicolas lexical diversity TreeTagger part of speech tagging lemmatization average word length average sentence length verb tense choice preferred collocations comics les bandes dessinées Italian Language and Literature
33	NATURAL LANGUAGE PROCESSING-BASED AUTOMATED INFORMATION EXTRACTION FROM BUILDING CODES TO SUPPORT AUTOMATED COMPLIANCE CHECKING Xiaorui Xue (13171173) 29 July 2022 (has links) <p> </p> <p>Traditional manual code compliance checking process is a time-consuming, costly, and error-prone process that has many shortcomings (Zhang & El-Gohary, 2015). Therefore, automated code compliance checking systems have emerged as an alternative to traditional code compliance checking. However, computer software cannot directly process regulatory information in unstructured building code texts. To support automated code compliance checking, building codes need to be transformed to a computer-processable, structured format. In particular, the problem that most automated code compliance checking systems can only check a limited number of building code requirements stands out.</p> <p>The transformation of building code requirements into a computer-processable, structured format is a natural language processing (NLP) task that requires highly accurate part-of-speech (POS) tagging results on building codes beyond the state of the art. To address this need, this dissertation research was conducted to provide a method to improve the performance of POS taggers by error-driven transformational rules that revise machine-tagged POS results. The proposed error-driven transformational rules fix errors in POS tagging results in two steps. First, error-driven transformational rules locate errors in POS tagging by their context. Second, error-driven transformational rules replace the erroneous POS tag with the correct POS tag that is stored in the rule. A dataset of POS tagged building codes, namely the Part-of-Speech Tagged Building Codes (PTBC) dataset (Xue & Zhang, 2019), was published in the Purdue University Research Repository (PURR). Testing on the dataset illustrated that the method corrected 71.00% of errors in POS tagging results for building codes. As a result, the POS tagging accuracy on building codes was increased from 89.13% to 96.85%.</p> <p>This dissertation research was conducted to provide a new POS tagger that is tailored to building codes. The proposed POS tagger utilized neural network models and error-driven transformational rules. The neural network model contained a pre-trained model and one or more trainable neural layers. The neural network model was trained and fine-tuned on the PTBC (Xue & Zhang, 2019) dataset, which was published in the Purdue University Research Repository (PURR). In this dissertation research, a high-performance POS tagger for building codes using one bidirectional Long-short Term Memory (LSTM) Recurrent Neural Network (RNN) trainable layer, a BERT-Cased-Base pre-trained model, and 50 epochs of training was discovered. This model achieved 91.89% precision without error-driven transformational rules and 95.11% precision with error-driven transformational rules, outperforming the otherwise most advanced POS tagger’s 89.82% precision on building codes in the state of the art.</p> <p>Other automated information extraction methods were also developed in this dissertation. Some automated code compliance checking systems represented building codes in logic clauses and used pattern matching-based rules to convert building codes from natural language text to logic clauses (Zhang & El-Gohary 2017). A ruleset expansion method that can expand the range of checkable building codes of such automated code compliance checking systems by expanding their pattern matching-based ruleset was developed in this dissertation research. The ruleset expansion method can guarantee: (1) the ruleset’s backward compatibility with the building codes that the ruleset was already able to process, and (2) forward compatibility with building codes that the ruleset may need to process in the future. The ruleset expansion method was validated on Chapters 5 and 10 of the International Building Code 2015 (IBC 2015). The Chapter 10 of IBC 2015 was used as the training dataset and the Chapter 5 of the IBC 2015 was used as the testing dataset. A gold standard of logic clauses was published in the Logic Clause Representation of Building Codes (LCRBC) dataset (Xue & Zhang, 2021). Expanded pattern matching-based rules were published in the dissertation (Appendix A). The expanded ruleset increased the precision, recall, and f1-score of the logic clause generation at the predicate-level by 10.44%, 25.72%, and 18.02%, to 95.17%, 96.60%, and 95.88%, comparing to the baseline ruleset, respectively. </p> <p>Most of the existing automated code compliance checking research focused on checking regulatory information that was stored in textual format in building code in text. However, a comprehensive automated code compliance checking process should be able to check regulatory information stored in other parts, such as, tables. Therefore, this dissertation research was conducted to provide a semi-automated information extraction and transformation method for tabular information processing in building codes. The proposed method can semi-automatically detect the layouts of tables and store the extracted information of a table in a database. Automated code compliance checking systems can then query the database for regulatory information in the corresponding table. The algorithm’s initial implementation accurately processed 91.67 % of the tables in the testing dataset composed of tables in Chapter 10 of IBC 2015. After iterative upgrades, the updated method correctly processed all tables in the testing dataset. </p> Construction engineering Natural language processing techniques Artificial Intelligence (AI) Automated compliance checking Automated information extraction Natural language processing Part-of-speech tagging Building design review Machine Learning
34	Die deelwoord in Afrikaans : perspektiewe vanuit ŉ kognitiewe gebruiksgebaseerde beskrywingsraamwerk / Anna Petronella Butler Butler, Anna Petronella January 2014 (has links) During an annotation project of 60 000 Afrikaans tokens by CTexT (North-West University), the developers had to answer difficult questions with regard to the annotation of the participle specifically. One of the main reasons for this difficulty is that the different sources that offer descriptions of the participle in Afrikaans are conflicting in such descriptions and, depending on which source is consulted, would provide different annotations. In order to clarify the uncertainty of how the participle in Afrikaans should be annotated, the available literature was surveyed to determine the exact nature of the participle in Afrikaans. The descriptions of the participle in Afrikaans were further situated in the context of how participles are described in English and Dutch. The conclusion that was reached is that the participle form of the verb in Afrikaans should be distinguished from the periphrastic construction form of the verb that appears in the past and the passive constructions. Furthermore, this study determined to what extent a cognitive usage-based descriptive framework could contribute towards a better understanding of the participle in Afrikaans. The first conclusion that was reached is that a characterisation of the participle within this framework enables one to make conceptual sense of the morphological structure of the participle. The study shows how the morphological structure of the participle is responsible for the fact that the verbal character of the participle stays intact while the participle functions as another word class. Another conclusion that was reached regarding the characterisation of the past and passive constructions from a cognitive usage-based descriptive framework is that the framework makes it possible to distinguish conceptually between the periphrastic form of the verb and the participle form of the verb. Lastly, the study determined to what extent new insights into the participle in Afrikaans could lead to alternative lemmatisation and part-of-speech-tagging of participles in the NCHLT-corpus. The conclusion that was reached is that participles are primarily lemmatised satisfactorily. Proposals that are made in order to improve the lemmatisation protocol, include: (i) distinguishing in the protocol between periphrastic forms of the verb and the participle form of the verb; (ii) repeating the guideline for the lemmatisation of compound verbs that was provided for verb lemmatisation under the lemmatisation guidelines for participles; (iii) adding more lexicalised adjectives to the existing list in the protocol; and (iv) suggesting a guideline that would allow one to consistently distinguish between participles that could function as adverbs as well as participles that could function as prepositions. The conclusion that was reached after the analysis of the part-of-speech protocol is that the part-of-speech tag set in Afrikaans does not allow for the specific attributes and values of participles to be taken into account. Participles in the Afrikaans tag set are tagged strictly according to the function of the word. Although such an approach is very practical, it results in a linguistically poorer part-of-speech tag that ignores the verbal character of the participle. An alternative strategy is therefore suggested for the part-of-speech tagging of participles in which the attributes and values of the verb tag are adapted. / MA (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Participle Present participle Past participle Cognitive grammar Lemmatisation Part-of-speech tagging Participle form of the verb Deelwoord Onvoltooide deelwoord Voltooide deelwoord Partisipium Kognitiewe grammatika Lemmatisering Woordsoortetikettering Deelwoordvorm van die werkwoord
35	Die deelwoord in Afrikaans : perspektiewe vanuit ŉ kognitiewe gebruiksgebaseerde beskrywingsraamwerk / Anna Petronella Butler Butler, Anna Petronella January 2014 (has links) During an annotation project of 60 000 Afrikaans tokens by CTexT (North-West University), the developers had to answer difficult questions with regard to the annotation of the participle specifically. One of the main reasons for this difficulty is that the different sources that offer descriptions of the participle in Afrikaans are conflicting in such descriptions and, depending on which source is consulted, would provide different annotations. In order to clarify the uncertainty of how the participle in Afrikaans should be annotated, the available literature was surveyed to determine the exact nature of the participle in Afrikaans. The descriptions of the participle in Afrikaans were further situated in the context of how participles are described in English and Dutch. The conclusion that was reached is that the participle form of the verb in Afrikaans should be distinguished from the periphrastic construction form of the verb that appears in the past and the passive constructions. Furthermore, this study determined to what extent a cognitive usage-based descriptive framework could contribute towards a better understanding of the participle in Afrikaans. The first conclusion that was reached is that a characterisation of the participle within this framework enables one to make conceptual sense of the morphological structure of the participle. The study shows how the morphological structure of the participle is responsible for the fact that the verbal character of the participle stays intact while the participle functions as another word class. Another conclusion that was reached regarding the characterisation of the past and passive constructions from a cognitive usage-based descriptive framework is that the framework makes it possible to distinguish conceptually between the periphrastic form of the verb and the participle form of the verb. Lastly, the study determined to what extent new insights into the participle in Afrikaans could lead to alternative lemmatisation and part-of-speech-tagging of participles in the NCHLT-corpus. The conclusion that was reached is that participles are primarily lemmatised satisfactorily. Proposals that are made in order to improve the lemmatisation protocol, include: (i) distinguishing in the protocol between periphrastic forms of the verb and the participle form of the verb; (ii) repeating the guideline for the lemmatisation of compound verbs that was provided for verb lemmatisation under the lemmatisation guidelines for participles; (iii) adding more lexicalised adjectives to the existing list in the protocol; and (iv) suggesting a guideline that would allow one to consistently distinguish between participles that could function as adverbs as well as participles that could function as prepositions. The conclusion that was reached after the analysis of the part-of-speech protocol is that the part-of-speech tag set in Afrikaans does not allow for the specific attributes and values of participles to be taken into account. Participles in the Afrikaans tag set are tagged strictly according to the function of the word. Although such an approach is very practical, it results in a linguistically poorer part-of-speech tag that ignores the verbal character of the participle. An alternative strategy is therefore suggested for the part-of-speech tagging of participles in which the attributes and values of the verb tag are adapted. / MA (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Participle Present participle Past participle Cognitive grammar Lemmatisation Part-of-speech tagging Participle form of the verb Deelwoord Onvoltooide deelwoord Voltooide deelwoord Partisipium Kognitiewe grammatika Lemmatisering Woordsoortetikettering Deelwoordvorm van die werkwoord

Page generated in 0.0937 seconds