Global ETD Search

1	Rule analysis and social analysis Hazell, Laurence Paul January 1986 (has links) This thesis investigates the use of rules in the analysis of language mastery and human action, which are both viewed as social phenomena. The investigation is conducted through an examination of two analyses of the use of language in everyday social life and documents how each formulates a different understanding of rule-following in explaining linguistic and social action. The analyses in question are ‘Speech Act Theory' and 'Ethnomethodology'. The principal idea of speech act theory is that social action is rule-governed, and the theory attempts to explain the possibility of meaningful social interaction on that basis. The rigidities imposed by the notion of rule-governance frustrate that aim. The thesis then turns to an examination of ethnomethodology and conversation analysis and contrasts the notion of rule-orientation developed by that perspective. From that examination it becomes clear that what is on offer is not just a greater flexibility in the use of rules, but a restructuring of the concept of analysis itself. It is argued that re-structuring amounts to a reflexive conception of analysis. Its meaning and implications are enlarged upon through a close scrutiny of the later philosophy of Wittgenstein, particularly his concern with the nature of rule-following in his ‘Philosophical Investigations'. The thesis argues that his concern with rules was motivated by his insight that their use as ‘explanations’ of action said as much about the formulater of the rule as the activities the rules were held to formulate. The thesis concludes by outlining the meaning of this analytic reflexivity for social scientific findings. 150 Human language mastery
2	Efficient development of human language technology resources for resource-scarce languages / Martin Johannes Puttkammer Puttkammer, Martin Johannes January 2014 (has links) The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Automatic speech recognition Lemmatisation Resource-scarce languages Human language technology Resource development
3	Efficient development of human language technology resources for resource-scarce languages / Martin Johannes Puttkammer Puttkammer, Martin Johannes January 2014 (has links) The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Automatic speech recognition Lemmatisation Resource-scarce languages Human language technology Resource development
4	Conditional Discrimination and Stimulus Equivalence: Effects of Suppressing Derived Symmetrical Responses on the Emergence of Transitivity. Jones, Aaron A. 05 1900 (has links) Symmetry suppression was conducted for five subjects who demonstrated a tendency to derive equivalence relations based on conditional discrimination training in a match-to-sample procedure. Symmetry suppression was applied in three consecutive sessions in which symmetrical responses were suppressed for one stimulus class in the first condition, two stimulus classes in the second condition, and all three stimulus classes in the final condition. Symmetry suppression slowed the emergence of transitivity for two subjects and prevented it for the other three. Results indicated that unplanned features of stimulus configurations emerged as discriminative variables that controlled selection responses and altered the function of consequent stimuli. Disruption of cognitive development by conflicting contingencies in natural learning environments is discussed. Stimulus equivalence symmetry suppression punishment cognitive development human language Discrimination learning. Learning, Psychology of.
5	Vocal combinations in guenon communication / Des combinaisons vocales dans la communication de cercopithèques forestiers Coye, Camille 05 July 2016 (has links) Il est classiquement admis que les études comparatives sur la communication des animaux peuvent permettre de mieux comprendre la coévolution de la vie sociale, de la communication vocale et des capacités cognitives, notamment l’émergence de certaines propriétés du langage humain. De récentes études ont suggéré la présence de capacités combinatoires chez les primates non humains, capacités qui permettraient à ces animaux de diversifier leurs répertoires ou d’enrichir les messages transmis par leurs vocalisations en dépit de capacités articulatoires limitées. Toutefois, les fonctions des cris combinés et les informations qui en sont extraites par les receveurs restent méconnues. Cette thèse porte sur les capacités de combinaison vocale de cercopithèques forestiers sauvages : la mone de Campbell (Cercopithecus Campbelli) et le singe Diane (Cercopithecus Diana). Premièrement, à l’aide d’expériences de repasse acoustiques, j’ai étudié la nature combinatoire de cris combinés et les informations qui en sont extraites par les receveurs chez ces deux espèces. Les résultats ont confirmé chez les mâles mone de Campbell la présence d’un mécanisme de suffixation diminuant l’urgence du danger signalé par un cri d’alarme ainsi que, chez les femelles singe Diane, la présence de cris complexes combinant linéairement les messages des deux unités qui les composent, signalant respectivement l’émotion et l’identité de l’émetteur. Deuxièmement, une étude observationnelle du contexte d’émission de cris simples et combinés par des femelles mones de Campbell sauvages a révélé une utilisation flexible de la combinaison en fonction du besoin immédiat de rester discret (i.e. cris simples) ou de signaler son identité (i.e. cris combinés). Finalement, j’ai comparé les systèmes de communication des femelles de ces deux espèces pour identifier leurs points communs et leurs différences. Leurs répertoires sont basés principalement sur des structures acoustiques homologues, comme prédit par leur proximité phylogénétique. Cependant, les femelles de ces deux espèces diffèrent fortement dans leur utilisation de ces structures. Par exemple, le grand nombre de cris combinés chez les singes Diane semble permettre un accroissement considérable de leur répertoire vocal par rapport aux mones de Campbell. Etant donné l’organisation non-aléatoire de ces combinaisons vocales qui font sens pour les receveurs et de leur utilisation flexible en fonction du contexte, je propose un parallèle avec une forme simple de morphosyntaxe sémantique et discute aussi plus généralement de la possibilité de trouver des capacités similaires chez d’autres espèces animales. / It is generally accepted that comparative studies on animal communication can provide insights into the coevolution of social life, vocal communication, cognitive capacities and notably the emergence of some human language features. Recent studies suggested that non-human primates possess combinatorial abilities that may allow a diversification of vocal repertoires or a richer communication in spite of limited articulatory capacities. However, the functions of combined calls and the information that receivers can extract remain poorly understood. This thesis investigated call combination systems in two species of guenons: Campbell’s monkey (Cercopithecus Campbelli) and Diana monkey (Cercopithecus Diana). Firstly, I studied the combinatorial structure and relevance to receivers of combined calls in of both species using playback experiments. Results confirmed the presence of a suffixation mechanism reducing the emergency of danger signaled by calls of male Campbell’s monkeys. Also, they showed that combined calls of females Diana monkeys convey linearly information via their two units, which signal respectively caller’s emotional state and identity. Secondly, focusing on the context associated with the emission of simple and combined female Campbell’s monkey calls, results revealed flexible use of combination reflecting the immediate need to remain cryptic (i.e. simple calls) or to signal caller’s identity (i.e. combined calls). Finally, I compared females’ communication systems of both species to identify their similarities and differences. As predicted by their close phylogenetic relatedness, their repertoires are mostly based on homologous structures. However, the females differ strongly in their use of those structures. In particular, the great number of calls combined by Diana monkeys increases considerably their vocal repertoire compared to Campbell’s monkeys. Given that the combinations are non-random, meaningful to receivers and used flexibly with the context, I propose a parallel with a rudimentary form of semantic morphosyntax and discuss more generally the possible existence of similar capacities in other non-human animals. Communication Évolution Primates Langage humain Cercopithèques sauvages Communication Evolution Primates Human language Wild guenons
6	Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald Groenewald, Hendrik Johannes January 2006 (has links) A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007. Lemmatisation Machine learning Memory-based learning Human language technology Natural language processing Computer engineering TIMBL Afrikaans Morphology
7	Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald Groenewald, Hendrik Johannes January 2006 (has links) A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007. Lemmatisation Machine learning Memory-based learning Human language technology Natural language processing Computer engineering TIMBL Afrikaans Morphology
8	Mathematical Expression Recognition based on Probabilistic Grammars Álvaro Muñoz, Francisco 15 June 2015 (has links) [EN] Mathematical notation is well-known and used all over the world. Humankind has evolved from simple methods representing countings to current well-defined math notation able to account for complex problems. Furthermore, mathematical expressions constitute a universal language in scientific fields, and many information resources containing mathematics have been created during the last decades. However, in order to efficiently access all that information, scientific documents have to be digitized or produced directly in electronic formats. Although most people is able to understand and produce mathematical information, introducing math expressions into electronic devices requires learning specific notations or using editors. Automatic recognition of mathematical expressions aims at filling this gap between the knowledge of a person and the input accepted by computers. This way, printed documents containing math expressions could be automatically digitized, and handwriting could be used for direct input of math notation into electronic devices. This thesis is devoted to develop an approach for mathematical expression recognition. In this document we propose an approach for recognizing any type of mathematical expression (printed or handwritten) based on probabilistic grammars. In order to do so, we develop the formal statistical framework such that derives several probability distributions. Along the document, we deal with the definition and estimation of all these probabilistic sources of information. Finally, we define the parsing algorithm that globally computes the most probable mathematical expression for a given input according to the statistical framework. An important point in this study is to provide objective performance evaluation and report results using public data and standard metrics. We inspected the problems of automatic evaluation in this field and looked for the best solutions. We also report several experiments using public databases and we participated in several international competitions. Furthermore, we have released most of the software developed in this thesis as open source. We also explore some of the applications of mathematical expression recognition. In addition to the direct applications of transcription and digitization, we report two important proposals. First, we developed mucaptcha, a method to tell humans and computers apart by means of math handwriting input, which represents a novel application of math expression recognition. Second, we tackled the problem of layout analysis of structured documents using the statistical framework developed in this thesis, because both are two-dimensional problems that can be modeled with probabilistic grammars. The approach developed in this thesis for mathematical expression recognition has obtained good results at different levels. It has produced several scientific publications in international conferences and journals, and has been awarded in international competitions. / [ES] La notación matemática es bien conocida y se utiliza en todo el mundo. La humanidad ha evolucionado desde simples métodos para representar cuentas hasta la notación formal actual capaz de modelar problemas complejos. Además, las expresiones matemáticas constituyen un idioma universal en el mundo científico, y se han creado muchos recursos que contienen matemáticas durante las últimas décadas. Sin embargo, para acceder de forma eficiente a toda esa información, los documentos científicos han de ser digitalizados o producidos directamente en formatos electrónicos. Aunque la mayoría de personas es capaz de entender y producir información matemática, introducir expresiones matemáticas en dispositivos electrónicos requiere aprender notaciones especiales o usar editores. El reconocimiento automático de expresiones matemáticas tiene como objetivo llenar ese espacio existente entre el conocimiento de una persona y la entrada que aceptan los ordenadores. De este modo, documentos impresos que contienen fórmulas podrían digitalizarse automáticamente, y la escritura se podría utilizar para introducir directamente notación matemática en dispositivos electrónicos. Esta tesis está centrada en desarrollar un método para reconocer expresiones matemáticas. En este documento proponemos un método para reconocer cualquier tipo de fórmula (impresa o manuscrita) basado en gramáticas probabilísticas. Para ello, desarrollamos el marco estadístico formal que deriva varias distribuciones de probabilidad. A lo largo del documento, abordamos la definición y estimación de todas estas fuentes de información probabilística. Finalmente, definimos el algoritmo que, dada cierta entrada, calcula globalmente la expresión matemática más probable de acuerdo al marco estadístico. Un aspecto importante de este trabajo es proporcionar una evaluación objetiva de los resultados y presentarlos usando datos públicos y medidas estándar. Por ello, estudiamos los problemas de la evaluación automática en este campo y buscamos las mejores soluciones. Asimismo, presentamos diversos experimentos usando bases de datos públicas y hemos participado en varias competiciones internacionales. Además, hemos publicado como código abierto la mayoría del software desarrollado en esta tesis. También hemos explorado algunas de las aplicaciones del reconocimiento de expresiones matemáticas. Además de las aplicaciones directas de transcripción y digitalización, presentamos dos propuestas importantes. En primer lugar, desarrollamos mucaptcha, un método para discriminar entre humanos y ordenadores mediante la escritura de expresiones matemáticas, el cual representa una novedosa aplicación del reconocimiento de fórmulas. En segundo lugar, abordamos el problema de detectar y segmentar la estructura de documentos utilizando el marco estadístico formal desarrollado en esta tesis, dado que ambos son problemas bidimensionales que pueden modelarse con gramáticas probabilísticas. El método desarrollado en esta tesis para reconocer expresiones matemáticas ha obtenido buenos resultados a diferentes niveles. Este trabajo ha producido varias publicaciones en conferencias internacionales y revistas, y ha sido premiado en competiciones internacionales. / [CA] La notació matemàtica és ben coneguda i s'utilitza a tot el món. La humanitat ha evolucionat des de simples mètodes per representar comptes fins a la notació formal actual capaç de modelar problemes complexos. A més, les expressions matemàtiques constitueixen un idioma universal al món científic, i s'han creat molts recursos que contenen matemàtiques durant les últimes dècades. No obstant això, per accedir de forma eficient a tota aquesta informació, els documents científics han de ser digitalitzats o produïts directament en formats electrònics. Encara que la majoria de persones és capaç d'entendre i produir informació matemàtica, introduir expressions matemàtiques en dispositius electrònics requereix aprendre notacions especials o usar editors. El reconeixement automàtic d'expressions matemàtiques té per objectiu omplir aquest espai existent entre el coneixement d'una persona i l'entrada que accepten els ordinadors. D'aquesta manera, documents impresos que contenen fórmules podrien digitalitzar-se automàticament, i l'escriptura es podria utilitzar per introduir directament notació matemàtica en dispositius electrònics. Aquesta tesi està centrada en desenvolupar un mètode per reconèixer expressions matemàtiques. En aquest document proposem un mètode per reconèixer qualsevol tipus de fórmula (impresa o manuscrita) basat en gramàtiques probabilístiques. Amb aquesta finalitat, desenvolupem el marc estadístic formal que deriva diverses distribucions de probabilitat. Al llarg del document, abordem la definició i estimació de totes aquestes fonts d'informació probabilística. Finalment, definim l'algorisme que, donada certa entrada, calcula globalment l'expressió matemàtica més probable d'acord al marc estadístic. Un aspecte important d'aquest treball és proporcionar una avaluació objectiva dels resultats i presentar-los usant dades públiques i mesures estàndard. Per això, estudiem els problemes de l'avaluació automàtica en aquest camp i busquem les millors solucions. Així mateix, presentem diversos experiments usant bases de dades públiques i hem participat en diverses competicions internacionals. A més, hem publicat com a codi obert la majoria del software desenvolupat en aquesta tesi. També hem explorat algunes de les aplicacions del reconeixement d'expressions matemàtiques. A més de les aplicacions directes de transcripció i digitalització, presentem dues propostes importants. En primer lloc, desenvolupem mucaptcha, un mètode per discriminar entre humans i ordinadors mitjançant l'escriptura d'expressions matemàtiques, el qual representa una nova aplicació del reconeixement de fórmules. En segon lloc, abordem el problema de detectar i segmentar l'estructura de documents utilitzant el marc estadístic formal desenvolupat en aquesta tesi, donat que ambdós són problemes bidimensionals que poden modelar-se amb gramàtiques probabilístiques. El mètode desenvolupat en aquesta tesi per reconèixer expressions matemàtiques ha obtingut bons resultats a diferents nivells. Aquest treball ha produït diverses publicacions en conferències internacionals i revistes, i ha sigut premiat en competicions internacionals. / Álvaro Muñoz, F. (2015). Mathematical Expression Recognition based on Probabilistic Grammars [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/51665 Mathematical expressions Handwriting recognition Text recognition Probabilistic grammars Symbol recognition Pattern recognition Human language technology LENGUAJES Y SISTEMAS INFORMATICOS
9	Lexical resources in psycholinguistic research January 2012 (has links) Experimental and quantitative research in the field of human language processing and production strongly depends on the quality of the underlying language material: beside its size, representativeness, variety and balance have been discussed as important factors which influence design, analysis and interpretation of experiments and their results. This volume brings together creators and users of both general purpose and specialized lexical resources which are used in psychology, psycholinguistics, neurolinguistics and cognitive research. It aims to be a forum to report experiences and results, review problems and discuss perspectives of any linguistic data used in the field. / Experimentelle und quantitative Forschung im Bereich der menschlichen Sprachverarbeitung und -produktion hängt wesentlich von der Qualität des zugrundeliegenden Sprachmaterials ab: Neben dessen Umfang wurden auch Repräsentativität, Vielfalt und Ausgewogenheit als wichtige Einflüsse auf Design, Analyse und Interpretation entsprechender Experimente und deren Ergebnisse diskutiert. Der vorliegende Band enthält Arbeiten von Entwicklern und Anwendern sowohl allgemeiner als auch spezialisierter lexikalischer Ressourcen aus den Bereichen Psychologie, Psycho-, Neurolinguistik und Kongitionswissenschaften. Ziel ist es anhand der dargestellten Ergebnisse Probleme und Perspektiven bei der Arbeit mit linguistischen Daten aufzuzeigen. Psychologie Psycholinguistik Kognitionswissenschaften lexikalische Datenbanken menschliche Sprachverarbeitung Worterkennung Psychology psycholinguistics cognitive sciences lexical databases human language processing word recognition Language, Linguistics
10	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman Snyman, Dirk Petrus January 2012 (has links) When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated using automatic text classification systems which classifies a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is the genre of a text. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aims to: i) investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language), ii) transfer this approach to other indigenous South African resource scarce languages, and iii) investigate the effectiveness of technology recycling for closely related languages in a resource scarce environment. To achieve the first goal, five machine learning approaches were identified from the literature that are generally used for text classification, together with five common approaches to feature extraction. Two different approaches to the identification of genre classes are presented. The machine learning-, feature extraction- and genre class identification approaches were used in a series of experiments to identify the best approach for genre classification for a resource scarce language. The best combination is identified as the multinomial naïve Bayes algorithm, using a bag of words approach as features to classify texts into three abstract classes. This results in an f-score (performance measure) of 0.929 and it was subsequently shown that this approach can be successfully applied to other indigenous South African languages. To investigate the viability of technology recycling for genre classification systems for closely related languages, Dutch test data was classified using an Afrikaans genre classification system and it is shown that this approach works well. A pre-processing step was implemented by using a machine translation system to increase the compatibility between Afrikaans and Dutch by translating the Dutch texts before classification. This results in an f-score of 0.577, indicating that technology recycling between closely related languages has merit. This approach can be used to promote and fast track the development of genre classification systems in a resource scarce environment. / MA (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2013 Genre classification Resource scarce languages Machine learning Technology recycling Human language technology Natural language processing Genreklassifikasie Hulpbronskaars tale Masjienleer Tegnologieherwinning Mensetaaltegnologie Natuurliketaalprosessering

Search results