Global ETD Search

11	Kvantitativní charakteristiky termínů / Quantitative Characteristics of Terms Kováříková, Dominika January 2014 (has links) The new method of automatic term recognition TERMIT is focused not only on the high number of correctly labeled terms, but also on the most important attributes of a term (in terms of their role in automatic term identification process). The method is based on data mining, i.e. finding meaningful information in very large corpus data. It was able to both successfuly identify terms in academic texts and find constitutive features of a term as a terminological unit. The single-word term (SWT) can be characterized as a word with a low frequency in corpus (SYN2010) that occurs considerably more often in specialized texts of a given field than in non-academic texts, occurs in a small number of academic disciplines, its distribution in the corpus (SYN2010) is uneven as is the distance between its two instances. The multi-word term (MWT) is a stable collocation consisting of words with low frequency and contains at least one (and often more) single-word term. Based on the characteristics of SWT and MWT, it is possible to classify individual tokens in texts as terms or non-terms with a success rate of more than 95 %. Automatically identified terms can be used to identify percentage of SWT or MWT in different academic disciplines, as well as find terms shared by two or more domains in order to assess their...
12	Las secuencias formulaicas en la adquisición de español L2 / Multi-word structures in Spanish L2 acquisition Moreno Teva, Inmaculada January 2012 (has links) The main purpose of this study is to observe the L2 acquisition effect of studying abroad during approximately four months in Swedish non-native speakers (NNSs) of Spanish with respect to their use of multi-word structures (MWSs) as compared to native speakers (NSs). In addition, this evolutionary study has a secondary aim which is to see the effect of the activity type on the amount and distribution of the MWSs encountered. This study shows positive effects of a study abroad period in L2 use and, particularly, regarding MWSs. It has been shown that the amount and variety of the NNSs’ MWSs have increased during their stay in Spain, and that the differences with the NSs in the use of MWSs have diminished or even, in some cases, disappeared. It is notable the improvement in the NNSs’ discourse competence. The study also shows that the type of task affects the results. Thus, the negotiations that contain specialised vocabulary which participants are familiar with, yield a higher token frequency of MWSs among NSs and NNSs than the focus group discussions, more free and spontaneous. The negotiations also yield a higher token frequency of conceptual MWSs, especially noun phrases, because of the specialised vocabulary, more complex and subject to greater nominalisation.The focus group discussions have a higher token frequency of own-management MWSs than the negotiations, which is attributed to a higher communicative pressure. On the other hand, the token frequency of interaction management MWSs is higher among the NSs in the mixed group discussions compared to those with only NSs, as a result of collaborative interaction between the NSs and the NNSs. Individual differences among NNSs have also been observed and five profiles have been distinguished. These differences decrease in general at the end of the stay, which also indicates a positive development. There is a positive development in all profiles, which is reflected in significant changes in the amount and variety of the MWSs, their distribution in categories or in the emergence of more complex types. A direct link has also been observed between communication orientedness, participation in conversation and a positive development. activity type discourse competence evolutionary study multi-word structures individual differences Spanish L2 acquisition studying abroad effect task variation Languages and Literature Språk och litteratur
13	Víceslovná slovesa v mluvě rodilých a nerodilých mluvčích angličtiny. / Multi-word verbs in speech of native and non-native speakers of English. Divišová, Klára January 2020 (has links) The present thesis is concerned with the topic of multi-word verbs (MWV) use in the speech of native and non-native (Czech) speakers of English. More precisely, it aims to give a quantitative as well as qualitative analysis of the use of three main MWV categories: phrasal verbs (PhV), prepositional verbs (PrV) and phrasal-prepositional verbs (PPV). In addition, it summarizes the main research areas in the field of MWV, one of them being the avoidance of MWV by non-native speakers of English, which has been an inspiration for conducting this study. The material comes from two spoken corpora: LINDSEI_CZ corpus of Czech speakers and its referential LOCNEC corpus of English native speakers. The analysis tries to disprove or prove three hypotheses, i.e. non-native speakers' usage of MWV is lower than that of native speakers, prepositional verbs are the favoured MWV by non-native speakers, and non-native speakers overuse certain MWV. The results show that the biggest difference is in the use of PhV as the non-native speakers use significantly fewer PhV than the native speakers; their usage of phrasal-prepositional verbs and especially prepositional verbs is rather comparable to native speakers. Non-native speakers also overuse (and conversely underuse) certain MWV that are far less (or conversely more)...
14	Zpracování turkických jazyků / Processing of Turkic Languages Ciddi, Sibel January 2014 (has links) Title: Processing of Turkic Languages Author: Sibel Ciddi Department: Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague Supervisor: RNDr. Daniel Zeman, Ph.D. Abstract: This thesis presents several methods for the morpholog- ical processing of Turkic languages, such as Turkish, which pose a specific set of challenges for natural language processing. In order to alleviate the problems with lack of large language resources, it makes the data sets used for morphological processing and expansion of lex- icons publicly available for further use by researchers. Data sparsity, caused by highly productive and agglutinative morphology in Turkish, imposes difficulties in processing of Turkish text, especially for meth- ods using purely statistical natural language processing. Therefore, we evaluated a publicly available rule-based morphological analyzer, TRmorph, based on finite state methods and technologies. In order to enhance the efficiency of this analyzer, we worked on expansion of lexicons, by employing heuristics-based methods for the extraction of named entities and multi-word expressions. Furthermore, as a prepro- cessing step, we introduced a dictionary-based recognition method for tokenization of multi-word expressions. This method complements...
15	Accès à l'information dans les grandes collections textuelles en langue arabe / Information access in large Arabic textual collections El Mahdaouy, Abdelkader 16 December 2017 (has links) Face à la quantité d'information textuelle disponible sur le web en langue arabe, le développement des Systèmes de Recherche d'Information (SRI) efficaces est devenu incontournable pour retrouver l'information pertinente. La plupart des SRIs actuels de la langue arabe reposent sur la représentation par sac de mots et l'indexation des documents et des requêtes est effectuée souvent par des mots bruts ou des racines. Ce qui conduit à plusieurs problèmes tels que l'ambigüité et la disparité des termes, etc.Dans ce travail de thèse, nous nous sommes intéressés à apporter des solutions aux problèmes d'ambigüité et de disparité des termes pour l'amélioration de la représentation des documents et le processus de l'appariement des documents et des requêtes. Nous apportons quatre contributions au niveau de processus de représentation, d'indexation et de recherche d'information en langue arabe. La première contribution consiste à représenter les documents à la fois par des termes simples et des termes complexes. Cela est justifié par le fait que les termes simples seuls et isolés de leur contexte sont ambigus et moins précis pour représenter le contenu des documents. Ainsi, nous avons proposé une méthode hybride pour l’extraction de termes complexes en langue arabe, en combinant des propriétés linguistiques et des modèles statistiques. Le filtre linguistique repose à la fois sur l'étiquetage morphosyntaxique et la prise en compte des variations pour sélectionner les termes candidats. Pour sectionner les termes candidats pertinents, nous avons introduit une mesure d'association permettant de combiner l'information contextuelle avec les degrés de spécificité et d'unité. La deuxième contribution consiste à explorer et évaluer les systèmes de recherche d’informations permettant de tenir compte de l’ensemble des éléments d’indexation (termes simples et complexes). Par conséquent, nous étudions plusieurs extensions des modèles existants de RI pour l'intégration des termes complexes. En outre, nous explorons une panoplie de modèles de proximité. Pour la prise en compte des dépendances de termes dans les modèles de RI, nous introduisons une condition caractérisant de tels modèle et leur validation théorique. La troisième contribution permet de pallier le problème de disparité des termes en proposant une méthode pour intégrer la similarité entre les termes dans les modèles de RI en s'appuyant sur les représentations distribuées des mots (RDMs). L'idée sous-jacente consiste à permettre aux termes similaires à ceux de la requête de contribuer aux scores des documents. Les extensions des modèles de RI proposées dans le cadre de cette méthode sont validées en utilisant les contraintes heuristiques d'appariement sémantique. La dernière contribution concerne l'amélioration des modèles de rétro-pertinence (Pseudo Relevance Feedback PRF). Étant basée également sur les RDM, notre méthode permet d'intégrer la similarité entre les termes d'expansions et ceux de la requête dans les modèles standards PRF. La validation expérimentale de l'ensemble des contributions apportées dans le cadre de cette thèse est effectuée en utilisant la collection standard TREC 2002/2001 de la langue arabe. / Given the amount of Arabic textual information available on the web, developing effective Information Retrieval Systems (IRS) has become essential to retrieve relevant information. Most of the current Arabic SRIs are based on the bag-of-words representation, where documents are indexed using surface words, roots or stems. Two main drawbacks of the latter representation are the ambiguity of Single Word Terms (SWTs) and term mismatch.The aim of this work is to deal with SWTs ambiguity and term mismatch. Accordingly, we propose four contributions to improve Arabic content representation, indexing, and retrieval. The first contribution consists of representing Arabic documents using Multi-Word Terms (MWTs). The latter is motivated by the fact that MWTs are more precise representational units and less ambiguous than isolated SWTs. Hence, we propose a hybrid method to extract Arabic MWTs, which combines linguistic and statistical filtering of MWT candidates. The linguistic filter uses POS tagging to identify MWTs candidates that fit a set of syntactic patterns and handles the problem of MWTs variation. Then, the statistical filter rank MWT candidate using our proposed association measure that combines contextual information and both termhood and unithood measures. In the second contribution, we explore and evaluate several IR models for ranking documents using both SWTs and MWTs. Additionally, we investigate a wide range of proximity-based IR models for Arabic IR. Then, we introduce a formal condition that IR models should satisfy to deal adequately with term dependencies. The third contribution consists of a method based on Distributed Representation of Word vectors, namely Word Embedding (WE), for Arabic IR. It relies on incorporating WE semantic similarities into existing probabilistic IR models in order to deal with term mismatch. The aim is to allow distinct, but semantically similar terms to contribute to documents scores. The last contribution is a method to incorporate WE similarity into Pseud-Relevance Feedback PRF for Arabic Information Retrieval. The main idea is to select expansion terms using their distribution in the set of top pseudo-relevant documents along with their similarity to the original query terms. The experimental validation of all the proposed contributions is performed using standard Arabic TREC 2002/2001 collection. Recherche d'Information Dépendance de Termes Termes Complexes Disparité des mots Représentations Distribuées des Mots Information Retrieval Arabic Natural Language Processing Term Dependencies Multi-Word Terms Term Mismatch Word Embedding 004 490
16	Nové názvy profesí / New Names of Professions Kožuriková, Daniela January 2016 (has links) The thesis New Names of Professions is divided into a theoretical and a practical part. In the theoretical part, basic terms from the world of work are defined; the essential term is profession. Furthermore, neologisms are defined in the first part in accordance with professional literature without being confined to denomination of persons. For the purposes of the practical part, new names of professions were excerpted from the list of professions of the employment office and from a list in Učitelské noviny. Their occurrence was further verified using the EDA and Newton media databases and the results of the research were compared with the created list. An alphabetical list of the excerpted professions and a thematically divided list of those professions are an integral part of this work.
17	Lexikální koselekce v anglickém textu nerodilých mluvčích / Lexical coselections in non-native speaker English text Felcmanová, Andrea January 2012 (has links) The research reported in this thesis explores the degree of authenticity of the formulaic language used by NNSs and the extent to which a learner's L1 interferes in the production of different types of multi-word units, namely non-idiomatic recurrent three and four-word combinations (lexical bundles), phrasal and prepositional verbs and collocation. Drawing on Granger's Contrastive Interlanguage analysis (CIA 1996), the investigation is conducted on two different learner sample corpora and subsequently contrasted with a native sample corpus. The study aims to prove that multi-word units pose a challenge for learners for several reasons. In general terms, learners are assumed to operate predominantly on what Sinclair calls the open-choice principle, that is to say their production will be less idiomatic than that of native speakers'. This assumption is independently tested on different types of phraseological combinations. As regards non-idiomatic recurrent word combinations, learners are expected to be more repetitive in their three- and four-word combinations and use less creativity in their writing. Concerning the phrasal verbs, it is highly likely to observe a small number of phrasal verbs in the non-native writing whereas prepositional verbs are considered problematic for learners due to the...

Page generated in 0.0394 seconds