411 |
Marco Polo's Travels Revisited: From Motion Event Detection to Optimal Path Computation in 3D MapsNiekler, Andreas, Wolska, Magdalena, Wiegmann, Matti, Stein, Benno, Burghardt, Manuel, Thiel, Marvin 11 July 2024 (has links)
In this work, we present a workflow for semi-automatic extraction
of geo-references and motion events from the book 'The Travels
of Marco Polo'. These are then used to create 3D renderings
of the space and movement which allows readers to visually trace
Marco Polo's route themselves to provide the exprience of the entirety
of the journey
|
412 |
Facilitating forgiveness: an NLP approach to forgivingVon Krosigk, Beate Christine 31 May 2004 (has links)
Facilitating forgiveness: an NLP approach to forgiving is an attempt at uncovering features of the blocks that prevent people to forgive. These blocks to forgiveness can be detected in the real life situations of the six individuals who told me their stories. The inner thoughts, feelings and the subsequent behaviour that prevented them from forgiving others is clearly uncovered in their stories. The facilitation process highlights the features that created the blocks in the past thus preventing forgiveness to occur. The blocks with their accompanying features reveal what needs to be clarified or changed in order to eventually enable the hurt individuals to forgive those who have hurt them. The application of discourse analysis to the stories of hurt highlights the links between the real life stories of the individuals within their contexts with regard to unforgiveness to the research findings of the existing body of knowledge, thereby creating a complexly interwoven comprehensive understanding of the individuals' thoughts, feelings, and behaviours in conjunction with their developmental phases within their socio-cultural contexts.
Neuro-linguistic-programming (NLP) is the instrument with which forgiving is facilitated in the six individuals who expressed their conscious desire to forgive, because they were unable to do so on their own. Their emotions had the habit of keeping them in a place in which they were forced to relive the hurtful event as if it were happening in the present. Arresting the process of reliving negative emotions requires a new way of being in this world. The assumption that this can be learnt is based on the results from a previous study, in which forgiveness was uncovered by means of the grounded theory approach as a cognitive process (Von Krosigk, 2000). The results from the previous research in conjunction with the results and insights from this research study are presented in the form of a grounded theory model of forgiveness. / Psychology / D. Litt. et Phil. (Psychology)
|
413 |
Désambiguisation de sens par modèles de contextes et son application à la Recherche d’InformationBrosseau-Villeneuve, Bernard 12 1900 (has links)
Il est connu que les problèmes d'ambiguïté de la langue ont un effet néfaste sur les résultats des systèmes de Recherche d'Information (RI). Toutefois, les efforts de recherche visant à intégrer des techniques de Désambiguisation de Sens (DS) à la RI n'ont pas porté fruit. La plupart des études sur le sujet obtiennent effectivement des résultats négatifs ou peu convaincants. De plus, des investigations basées sur l'ajout d'ambiguïté artificielle concluent qu'il faudrait une très haute précision de désambiguation pour arriver à un effet positif. Ce mémoire vise à développer de nouvelles approches plus performantes et efficaces, se concentrant sur l'utilisation de statistiques de cooccurrence afin de construire des modèles de contexte. Ces modèles pourront ensuite servir à effectuer une discrimination de sens entre une requête et les documents d'une collection.
Dans ce mémoire à deux parties, nous ferons tout d'abord une investigation de la force de la relation entre un mot et les mots présents dans son contexte, proposant une méthode d'apprentissage du poids d'un mot de contexte en fonction de sa distance du mot modélisé dans le document. Cette méthode repose sur l'idée que des modèles de contextes faits à partir d'échantillons aléatoires de mots en contexte devraient être similaires. Des expériences en anglais et en japonais montrent que la force de relation en fonction de la distance suit généralement une loi de puissance négative. Les poids résultant des expériences sont ensuite utilisés dans la construction de systèmes de DS Bayes Naïfs. Des évaluations de ces systèmes sur les données de l'atelier Semeval en anglais pour la tâche Semeval-2007 English Lexical Sample, puis en japonais pour la tâche Semeval-2010 Japanese WSD, montrent que les systèmes ont des résultats comparables à l'état de l'art, bien qu'ils soient bien plus légers, et ne dépendent pas d'outils ou de ressources linguistiques.
La deuxième partie de ce mémoire vise à adapter les méthodes développées à des applications de Recherche d'Information. Ces applications ont la difficulté additionnelle de ne pas pouvoir dépendre de données créées manuellement. Nous proposons donc des modèles de contextes à variables latentes basés sur l'Allocation Dirichlet Latente (LDA). Ceux-ci seront combinés à la méthodes de vraisemblance de requête par modèles de langue. En évaluant le système résultant sur trois collections de la conférence TREC (Text REtrieval Conference), nous observons une amélioration proportionnelle moyenne de 12% du MAP et 23% du GMAP. Les gains se font surtout sur les requêtes difficiles, augmentant la stabilité des résultats. Ces expériences seraient la première application positive de techniques de DS sur des tâches de RI standard. / It is known that the ambiguity present in natural language has a negative effect on Information Retrieval (IR) systems effectiveness. However, up to now, the efforts made to integrate Word Sense Disambiguation (WSD) techniques in IR systems have not been successful. Past studies end up with either poor or unconvincing results. Furthermore, investigations based on the addition of artificial ambiguity shows that a very high disambiguation accuracy would be needed in order to observe gains. This thesis has for objective to develop efficient and effective approaches for WSD, using co-occurrence statistics in order to build context models. Such models could then be used in order to do a word sense discrimination between a query and documents of a collection.
In this two-part thesis, we will start by investigating the principle of strength of relation between a word and the words present in its context, proposing an approach to learn a function mapping word distance to count weights. This method is based on the idea that context models made from random samples of word in context should be similar. Experiments in English and Japanese shows that the strength of relation roughly follows a negative power law. The weights resulting from the experiments are then used in the construction of Naïve Bayes WSD systems. Evaluations of these systems in English with the Semeval-2007 English Lexical Sample (ELS), and then in Japanese with the Semeval-2010 Japanese WSD (JWSD) tasks shows that the systems have state-of-the-art accuracy even though they are much lighter and don't rely on linguistic tools or resources.
The second part of this thesis aims to adapt the new methods to IR applications. Such applications put heavy constraints on performance and available resources. We thus propose the use of corpus-based latent context models based on Latent Dirichlet Allocation (LDA). The models are combined with the query likelihood Language Model (LM) approach for IR. Evaluating the systems on three collections from the Text REtrieval Conference (TREC), we observe average proportional improvement in the range of 12% in MAP and 23% in GMAP. We then observe that the gains are mostly made on hard queries, augmenting the robustness of the results. To our knowledge, these experiments are the first positive application of WSD techniques on standard IR tasks.
|
414 |
Le repérage automatique des entités nommées dans la langue arabe : vers la création d'un système à base de règlesZaghouani, Wajdi January 2009 (has links)
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal.
|
415 |
Désambiguisation de sens par modèles de contextes et son application à la Recherche d’InformationBrosseau-Villeneuve, Bernard 12 1900 (has links)
Il est connu que les problèmes d'ambiguïté de la langue ont un effet néfaste sur les résultats des systèmes de Recherche d'Information (RI). Toutefois, les efforts de recherche visant à intégrer des techniques de Désambiguisation de Sens (DS) à la RI n'ont pas porté fruit. La plupart des études sur le sujet obtiennent effectivement des résultats négatifs ou peu convaincants. De plus, des investigations basées sur l'ajout d'ambiguïté artificielle concluent qu'il faudrait une très haute précision de désambiguation pour arriver à un effet positif. Ce mémoire vise à développer de nouvelles approches plus performantes et efficaces, se concentrant sur l'utilisation de statistiques de cooccurrence afin de construire des modèles de contexte. Ces modèles pourront ensuite servir à effectuer une discrimination de sens entre une requête et les documents d'une collection.
Dans ce mémoire à deux parties, nous ferons tout d'abord une investigation de la force de la relation entre un mot et les mots présents dans son contexte, proposant une méthode d'apprentissage du poids d'un mot de contexte en fonction de sa distance du mot modélisé dans le document. Cette méthode repose sur l'idée que des modèles de contextes faits à partir d'échantillons aléatoires de mots en contexte devraient être similaires. Des expériences en anglais et en japonais montrent que la force de relation en fonction de la distance suit généralement une loi de puissance négative. Les poids résultant des expériences sont ensuite utilisés dans la construction de systèmes de DS Bayes Naïfs. Des évaluations de ces systèmes sur les données de l'atelier Semeval en anglais pour la tâche Semeval-2007 English Lexical Sample, puis en japonais pour la tâche Semeval-2010 Japanese WSD, montrent que les systèmes ont des résultats comparables à l'état de l'art, bien qu'ils soient bien plus légers, et ne dépendent pas d'outils ou de ressources linguistiques.
La deuxième partie de ce mémoire vise à adapter les méthodes développées à des applications de Recherche d'Information. Ces applications ont la difficulté additionnelle de ne pas pouvoir dépendre de données créées manuellement. Nous proposons donc des modèles de contextes à variables latentes basés sur l'Allocation Dirichlet Latente (LDA). Ceux-ci seront combinés à la méthodes de vraisemblance de requête par modèles de langue. En évaluant le système résultant sur trois collections de la conférence TREC (Text REtrieval Conference), nous observons une amélioration proportionnelle moyenne de 12% du MAP et 23% du GMAP. Les gains se font surtout sur les requêtes difficiles, augmentant la stabilité des résultats. Ces expériences seraient la première application positive de techniques de DS sur des tâches de RI standard. / It is known that the ambiguity present in natural language has a negative effect on Information Retrieval (IR) systems effectiveness. However, up to now, the efforts made to integrate Word Sense Disambiguation (WSD) techniques in IR systems have not been successful. Past studies end up with either poor or unconvincing results. Furthermore, investigations based on the addition of artificial ambiguity shows that a very high disambiguation accuracy would be needed in order to observe gains. This thesis has for objective to develop efficient and effective approaches for WSD, using co-occurrence statistics in order to build context models. Such models could then be used in order to do a word sense discrimination between a query and documents of a collection.
In this two-part thesis, we will start by investigating the principle of strength of relation between a word and the words present in its context, proposing an approach to learn a function mapping word distance to count weights. This method is based on the idea that context models made from random samples of word in context should be similar. Experiments in English and Japanese shows that the strength of relation roughly follows a negative power law. The weights resulting from the experiments are then used in the construction of Naïve Bayes WSD systems. Evaluations of these systems in English with the Semeval-2007 English Lexical Sample (ELS), and then in Japanese with the Semeval-2010 Japanese WSD (JWSD) tasks shows that the systems have state-of-the-art accuracy even though they are much lighter and don't rely on linguistic tools or resources.
The second part of this thesis aims to adapt the new methods to IR applications. Such applications put heavy constraints on performance and available resources. We thus propose the use of corpus-based latent context models based on Latent Dirichlet Allocation (LDA). The models are combined with the query likelihood Language Model (LM) approach for IR. Evaluating the systems on three collections from the Text REtrieval Conference (TREC), we observe average proportional improvement in the range of 12% in MAP and 23% in GMAP. We then observe that the gains are mostly made on hard queries, augmenting the robustness of the results. To our knowledge, these experiments are the first positive application of WSD techniques on standard IR tasks.
|
416 |
Le repérage automatique des entités nommées dans la langue arabe : vers la création d'un système à base de règlesZaghouani, Wajdi January 2009 (has links)
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal
|
417 |
Integrating Natural Language Processing (NLP) and Language Resources Using Linked DataHellmann, Sebastian 12 January 2015 (has links) (PDF)
This thesis is a compendium of scientific works and engineering
specifications that have been contributed to a large community of
stakeholders to be copied, adapted, mixed, built upon and exploited in
any way possible to achieve a common goal: Integrating Natural Language
Processing (NLP) and Language Resources Using Linked Data
The explosion of information technology in the last two decades has led
to a substantial growth in quantity, diversity and complexity of
web-accessible linguistic data. These resources become even more useful
when linked with each other and the last few years have seen the
emergence of numerous approaches in various disciplines concerned with
linguistic resources and NLP tools. It is the challenge of our time to
store, interlink and exploit this wealth of data accumulated in more
than half a century of computational linguistics, of empirical,
corpus-based study of language, and of computational lexicography in all
its heterogeneity.
The vision of the Giant Global Graph (GGG) was conceived by Tim
Berners-Lee aiming at connecting all data on the Web and allowing to
discover new relations between this openly-accessible data. This vision
has been pursued by the Linked Open Data (LOD) community, where the
cloud of published datasets comprises 295 data repositories and more
than 30 billion RDF triples (as of September 2011).
RDF is based on globally unique and accessible URIs and it was
specifically designed to establish links between such URIs (or
resources). This is captured in the Linked Data paradigm that postulates
four rules: (1) Referred entities should be designated by URIs, (2)
these URIs should be resolvable over HTTP, (3) data should be
represented by means of standards such as RDF, (4) and a resource should
include links to other resources.
Although it is difficult to precisely identify the reasons for the
success of the LOD effort, advocates generally argue that open licenses
as well as open access are key enablers for the growth of such a network
as they provide a strong incentive for collaboration and contribution by
third parties. In his keynote at BNCOD 2011, Chris Bizer argued that
with RDF the overall data integration effort can be “split between data
publishers, third parties, and the data consumer”, a claim that can be
substantiated by observing the evolution of many large data sets
constituting the LOD cloud.
As written in the acknowledgement section, parts of this thesis has
received numerous feedback from other scientists, practitioners and
industry in many different ways. The main contributions of this thesis
are summarized here:
Part I – Introduction and Background.
During his keynote at the Language Resource and Evaluation Conference in
2012, Sören Auer stressed the decentralized, collaborative, interlinked
and interoperable nature of the Web of Data. The keynote provides strong
evidence that Semantic Web technologies such as Linked Data are on its
way to become main stream for the representation of language resources.
The jointly written companion publication for the keynote was later
extended as a book chapter in The People’s Web Meets NLP and serves as
the basis for “Introduction” and “Background”, outlining some stages of
the Linked Data publication and refinement chain. Both chapters stress
the importance of open licenses and open access as an enabler for
collaboration, the ability to interlink data on the Web as a key feature
of RDF as well as provide a discussion about scalability issues and
decentralization. Furthermore, we elaborate on how conceptual
interoperability can be achieved by (1) re-using vocabularies, (2) agile
ontology development, (3) meetings to refine and adapt ontologies and
(4) tool support to enrich ontologies and match schemata.
Part II - Language Resources as Linked Data.
“Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge
Acquisition Spiral” summarize the results of the Linked Data in
Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in
2013 and give a preview of the MLOD special issue. In total, five
proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP &
DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012)
and one journal special issue (Multilingual Linked Open Data, MLOD to
appear) – have been (co-)edited to create incentives for scientists to
convert and publish Linked Data and thus to contribute open and/or
linguistic data to the LOD cloud. Based on the disseminated call for
papers, 152 authors contributed one or more accepted submissions to our
venues and 120 reviewers were involved in peer-reviewing.
“DBpedia as a Multilingual Language Resource” and “Leveraging the
Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked
Data Cloud” contain this thesis’ contribution to the DBpedia Project in
order to further increase the size and inter-linkage of the LOD Cloud
with lexical-semantic resources. Our contribution comprises extracted
data from Wiktionary (an online, collaborative dictionary similar to
Wikipedia) in more than four languages (now six) as well as
language-specific versions of DBpedia, including a quality assessment of
inter-language links between Wikipedia editions and internationalized
content negotiation rules for Linked Data. In particular the work
described in created the foundation for a DBpedia Internationalisation
Committee with members from over 15 different languages with the common
goal to push DBpedia as a free and open multilingual language resource.
Part III - The NLP Interchange Format (NIF).
“NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and
“Evaluation and Related Work” constitute one of the main contribution of
this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format
that aims to achieve interoperability between Natural Language
Processing (NLP) tools, language resources and annotations. The core
specification is included in and describes which URI schemes and RDF
vocabularies must be used for (parts of) natural language texts and
annotations in order to create an RDF/OWL-based interoperability layer
with NIF built upon Unicode Code Points in Normal Form C. In , classes
and properties of the NIF Core Ontology are described to formally define
the relations between text, substrings and their URI schemes. contains
the evaluation of NIF.
In a questionnaire, we asked questions to 13 developers using NIF. UIMA,
GATE and Stanbol are extensible NLP frameworks and NIF was not yet able
to provide off-the-shelf NLP domain ontologies for all possible domains,
but only for the plugins used in this study. After inspecting the
software, the developers agreed however that NIF is adequate enough to
provide a generic RDF output based on NIF using literal objects for
annotations. All developers were able to map the internal data structure
to NIF URIs to serialize RDF output (Adequacy). The development effort
in hours (ranging between 3 and 40 hours) as well as the number of code
lines (ranging between 110 and 445) suggest, that the implementation of
NIF wrappers is easy and fast for an average developer. Furthermore the
evaluation contains a comparison to other formats and an evaluation of
the available URI schemes for web annotation.
In order to collect input from the wide group of stakeholders, a total
of 16 presentations were given with extensive discussions and feedback,
which has lead to a constant improvement of NIF from 2010 until 2013.
After the release of NIF (Version 1.0) in November 2011, a total of 32
vocabulary employments and implementations for different NLP tools and
converters were reported (8 by the (co-)authors, including Wiki-link
corpus, 13 by people participating in our survey and 11 more, of
which we have heard). Several roll-out meetings and tutorials were held
(e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC
2014).
Part IV - The NLP Interchange Format in Use.
“Use Cases and Applications for NIF” and “Publication of Corpora using
NIF” describe 8 concrete instances where NIF has been successfully used.
One major contribution in is the usage of NIF as the recommended RDF
mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard
and the conversion algorithms from ITS to NIF and back. One outcome
of the discussions in the standardization meetings and telephone
conferences for ITS 2.0 resulted in the conclusion there was no
alternative RDF format or vocabulary other than NIF with the required
features to fulfill the working group charter. Five further uses of NIF
are described for the Ontology of Linguistic Annotations (OLiA), the
RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and
visualisations of NIF using the RelFinder tool. These 8 instances
provide an implemented proof-of-concept of the features of NIF.
starts with describing the conversion and hosting of the huge Google
Wikilinks corpus with 40 million annotations for 3 million web sites.
The resulting RDF dump contains 477 million triples in a 5.6 GB
compressed dump file in turtle syntax. describes how NIF can be used to
publish extracted facts from news feeds in the RDFLiveNews tool as
Linked Data.
Part V - Conclusions.
provides lessons learned for NIF, conclusions and an outlook on future
work. Most of the contributions are already summarized above. One
particular aspect worth mentioning is the increasing number of
NIF-formated corpora for Named Entity Recognition (NER) that have come
into existence after the publication of the main NIF paper Integrating
NLP using Linked Data at ISWC 2013. These include the corpora converted
by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an
OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of
three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP
Interchange Format for Open German Governmental Data, N^3 – A Collection
of Datasets for Named Entity Recognition and Disambiguation in the NLP
Interchange Format and Global Intelligent Content: Active Curation of
Language Resources using Linked Data as well as an early implementation
of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr.
Further funding for the maintenance, interlinking and publication of
Linguistic Linked Data as well as support and improvements of NIF is
available via the expiring LOD2 EU project, as well as the CSA EU
project called LIDER, which started in November 2013. Based on the
evidence of successful adoption presented in this thesis, we can expect
a decent to high chance of reaching critical mass of Linked Data
technology as well as the NIF standard in the field of Natural Language
Processing and Language Resources.
|
418 |
Facilitating forgiveness: an NLP approach to forgivingVon Krosigk, Beate Christine 31 May 2004 (has links)
Facilitating forgiveness: an NLP approach to forgiving is an attempt at uncovering features of the blocks that prevent people to forgive. These blocks to forgiveness can be detected in the real life situations of the six individuals who told me their stories. The inner thoughts, feelings and the subsequent behaviour that prevented them from forgiving others is clearly uncovered in their stories. The facilitation process highlights the features that created the blocks in the past thus preventing forgiveness to occur. The blocks with their accompanying features reveal what needs to be clarified or changed in order to eventually enable the hurt individuals to forgive those who have hurt them. The application of discourse analysis to the stories of hurt highlights the links between the real life stories of the individuals within their contexts with regard to unforgiveness to the research findings of the existing body of knowledge, thereby creating a complexly interwoven comprehensive understanding of the individuals' thoughts, feelings, and behaviours in conjunction with their developmental phases within their socio-cultural contexts.
Neuro-linguistic-programming (NLP) is the instrument with which forgiving is facilitated in the six individuals who expressed their conscious desire to forgive, because they were unable to do so on their own. Their emotions had the habit of keeping them in a place in which they were forced to relive the hurtful event as if it were happening in the present. Arresting the process of reliving negative emotions requires a new way of being in this world. The assumption that this can be learnt is based on the results from a previous study, in which forgiveness was uncovered by means of the grounded theory approach as a cognitive process (Von Krosigk, 2000). The results from the previous research in conjunction with the results and insights from this research study are presented in the form of a grounded theory model of forgiveness. / Psychology / D. Litt. et Phil. (Psychology)
|
419 |
Typologie contrastive des pronoms personnels en hongrois et en mordve erzya / Contrastivity typology of personal pronouns in Hungarian and Erzya Mordvin languageHevér-Joly, Krisztina 29 January 2015 (has links)
La présente thèse constitue une étude typologique contrastive des allomorphies pronominales dans deux langues finno-ougriennes : en hongrois et en mordve erzya. On entend ici par typologie contrastive une approche typologique fondée sur la mise en contraste des structures de deux ou plusieurs langues, y compris des langues de la même famille linguistique, afin d’explorer des propriétés à la fois spécifiques et universelles. De ce point de vue, le hongrois et le mordve s’avèrent particulièrement pertinents en termes de structuration des systèmes de marques pronominales, en raison de propriétés morphologiques caractéristiques de l’ouralien central et oriental, tels que l’existence d’une double conjugaison (subjective et objective, voire « objective définie », en mordve), qui induit des séries allomorphiques complexes, tout en suivant des principes réducteurs universels (syncrétisme, sous-spécification et surspécification de certaines marques ou conditions de marquage morphonologique). Cette thèse comprend neuf chapitres, distribués sur trois volets. Le premier volet décrit les structures et les étapes de la modélisation des systèmes pronominaux dans les deux langues. Dans le premier chapitre, nous présentons des généralités historiques et structurales du hongrois et du mordve erzya, ainsi que la place que ces langues occupent parmi les langues finno-ougriennes, du point de vue de la classification et de la typologie. Une série de particularités importantes pour la compréhension des deux systèmes, en termes d’organisation structurale, concerne les propriétés allomorphiques des unités fonctionnelles et relationnelles de type pronominal, telles que l’harmonie vocalique, les suffixes casuels, le système verbal, et l’ordre des mots. Le deuxième chapitre concerne le lien entre les pronoms personnels et des catégories grammaticales fondamentales telles qu’animacité, nombre, personne, définitude, et aboutit à la conclusion que c’est le pronom personnel qui est particulièrement marqué par ces catégories grammaticales – les mêmes qui peuvent avoir, dans les langues du monde, une incidence sur la construction ou l’organisation des systèmes de classes flexionnelles. Le troisième chapitre présente une approche historiographique du hongrois et du mordve erzya; le quatrième chapitre propose une réanalyse de la flexion pronominale erzya, en suivant les mêmes principes que ceux jadis préconisés par András Kornai dans son analyse du système de la flexion nominale du hongrois (Kornai 1994), dans la mesure où ce modèle morphologique traite l’affixation comme une opération sur des traits combinés. Le deuxième volet de cette recherche développe des études de cas exploratoires dans une perspective de TAL : un corpus d’erzya littéraire et un corpus d’erzya biblique sont analysés contrastivement en suivant les démarches et le paramétrage requis par le logiciel Trameur. Le troisième volet sort de l’analyse des registres stylistiques au sein d’une langue donnée pour revenir à une typologie contrastive structurale hongrois-mordve. Dans le dernier chapitre, nous proposons une synthèse de ces deux aspects de la typologie contrastive : contrastes de registres intralangue, contraste de structures interlangues, en fonction d’un ensemble de paramètres partagés. La synergie entre la méthode lexicométrique et la typologie générale constitue l’un des principaux apports heuristiques de cette thèse, dont le but est de développer une typologie des langues finno-ougriennes qui tienne davantage compte de la contrastivité des structures et de leur relativisme que des grands traits catégoriels interlangues, davantage sujets aux biais empiriques et méthodologiques que peuvent recéler les grands corpus. / This dissertation provides a contrastive and typological study of pronoun allomorphy in two Finno-Ugric languages: Hungarian and Erzya Mordvin. Contrastive typology is a typological approach aiming at contrasting the structures of two or more languages, including from the same language family, to explore specific and universal properties. From this standpoint, Hungarian and Mordvin are particularly relevant as to the structure of pronoun markers, due to some morphological characteristics of the central and eastern languages of the Uralic language family, such as double conjugation paradigms (subjective and objective, moreover the "objective definite inflectional paradigm" in Mordvin). This results in complex allomorphic patterns, while following universal principles (syncretism, sub-specification and over-specification of certain markers, or the conditions of morphonologic exponence). The first part describes the structures and modelling stages of the pronominal system in both languages. In the first chapter, we present historical and structural generalities about the Hungarian and Mordvin Erzya languages, and the place they occupy within the Finno-Ugric group, from the point of view of classification and typology. A series of important features to understand the two systems in terms of structural organization, concerns the allomorphic properties of functional and relational units of pronominal type such as vowel harmony, the casual suffixes, the verbal system, and word order. The second chapter deals with the relationship between personal pronouns and basic grammatical categories such as animacity, number, person, definiteness, and concludes that it is the personal pronoun that is most marked by these grammatical categories - the same that may affect, in the languages of the world, the construction or organisation of inflectional classes. The third chapter is a historiographical approach of Hungarian and Erzya to show the outline of the research on the evolutionary periods of both systems. The fourth chapter provides a reanalysis of pronominal inflection in Erzya, following the same principles as those previously recommended by András Kornai's analysis of the nominal inflection system of Hungarian (Kornai, 1994), as it deals with the morphological model considering affixation as an operation on the combined features. The second part of this research develops exploratory case studies from the perspective of NLP (French: TAL) a literary corpus and a biblical corpus of Erzya are analysed following the steps and the settings required by the Trameur software. The third part departs from the contrastive analysis of stylistic registers within a given language to return to a Hungarian-Mordvin contrastive structural typology. In the last chapter, we propose a synthesis of these two aspects of contrastive typology: contrasting registers of intralanguage, contrast-linguistic structures, based on a set of shared parameters. The synergy between the lexicometric method and the general typology is one of the main contributions of this thesis’s heuristics to develop a typology of Finno-Ugric languages that takes greater account of the contrastivity of structures and their relativism as major categorical traits of interlanguage, resulting more sensitive to empirical and methodological biases that may conceal a large corpus.
|
420 |
Metody sumarizace dokumentů na webu / Methods of Document Summarization on the WebBelica, Michal January 2013 (has links)
The work deals with automatic summarization of documents in HTML format. As a language of web documents, Czech language has been chosen. The project is focused on algorithms of text summarization. The work also includes document preprocessing for summarization and conversion of text into representation suitable for summarization algorithms. General text mining is also briefly discussed but the project is mainly focused on the automatic document summarization. Two simple summarization algorithms are introduced. Then, the main attention is paid to an advanced algorithm that uses latent semantic analysis. Result of the work is a design and implementation of summarization module for Python language. Final part of the work contains evaluation of summaries generated by implemented summarization methods and their subjective comparison of the author.
|
Page generated in 0.1034 seconds