Spelling suggestions: "subject:"lemmatization"" "subject:"summatization""
11 |
Rozpoznávání emocí v česky psaných textech / Recognition of emotions in Czech textsČervenec, Radek January 2011 (has links)
With advances in information and communication technologies over the past few years, the amount of information stored in the form of electronic text documents has been rapidly growing. Since the human abilities to effectively process and analyze large amounts of information are limited, there is an increasing demand for tools enabling to automatically analyze these documents and benefit from their emotional content. These kinds of systems have extensive applications. The purpose of this work is to design and implement a system for identifying expression of emotions in Czech texts. The proposed system is based mainly on machine learning methods and therefore design and creation of a training set is described as well. The training set is eventually utilized to create a model of classifier using the SVM. For the purpose of improving classification results, additional components were integrated into the system, such as lexical database, lemmatizer or derived keyword dictionary. The thesis also presents results of text documents classification into defined emotion classes and evaluates various approaches to categorization.
|
12 |
A Comparative Analysis of Text Usage and Composition in Goscinny's <em>Le petit Nicolas</em>, Goscinny's <em>Astérix</em>, and Albert Uderzo's <em>Astérix</em>Meyer, Dennis Scott 05 March 2012 (has links) (PDF)
The goal of this thesis is to analyze the textual composition of René Goscinny’s Astérix and Le petit Nicolas, demonstrating how they differ and why. Taking a statistical look at the comparative qualities of each series of works, the structural differences and similarities in language use in these two series and their respective media are highlighted and compared. Though one might expect more complicated language use in traditional text by virtue of its format, analysis of average word length, average sentence length, lexical diversity, the prevalence of specific forms (the passé composé, possessive pronouns, etc.), and preferred collocations (ils sont fous, ces romains !) shows interesting results. Though Le petit Nicolas has longer sentences and more relative pronouns (and hence more clauses per sentence on average), Astérix has longer words and more lexical diversity. A similar comparison of the albums of Astérix written by Goscinny to those of Uderzo, paying additional attention to the structural elements of each album (usage of narration and sound effects, for example) shows that Goscinny's love of reusing phrases is far greater than Uderzo's, and that the two have very different ideas of timing as expressed in narration boxes.
|
13 |
Metody sumarizace textových dokumentů / Methods of Text Document SummarizationPokorný, Lubomír January 2012 (has links)
This thesis deals with one-document summarization of text data. Part of it is devoted to data preparation, mainly to the normalization. Listed are some of the stemming algorithms and it contains also description of lemmatization. The main part is devoted to Luhn"s method for summarization and its extension of use WordNet dictionary. Oswald summarization method is described and applied as well. Designed and implemented application performs automatic generation of abstracts using these methods. A set of experiments where developed, which verified correct functionality of the application and of extension of Luhn"s summarization method too.
|
14 |
Překlad z češtiny do angličtiny / Czech-English TranslationPetrželka, Jiří January 2010 (has links)
Tato diplomová práce popisuje principy statistického strojového překladu a demonstruje, jak sestavit systém pro statistický strojový překlad Moses. V přípravné fázi jsou prozkoumány volně dostupné bilingvní česko-anglické korpusy. Empirická analýza časové náročnosti vícevláknových nástrojů pro zarovnání slov demonstruje, že MGIZA++ může dosáhnout až pětinásobného zrychlení, zatímco PGIZA++ až osminásobného zrychlení (v porovnání s GIZA++). Jsou otestovány tři způsoby morfologického pre-processingu českých trénovacích dat za použití jednoduchých nefaktorových modelů. Zatímco jednoduchá lemmatizace může snížit BLEU, sofistikovanější přístupy většinou BLEU zvyšují. Positivní efekty morfologického pre-processingu se vytrácejí s růstem velikosti korpusu. Vztah mezi dalšími charakteristikami korpusu (velikost, žánr, další data) a výsledným BLEU je empiricky měřen. Koncový systém je natrénován na korpusu CzEng 0.9 a vyhodnocen na testovacím vzorku z workshopu WMT 2010.
|
Page generated in 0.0761 seconds