Spelling suggestions: "subject:"memorybased learning"" "subject:"memory.based learning""
1 |
High-Performance Knowledge-Based Entity ExtractionMiddleton, Anthony M. 01 January 2009 (has links)
Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships.
Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b).
Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope.
The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents.
|
2 |
Automatic lemmatisation for Afrikaans / by Hendrik J. GroenewaldGroenewald, Hendrik Johannes January 2006 (has links)
A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
|
3 |
Automatic lemmatisation for Afrikaans / by Hendrik J. GroenewaldGroenewald, Hendrik Johannes January 2006 (has links)
A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
|
4 |
Pattern Acquisition Methods for Information Extraction SystemsMarcińczuk, Michał January 2007 (has links)
This master thesis treats about Event Recognition in the reports of Polish stockholders. Event Recognition is one of the Information Extraction tasks. This thesis provides a comparison of two approaches to Event Recognition: manual and automatic. In the manual approach regular expressions are used. Regular expressions are used as a baseline for the automatic approach. In the automatic approach three Machine Learning methods were applied. In the initial experiment the Decision Trees, naive Bayes and Memory Based Learning methods are compared. A modification of the standard Memory Based Learning method is presented which goal is to create a classifier that uses only positives examples in the classification task. The performance of the modified Memory Based Learning method is presented and compared to the baseline and also to other Machine Learning methods. In the initial experiment one type of annotation is used and it is the meeting date annotation. The final experiment is conducted using three types of annotations: the meeting time, the meeting date and the meeting place annotation. The experiments show that the classification can be performed using only one class of instances with the same level of performance. / (+48)669808616
|
5 |
Automatsko određivanje vrsta riječi u morfološki složenom jeziku / Automatic parts of speech determination in amorphologically complex languageDimitrijević Strahinja 24 July 2015 (has links)
<p>Istraţivanje je imalo za cilj da provjeri u<br />kojoj mjeri se naš kognitivni sistem moţe<br />osloniti na fonotaktiĉke informacije, tj.<br />moguće/dozvoljene kombinacije fonema/<br />grafema, u zadacima automatske percepcije i<br />produkcije rijeĉi u jezicima sa bogatom<br />infleksionom morfologijom.<br />Da bi se dobio odgovor na to pitanje,<br />sprovedene su tri studije. U prvoj studiji, uz<br />pomoć mašina sa vektorima podrške (SVM),<br />obavljena je diskriminacija promjenljivih<br />vrsta rijeĉi. U drugoj studiji, produkcija<br />infleksionih oblika rijeĉi izvedena je<br />pomoću uĉenja zasnovanog na memoriji<br />(MBL). Na osnovu rezultata iz druge studije,<br />izveden je eksperiment u kojem se traţila<br />potvrda kognitivne vjerodostojnosti modela i<br />korišćenih informacija.<br />Diskriminacija promjenljivih vrsta rijeĉi<br />obavljena je na osnovu dozvoljenih sekvenci<br />dva i tri grafema/fonema (tzv. bigrama i<br />trigrama), ĉije su frekvencije javljanja<br />unutar pojedinaĉnih gramatiĉkih tipova<br />izraĉunate u zavisnosti od njihovog poloţaja<br />u rijeĉima: na poĉetku, na kraju, unutar<br />rijeĉi, svi zajedno. Maksimalna taĉnost se<br />kretala oko 95% i dobijena je na svim<br />bigramima, uz pomoć RBF jezgrene<br />funkcije. Ovako visok procenat taĉne<br />diskriminacije ukazuje da postoje<br />karakteristiĉne distribucije bigrama za<br />razliĉite vrste promjenljivih rijeĉi. S druge<br />strane, najmanje informativnim su se<br />pokazali bigrami na kraju i na poĉetku rijeĉi.<br />MBL model iskorišćen je u zadatku<br />automatske infleksione produkcije, tako što<br />je za zadatu rijeĉ, na osnovu fonotaktiĉkih<br />informacija iz posljednja ĉetiri sloga,<br />generisan traţeni infleksioni oblik. Na<br />uzorku od 89024 promjenljivih rijeĉi uzetih<br />iz Frekvencijskog reĉnika dnevne štampe<br />srpskog jezika, koristeći metod izostavljanja<br />jednog primjera i konstantu veliĉinu skupa<br />susjeda (k = 7), ostvarena je taĉnost oko<br />92%. Identifikovano je nekoliko faktora koji<br />su uticali na ovu taĉnost, kao što su: vrsta<br />rijeĉi, gramatiĉki tip, naĉin tvorbe i broj<br />primjera u okviru jednog gramatiĉkog tipa,<br />broju izuzetaka, broj fonoloških alternacija<br />itd.<br />U istraţivanju na subjektima, u zadatku<br />leksiĉke odluke, za rijeĉi koje je MBL<br />pogrešno obradio utvrĊeno je duţe vrijeme<br />obrade. Ovo ukazuje na kognitivnu<br />vjerodostojnost uĉenja zasnovanog na<br />memoriji. Osim toga, potvrĊena je i<br />kognitivna vjerodostojnost fonotaktiĉkih<br />informacija, ovaj put u zadatku<br />razumijevanja jezika.<br />Sveukupno, nalazi dobijeni u ove tri studije<br />govore u prilog teze o znaĉajnoj ulozi<br />fonotaktiĉkih informacija u percepciji i<br />produkciji morfološki sloţenih rijeĉi.<br />Rezultati, takoĊe, ukazuju na potrebu da se<br />ove informacije uzmu u obzir kada se<br />diskutuje pojavljivanje većih jeziĉkih<br />jedinica i obrazaca.</p> / <p>The study was aimed at testing the extent to<br />which our cognitive system can rely on<br />phonotactic information, i.e., possible/<br />permissible combinations of phonemes/<br />graphemes, in the tasks of automatic<br />processing and production of words in<br />languages with rich inflectional<br />morphology.<br />In order to obtain the answer to this<br />question, three studies have been conducted.<br />In the first study, by applying the support<br />vector machines (SVM) the discrimination<br />of part of speech (PoS) with more than one<br />possible meaning (i.e., ambiguous PoS) was<br />performed. In the second study, the<br />production of inflected word forms was<br />done with memory based learning (MBL).<br />Based on the results from the second study,<br />a behavioral experiment was conducted as<br />the third study, to test cognitive plausibility<br />of the MBL performance.<br />The discrimination of ambiguous PoS was<br />performed using permissible sequences of<br />two and three characters/sounds (i.e.,<br />bigrams and trigrams), whose frequency of<br />occurrence within individual grammatical<br />types was calculated depending on their<br />position in a word: at the beginning, at the<br />end, and irrespective of position in a word.<br />Maximum accuracy achieved was<br />approximatelly 95%. It was obtained when<br />bigrams irrespective of position in a word<br />were used. SVM model used RBF kernel<br />function. Such high accuracy suggests that<br />brigrams' probability distribution is<br />informative about the types of flective<br />words. Interestingly, the least informative<br />were bigrams at the end and at the beginning<br />of words.<br />The MBL model was used in the task of<br />automatic production of inflected forms,<br />utilizingphonotactic information from the<br />last four syllables. In a sample of 89024<br />flective words, taken from the Frequency<br />dictionary of Serbian language (daily press),<br />achieved accuracy was 92%. For this result<br />the MBL used leave<br />-one<br />-out method and nearest neighborhood size of 7 (k = 7). We</p><p>identified several factors that have<br />contributed to the accuracy; in particular,<br />part of speech, grammatical type, formation<br />method and number of examples within one<br />grammatical type, number of exceptions, the<br />number of phonological alternations, etc.<br />The visual lexical decision experiment<br />revealed that words that the MBL model<br />produced incorrectly also induced elongated<br />reaction time latencies. Thus, we concluded<br />that the MBL model might be cognitively<br />plausibile. In addition, we reconfirmed<br />informativeness of phonotactic information,<br />this time in human conmprehension task.<br />Overall, findings from three undertaken<br />studies are in favor of phonotactic<br />information for both processing and<br />production of morphologically complex<br />words. Results also suggest a necessity of<br />taking into account this information when<br />discussing emergence of larger units and<br />language patterns.</p>
|
6 |
Data-driven syntactic analysisMegyesi, Beata January 2002 (has links)
No description available.
|
7 |
Data-driven syntactic analysisMegyesi, Beata January 2002 (has links)
No description available.
|
Page generated in 0.0703 seconds