Return to search

Probabilistic tree transducers for grammatical error correction

Thesis (MSc)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: We investigate the application of weighted tree transducers to correcting grammatical
errors in natural language. Weighted finite-state transducers (FST) have been
used successfully in a wide range of natural language processing (NLP) tasks, even
though the expressiveness of the linguistic transformations they perform is limited.
Recently, there has been an increase in the use of weighted tree transducers and
related formalisms that can express syntax-based natural language transformations
in a probabilistic setting.
The NLP task that we investigate is the automatic correction of grammar errors
made by English language learners. In contrast to spelling correction, which can
be performed with a very high accuracy, the performance of grammar correction
systems is still low for most error types. Commercial grammar correction systems
mostly use rule-based methods. The most common approach in recent grammatical
error correction research is to use statistical classifiers that make local decisions about
the occurrence of specific error types. The approach that we investigate is related to
a number of other approaches inspired by statistical machine translation (SMT) or
based on language modelling. Corpora of language learner writing annotated with
error corrections are used as training data.
Our baseline model is a noisy-channel FST model consisting of an n-gram language
model and a FST error model, which performs word insertion, deletion and
replacement operations. The tree transducer model we use to perform error correction
is a weighted top-down tree-to-string transducer, formulated to perform transformations
between parse trees of correct sentences and incorrect sentences. Using
an algorithm developed for syntax-based SMT, transducer rules are extracted from
training data of which the correct version of sentences have been parsed. Rule weights
are also estimated from the training data. Hypothesis sentences generated by the
tree transducer are reranked using an n-gram language model.
We perform experiments to evaluate the performance of different configurations
of the proposed models. In our implementation an existing tree transducer toolkit is
used. To make decoding time feasible sentences are split into clauses and heuristic
pruning is performed during decoding. We consider different modelling choices in the
construction of transducer rules. The evaluation of our models is based on precision
and recall. Experiments are performed to correct various error types on two learner
corpora. The results show that our system is competitive with existing approaches
on several error types. / AFRIKAANSE OPSOMMING: Ons ondersoek die toepassing van geweegde boomoutomate om grammatikafoute in
natuurlike taal outomaties reg te stel. Geweegde eindigetoestand outomate word
suksesvol gebruik in ’n wye omvang van take in natuurlike taalverwerking, alhoewel
die uitdrukkingskrag van die taalkundige transformasies wat hulle uitvoer beperk
is. Daar is die afgelope tyd ’n toename in die gebruik van geweegde boomoutomate
en verwante formalismes wat sintaktiese transformasies in natuurlike taal in ’n
probabilistiese raamwerk voorstel.
Die natuurlike taalverwerkingstoepassing wat ons ondersoek is die outomatiese
regstelling van taalfoute wat gemaak word deur Engelse taalleerders. Terwyl speltoetsing
in Engels met ’n baie hoë akkuraatheid gedoen kan word, is die prestasie van
taalregstellingstelsels nog relatief swak vir meeste fouttipes. Kommersiële taalregstellingstelsels
maak oorwegend gebruik van reël-gebaseerde metodes. Die algemeenste
benadering in onlangse navorsing oor grammatikale foutkorreksie is om statistiese
klassifiseerders wat plaaslike besluite oor die voorkoms van spesifieke fouttipes maak
te gebruik. Die benadering wat ons ondersoek is verwant aan ’n aantal ander benaderings
wat geïnspireer is deur statistiese masjienvertaling of op taalmodellering
gebaseer is. Korpora van taalleerderskryfwerk wat met foutregstellings geannoteer
is, word as afrigdata gebruik.
Ons kontrolestelsel is ’n geraaskanaal eindigetoestand outomaatmodel wat bestaan
uit ’n n-gram taalmodel en ’n foutmodel wat invoegings-, verwyderings- en vervangingsoperasies
op woordvlak uitvoer. Die boomoutomaatmodel wat ons gebruik
vir grammatikale foutkorreksie is ’n geweegde bo-na-onder boom-na-string omsetteroutomaat
geformuleer om transformasies tussen sintaksbome van korrekte sinne
en foutiewe sinne te maak. ’n Algoritme wat ontwikkel is vir sintaksgebaseerde
statistiese masjienvertaling word gebruik om reëls te onttrek uit die afrigdata, waarvan
sintaksontleding op die korrekte weergawe van die sinne gedoen is. Reëlgewigte
word ook vanaf die afrigdata beraam. Hipotese-sinne gegenereer deur die boomoutomaat
word herrangskik met behulp van ’n n-gram taalmodel.
Ons voer eksperimente uit om die doeltreffendheid van verskillende opstellings
van die voorgestelde modelle te evalueer. In ons implementering word ’n bestaande
boomoutomaat sagtewarepakket gebruik. Om die dekoderingstyd te verminder word
sinne in frases verdeel en die soekruimte heuristies besnoei. Ons oorweeg verskeie
modelleringskeuses in die samestelling van outomaatreëls. Die evaluering van ons
modelle word gebaseer op presisie en herroepvermoë. Eksperimente word uitgevoer
om verskeie fouttipes reg te maak op twee leerderkorpora. Die resultate wys dat ons
model kompeterend is met bestaande benaderings op verskeie fouttipes.

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:sun/oai:scholar.sun.ac.za:10019.1/85592
Date12 1900
CreatorsBuys, Jan Moolman
ContributorsVan der Merwe, A. B., Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences.
PublisherStellenbosch : Stellenbosch University
Source SetsSouth African National ETD Portal
Languageen_ZA
Detected LanguageUnknown
TypeThesis
Format108 p.
RightsStellenbosch University

Page generated in 0.0021 seconds