Return to search

THE VOCABULARY OF EXTENSIVE READING: A CORPUS ANALYSIS OF GRADED READERS

The importance of input on language learning cannot be overstated. One method of providing input to learners at a level that is appropriate for them is called extensive reading, in which learners read an abundance of texts. In practice, for learners of English as a second or foreign language, these texts are often books that have been written and classified into a particular difficulty level, called graded readers. Previous studies of the language in these texts have been limited in size and scope, often including books from a single publisher or series. However, if these books are meant to serve as the primary source of input for students in extensive reading programs, it is important to not only better understand the language in them, but to understand how the books within different series and made by different publishers compare with one another. Therefore, in this study I investigated the single- and multiword expressions present in graded readers for three purposes.First, I wished to better understand the difficulty of the texts by analyzing the vocabulary within them and learning how much vocabulary knowledge is required to reach 95% and 98% lexical coverage thresholds. Second, I wished to investigate the multi-word expressions (MWE) present in graded readers to better understand what MWEs students are exposed to when reading these books. Third, I investigated how the use of MWEs differs between graded readers at each level of text difficulty, as defined by reading levels defined by the Extensive Reading Foundation (ERF).
In order to address these problems, I utilized a large corpus of 1,872 graded readers containing 16,448,662 tokens. Using this corpus, I calculated the coverage figures for all texts within each level to determine the vocabulary required to reach 95 and 98% levels of coverage. These coverage figures were calculated using two kinds of lists, frequency- and difficulty-based, each meant to represent learner word knowledge. The frequency-based lists were the New General Service List (New GSL; Brezina & Gablasova, 2015), another list by the same name, which I refer to as the NGSL (Browne, 2014), and Nation’s BNC/COCA list (2020) based on the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). The difficulty-based list was the Scale of English Word Knowledge–Japanese (SEWK-J), a word list designed to estimate vocabulary difficulty for Japanese learners of English (Mizumoto et al., 2021; Pinchbeck, 2019).
The results of the single-word analyses showed that graded readers start to be become available at the minimum 95% threshold of known vocabulary of around the 1,700 rank in the lemma-based New GSL, the 1,250 rank for the flemma-based NGSL, and the first 1,000-word level for the level-6 word family-based BNC/COCA lists (based on the 25th percentiles for ERF level 1 using those lists). Studying beyond those ranks and levels should give students access to a wide range of graded readers, both at the 95% and 98% coverage thresholds, unless using the New GSL, which was much more limited in its ability to provide coverage. The median rank needed for sufficient coverage rises with each ERF level, no matter what list is used. There is also considerable overlap between levels, allowing learners to move between levels easily, as far as lexical requirements are concerned. These findings indicate that ERF levels incrementally guide learners towards more and more authentic language and texts. Similarly, the SEWK-J provides coverage of the majority of books, making it suitable for comparing a wide range of books together under the same framework. Differences between ERF levels in the SEWK-J ranks required to reach 95% and 98% were more less noticeable than those for the pedagogically focused frequency-based lists.
Next, I investigated the degree to which publisher-declared headword counts are representative of the number of headwords in each graded reader. Using the headword ranges provided by publishers tends to overestimate the number of word types needed for 95% coverage, except at the lowest ERF level. If 98% coverage is expected, then a general trend towards underestimation was found at the lowest ERF levels.
Following up on these single-word analyses, I then investigated the MWEs within the graded reader corpus to produce a list of the most frequent MWEs, which I compared with a large comparison corpus, the COCA. These results indicated that graded readers are a good source of 2-, 3-, 4-, and 5-grams, with more occurring in graded readers than the COCA.
Next, I examined the degree to which the most useful MWEs were included, defined as being MWEs in the Phrasal Expressions List (PHRASE) (Martinez & Schmitt, 2012) list and Phrasal Verbs Pedagogical List (PHaVE) (Garnier & Schmitt, 2015). Graded readers tended to include the most pedagogically important MWEs and phrasal verbs at all ERF levels. Those PHRASE and PHaVE list items that were most common in the large reference corpora used in their creation were also found to be most common in the GRC, suggesting that graded readers are a good source of comprehensible input using these forms.
Finally, using studies of L2 speaking and writing at different levels of proficiency as a guide (Siyanova-Chanturia & Spina, 2020; Tavakoli & Uchihara, 2020), I conducted an exploratory investigation into whether MWE usage in graded readers follows similar trajectories as graded reader difficulty levels increase. It was found that 2-grams that are infrequent and strongly associated in unsimplified text tend to become more common as ERF levels increase. / Applied Linguistics

Identiferoai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/8955
Date08 1900
CreatorsKramer, Brandon, 0000-0003-3910-0810
ContributorsPinchbeck, Geoffrey G., 1967-, Beglar, David, Vitta, Joseph P., Nakata, Tatsuya
PublisherTemple University. Libraries
Source SetsTemple University
LanguageEnglish
Detected LanguageEnglish
TypeThesis/Dissertation, Text
Format483 pages
RightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relationhttp://dx.doi.org/10.34944/dspace/8919, Theses and Dissertations

Page generated in 0.0024 seconds