Return to search

Text classification in the BNC using corpus and statistical methods

The main part of this thesis sets out to develop a system of categories within a text typology. Although there exist many different approaches to the classification of text into categories, this research fills a gap in the literature, as most work on text classification is based on features external to the text such as the text's purpose, the aim of discourse, and the medium of communication. Text categories that have been set up based on some external features are not linguistically defined. In consequence, texts that belong to the same type are not necessarily similar in their linguistic forms. Even Biber's (1988) linguistically-oriented work was based on externally defined ~registers. Further, establishing text categories based on text-external features favours theoretical and qualitative approaches of text classification. These approaches can be seen as top-down approaches where external features are defined functionally in advance, and subsequently patterns of linguistic features are described in relation to each function. In such a case, the process of linking texts with a particular type is not done in a systematic way. In this thesis, I show how a text typology based on similarities in linguistic form can be developed systematically using a multivariate statistical technique; namely, cluster analysis. Following a review of various possible approaches to multivariate statistical analysis, I argue that cluster analysis is the most appropriate for systematising the study of text classification, because it has the distinctive feature of placing objects into distinct groupings based on their overall similarities across multiple variables. Cluster analysis identifies these grouping algorithmically. The objects to be clustered in my thesis are the written texts in the British National Corpus (BNC). I will make use of the written part only, since results of previous research which attempts to classify texts of this dataset were not very beneficial. Takahashi (2006), for instance, identified merely a broad distinction between formal and informal styles in the written part; whereas in the spoken part, he could come up with insightful results. Thus, it seems justifiable to look at the part of the BNC which Taka..1.ashi found intractable, using a different multivariate technique, to see if this methodology allows patterns to emerge in the dataset. Further, there are two other reasons to use the written BNC. First, some studies (e.g. Akinnaso 1982; Chafe and Danielewicz 1987) suggest that distinctions between text varieties based on frequencies of linguistic features can be identified even within one mode of communication, i.e. writing. Second, analysing written text varieties has direct implications for pedagogy (Biber and Conrad 2009). The variables measured in the written texts of the BNC are linguistic features that have functional associations. However, any linguistic feature can be interpreted functionally; hence, we cannot say that there is an easy way to decide on a list of linguistic features to investigate text varieties. In this thesis, the list of linguistic features is informed by some aspects of Systemic Functional Theory (STF) and characteristics identified in previous research on writing, as opposed to speech. SFT lends itself to the interpretation of how language is used through functional associations of linguistic features, treating meaning and form as two inseparable notions. This characteristic of SFT can be one source to inform my research to some extent, which assumes that a model of text-types can be established by investigating not only the linguistic features shared in each type, but also the functions served by these linguistic features in each type. However, there is no commitment in this study to aspects of SFT other than those I have discussed here. Similarly, the linguistic features that reflect characteristics of speech and writing identified in previous research also have a crucial role in distinguishing between different texts. For instance, writing is elaborate, and this is associated with linguistic features such as subordinate clauses, prepositional phrases, adjectives, and so on. However, these characteristics do not only reflect the distinction between speech and writing; they can also distinguish between different spoken texts or different written texts (see Akinnaso 1982). Thus, the linguistic features seen as important from these two perspectives are included in my list of linguistic features. To make the list more principled and exhaustive, I also consult a comprehensive corpus-based work on English language, along with some microscopic studies examining individual features in different registers. The linguistic features include personal pronouns, passive constructions, prepositional phrases, nominalisation, modal auxiliaries, adverbs, and adjectives. Computing a cluster analysis based on this data is a complex process with many steps. At each step, several alternative techniques are available. Choosing among the available teclmiques is a non-trivial decision, as multiple alternatives are in common use by statisticians. I demonstrate how a process of testing several combinations of clustering methods, in order to determine the most useful/stable clustering combination(s) for use in the classification of texts by their linguistic features . To test the robustness of the clustering algorithms techniques and to validate the cluster analysis, I use three validation techniques for cluster analysis, namely the cophenetic coefficient, the adjusted Rand index, and the AV p-value. The findings of the cluster analysis represent a plausible attempt to systematise the study of the diversity of texts by means of automatic classification. Initially, the cluster analysis resulted in 16 clusters/text types. However, a thorough investigation of those 16 clusters reveals that some clusters represent quite similar text types. Thus, it is possible to establish overall headings for similar types, reflecting their shared linguistic features. The resulting typology contains six major text types: persuasion, narration, informational narration, exposition, scientific exposition, and literary exposition. Cluster analysis thus proves to be a powerful tool for structuring the data, if used with caution. The way it is implemented in this study constitutes an advance in the field of text typology. Finally, a small-scale case study of the validity of the text typology is carried out. A questionnaire is used to find out whether and to what extent my taxonomy corresponds to native speakers' understanding of textual variability, that is, whether the taxonomy has some mental reality for native speakers of English. The results showed that native speakers of English, on the one hand, are good at explicitly identifying the grammatical features associated with scientific exposition and narration; but on the other hand, they are not so good at identifying the grammatical features associated with literary exposition and persuasion. The results also showed that participants seem to have difficulties in identifying grammatical features of informational narration. The results of this small-scale case study indicate that the text typology in my thesis is, to some extent, a phenomenon that native speakers are aware of, and thus we can justify placing our trust in the results - at least in their general pattern, if not in every detail.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:658020
Date January 2011
CreatorsMohamed, Ghada
PublisherLancaster University
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0089 seconds