Thèse numérisée par la Direction des bibliothèques de l’Université de Montréal / While content selection has been intensively explored in the sentence extraction approach to automatic swnmarization, there is generally little work on the other process of content condensation. To understand this process of condensation, we propose a partial typology based on whether a linguistic unit is replaced, deleted, compressed into fewer essential units, or combined with another unit. Four important categories of condensation processes: generalization, deletion, compression, and aggregation, including their inverse processes, e.g. insertion, and expansion, which were occasionally observed, are proposed. To guide the usage of the same tenu for similar operations, we borrow definitions from linguistics. The type and function of the linguistic units involved are also discussed. We carried out an empirical analysis of 57 author-written abstracts of on-line journal articles in entomology, tracing each abstract sentence back to the plausible source sentences in the corresponding full text. Unlike other studies which focus on the resultant abstract, our study focuses on the processes leading to the production of abstract sentences from corresponding full-text sentences. We do not, however, propose an algorithm for abstracting, or account for all the conditions under which individual condensation operations may apply. While a range of substitutes were used in abstracting, about half of the stems of lexical units in our abstracts share the same stem as their source words, or are their derived forms. Only a small proportion of substitutes were synonyms, and the rest were (quasi-)synonyms, or imprecise equivalents. Authors tend to use less technical forms in abstracts possibly in anticipation of non-specialist abstract readers. Numerical expressions are rendered less precise although no less accurate: absolute numbers and decimals are rounded off, and percentages replaced by ratios or fractions. These observations are consistent with the "new" context of an abstract where only the gist of a document s content need be re-conveyed. Among the linguistic units commonly deleted are metadiscourse phrases, and segments of text (e.g. parenthetical texts, and apposed texts), which provide details and precision in the full text, but are out of place in an abstract. Redundancies inserted for various reasons, or units deemed to be implicit to the comprehension of targeted readers are also often removed. While deletion is an important sub-process of condensation, we observed some instances of adding experimental and other details to compact more information into abstract. The expansion or "unpacking" of compact linguistic units was also observed. The secondary role of inverse processes observed calls for a review of the meaning of condensation from "not giving as much detail or using fewer words" to include the adding of information in order to make a unit of text informatively compact. Among the linguistic units compressed are verbal complexes containing a support verb, or a catenative. Like semantically empty support verbs (e.g. X caused decreases in Y = X reduced Y), some catenatives too may be deleted without significant changes in meaning to the verbal complex (e.g. X was allowed to hatch E-e X hatched). Redundancy in meaning between an adjective and a noun in a noun phrase, e.g. functional role, may be removed, and the phrase compressed to just the stem of the adjective, i.e. function. While not frequently occurring in the corpus studied, the compression of such units may be described by rules, and hence, might be operationalized for automatic abstracting. Aggregation, the combining of units of text within or between sentences, is an important sub-process of condensation. Two-thirds of sentences in abstracts studied were written using multiple sentences, and more sentences were combined without than with the use of an explicit sign, such as a connective, a colon or a semi-colon. If research in summarization is to progress beyond sentence selection, then we must work towards: (a) a clear distinction between operations that are condensation processes, and those that are not; (b) bringing operationally similar processes together under the same designation, and (c) a greater understanding of sub-processes constitutiiig condensation. To this end, our provisional typology for condensation, the range of type of linguistic units involved and their functions sets the first step to advance research into content condensation. We have only just begun to identify the condensation sub-processes in operation during abstracting. The factors that are critical on the interplay of these processes still need to be investigated.
Identifer | oai:union.ndltd.org:umontreal.ca/oai:papyrus.bib.umontreal.ca:1866/33221 |
Date | 04 1900 |
Creators | Chuah, Choy-Kim |
Contributors | Kittredge, Richard |
Source Sets | Université de Montréal |
Language | English |
Detected Language | English |
Type | thesis, thèse |
Format | application/pdf |
Page generated in 0.0023 seconds