Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3

The thesis discusses a series of related techniques that prepare and transform raw linguistic data for advanced processing in order to unveil hidden grammatical patterns. A threedimensional array is identified as a suitable data structure to build a data cube to capture multidimensional linguistic data in a computer's temporary storage facility. It also enables online analytical processing, like slicing, to be executed on this data cube in order to reveal various subsets and presentations of the data. XML is investigated as a suitable mark-up language to permanently store such an exploitable databank of Biblical Hebrew linguistic data. This concept is illustrated by tagging a phonetic transcription of Genesis 1:1-2:3 on various linguistic levels and manipulating this databank. Transferring the data set between an XML file and a threedimensional array creates a stable environment allowing editing and advanced processing of the data in order to confirm existing knowledge or to mine for new, yet undiscovered, linguistic features. Two experiments are executed to demonstrate possible text-mining procedures. Finally, visualisation is discussed as a technique that enhances interaction between the human researcher and the computerised technologies supporting the process of knowledge creation. Although the data set is very small there are exciting indications that the compilation and analysis of aggregate linguistic data may assist linguists to perform rigorous research, for example regarding the definitions of semantic functions and the mapping of these functions onto the syntactic module. / Thesis (PhD (Information Technology))--University of Pretoria, 2008. / Information Science / unrestricted

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:up/oai:repository.up.ac.za:2263/26750
Date28 July 2008
CreatorsKroeze, J.H. (Jan Hendrik)
ContributorsProf T J D Bothma, jan.kroeze@gmail.com, Dr M C Matthee
Source SetsSouth African National ETD Portal
Detected LanguageEnglish
TypeThesis
Rights©University of Pretoria 2008 B23/

Page generated in 0.0024 seconds