<p>This thesis explored and visualized the relationships of documents data, based on the technique of self-organizing maps (SOM), a subtype of artificial neural network for visualizing high-dimensional data in low-dimensional views. The source data for this thesis are the full Extensible Markup Language (XML) texts of A Standard Corpus of Present Day Edited American English. The first step is transforming these XML files to produce a term-document matrix, including stop word removal, stemming, tf-idf (term frequency–inverse document frequency) weighting, global filtering; here rows of this matrix represent documents as n-dimensional vectors. Secondly, these vectors are clustered and visualized by SOM consisting of neurons, each neuron relatives to a set of documents with a certain number of same terms. Then a network has been constructed from SOM, with vertices set of neurons and documents, lines set of linkages between neurons and documents. Finally this network exports to the Pajek for analysis and final visualization.</p>
Identifer | oai:union.ndltd.org:UPSALLA/oai:DiVA.org:hig-129 |
Date | January 2007 |
Creators | Lu, Weiping |
Publisher | University of Gävle, Department of Technology and Built Environment |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, text |
Page generated in 0.0021 seconds