1 |
Ανάκτηση κειμένου και εξαγωγή κανόνων από κείμενα με βιολογικό περιεχόμενο / Text retrieval and rule extraction from documents with biological conceptΓαϊτάνου, Ευφροσύνη 01 October 2008 (has links)
Η ραγδαία ανάπτυξη του Παγκόσμιου Ιστού προσέφερε σε όλους τους χρήστες ανά τον κόσμο τη δυνατότητα άμεσης, γρήγορης και αποτελεσματικής προσπέλασης κάθε είδους πληροφορίας. Καθημερινά πραγματοποιούνται εκατομμύρια καταχωρήσεις πληροφοριών στο Διαδίκτυο με αποτέλεσμα ο όγκος της διακινούμενης πληροφορίας να αυξάνει με εκθετικούς ρυθμούς.
Με το πάτημα ενός κουμπιού, μια πληθώρα πληροφοριών, ακόμη και για το πιο εξειδικευμένο θέμα, βρίσκεται μπροστά στην οθόνη του χρήστη, έτοιμη προς ανάγνωση και επεξεργασία. Αυτή ακριβώς η «υπερδιάθεση» πληροφοριών καθιστά πολύ δύσκολη έως αδύνατη οποιουδήποτε είδους επεξεργασία των δεδομένων από το χρήστη, έστω και σε επίπεδο απλής ανάγνωσης.
Η ύπαρξη ενός εργαλείου ανάκτησης κειμένου και εξαγωγής όρων και κανόνων από μια υπερμεγέθη συλλογή κειμένων θα έδινε τη δυνατότητα στο χρήστη να ανακτήσει χρήσιμες πληροφορίες γρήγορα, χωρίς να είναι απαραίτητη η ανάγνωση και η φυσική επεξεργασία όλων αυτών των κειμένων.
Ειδικότερα στο ευαίσθητο πεδίο των Βιο-Επιστημών όπου η αδυναμία επεξεργασίας της διαθέσιμης πληροφορίας και της εξαγωγής χρήσιμων συνδέσεων και συμπερασμάτων επηρεάζει αρνητικά την επιστημονική έρευνα, είναι επιτακτική η ανάγκη παρουσίας εργαλείων που θα διευκολύνουν τη διαδικασία εξόρυξης γνώσης από κείμενα με βιολογικό περιεχόμενο.
Στην παρούσα διπλωματική εργασία γίνεται μια παρουσίαση τεχνικών με τις οποίες είναι δυνατή η εξαγωγή γνώσης και κανόνων από κείμενα ηλεκτρονικής μορφής στο Διαδίκτυο τα οποία αφορούν στο επιστημονικό πεδίο της Βιολογίας.
Η προσπάθειά μας επικεντρώνεται κυρίως στη δυνατότητα εξόρυξης γνώσης από κείμενα που αναφέρονται σε ένα συγκεκριμένο θέμα Βιολογίας (π.χ. μεταγραφικοί παράγοντες) και που η πραγματοποίηση του στόχου αυτού θα ήταν διαφορετικά από δύσκολη έως αδύνατη καθώς το πλήθος των κειμένων είναι απαγορευτικό για την αναλυτική μελέτη τους από ειδικό ή ομάδα ειδικών, πόσο μάλλον από έναν απλό χρήστη.
Αρχικά, περιγράφουμε τον τρόπο ανάκτησης των κειμένων που αναφέρονται στο συγκεκριμένο θέμα του ενδιαφέροντός μας από την ηλεκτρονική βιβλιοθήκη National Library of Medicine και τη δημιουργία της προς επεξεργασία συλλογής κειμένων. Η συλλογή αυτή υπόκειται σε λεξικολογική ανάλυση και επεξεργασία κατά τη διάρκεια της οποίας διατηρούνται από κάθε κείμενο οι πιο σημαντικοί όροι, ενώ οι υπόλοιποι απορρίπτονται. Με τον τρόπο αυτό δημιουργείται ένα σύνολο από τους πιο αντιπροσωπευτικούς όρους ανά κείμενο με τη συχνότητα εμφάνισής τους σε αυτά.
Στη συνέχεια, εφαρμόζουμε τεχνικές ομαδοποίησης δεδομένων με στόχο τη δημιουργία ομάδων όρων, αλλά και ομάδων κειμένων. Στα πλαίσια της προσπάθειας αυτής, πειραματιστήκαμε με διάφορες γνωστές τεχνικές ομαδοποίησης (αλγόριθμοι k-means και ιεραρχικός μονής σύνδεσης), ενώ υλοποιήσαμε εκ νέου τον αλγόριθμο ISODATA σε περιβάλλον ανάπτυξης Matlab.
Η έρευνά μας ολοκληρώνεται με την εφαρμογή της τεχνικής του Latent Semantic Indexing πριν τη ομαδοποίηση των δεδομένων και τη σύγκριση των αποτελεσμάτων.
Μέσα από τις ομάδες που δημιουργούνται με αυτή τη διαδικασία, διαπιστώνουμε την παρουσία συνδέσεων μεταξύ όρων και κειμένων και, ακόμη περισσότερο, τη δυνατότητα εξαγωγής συμπερασμάτων, αλλά και εξόρυξης πραγματικά νέας γνώσης επάνω σε συγκεκριμένα πεδία της επιστήμης της Βιολογίας. / The rapid growth of World Wide Web offered every user around the globe the ability to have immediate, quick and effective access to every kind of information. Daily, millions of records of information about every subject are added on Internet, giving the volume of available information an exponential boost.
Simply by pressing only one single button, a plethora of information – even about the most sophisticated topic - is laid out in front of user’s screen ready to be read and processed. This plethora is exactly the reason that makes it difficult or even impossible for a simple user to process all the available data, or even just read it.
It is clear that the presence of a tool that will make feasible the retrieval of documents and the extraction of terms and rule-associations from a huge document collection would give users the ability to retrieve valuable information quickly, without even reading or pre-processing all these documents.
Especially in Bio-sciences, the inability of processing the available information and extracting useful connections and assumptions is an obstacle in scientific research. Therefore, there is a crying need for tools that will facilitate the process of text mining from documents with biological concept.
In the present master thesis we present techniques for extracting knowledge and rules from documents in a digital format retrieved from Internet, with special reference to the scientific field of Biology.
Our attempt is mainly focused on knowledge extraction from documents with specific biological concept (e.g. transcription factors), which is a really difficult – in some cases even impossible – task to accomplish due to the huge amount of available documents that an expert or a group of experts should read and process – imagine what a simple user could do.
First, we describe the retrieval of documents referring to the specific biological concept we are interested about, from the National Library of Medicine and the construction of our document set. This set will be lexicological processed and only the most important term from each document will be kept while the rest will be ignored. This way, a set of the most representative terms per document will be created, along with the frequency in which the terms appear in each document.
Secondly, we apply clustering techniques over this terms-by-document set in order to produce clusters of terms as well as clusters of documents. During this step, many well known clustering techniques are being tested, such as the k-means algorithm and the hierarchical-single linkage algorithm. We also describe our implementation, the ISODATA algorithm. The implementation of all clustering algorithms tested here was done on Matlab 6p5.
Our research ends with the application of Latent Semantic Indexing (LSI) technique over our terms-by-documents set before the clustering step; we compare the resulting clusters with those taken without performing LSI before clustering.
It is in those clusters that we find many connections between terms and documents and - even more – we discover the ability of extracting not only conclusions about the concept of the documents in each cluster but also truly new knowledge referring to specific scientific fields of Biology.
|
2 |
Modeling Potential Native Plant Species Distributions in Rich County, UtahPeterson, Kathryn A. 01 May 2009 (has links)
Georeferenced field data were used to develop logistic regression models of the geographic distribution of 38 frequently common plant species throughout Rich County, Utah, to assist in the future correlation of Natural Resources Conservation Service Ecological Site Descriptions to soil map units. Field data were collected primarily during the summer of 2007, and augmented with previously existing data collected in 2001 and 2006. Several abiotic parameters and Landsat Thematic Mapper imagery were used to stratify the study area into sampling units prior to the 2007 field season. Models were initially evaluated using an independent dataset extracted from data collected by the Bureau of Land Management and by another research project conducted in Rich County by Utah State University. By using this independent dataset, model accuracy statistics widely varied across individual species, but the average model sensitivity (modeling a species as common where it was common in the independent dataset) was 0.626, and the average overall correct classification rate was 0.683. Because of concerns pertaining to the appropriateness of the independent dataset for evaluation, models were also evaluated using an internal cross-validation procedure. Model accuracy statistics computed by this procedure averaged 0.734 for sensitivity and 0.813 for overall correct classification rate. There was less variability in accuracy statistics across species using the internal cross-validation procedure. Despite concerns with the independent dataset, we wanted to determine if models would be improved, based on internal cross-validation accuracy statistics, by adding these data to the original training data. Results indicated that the original training data, collected with this modeling effort in mind, were better for choosing model parameters, but sometimes model coefficients were better when computed using the combined dataset.
|
3 |
Evaluating Long-Term Land Cover Changes for Malheur Lake, Oregon Using ENVI and ArcGISWoods, Ryan Joseph 01 December 2015 (has links)
Land cover change over time can be a useful indicator of variations in a watershed, such as the patterns of drought in an area. I present a case study using remotely sensed images from Landsat satellites for over a 30-year period to generate classifications representing land cover categories, which I use to quantify land cover change in the watershed areas that contribute to Malheur, Mud, and Harney Lakes. I selected images, about every 4 to 6 years from late June to late July, in an attempt to capture the peak vegetation growth and to avoid cloud cover. Complete coverage of the watershed required that I selected an image that included the lakes, an image to the North, and an image to the West of the lakes to capture the watershed areas for each chosen year. I used the watershed areas defined by the HUC-8 shapefiles. The relevant watersheds are called: Harney-Malheur Lakes, Donner und Blitzen, Silver, and Silvies watershed. To summarize the land cover classes that could be discriminated from the Landsat images in the area, I used an unsupervised classification algorithm called Iterative Self-Organizing Data Analysis Technique (ISODATA) to identify different classes from the pixels. I then used the ISODATA results and visual inspection of calibrated Landsat images and Google Earth imagery, to create Regions of Interest (ROI) with the following land cover classes: Water, Shallow Water, Vegetation, Dark Vegetation, Salty Area, and Bare Earth. The ROIs were used in the following supervised classification algorithms: maximum likelihood, minimum distance, and Mahalanobis distance, to classify land cover for the area. Using ArcGIS, I removed most of the misclassified area from the classified images by the use of the Landsat CDR, combined the main, north, and west images and then extracted the watersheds from the combined image. The area in acres for each land cover class and watershed was computed and stored in graphs and tables.After comparing the three supervised classifications using the amount of area classified into each category, normalized area in each category, and the raster datasets, I determined that the minimum distance classification algorithm produced the most accurate land cover classification. I investigated the correlation of the land cover classes with the average precipitation, average discharge, average summer high temperature, and drought indicators. For the most part, the land cover changes correlate with the weather. However, land use changes, groundwater, and error in the land cover classes may have accounted for the instances of discrepancy. The correlation of land cover classes, except Dark Vegetation and Bare Earth, are statistically significant with weather data. This study shows that Landsat imagery has the necessary components to create and track land cover changes over time. These results can be useful in hydrological studies and can be applied to models.
|
Page generated in 0.0397 seconds