Global ETD Search

61	A probabilistic approach for cluster based polyrepresentative information retrieval Abbasi, Muhammad Kamran January 2015 (has links) Document clustering in information retrieval (IR) is considered an alternative to rank-based retrieval approaches, because of its potential to support user interactions beyond just typing in queries. Similarly, the Principle of Polyrepresentation (multi-evidence: combining multiple cognitively and/or functionally diff erent information need or information object representations for improving an IR system's performance) is an established approach in cognitive IR with plausible applicability in the domain of information seeking and retrieval. The combination of these two approaches can assimilate their respective individual strengths in order to further improve the performance of IR systems. The main goal of this study is to combine cognitive and cluster-based IR approaches for improving the eff ectiveness of (interactive) information retrieval systems. In order to achieve this goal, polyrepresentative information retrieval strategies for cluster browsing and retrieval have been designed, focusing on the evaluation aspect of such strategies. This thesis addresses the challenge of designing and evaluating an Optimum Clustering Framework (OCF) based model, implementing probabilistic document clustering for interactive IR. Thus, polyrepresentative cluster browsing strategies have been devised. With these strategies a simulated user based method has been adopted for evaluating the polyrepresentative cluster browsing and searching strategies. The proposed approaches are evaluated for information need based polyrepresentative clustering as well as document based polyrepresentation and the combination thereof. For document-based polyrepresentation, the notion of citation context is exploited, which has special applications in scientometrics and bibliometrics for science literature modelling. The information need polyrepresentation, on the other hand, utilizes the various aspects of user information need, which is crucial for enhancing the retrieval performance. Besides describing a probabilistic framework for polyrepresentative document clustering, one of the main fi ndings of this work is that the proposed combination of the Principle of Polyrepresentation with document clustering has the potential of enhancing the user interactions with an IR system, provided that the various representations of information need and information objects are utilized. The thesis also explores interactive IR approaches in the context of polyrepresentative interactive information retrieval when it is combined with document clustering methods. Experiments suggest there is a potential in the proposed cluster-based polyrepresentation approach, since statistically signifi cant improvements were found when comparing the approach to a BM25-based baseline in an ideal scenario. Further marginal improvements were observed when cluster-based re-ranking and cluster-ranking based comparisons were made. The performance of the approach depends on the underlying information object and information need representations used, which confi rms fi ndings of previous studies where the Principle of Polyrepresentation was applied in diff erent ways. 025.04
62	La programmation DC et DCA pour certaines classes de problèmes en apprentissage et fouille de donées [i.e. données] / DC programming and DCA for some classes of problems in machine learning and data mining Nguyen, Manh Cuong 19 May 2014 (has links) La classification (supervisée, non supervisée et semi-supervisée) est une thématique importante de la fouille de données. Dans cette thèse, nous nous concentrons sur le développement d'approches d'optimisation pour résoudre certains types des problèmes issus de la classification de données. Premièrement, nous avons examiné et développé des algorithmes pour résoudre deux problèmes classiques en apprentissage non supervisée : la maximisation du critère de modularité pour la détection de communautés dans des réseaux complexes et les cartes auto-organisatrices. Deuxièmement, pour l'apprentissage semi-supervisée, nous proposons des algorithmes efficaces pour le problème de sélection de variables en semi-supervisée Machines à vecteurs de support. Finalement, dans la dernière partie de la thèse, nous considérons le problème de sélection de variables en Machines à vecteurs de support multi-classes. Tous ces problèmes d'optimisation sont non convexe de très grande dimension en pratique. Les méthodes que nous proposons sont basées sur les programmations DC (Difference of Convex functions) et DCA (DC Algorithms) étant reconnues comme des outils puissants d'optimisation. Les problèmes évoqués ont été reformulés comme des problèmes DC, afin de les résoudre par DCA. En outre, compte tenu de la structure des problèmes considérés, nous proposons différentes décompositions DC ainsi que différentes stratégies d'initialisation pour résoudre un même problème. Tous les algorithmes proposés ont été testés sur des jeux de données réelles en biologie, réseaux sociaux et sécurité informatique / Classification (supervised, unsupervised and semi-supervised) is one of important research topics of data mining which has many applications in various fields. In this thesis, we focus on developing optimization approaches for solving some classes of optimization problems in data classification. Firstly, for unsupervised learning, we considered and developed the algorithms for two well-known problems: the modularity maximization for community detection in complex networks and the data visualization problem with Self-Organizing Maps. Secondly, for semi-supervised learning, we investigated the effective algorithms to solve the feature selection problem in semi-supervised Support Vector Machine. Finally, for supervised learning, we are interested in the feature selection problem in multi-class Support Vector Machine. All of these problems are large-scale non-convex optimization problems. Our methods are based on DC Programming and DCA which are well-known as powerful tools in optimization. The considered problems were reformulated as the DC programs and then the DCA was used to obtain the solution. Also, taking into account the structure of considered problems, we can provide appropriate DC decompositions and the relevant choice strategy of initial points for DCA in order to improve its efficiency. All these proposed algorithms have been tested on the real-world datasets including biology, social networks and computer security Classification de données Fouille de données Apprentissage Optimisation Programmations DC et DCA 025.04 519.7
63	Gérer et exploiter des connaissances produites par une communauté en ligne : application au raisonnement à partir de cas / Managing and exploiting knowledge produced by an e-community : application to case-based reasoning Gaillard, Emmanuelle 22 June 2016 (has links) Cette thèse propose deux approches pour améliorer la qualité des réponses d'un système de raisonnement à partir de cas (RàPC) utilisant des connaissances produites par une communauté en ligne. La première approche concerne la mise en œuvre d'un modèle permettant de gérer la fiabilité des connaissances produites par la communauté sous la forme d'un score. Ce score de fiabilité est utilisé d'une part pour filtrer les connaissances non fiables afin qu'elles ne soient pas utilisées par le système de RàPC et d'autre part pour classer les réponses retournées par le système. La deuxième approche concerne la représentation de la typicalité entre sous-classes et classes dans une organisation hiérarchique. La typicalité est alors utilisée pour réorganiser les connaissances hiérarchiques utilisées par le système de RàPC. L'apport de ces deux approches a été évalué dans le cadre de eTaaable, un système de RàPC qui adapte des recettes de cuisine en utilisant des connaissances produites par une communauté en ligne. L'évaluation montre que la gestion de la fiabilité des connaissances produites par la communauté améliore la qualité des réponses retournées par eTaaable. De même, l'évaluation montre que l'utilisation par eTaaable des hiérarchies des connaissances réorganisées en exploitant la typicalité améliore également la qualité des réponses / This research work presents two approaches to improve the quality of the results returned by a case-based reasoning system (CBR) exploiting knowledge produced by an e-community. The first approach relies on a new model to manage the trustworthiness of the knowledge produced by the e community. In this model, the trustworthiness is represented through a score which is used to filter untrustworthy knowledge so that the CBR system will not use it anymore. Moreover, the trustworthiness score is also used to rank the CBR results. The second approach addresses the issue of representing the typicality between subclasses and classes in a hierarchy. The typicality is used to change the hierarchical organization used by the CBR system. Both approaches have been evaluated in the framework of eTaaable, a CBR system which adapts cooking recipes using knowledge coming from an e-community. The evaluations show that managing the trustworthiness of the knowledge produced by an e-community improves the quality of the results returned by eTaaable. The evaluations also shows that eTaaable returns also better results when using knowledge reorganized according to typicality. Raisonnement à partir de cas Communauté en ligne Fiabilité Typicalité Case-Based reasoning E-Community Trustworthiness Typicality 006.33 025.04
64	Χρήση θεματικών ταξινομιών για την αυτόματη δημιουργία και οργάνωση εξατομικευμένων καταλόγων διαδικτύου : ένας πρότυπος αλγόριθμος ταξινόμησης / Usage of thematic taxonomy for the automatic creation and organization of specialized network catalogs Κρίκος, Βλάσης 16 May 2007 (has links) Οι εξατομικευμένοι κατάλογοι διαδικτύου εμφανίστηκαν σχεδόν ταυτόχρονα με την εμφάνιση των φυλλομετρητών διαδικτύων, και από τότε όλοι οι φυλλομετρητές ενσωματώνουν απλά συστήματα διαχείρισης των εξατομικευμένων καταλόγων. Με τον όρο εξατομικευμένοι κατάλογοι εννοούμε τις προσωπικές συλλογές από ιστοσελίδες που ένας χρήστης διαδικτύου αποθηκεύει κατά την ώρα της πλοήγησης στον Παγκόσμιο Ιστό. Οι εξατομικευμένοι κατάλογοι διαδικτύου χρησιμοποιούνται σαν «προσωπικός χώρος πληροφορίας του δικτύου» για να βοηθούν τους ανθρώπους να θυμούνται και να ανακτούν ενδιαφέρουσες ιστοσελίδες από το διαδίκτυο. Στην εργασία αυτή παρουσιάζουμε ένα πρότυπο σύστημα διαχείρισης εξατομικευμένων καταλόγων διαδικτύου ορίζοντας τις προϋποθέσεις που πρέπει να πληρεί ώστε να είναι εύχρηστο και αποτελεσματικό. Το σύστημα αυτό έχει όλες τις δυνατότητες που έχουν τα εμπορικά αλλά και τα πρότυπα συστήματα διαχείρισης bookmarks. Επιπλέον διαθέτει καινοτόμες λειτουργίες που το καθιστούν μοναδικό. Παράλληλα παρουσιάζουμε αναλυτικά έναν πρότυπο αλγόριθμο κατάταξης, τον αλγόριθμο κατάταξης με βάση την συνάφεια των σελίδων με τις κατηγορίες στις οποίες ανήκουν. Τον αλγόριθμο αυτόν τον συγκρίνουμε με τον δημοφιλή αλγόριθμος γενικής κατάταξης το PageRank. Από το πείραμα που κάναμε προκύπτει ότι ο αλγόριθμος που προτείνουμε είναι πιο κατάλληλος για την ταξινόμηση των σελίδων σε θεματικές κατηγορίες από το PageRank. / The individualised lists of internet were presented almost simultaneously with the appearance of browser internets, and from then all browser incorporate simple systems of management of individualised lists. With the term individualised lists we mean the personal collections from web pages that a user of internet stores at the hour of pilotage in the World Web. The individualised lists of internet are used as \"personal space of information of network\" in order to they help the persons to remember and to recover interesting web pages from the internet. In this work we present a model system of management of individualised lists of internet horizon the conditions that should plirej Θεματικές ταξινομίες 025.04 Bookmarks Ranking algorithm Thematic taxonomy
65	Η χρήση σημασιολογικών δικτύων για τη διαχείριση του περιεχομένου του παγκόσμιου ιστού / Managing the web content through the use of semantic networks Στάμου, Σοφία 25 June 2007 (has links) Η παρούσα διατριβή πραγματεύεται την ενσωμάτωση ενός σημασιολογικού δικτύου λημμάτων σ’ ένα σύνολο εφαρμογών Διαδικτύου για την αποτελεσματική διαχείριση του περιεχομένου του Παγκόσμιου Ιστού. Τα δίκτυα σημασιολογικά συσχετισμένων λημμάτων αποτελούν ένα είδος ηλεκτρονικών λεξικών στα οποία καταγράφεται σημασιολογική πληροφορία για τα λήμματα που περιλαμβάνουν, όπου τα τελευταία αποθηκεύονται σε μια δενδρική δομή δεδομένων. Ο τρόπος δόμησης του περιεχομένου των σημασιολογικών δικτύων παρουσιάζει αρκετές ομοιότητες με την οργάνωση που ακολουθούν οι ιστοσελίδες στον Παγκόσμιο Ιστό, με αποτέλεσμα τα σημασιολογικά δίκτυα να αποτελούν έναν σημασιολογικό πόρο άμεσα αξιοποιήσιμο από ένα πλήθος εφαρμογών Διαδικτύου που καλούνται να διαχειριστούν αποδοτικά το πλήθος των δεδομένων που διακινούνται στον Παγκόσμιο Ιστό. Μετά από επισκόπηση των τεχνικών που παρουσιάζονται στη διεθνή βιβλιογραφία για τη διαχείριση του περιεχομένου του Παγκόσμιου Ιστού, προτείνεται και υλοποιείται ένα πρότυπο μοντέλο διαχείρισης ιστοσελίδων, το οποίο κάνοντας εκτεταμένη χρήση ενός εμπλουτισμένου σημασιολογικού δικτύου λημμάτων, εντοπίζει εννοιολογικές ομοιότητες μεταξύ του περιεχομένου διαφορετικών ιστοσελίδων και με βάση αυτές επιχειρεί και κατορθώνει την αυτοματοποιημένη και αποδοτική δεικτοδότηση, κατηγοριοποίηση και ταξινόμηση του πλήθους των δεδομένων του Παγκόσμιου Ιστού. Για την επίδειξη του μοντέλου διαχείρισης ιστοσελίδων που παρουσιάζεται, υιοθετούμε το μοντέλο πλοήγησης στους θεματικούς καταλόγους του Παγκόσμιου Ιστού και καταδεικνύουμε πειραματικά τη συμβολή των σημασιολογικών δικτύων σε όλα τα στάδια της δημιουργίας θεματικών καταλόγων Διαδικτύου. Συγκεκριμένα, εξετάζεται η συνεισφορά των σημασιολογικών δικτύων: (i) στον ορισμό και εμπλουτισμό των θεματικών κατηγοριών των καταλόγων του Παγκόσμιου Ιστού, (ii) στην επεξεργασία και αποσαφήνιση του περιεχομένου των ιστοσελίδων, (iii) στον αυτόματο εμπλουτισμό των θεματικών κατηγοριών ενός δικτυακού καταλόγου, (iv) στην ταξινόμηση των ιστοσελίδων που έχουν δεικτοδοτηθεί στις αντίστοιχες θεματικές κατηγορίες ενός καταλόγου, (v) στη διαχείριση των περιεχομένων των θεματικών καταλόγων με τρόπο που να διασφαλίζει την παροχή χρήσιμων ιστοσελίδων προς τους χρήστες, και τέλος (vi) στην αναζήτηση πληροφορίας στους θεματικούς καταλόγους του Παγκόσμιου Ιστού. Η επιτυχία του προτεινόμενου μοντέλου επιβεβαιώνεται από τα αποτελέσματα ενός συνόλου πειραματικών εφαρμογών που διενεργήθηκαν στο πλαίσιο της παρούσας διατριβής, όπου καταδεικνύεται η συμβολή των σημασιολογικών δικτύων στην αποτελεσματική διαχείριση των πολυάριθμων και δυναμικά μεταβαλλόμενων ιστοσελίδων του Παγκόσμιου Ιστού. Η σπουδαιότητα του προτεινόμενου μοντέλου διαχείρισης ιστοσελίδων, έγκειται στο ότι, εκτός από αυτόνομο εργαλείο διαχείρισης και οργάνωσης ιστοσελίδων, συνιστά το πρώτο επίπεδο επεξεργασίας σε ευρύτερο πεδίο εφαρμογών, όπως είναι η εξαγωγή περιλήψεων, η εξόρυξη πληροφορίας, η θεματικά προσανατολισμένη προσκομιδή ιστοσελίδων, ο υπολογισμός του ρυθμού μεταβολής των δεδομένων του Παγκόσμιου Ιστού, η ανίχνευση ιστοσελίδων με παραποιημένο περιεχόμενο, κτλ. / This dissertation addresses the incorporation of a semantic network into a set of Web-based applications for the effective management of Web content. Semantic networks are a kind of machine readable dictionaries, which encode semantic information for the lemmas they contain, where the latter are stored in a tree structure. Semantic networks store their contents in a similar way to the organization that Web pages exhibit on the Web graph; a feature that makes semantic networks readily usable by several Web applications that aim at the efficient management of the proliferating and constantly changing Web data. After an overview of the techniques that have been employed for managing the Web content, we propose and implement a novel Web data management model, which relies on an enriched semantic network for locating semantic similarities in the context of distinct Web pages. Based on these similarities, our model attempts and successfully achieves the automatic and effective indexing, categorization and ranking of the numerous pages that are available on the Web. For demonstrating the potential of our Web data management model, we adopt the navigation model in Web thematic directories and we experimentally show the contribution of semantic networks throughout the construction of Web catalogs. More specifically, we study the contribution of semantic networks in: (i) determining and enriching the thematic categories of Web directories, (ii) processing and disambiguating the contents of Web pages, (iii) automatically improving the thematic categories of Web directories, (iv) ordering Web pages that have been assigned in the respective categories of a Web directory, (v) managing the contents of Web directories in a way that ensures the availability of useful Web data to the directories’ users, and (vi) searching for information in the contents of Web directories. The contribution of our model is certified by the experimental results that we obtained from a numerous of testing applications that we run in the framework of our study. Obtained results demonstrate the contribution of semantic networks in the effective management of the dynamically evolving Web content. The practical outcome of the research presented herein, besides offering a fully-fledge infrastructure for the efficient manipulation and organization of the Web data, it can play a key role in the development of numerous applications, such as text summarization, information extraction, topical-focused crawling, measuring the Web’s evolution, spam detection, and so forth. Σημασιολογικά δίκτυα Παγκόσμιος ιστός Κατηγοριοποίηση Ταξινόμηση Αποσαφήνιση 025.04 Semantic networks World Wide Web Classification Ranking Disambiguation
66	Αποδοτικοί αλγόριθμοι εξατομίκευσης βασισμένοι σε εξόρυξη γνώσης απο δεδομένα χρήσης Web / Effective personalization algorithms based on Web usage mining Ρήγκου, Μαρία 25 June 2007 (has links) Το Web αποτελεί πλέον µια τεράστια αποθήκη πληροφοριών και συνεχίζει να µεγαλώνει εκθετικά, ενώ η ανθρώπινη ικανότητα να εντοπίζει, να επεξεργάζεται και να αντιλαµβάνεται τις πληροφορίες παραµένει πεπερασµένη. Το πρόβληµα στις µέρες µας δεν είναι η πρόσβαση στην πληροφορία, αλλά το ότι όλο και περισσότεροι άνθρωποι µε διαφορετικές ανάγκες και προτιµήσεις πλοηγούνται µέσα σε περίπλοκες δοµές Web χάνοντας στην πορεία το στόχο της αναζήτησής τους. Η εξατοµίκευση, µια πολυσυλλεκτική ερευνητική περιοχή, αποτελεί µια από τις πιο πολλά υποσχόµενες προσεγγίσεις για τη λύση του προβλήµατος του πληροφοριακού υπερφόρτου, παρέχοντας κατάλληλα προσαρµοσµένες εµπειρίες πλοήγησης. Η διατριβή εξετάζει αλγοριθµικά θέµατα που σχετίζονται µε την υλοποίηση αποδοτικών σχηµάτων εξατοµίκευσης σε περιβάλλον web, βασισµένων σε εξόρυξη γνώσης από δεδοµένα χρήσης web. Οι τεχνικές ανακάλυψης προτύπων που µελετώνται περιλαµβάνουν το clustering, την εξόρυξη κανόνων συσχέτισης και την ανακάλυψη σειριακών προτύπων, ενώ οι προτεινόµενες λύσεις εξατοµίκευσης που βασίζονται στις δύο τελευταίες τεχνικές συνδυάζουν τα δεδοµένα χρήσης µε δεδοµένα περιεχοµένου και δοµής. Ειδικότερα, στο πρώτο κεφάλαιο της διατριβής, ορίζεται το επιστηµονικό πεδίο των σύγχρονων τεχνολογιών εξατοµίκευσης στο περιβάλλον του web, εστιάζοντας στη στενή σχέση τους µε το χώρο του web mining, στοιχειοθετώντας µε αυτό τον τρόπο το γενικότερο πλαίσιο αναφοράς. Στη συνέχεια, περιγράφονται τα διαδοχικά στάδια της τυπικής διαδικασίας εξατοµίκευσης µε έµφαση στη φάση ανακάλυψης προτύπων και τις τεχνικές machine learning που χρησιµοποιούνται σε δεδοµένα χρήσης web και το κεφάλαιο ολοκληρώνεται µε µια συνοπτική περιγραφή της συµβολής της διατριβής στο πεδίο της εξατοµίκευσης σε περιβάλλον web. Στο δεύτερο κεφάλαιο προτείνεται ένας αλγόριθµος για εξατοµικευµένο clustering, που βασίζεται σε µια δοµή range tree που διατρέχεται σε πρώτη φάση για τον εντοπισµό των web αντικειµένων που ικανοποιούν τα ατοµικά κριτήρια του χρήστη. Στα αντικείµενα αυτά, εφαρµόζεται στη συνέχεια clustering, ώστε να είναι δυνατή η αποδοτικότερη διαχείρισή τους και να διευκολυνθεί η διαδικασία λήψης αποφάσεων από πλευράς χρήστη. O αλγόριθµος που προτείνεται αποτελεί βελτίωση του αλγόριθµου kmeans range, καθώς εκµεταλλεύεται το range tree που έχει ήδη κατασκευαστεί κατά το βήµα της εξατοµίκευσης και το χρησιµοποιεί ως τη βασική δοµή πάνω στην οποία στηρίζεται το βήµα του clustering χρησιµοποιώντας εναλλακτικά του k-means, τον αλγόριθµο k-windows. Ο συνολικός αριθµός παραµέτρων που χρησιµοποιούνται για την µοντελοποίηση των αντικειµένων υπαγορεύει και τον αριθµό των διαστάσεων του χώρου εργασίας. Η συνολική πολυπλοκότητα χρόνου του αλγορίθµου είναι ίση µε O(logd-2n+v), όπου n είναι ο συνολικός αριθµός των στοιχείων που δίνονται σαν είσοδος και v είναι το µέγεθος της απάντησης. Στο τρίτο κεφάλαιο της διατριβής προτείνεται ένα αποδοτικό σχήµα πρόβλεψης µελλοντικών δικτυακών αιτήσεων βασισµένο στην εξόρυξη σειριακών προτύπων πλοήγησης (navigation patterns) από αρχεία server log, σε συνδυασµό µε την τοπολογία των συνδέσµων του website και τη θεµατική κατηγοριοποίηση των σελίδων του. Τα µονοπάτια που ακολουθούν οι χρήστες κατά την πλοήγηση καταγράφονται, συµπληρώνονται µε τα κοµµάτια που λείπουν λόγω caching και διασπώνται σε συνόδους και σε επεισόδια, ώστε να προκύψουν σηµασιολογικά πλήρη υποσύνολά τους. Τα πρότυπα που εντοπίζονται στα επεισόδια µοντελοποιούνται µε τη µορφή n-grams και οι αποφάσεις πρόβλεψης βασίζονται στη λογική ενός µοντέλου n-gram+ που προσοµοιάζει το all Kth-τάξης µοντέλο Markov και πιο συγκεκριµένα, το επιλεκτικό µοντέλο Markov. Η υβριδική προσέγγιση που υιοθετεί το προτεινόµενο σχήµα, επιτυγχάνει 100% coverage, ενώ κατά τις πειραµατικές µετρήσεις το άνω όριο της ακρίβειας έφθασε το 71,67% στο σύνολο των προβλέψεων που επιχειρήθηκαν. Το χαρακτηριστικό του πλήρους coverage καθιστά το σχήµα κατάλληλο για συστήµατα παραγωγής συστάσεων, ενώ η ακρίβεια µπορεί να βελτιωθεί περαιτέρω αν µεγαλώσει το παράθυρο πρόβλεψης. Στο τέταρτο κεφάλαιο της διατριβής, εξετάζεται η ενσωµάτωση λειτουργιών εξατοµίκευσης στις ηλεκτρονικές µαθησιακές κοινότητες και προτείνεται ένα σύνολο από δυνατότητες εξατοµίκευσης που διαφοροποιούνται ως προς τα δεδοµένα στα οποία βασίζονται, την τεχνική εξόρυξης προτύπων που χρησιµοποιούν και την αντίστοιχη πολυπλοκότητα υλοποίησης. Οι υπηρεσίες αυτές περιλαµβάνουν: (α) εξατοµίκευση µε βάση το ρόλο του χρήστη, (β) εξατοµίκευση µε βάση το βαθµό δραστηριοποίησης του χρήστη, (γ) εξατοµίκευση µε βάση την ανακάλυψη προτύπων στα ατοµικά ιστορικά µελέτης των εκπαιδευόµενων και (δ) εξατοµίκευση µε βάση συσχετίσεις του περιεχοµένου των µαθηµάτων. / The Web has become a huge repository of information and keeps growing exponentially under no editorial control, while the human capability to find, read and understand content remains constant. Providing people with access to information is not the problem; the problem is that people with varying needs and preferences navigate through large Web structures, missing the goal of their inquiry. Web personalization is one of the most promising approaches for alleviating this information overload, providing tailored Web experiences. The present dissertation investigates algorithmic issues concerning the implementation of effective personalization scenarios in the web environment, based on web usage mining. The pattern discovery techniques deployed comprise clustering, association rule mining and sequential pattern discovery, while the proposed personalization schemas based on the latter two techniques integrate usage data with content and structure information. The first chapter introduces the scientific field of current web personalization technology, focusing on its close relation with the web mining domain, providing this way the general framework of the dissertation. Next, the typical web personalization process is described with emphasis on the pattern discovery phase along with an overview of the machine learning techniques applied on web usage data. The chapter concludes with a synoptic description of the contribution of the dissertation to web personalization research and applications domian. The second chapter introduces an algorithm for personalized clustering based on a range tree structure, used for identifying all web objects satisfying a set of predefined personal user preferences. The returned objects go through a clustering phase before reaching the end user, thus allowing more effective manipulation and supporting the decision making process. The proposed algorithm improves the k-means range algorithm, as it uses the already constructed range tree (i.e. during the personalized filtering phase) as the basic structure on which the clustering step is based, applying instead of the kmeans, the k-windows algorithm. The total number of parameters used for modeling the web objects dictates the number of dimensions of the Euclidean space representation. The time complexity of the algorithm is O(logd-2n+v), where d is the number of dimensions, n is the total number of web objects and v is the size of the answer. The third chapter proposes an effective prediction schema for web requests based on extracting sequential navigational patterns from server log files, combined with the website link structure and the thematic categorization of its content pages. The schema records the paths followed by users when browsing through the website pages, completes them with the missing parts (due to caching) and identifies sessions and episodes, so as to derive meaningful path subsets. The patterns extracted from the episodes are modeled in the form of n-grams and the prediction decisions are based on an n-gram+ model that resembles an all Kth-order Markov model and more specifically a selective Markov model. The hybrid approach adapted achieves full-coverage prediction, and reached the upper limit of 71,67% presicion when tested at an experimental setting. The full-coverage feature makes the proposed schema quite suitable for recommendation engines, while precision is further improved when using a larger prediction window. The fourth chapter examines the integration of personalized functionalities in the framework of electronic learning communities and studies the advantages derived from generating dynamic adaptations on the layout, the content as well as the learning scenarios delivered to each community student based on personal data, needs and preferences. More specifically, the chapter proposes a set of personalization functions differentiated by the data they use, the pattern discovery technique they apply and the resulting implementation complexity. These services comprise: (a) personalization based on the user role in the community, (b) personalization based on the level of user activity, (c) personalization based on discovery of association rules in the personal progress files of students, and (d) personalization based on predefined content correlations among learning topics. Εξατομίκευση Αλγόριθμος Εξόρυξη γνώσης Παγκόσμιος ιστός 025.04 Personalization Algorithm Web Data mining
67	Σημασιολογική προσωποποίηση στον παγκόσμιο ιστό / Semantic personalization in the world wide web Βόπη, Αγορίτσα 07 February 2008 (has links) Η αναζήτηση πληροφορίας στο Παγκόσμιο Ιστό λόγω της ραγδαίας αύξησης του όγκου του αποτελεί ένα δύσκολο και χρονοβόρο εγχείρημα. Επιπρόσθετα, η συνωνυμία και η πολυσημία συμβάλλουν στη δυσκολία εύρεσης πληροφορίας. Στα πλαίσια αυτής της διπλωματικής εργασίας αναπτύχθηκε μια μεθοδολογία για την προσωποποίηση των αποτελεσμάτων μιας μηχανής αναζήτησης ώστε αυτά να ανταποκρίνονται στα ενδιαφέροντα των χρηστών. Η μεθοδολογία αποτελείται από δύο τμήματα, το εκτός σύνδεσης τμήμα και το συνδεδεμένο τμήμα. Στο εκτός σύνδεσης τμήμα χρησιμοποιώντας τα αρχεία πρόσβασης της μηχανής αναζήτησης και εξάγεται πληροφορία για τις επιλογές του χρήστη. Στη συνέχεια πραγματοποιείται η σημασιολογική κατηγοριοποίηση των προηγούμενων επιλογών των χρηστών με χρήση μιας οντολογίας, που αναπτύχθηκε με βάση τους καταλόγους του ODP. Κατόπιν, αναπτύσσεται το προφίλ του χρήστη με βάση την οντολογία αναφοράς που χρησιμοποιήθηκε και στη φάση της σημασιολογικής αντιστοίχισης. Στη συνέχεια, με χρήση αλγορίθμου ομαδοποίησης γίνεται ομαδοποίηση των χρηστών με βάση τα ενδιαφέροντά τους. Στο συνδεδεμένο τμήμα ο αλγόριθμος προσωποποίησης χρησιμοποιεί τις ομάδες που δημιουργήθηκαν στο μη συνδεδεμένο τμήμα και τη σημασιολογική αντιστοίχηση των αποτελεσμάτων της μηχανής αναζήτησης και αναδιοργανώνει τα αποτελέσματά της προωθώντας στις πρώτες θέσεις επιλογής τα αποτελέσματα που είναι περισσότερο σχετικά με τις προτιμήσεις της ομάδας στην οποία ανήκει ο χρήστης. Η μεθοδολογία που προτείνεται έχει εφαρμοστεί σε πειραματική υλοποίηση δίνοντας τα επιθυμητά αποτελέσματα για την προσωποποίηση σύμφωνα με τις σημασιολογικές ομάδες χρηστών. / During the recent years the World Wide Web has been developed rapidly making the efficient searching of information difficult and time-consuming. In this work, we propose a web search results personalization methodology by coupling data mining techniques with the underlying semantics of the web content. To this purpose, we exploit reference ontologies that emerge from web catalogs (such as ODP), which can scale to the growth of the web. Our methodology uses ontologies to provide the semantic profiling of users’ interests based on the implicit logging of their behavior and the on-the-fly semantic analysis and annotation of the web results summaries. Following this the logged web clickthrough data are submitted to offline processing in order to form semantic clusters of interesting categories according to the users’ perspective. Finally, profiles of semantic clusters are combined with the emerging profile of the active user in order to apply a sophisticated re-ranking of search engines results. Experimental evaluation of our approach shows that the objectives expected from semantic users’ clustering in search engines are achievable. Προσωποποίηση Σημασιολογικός ιστός Οντολογίες Ομαδοποίηση 025.04 Personalization Semantic web Ontologies Clustering
68	Σχεδίαση και υλοποίηση συστήματος αξιολόγησης της δομής και του περιεχομένου ιστότοπων για κινητές συσκευές Στεφανής, Βασίλειος 12 February 2008 (has links) Τα τελευταία χρόνια η πρόσβαση στον παγκόσμιο ιστό δεν περιορίζεται μόνο στους επιτραπέζιους υπολογιστές αλλά πλέον περιλαμβάνει τα κινητά τηλέφωνα, τα PDAs και γενικότερα κάθε είδους κινητή συσκευή. Μάλιστα, στις αναπτυσσόμενες χώρες ο αριθμός των χρηστών που πλοηγούνται στον παγκόσμιο ιστό από κινητές συσκευές είναι μεγαλύτερος από αυτόν των χρηστών που πλοηγούνται μέσω επιτραπέζιων υπολογιστών. Επίσης, η ανάπτυξη περιεχομένου για τον παγκόσμιο ιστό έχει γίνει ευκολότερη λόγω της ύπαρξης αρκετών εργαλείων, που υπόσχονται τη γρήγορη και εύκολη παραγωγή του, χωρίς να απαιτούνται ιδιαίτερες γνώσεις από το χρήστη. Το ερώτημα είναι ποια χαρακτηριστικά θα πρέπει να έχουν οι ιστότοποι και το περιεχόμενό τους ώστε να προσφέρεται η βέλτιστη εμπειρία πλοήγησης στους χρήστες κινητών συσκευών. Το World Wide Web Consortium (W3C) έχει συντάξει τις πρακτικές που θα πρέπει να εφαρμόζονται για τη σωστή παρουσίαση του περιεχομένου του παγκόσμιου ιστού σε κινητές συσκευές (Mobile Web Best Practices). Η συμμόρφωση με τις πρακτικές αυτές είναι απαραίτητη κυρίως λόγω των περιορισμών των κινητών συσκευών. Οι κυριότεροι περιορισμοί είναι το μικρό μέγεθος οθόνης, ο τρόπος εισαγωγής δεδομένων στη συσκευή από το χρήστη, η διαθέσιμη μνήμη, η μικρή υπολογιστική ισχύ, η ταχύτητα μετάδοσης δεδομένων και η αυτονομία των συσκευών σε ενέργεια. Οι παραπάνω πρακτικές έχουν αντιστοιχηθεί, από το ίδιο το W3C, σε μία σειρά από ελέγχους που μπορούν να γίνουν στη δομή και το περιεχόμενο μιας ιστοσελίδας. Οι έλεγχοι αυτοί αποσκοπούν στο να εξασφαλίσουν ότι η συγκεκριμένη ιστοσελίδα μπορεί να προσφέρει μία αποδεκτή εμπειρία πλοήγησης στους χρήστες κινητών συσκευών. Ένα μέρος από τις πρακτικές αυτές ορίζουν ελέγχους που μπορούν να πραγματοποιηθούν αυτόματα με τη χρήση υπολογιστή, ενώ άλλες ελέγχους που απαιτούν και την ανθρώπινη κρίση. Στα πλαίσια της διπλωματικής, αφού παρουσιάστηκαν και αναλύθηκαν οι πρακτικές του W3C, σχεδιάστηκε και υλοποιήθηκε σύστημα για την αξιολόγηση της δομής και του περιεχομένου ιστότοπων που απευθύνονται σε κινητές συσκευές. Σκοπός του συστήματος είναι ανάλυση του ιστότοπου, η ανάκτηση των ιστοσελίδων που τον αποτελούν και ο έλεγχος της κάθε ιστοσελίδας για την ικανοποίηση ή όχι των παραπάνω ελέγχων. Τελικός στόχος αποτελεί η δημιουργία αναφοράς που θα αφορά συνολικά τον ιστότοπο καθώς και η παραγωγή βαθμού αξιολόγησης του ιστότοπου. Επίσης, ιδιαίτερο βάρος δόθηκε στην ανάκτηση και την αξιολόγηση σελίδων και περιεχομένου του ιστότοπου που αποτελούν μέρος του «κρυμμένου ιστού» (hidden web). Τέλος, στους χρήστες του συστήματος δίνεται η δυνατότητα χρήσης βαρών σημαντικότητας των ελέγχων που πραγματοποιούνται. / During the last years the access to the Web, not only from desktop PCs but from mobile devices too, such as mobile phones and PDAs, is a fact. Furthermore, in developing countries the number of users that browse the Web through mobile devices is larger than the number of users that browses the web from desktop PCs. Also, the creation of web content is much easier, due to a large number of applications that promise the fast and easy creation of web content without demanding special knowledge from their users. The question is which characteristics the web sites and their content should have in order to improve the user experience when accessed from mobile devices. The World Wide Web Consortium (W3C) has gathered the practices for delivering Web content to mobile devices (Mobile Web Best Practices). Those practices are strongly recommended because of the limitations of mobile devices. Those limitations are the small screen size, the inputting text method, the available memory, the small computational power and the power consumption. W3C, based on the above practices, has published a set of tests that refer to the structure and the content of a web page. Web pages which pass the tests provide a functional user experience for users of mobile devices. Some of the practices define tests that are machine verifiable and others tests that require the human judge as well. In this thesis at first the W3C Mobile Web Best practices are presented. Then, a system for the evaluation of the content and the structure of mobile web sites was designed and implemented. Purpose of the system is the analysis of a web site, the crawling of its web pages and the check of every web page against the W3C tests. The final goal of the system is to provide a report and a rating for the whole web site. Also, a module for crawling and evaluating content of the web site that is part of the "hidden web" is provided. Finally, the system's users may put weights of importance to each W3C test. Αξιολόγηση Κρυμμένος ιστός 025.04 Evaluation Mobile OK Hidden web Mobile web
69	Διαχείριση ψηφιακών αντικειμένων - σχεδιασμός, ανάπτυξη και υλοποίηση συστήματος Σαλούρος, Δημήτριος 18 April 2008 (has links) Η δημιουργία, παρουσίαση και ανταλλαγή της πληροφορίας όπως, επίσης, και η συλλογή, οργάνωση και αποθήκευση των μέσων πληροφορίας είναι εργασίες που επιτελούνται από τον άνθρωπο από τον καιρό της ύπαρξής του. Αυτό που καθιστά το πρόβλημα μεγαλύτερο και δυσκολότερο για την σημερινή κοινωνία της πληροφορίας είναι η διαχείριση της ποσότητας της πληροφορίας σε ψηφιακή μορφή (ψηφιακό περιεχόμενο), η ταχύτητα με την οποία αναπαράγεται και οι τρόποι με τους οποίους παρουσιάζεται, ανταλλάσσεται, οργανώνεται και αποθηκεύεται. Η εδραίωση του Παγκόσμιου Ιστού έχει επηρεάσει δραματικά όλες αυτές τις δραστηριότητες παρέχοντάς μας νέα εργαλεία και μορφές διαχείρισης και διάθεσης ψηφιακού υλικού. Τόσο η δημιουργία ψηφιακού περιεχομένου από έναν συνεχώς αυξανόμενο αριθμό αναλογικών και ψηφιακών πηγών όσο και η ανάγκη αναπαράστασής του σε μια ατέλειωτη λίστα διαφορετικών μορφών και τύπων έχουν μεταβάλλει σε πολύ μεγάλο βαθμό τους τρόπους της διαχείρισής του. Στις μέρες μας, οργανισμοί με μεγάλο όγκο ψηφιακού υλικού προβάλλουν και διανέμουν το υλικό τους μέσω του Παγκόσμιου Ιστού εμπλουτίζοντας και προεκτείνοντας τις ηλεκτρονικές τους υπηρεσίες και εφαρμογές αξιοποιώντας τα δίκτυα υπολογιστών. Κάτι τέτοιο απαιτεί την ολοκλήρωση εξειδικευμένων πληροφοριακών συστημάτων στην επιχειρησιακή λογική ενός οργανισμού καθώς και την ορθή του χρήση και υποστήριξη από ειδικό προσωπικό. Τα συστήματα αυτά αποτελούν έναν καθοριστικό παράγοντα για τη μελλοντική ανάπτυξη ενός οργανισμού ενώ εγγυώνται και διασφαλίζουν τις οικονομικές του επενδύσεις. Ωστόσο, ο Παγκόσμιος Ιστός αποτελεί ένα ιδιαίτερα εχθρικό περιβάλλον όσον αφορά θέματα ασφάλειας, πλήττοντας άμεσα την εμπορική (ή μη) εκμετάλλευση του διακινούμενου ψηφιακού περιεχομένου. Σε κάθε περίπτωση, οι οργανισμοί οφείλουν να αντιμετωπίσουν όλες εκείνες τις προκλήσεις που προέρχονται από τα ανοιχτά θέματα ασφάλειας του Διαδικτύου ακριβώς επειδή αυτά μπορούν να προκαλέσουν απώλειες μεγάλου όγκου δεδομένων, να οδηγήσουν σε οικονομική καταστροφή ή/και να αμαυρώσουν το κύρος και την αξιοπιστία του οργανισμού προς το ευρύ κοινό. Στόχος της παρούσας εργασίας είναι η ενδελεχής παρουσίαση πληροφοριακών Συστημάτων Διαχείρισης Ψηφιακών Αντικειμένων (Digital Asset Management Systems – DAMS). Στο 1ο από τα 2 μέρη της, παρουσιάζουμε την αρχιτεκτονική τέτοιων συστημάτων, τις υποδομές πάνω στις οποίες στηρίζονται και υλοποιούνται, τις υπηρεσίες και εφαρμογές που παρέχουν καθώς και τους τρόπους ολοκλήρωσής τους με άλλα πληροφοριακά συστήματα καθώς και με το Διαδίκτυο. Στο 2ο μέρος περιγράφουμε το λεπτομερή σχεδιασμό και την υλοποίηση ενός ανθεκτικού αρχιτεκτονικού μοντέλου ασφάλειας για Internet-based DAMS. Αναπτύσσουμε τις βασικές λειτουργικές προδιαγραφές και απαιτήσεις ασφάλειας με βάση τις οποίες κάναμε το σχεδιασμό. Επιπλέον, περιγράφουμε όλες εκείνες τις κρυπτογραφικές αρχές και τεχνολογίες που χρησιμοποιούμε για να πετύχουμε ασφάλεια στα διαχειριζόμενα δεδομένα και ασφαλή αλληλεπίδραση χρηστών με το μοντέλο μας σε διαδικτυακά συνεργατικά περιβάλλοντα. Τέλος, παρέχουμε μια υλοποίηση ενός πρωτοτύπου για Internet-based DAMS το οποίο στηρίζεται πάνω στο αρχιτεκτονικό μας μοντέλο και αναλύουμε όλα τα τεχνικά ζητήματα που ανακύπτουν. / Information creation, presentation and exchange, but also the collection, organization and storage of information carriers, is an old craft. What makes the problem different in today's information society is the amount of information in digital form (digital content) that has to be handled, the speed at which it is produced and the ways that it is presented, exchanged, organized and stored. The advent of the World Wide Web has tremendously affected all these activities, giving us new tools and ways for harnessing digital material. Its creation from an ever-increasing number of analog and digital sources and the need for representing it into an endless list of different types and formats influences dramatically the ways of its management. Nowadays, rich-media organizations tend to exhibit and distribute their material over Internet by extending their electronic services and applications into computer networks. This task requires specialized information systems and also skilled staff to use, maintain and integrate them into the organization's business logic. Adoption of such systems is a critical factor for future economic growth and return on investment (ROI). However, Internet increases the vulnerability of digital content commercial (or not) exploitation since it is a possibly hostile environment. In any case, organizations have to deal with all the open security challenges that can cause huge data and financial losses, harm their reputation and strictly affect people's trust on them. In the present work we describe the design and implementation of a secure and robust architectural model for digital asset management. Usage and exploitation of the World Wide Web is a critical requirement for a series of administrative tasks such as collecting, managing and distributing valuable assets. Our model addresses a list of fundamental operational and security requirements. It utilizes a number of cryptographic primitives and techniques that provide data safety and secure user interaction on especially demanding on-line collaboration environments. We provide a reference implementation of our architectural model and discuss the technical issues. It is designed as a standalone solution but it can be flexibly adapted in broader management infrastructures as well as existing DAMS platforms. Ψηφιακά αντικείμενα 025.04 Digital assets Digital asset management
70	Ανάκτηση κειμένου και εξαγωγή κανόνων από κείμενα με βιολογικό περιεχόμενο / Text retrieval and rule extraction from documents with biological concept Γαϊτάνου, Ευφροσύνη 01 October 2008 (has links) Η ραγδαία ανάπτυξη του Παγκόσμιου Ιστού προσέφερε σε όλους τους χρήστες ανά τον κόσμο τη δυνατότητα άμεσης, γρήγορης και αποτελεσματικής προσπέλασης κάθε είδους πληροφορίας. Καθημερινά πραγματοποιούνται εκατομμύρια καταχωρήσεις πληροφοριών στο Διαδίκτυο με αποτέλεσμα ο όγκος της διακινούμενης πληροφορίας να αυξάνει με εκθετικούς ρυθμούς. Με το πάτημα ενός κουμπιού, μια πληθώρα πληροφοριών, ακόμη και για το πιο εξειδικευμένο θέμα, βρίσκεται μπροστά στην οθόνη του χρήστη, έτοιμη προς ανάγνωση και επεξεργασία. Αυτή ακριβώς η «υπερδιάθεση» πληροφοριών καθιστά πολύ δύσκολη έως αδύνατη οποιουδήποτε είδους επεξεργασία των δεδομένων από το χρήστη, έστω και σε επίπεδο απλής ανάγνωσης. Η ύπαρξη ενός εργαλείου ανάκτησης κειμένου και εξαγωγής όρων και κανόνων από μια υπερμεγέθη συλλογή κειμένων θα έδινε τη δυνατότητα στο χρήστη να ανακτήσει χρήσιμες πληροφορίες γρήγορα, χωρίς να είναι απαραίτητη η ανάγνωση και η φυσική επεξεργασία όλων αυτών των κειμένων. Ειδικότερα στο ευαίσθητο πεδίο των Βιο-Επιστημών όπου η αδυναμία επεξεργασίας της διαθέσιμης πληροφορίας και της εξαγωγής χρήσιμων συνδέσεων και συμπερασμάτων επηρεάζει αρνητικά την επιστημονική έρευνα, είναι επιτακτική η ανάγκη παρουσίας εργαλείων που θα διευκολύνουν τη διαδικασία εξόρυξης γνώσης από κείμενα με βιολογικό περιεχόμενο. Στην παρούσα διπλωματική εργασία γίνεται μια παρουσίαση τεχνικών με τις οποίες είναι δυνατή η εξαγωγή γνώσης και κανόνων από κείμενα ηλεκτρονικής μορφής στο Διαδίκτυο τα οποία αφορούν στο επιστημονικό πεδίο της Βιολογίας. Η προσπάθειά μας επικεντρώνεται κυρίως στη δυνατότητα εξόρυξης γνώσης από κείμενα που αναφέρονται σε ένα συγκεκριμένο θέμα Βιολογίας (π.χ. μεταγραφικοί παράγοντες) και που η πραγματοποίηση του στόχου αυτού θα ήταν διαφορετικά από δύσκολη έως αδύνατη καθώς το πλήθος των κειμένων είναι απαγορευτικό για την αναλυτική μελέτη τους από ειδικό ή ομάδα ειδικών, πόσο μάλλον από έναν απλό χρήστη. Αρχικά, περιγράφουμε τον τρόπο ανάκτησης των κειμένων που αναφέρονται στο συγκεκριμένο θέμα του ενδιαφέροντός μας από την ηλεκτρονική βιβλιοθήκη National Library of Medicine και τη δημιουργία της προς επεξεργασία συλλογής κειμένων. Η συλλογή αυτή υπόκειται σε λεξικολογική ανάλυση και επεξεργασία κατά τη διάρκεια της οποίας διατηρούνται από κάθε κείμενο οι πιο σημαντικοί όροι, ενώ οι υπόλοιποι απορρίπτονται. Με τον τρόπο αυτό δημιουργείται ένα σύνολο από τους πιο αντιπροσωπευτικούς όρους ανά κείμενο με τη συχνότητα εμφάνισής τους σε αυτά. Στη συνέχεια, εφαρμόζουμε τεχνικές ομαδοποίησης δεδομένων με στόχο τη δημιουργία ομάδων όρων, αλλά και ομάδων κειμένων. Στα πλαίσια της προσπάθειας αυτής, πειραματιστήκαμε με διάφορες γνωστές τεχνικές ομαδοποίησης (αλγόριθμοι k-means και ιεραρχικός μονής σύνδεσης), ενώ υλοποιήσαμε εκ νέου τον αλγόριθμο ISODATA σε περιβάλλον ανάπτυξης Matlab. Η έρευνά μας ολοκληρώνεται με την εφαρμογή της τεχνικής του Latent Semantic Indexing πριν τη ομαδοποίηση των δεδομένων και τη σύγκριση των αποτελεσμάτων. Μέσα από τις ομάδες που δημιουργούνται με αυτή τη διαδικασία, διαπιστώνουμε την παρουσία συνδέσεων μεταξύ όρων και κειμένων και, ακόμη περισσότερο, τη δυνατότητα εξαγωγής συμπερασμάτων, αλλά και εξόρυξης πραγματικά νέας γνώσης επάνω σε συγκεκριμένα πεδία της επιστήμης της Βιολογίας. / The rapid growth of World Wide Web offered every user around the globe the ability to have immediate, quick and effective access to every kind of information. Daily, millions of records of information about every subject are added on Internet, giving the volume of available information an exponential boost. Simply by pressing only one single button, a plethora of information – even about the most sophisticated topic - is laid out in front of user’s screen ready to be read and processed. This plethora is exactly the reason that makes it difficult or even impossible for a simple user to process all the available data, or even just read it. It is clear that the presence of a tool that will make feasible the retrieval of documents and the extraction of terms and rule-associations from a huge document collection would give users the ability to retrieve valuable information quickly, without even reading or pre-processing all these documents. Especially in Bio-sciences, the inability of processing the available information and extracting useful connections and assumptions is an obstacle in scientific research. Therefore, there is a crying need for tools that will facilitate the process of text mining from documents with biological concept. In the present master thesis we present techniques for extracting knowledge and rules from documents in a digital format retrieved from Internet, with special reference to the scientific field of Biology. Our attempt is mainly focused on knowledge extraction from documents with specific biological concept (e.g. transcription factors), which is a really difficult – in some cases even impossible – task to accomplish due to the huge amount of available documents that an expert or a group of experts should read and process – imagine what a simple user could do. First, we describe the retrieval of documents referring to the specific biological concept we are interested about, from the National Library of Medicine and the construction of our document set. This set will be lexicological processed and only the most important term from each document will be kept while the rest will be ignored. This way, a set of the most representative terms per document will be created, along with the frequency in which the terms appear in each document. Secondly, we apply clustering techniques over this terms-by-document set in order to produce clusters of terms as well as clusters of documents. During this step, many well known clustering techniques are being tested, such as the k-means algorithm and the hierarchical-single linkage algorithm. We also describe our implementation, the ISODATA algorithm. The implementation of all clustering algorithms tested here was done on Matlab 6p5. Our research ends with the application of Latent Semantic Indexing (LSI) technique over our terms-by-documents set before the clustering step; we compare the resulting clusters with those taken without performing LSI before clustering. It is in those clusters that we find many connections between terms and documents and - even more – we discover the ability of extracting not only conclusions about the concept of the documents in each cluster but also truly new knowledge referring to specific scientific fields of Biology. Καρκίνος Ομαδοποίηση Βιολογικά δεδομένα 025.04 lsi Clustering Isodata Cancer Biological terms

Search results