• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 249
  • 124
  • 44
  • 38
  • 31
  • 29
  • 24
  • 24
  • 13
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • Tagged with
  • 631
  • 631
  • 145
  • 132
  • 122
  • 115
  • 95
  • 89
  • 87
  • 82
  • 81
  • 77
  • 72
  • 67
  • 66
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
461

Extraction automatique de connaissances pour la décision multicritère

Plantié, Michel 29 September 2006 (has links) (PDF)
Cette thèse, sans prendre parti, aborde le sujet délicat qu'est l'automatisation cognitive. Elle propose la mise en place d'une chaîne informatique complète pour supporter chacune des étapes de la décision. Elle traite en particulier de l'automatisation de la phase d'apprentissage en faisant de la connaissance actionnable--la connaissance utile à l'action--une entité informatique manipulable par des algorithmes.<br />Le modèle qui supporte notre système interactif d'aide à la décision de groupe (SIADG) s'appuie largement sur des traitements automatiques de la connaissance. Datamining, multicritère et optimisation sont autant de techniques qui viennent se compléter pour élaborer un artefact de décision qui s'apparente à une interprétation cybernétique du modèle décisionnel de l'économiste Simon. L'incertitude épistémique inhérente à une décision est mesurée par le risque décisionnel qui analyse les facteurs discriminants entre les alternatives. Plusieurs attitudes dans le contrôle du risque décisionnel peuvent être envisagées : le SIADG peut être utilisé pour valider, vérifier ou infirmer un point de vue. Dans tous les cas, le contrôle exercé sur l'incertitude épistémique n'est pas neutre quant à la dynamique du processus de décision. L'instrumentation de la phase d'apprentissage du processus décisionnel conduit ainsi à élaborer l'actionneur d'une boucle de rétroaction visant à asservir la dynamique de décision. Notre modèle apporte un éclairage formel des liens entre incertitude épistémique, risque décisionnel et stabilité de la décision.<br />Les concepts fondamentaux de connaissance actionnable (CA) et d'indexation automatique sur lesquels reposent nos modèles et outils de TALN sont analysés. La notion de connaissance actionnable trouve dans cette vision cybernétique de la décision une interprétation nouvelle : c'est la connaissance manipulée par l'actionneur du SIADG pour contrôler la dynamique décisionnelle. Une synthèse rapide des techniques d'apprentissage les plus éprouvées pour l'extraction automatique de connaissances en TALN est proposée. Toutes ces notions et techniques sont déclinées sur la problématique spécifique d'extraction automatique de CAs dans un processus d'évaluation multicritère. Enfin, l'exemple d'application d'un gérant de vidéoclub cherchant à optimiser ses investissements en fonction des préférences de sa clientèle reprend et illustre le processus informatisé dans sa globalité.
462

Concept Mining: A Conceptual Understanding based Approach

Shehata, Shady January 2009 (has links)
Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content. In addition, statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Therefore, there is an intensive need for a model that captures the meaning of linguistic utterances in a formal structure. The underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based model that analyzes terms on the sentence, document and corpus levels rather than the traditional analysis of document only is introduced. The concept-based model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation, concept extractor and concept-based similarity measure. The term which contributes to the sentence semantics is assigned two different weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. The similarity between documents is calculated based on a new concept-based similarity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between documents. Large sets of experiments using the proposed concept-based model on different datasets in text clustering, categorization and retrieval are conducted. The experiments demonstrate extensive comparison between traditional weighting and the concept-based weighting obtained by the concept-based model. Experimental results in text clustering, categorization and retrieval demonstrate the substantial enhancement of the quality using: (1) concept-based term frequency (tf), (2) conceptual term frequency (ctf), (3) concept-based statistical analyzer, (4) conceptual ontological graph, (5) concept-based combined model. In text clustering, the evaluation of results is relied on two quality measures, the F-Measure and the Entropy. In text categorization, the evaluation of results is relied on three quality measures, the Micro-averaged F1, the Macro-averaged F1 and the Error rate. In text retrieval, the evaluation of results relies on three quality measures, the precision at 10 documents retrieved P(10), the preference measure (bpref), and the mean uninterpolated average precision (MAP). All of these quality measures are improved when the newly developed concept-based model is used to enhance the quality of the text clustering, categorization and retrieval.
463

Supplementing consumer insights at Electrolux by mining social media: An exploratory case study

Chaudhary, Amit January 2011 (has links)
Purpose – The aim of this thesis is to explore the possibility of text mining social media, for consumer insights from an organizational perspective. Design/methodology/approach – An exploratory, single case embedded case study with inductive approach and partially mixed, concurrent, dominant status mixed method research design. The case study contains three different studies to try to triangulate the research findings and support research objective of using social media for consumer insights for new products, new ideas and helping research and development process of any organization. Findings – Text mining is a useful, novel, flexible and an unobtrusive method to harness the hidden information in social media. By text-mining social media, an organization can find consumer insights from a large data set and this initiative requires an understanding of social media and its building blocks. In addition, a consumer focused product development approach not only drives social media mining but also enriched by using consumer insights from social media. Research limitations/implications – Text mining is a relatively new subject and focus on developing better analytical tool kits would promote the use of this novel method. The researchers in the field of consumer driven new product development can use social media as additional evidence in their research. Practical implications – The consumer insights gained from the text mining of social media within a workable ethical policy are positive implications for any organization. Unlike conventional marketing research methods text mining is social media is cost and time effective. Originality/value –This thesis attempts to use innovatively text-mining tools, which appear, in the field of computer sciences to mine social media for gaining better understanding of consumers thereby enriching the field of marketing research, a cross-industry effort. The ability of consumers to spread the electronic word of mouth (eWOM) using social media is no secret and organizations should now consider social media as a source to supplement if not replace the insights captured using conventional marketing research methods. Keywords – Social media, Web 2.0, Consumer generated content, Text mining, Mixed methods design, Consumer insights, Marketing research, Case study, Analytic coding, Hermeneutics, Asynchronous, Emergent strategy Paper type Master Thesis
464

Concept Mining: A Conceptual Understanding based Approach

Shehata, Shady January 2009 (has links)
Due to the daily rapid growth of the information, there are considerable needs to extract and discover valuable knowledge from data sources such as the World Wide Web. Most of the common techniques in text mining are based on the statistical analysis of a term either word or phrase. These techniques consider documents as bags of words and pay no attention to the meanings of the document content. In addition, statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. Therefore, there is an intensive need for a model that captures the meaning of linguistic utterances in a formal structure. The underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discover the topic of the document. A new concept-based model that analyzes terms on the sentence, document and corpus levels rather than the traditional analysis of document only is introduced. The concept-based model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation, concept extractor and concept-based similarity measure. The term which contributes to the sentence semantics is assigned two different weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. The similarity between documents is calculated based on a new concept-based similarity measure. The proposed similarity measure takes full advantage of using the concept analysis measures on the sentence, document, and corpus levels in calculating the similarity between documents. Large sets of experiments using the proposed concept-based model on different datasets in text clustering, categorization and retrieval are conducted. The experiments demonstrate extensive comparison between traditional weighting and the concept-based weighting obtained by the concept-based model. Experimental results in text clustering, categorization and retrieval demonstrate the substantial enhancement of the quality using: (1) concept-based term frequency (tf), (2) conceptual term frequency (ctf), (3) concept-based statistical analyzer, (4) conceptual ontological graph, (5) concept-based combined model. In text clustering, the evaluation of results is relied on two quality measures, the F-Measure and the Entropy. In text categorization, the evaluation of results is relied on three quality measures, the Micro-averaged F1, the Macro-averaged F1 and the Error rate. In text retrieval, the evaluation of results relies on three quality measures, the precision at 10 documents retrieved P(10), the preference measure (bpref), and the mean uninterpolated average precision (MAP). All of these quality measures are improved when the newly developed concept-based model is used to enhance the quality of the text clustering, categorization and retrieval.
465

Fuzzy Cluster-Based Query Expansion

Tai, Chia-Hung 29 July 2004 (has links)
Advances in information and network technologies have fostered the creation and availability of a vast amount of online information, typically in the form of text documents. Information retrieval (IR) pertains to determining the relevance between a user query and documents in the target collection, then returning those documents that are likely to satisfy the user¡¦s information needs. One challenging issue in IR is word mismatch, which occurs when concepts can be described by different words in the user queries and/or documents. Query expansion is a promising approach for dealing with word mismatch in IR. In this thesis, we develop a fuzzy cluster-based query expansion technique to solve the word mismatch problem. Using existing expansion techniques (i.e., global analysis and non-fuzzy cluster-based query expansion) as performance benchmarks, our empirical results suggest that the fuzzy cluster-based query expansion technique can provide a more accurate query result than the benchmark techniques can.
466

Discovering Discussion Activity Flows in an On-line Forum Using Data Mining Techniques

Hsieh, Lu-shih 22 July 2008 (has links)
In the Internet era, more and more courses are taught through a course management system (CMS) or learning management system (LMS). In an asynchronous virtual learning environment, an instructor has the need to beware the progress of discussions in forums, and may intervene if ecessary in order to facilitate students¡¦ learning. This research proposes a discussion forum activity flow tracking system, called FAFT (Forum Activity Flow Tracer), to utomatically monitor the discussion activity flow of threaded forum postings in CMS/LMS. As CMS/LMS is getting popular in facilitating learning activities, the proposedFAFT can be used to facilitate instructors to identify students¡¦ interaction types in discussion forums. FAFT adopts modern data/text mining techniques to discover the patterns of forum discussion activity flows, which can be used for instructors to facilitate the online learning activities. FAFT consists of two subsystems: activity classification (AC) and activity flow discovery (AFD). A posting can be perceived as a type of announcement, questioning, clarification, interpretation, conflict, or assertion. AC adopts a cascade model to classify various activitytypes of posts in a discussion thread. The empirical evaluation of the classified types from a repository of postings in earth science on-line courses in a senior high school shows that AC can effectively facilitate the coding rocess, and the cascade model can deal with the imbalanced distribution nature of discussion postings. AFD adopts a hidden Markov model (HMM) to discover the activity flows. A discussion activity flow can be presented as a hidden Markov model (HMM) diagram that an instructor can adopt to predict which iscussion activity flow type of a discussion thread may be followed. The empirical results of the HMM from an online forum in earth science subject in a senior high school show that FAFT can effectively predict the type of a discussion activity flow. Thus, the proposed FAFT can be embedded in a course management system to automatically predict the activity flow type of a discussion thread, and in turn reduce the teachers¡¦ loads on managing online discussion forums.
467

應用資料採礦技術於購物中心顧客群消費行為之研究 / The Application of Data Mining on Shopping Behavior of Shopping Mall Customers

范瀞云, Fan,Jing Yun Unknown Date (has links)
國民所得上升而提升了顧客購買力,物質需求不再像以往,現在同時必須滿足消費者休閒娛樂,因此結合購物以及餐飲與娛樂的大型購物中心逐漸拓展。本研究以資料採礦技術對T購物中心所提供之問卷進行資料分析,其中問卷包含了消費者基本資料、消費者行為與偏好、滿意度與建議四大部分,以統計方法分析會員與非會員之間的消費行為差異,進而做出市場區隔與行銷決策,增進顧客人數且提高消費意願和忠誠度,以及吸引非會員前來消費並申辦會員,提升顧客對T購物中心的依賴,並期望經由研究結果提供日後T購物中心於行銷計劃上之參考。 / The rising of national income promoted customers purchasing power. Material needs no longer as before,and must satisfy consumer entertainment at the same time now. Thus ,the shopping malls which combining shopping、dining and entertainment gradually expand. In this study,we used the technology of data mining to the questionnaires provided by T shopping Mall and conducted data analysis. The four parts of questionnaires contain basic information of consumers、consumer behavior、satisfaction and preferences. We analyze consumer behavioral differences between members and non-members by statistical methods and then make market segmentation and marketing decision, increase the number of customers and enhance consumer willingness and loyalty. Moreover, attract non-members to come and consume and bid for membership. Promote the dependence on T shopping mall of customers and expect the results to provide a reference on the marketing plan of T shopping mall in the future.
468

脈絡下的保護責任:文本探勘的再詮釋 / Contextualizing Responsibility to Protect: Re-Interpretation of Text Mining

張道宜, Chang, Tao Yi Unknown Date (has links)
保護責任(R2P)是當前國際社會最受矚目,但同時也最受爭議的概念之一,有人認為這個概念有助於實踐國際人權,幫助國際社會向需要幫助的人民伸出援手;有學者認為這是為了解決主權與人權之間的爭端;更有人認為這只是「人道干涉」的借屍還魂,不過是西方強權為了干涉他國的手段而已。 隨著時間進展,當2005年保護責任在世界高峰會(World Summit)中得到聯合國會員國一致共識同時,有人認為R2P原本試圖修正「人道干涉困境」與國際法架構的雄心壯志,淪落對現有國際法架構的確認,無疑地呈現顯著的概念質變(conceptual change)。然而當民主春風吹過中東與北非地區,阿拉伯之春導致許多政府爆發侵犯人權情事,R2P再度受到矚目,甚至在2011年被聯合國安理會引用,作為干涉利比亞局勢的重要說辭,產生與2005年世界高峰會截然不同的內容。儘管被視為R2P概念成形以來的一大勝利,但也掀起新一波對R2P概念的爭辯。 本文主張,R2P面臨如此爭議,「定義」以及「與主權關係」不明是最主要的原因之一。對於支持者而言,這是有別於人道干涉,且根據現行「負責任主權」的全新人權執行機制,符合「即時性」、「有效性」的大原則。但對於反對者而言,這是人道干涉藉由「責任」一詞改頭換面,「換湯不換藥」,表面說詞再動人,都無法掩蓋他基於國家私利,干涉他國完整,破壞國家主權體系的意圖。 為了解決這項爭議,本研究試圖透過「語料庫語言學」的方法回答以下研究問題:第一,對於實際參與聯合國決策的各國代表而言,到底何謂「R2P」?在聯合國的場域中,「R2P」的出現是否真如部分學者所主張,漸漸改變「主權」的內涵?第二,如果主權概念與「R2P」的概念確實存在連動關係,那關係為何? 根據本研究的研究成果可以發現,第一,在聯合國安理會的場域中,2005年世界高峰會的共識確實取代既有概念,讓「R2P」內容產生質變。但過去的「預防」概念依舊得到存續。第二,「R2P」概念的出現確實為「主權」增加更多的「責任」,儘管在安理會的場域中幅度並不大,但當有意識地使用「R2P」時,會特別強調「責任」的重要性。第三,儘管許多學者主張「R2P」的概念已經形成共識,重點在於「實踐」而非「爭辯」,但實際上真正重視「實踐」者,恐怕只有聯合國秘書長本人。 / Although generally recognized in World Summit Outcome Document, Responsibility to Protect (R2P) is one of the most controversial concepts of International Relations (IR). Especially, its relationship with sovereignty is one of the most debatable ones. For answering the questions, it is purposed to inquiry texts of meeting verbatim record of United Nations (UN), with the assistance of discourse analysis and digital toolkits. While scholars of IR and Political Thoughts have analyzed its theoretical, definition, legal and implementation dimensions, little attention is paid to its discursive change to examine their mutual influence. For proceeding full and large-scale research, present techniques of text mining enable researchers to work on “big texts”, and to extract the linguistic context beyond them. In general, this thesis is intended to complete the following advancements of IR studies: firstly, establish the contextual understanding of conceptual change of R2P and sovereignty, and find if hidden information exists behind those texts; secondly, if text mining and related toolkits does assist fulfillment of this proposal, it might be possibly new research skill to be applied in IR. This thesis investigates the present understandings of Sovereignty and R2P in IR. It hypothesizes that, firstly, most of present researches on R2P neglected the role of language; secondly, the generation of R2P might be related to the conceptual change of sovereignty in twenty-first century.
469

Αποτελεσματικές τεχνικές διαχείρισης δεδομένων στον Παγκόσμιο Ιστό / Efficient techniques for Web data management

Ιωάννου, Ζαφειρία-Μαρίνα 24 November 2014 (has links)
Η εξέλιξη της τεχνολογίας των υπολογιστών σε συνδυασμό με την πρόοδο της τεχνολογίας των βάσεων δεδομένων έχουν συμβάλει στην ανάπτυξη νέων αποδοτικών και αυτοματοποιημένων τεχνικών για την αποτελεσματική συλλογή, αποθήκευση και διαχείριση των δεδομένων. Ως συνέπεια, ο όγκος των δεδομένων που αποθηκεύονται και είναι ευρέως διαθέσιμα ηλεκτρονικά αυξάνεται ραγδαία και η ανάγκη ανάπτυξης και χρήσης αποδοτικών μεθόδων ανάλυσης για την εξαγωγή χρήσιμης πληροφορίας καθίσταται ολοένα και πιο επιτακτική. Η εξόρυξη δεδομένων (data mining) ως ένα αναδυόμενο πεδίο διεπιστημονικών εφαρμογών συνδυάζει παραδοσιακές μεθόδους ανάλυσης δεδομένων με εξελιγμένους αλγόριθμους και διαδραματίζει σημαντικό ρόλο στην επεξεργασία μεγάλου όγκου δεδομένων. Ο όρος οπτικοποίηση δεδομένων (data visualization) αναφέρεται στη μελέτη τεχνικών οπτικής αναπαράστασης δεδομένων χρησιμοποιώντας γραφικά, κίνηση, τρισδιάστατες απεικονίσεις και άλλα πολυμεσικά εργαλεία. Στόχος των τεχνικών οπτικοποίησης είναι παρουσίαση ενός συνόλου δεδομένων με τρόπο σαφή και αποτελεσματικό που να παρέχει τη δυνατότητα εξαγωγής συμπερασμάτων και ανακάλυψης συσχετίσεων που διαφορετικά θα παρέμεναν άγνωστες. Στη διεθνή βιβλιογραφία, έχουν παρουσιαστεί αρκετές τεχνικές οπτικοποίησης δεδομένων, ενώ τα τελευταία χρόνια η επιστημονική κοινότητα έχει εστιάσει το ενδιαφέρον της και στην οπτικοποίηση των αποτελεσμάτων της εξόρυξης δεδομένων. Στα πλαίσια αυτής της μεταπτυχιακής διπλωματικής εργασίας, προτείνεται μια αποδοτική τεχνική εξόρυξης δεδομένων που βασίζεται σε γνωστές μεθόδους συσταδοποίησης, όπως ο Ιεραρχικός αλγόριθμος και o αλγόριθμος Spherical K-means και είναι κατάλληλη να εφαρμοστεί για την ανάλυση και εξαγωγή χρήσιμης γνώσης σε διαφορετικά σύνολα δεδομένων. Η προτεινόμενη τεχνική εφαρμόστηκε σε δύο διαφορετικούς τύπους δεδομένων: α) κειμενικά δεδομένα (textual data) που προέρχονται από τη βάση δεδομένων του PubMed, β) αριθμητικά δεδομένα (numerical data) από τη βάση δεδομένων της FINDbase. Επιπλέον, παρουσιάζεται μια μελέτη τεχνικών οπτικοποίησης και η ανάπτυξη σύγχρονων εφαρμογών οπτικοποίησης, τόσο για την αποτελεσματική αναπαράσταση των αρχικών δεδομένων μιας συλλογής (πριν από την επεξεργασία τους), όσο και των αποτελεσμάτων που προέκυψαν από την προτεινόμενη τεχνική συσταδοποίησης. / The evolution of computer technology along with advances in database technology have contributed to the development of new efficient and automated techniques for the effective collection, storage and management of data. As a result, the volume of stored and widely available online data is growing rapidly, and the need for effective analytical methods for extracting relevant information is becoming increasingly urgent. As an emerging field of interdisciplinary applications, data mining combines traditional data analysis methods with sophisticated algorithms and plays an important role in the processing of large volumes of data. Data visualization refers to the study of the techniques used for the visual representation of data, including graphics, animation, 3D depictions and other multimedia tools. The main goal of data visualization techniques is to present a set of data in a clear and effective way, so that the extraction of conclusions and discovery of correlations that would otherwise remain unknown, are enabled. While several data visualization techniques have been presented in the relative literature, in recent years the scientific community has been focusing on the visualization of the results obtained by the application of data mining techniques. In the present thesis, we propose an efficient data mining technique that is based on well-known clustering methods, such as the Hierarchical and Spherical K-means ones, and is suitable for the analysis and extraction of useful knowledge from different types of datasets. The proposed technique was applied into two different types of data including: a) textual data from the PubMed database, b) numerical data from the FINDbase database. Furthermore, we present a study of visualization techniques and the development of modern visualization tools for the effective representation of the original dataset (before processing) and the results obtained by the proposed clustering technique.
470

Text mining : μια νέα προτεινόμενη μέθοδος με χρήση κανόνων συσχέτισης

Νασίκας, Ιωάννης 14 September 2007 (has links)
Η εξόρυξη κειμένου (text mining) είναι ένας νέος ερευνητικός τομέας που προσπαθεί να επιλύσει το πρόβλημα της υπερφόρτωσης πληροφοριών με τη χρησιμοποίηση των τεχνικών από την εξόρυξη από δεδομένα (data mining), την μηχανική μάθηση (machine learning), την επεξεργασία φυσικής γλώσσας (natural language processing), την ανάκτηση πληροφορίας (information retrieval), την εξαγωγή πληροφορίας (information extraction) και τη διαχείριση γνώσης (knowledge management). Στο πρώτο μέρος αυτής της διπλωματικής εργασίας αναφερόμαστε αναλυτικά στον καινούριο αυτό ερευνητικό τομέα, διαχωρίζοντάς τον από άλλους παρεμφερείς τομείς. Ο κύριος στόχος του text mining είναι να βοηθήσει τους χρήστες να εξαγάγουν πληροφορίες από μεγάλους κειμενικούς πόρους. Δύο από τους σημαντικότερους στόχους είναι η κατηγοριοποίηση και η ομαδοποίηση εγγράφων. Υπάρχει μια αυξανόμενη ανησυχία για την ομαδοποίηση κειμένων λόγω της εκρηκτικής αύξησης του WWW, των ψηφιακών βιβλιοθηκών, των ιατρικών δεδομένων, κ.λ.π.. Τα κρισιμότερα προβλήματα για την ομαδοποίηση εγγράφων είναι η υψηλή διαστατικότητα του κειμένου φυσικής γλώσσας και η επιλογή των χαρακτηριστικών γνωρισμάτων που χρησιμοποιούνται για να αντιπροσωπεύσουν μια περιοχή. Κατά συνέπεια, ένας αυξανόμενος αριθμός ερευνητών έχει επικεντρωθεί στην έρευνα για τη σχετική αποτελεσματικότητα των διάφορων τεχνικών μείωσης διάστασης και της σχέσης μεταξύ των επιλεγμένων χαρακτηριστικών γνωρισμάτων που χρησιμοποιούνται για να αντιπροσωπεύσουν το κείμενο και την ποιότητα της τελικής ομαδοποίησης. Υπάρχουν δύο σημαντικοί τύποι τεχνικών μείωσης διάστασης: οι μέθοδοι «μετασχηματισμού» και οι μέθοδοι «επιλογής». Στο δεύτερο μέρος αυτής τη διπλωματικής εργασίας, παρουσιάζουμε μια καινούρια μέθοδο «επιλογής» που προσπαθεί να αντιμετωπίσει αυτά τα προβλήματα. Η προτεινόμενη μεθοδολογία είναι βασισμένη στους κανόνες συσχέτισης (Association Rule Mining). Παρουσιάζουμε επίσης και αναλύουμε τις εμπειρικές δοκιμές, οι οποίες καταδεικνύουν την απόδοση της προτεινόμενης μεθοδολογίας. Μέσα από τα αποτελέσματα που λάβαμε διαπιστώσαμε ότι η διάσταση μειώθηκε. Όσο όμως προσπαθούσαμε, βάσει της μεθοδολογίας μας, να την μειώσουμε περισσότερο τόσο χανόταν η ακρίβεια στα αποτελέσματα. Έγινε μια προσπάθεια βελτίωσης των αποτελεσμάτων μέσα από μια διαφορετική επιλογή των χαρακτηριστικών γνωρισμάτων. Τέτοιες προσπάθειες συνεχίζονται και σήμερα. Σημαντική επίσης στην ομαδοποίηση των κειμένων είναι και η επιλογή του μέτρου ομοιότητας. Στην παρούσα διπλωματική αναφέρουμε διάφορα τέτοια μέτρα που υπάρχουν στην βιβλιογραφία, ενώ σε σχετική εφαρμογή κάνουμε σύγκριση αυτών. Η εργασία συνολικά αποτελείται από 7 κεφάλαια: Στο πρώτο κεφάλαιο γίνεται μια σύντομη ανασκόπηση σχετικά με το text mining. Στο δεύτερο κεφάλαιο περιγράφονται οι στόχοι, οι μέθοδοι και τα εργαλεία που χρησιμοποιεί η εξόρυξη κειμένου. Στο τρίτο κεφάλαιο παρουσιάζεται ο τρόπος αναπαράστασης των κειμένων, τα διάφορα μέτρα ομοιότητας καθώς και μια εφαρμογή σύγκρισης αυτών. Στο τέταρτο κεφάλαιο αναφέρουμε τις διάφορες μεθόδους μείωσης της διάστασης και στο πέμπτο παρουσιάζουμε την δικιά μας μεθοδολογία για το πρόβλημα. Έπειτα στο έκτο κεφάλαιο εφαρμόζουμε την μεθοδολογία μας σε πειραματικά δεδομένα. Η εργασία κλείνει με τα συμπεράσματα μας και κατευθύνσεις για μελλοντική έρευνα. / Text mining is a new searching field which tries to solve the problem of information overloading by using techniques from data mining, natural language processing, information retrieval, information extraction and knowledge management. At the first part of this diplomatic paper we detailed refer to this new searching field, separated it from all the others relative fields. The main target of text mining is helping users to extract information from big text resources. Two of the most important tasks are document categorization and document clustering. There is an increasing concern in document clustering due to explosive growth of the WWW, digital libraries, technical documentation, medical data, etc. The most critical problems for document clustering are the high dimensionality of the natural language text and the choice of features used to represent a domain. Thus, an increasing number of researchers have concentrated on the investigation of the relative effectiveness of various dimension reduction techniques and of the relationship between the selected features used to represent text and the quality of the final clustering. There are two important types of techniques that reduce dimension: transformation methods and selection methods. At the second part of this diplomatic paper we represent a new selection method trying to tackle these problems. The proposed methodology is based on Association Rule Mining. We also present and analyze empirical tests, which demonstrate the performance of the proposed methodology. Through the results that we obtained we found out that dimension has been reduced. However, the more we have been trying to reduce it, according to methodology, the bigger loss of precision we have been taking. There has been an effort for improving the results through a different feature selection. That kind of efforts are taking place even today. In document clustering is also important the choice of the similarity measure. In this diplomatic paper we refer several of these measures that exist to bibliography and we compare them in relative application. The paper totally has seven chapters. At the first chapter there is a brief review about text mining. At the second chapter we describe the tasks, the methods and the tools are used in text mining. At the third chapter we give the way of document representation, the various similarity measures and an application to compare them. At the fourth chapter we refer different kind of methods that reduce dimensions and at the fifth chapter we represent our own methodology for the problem. After that at the sixth chapter we apply our methodology to experimental data. The paper ends up with our conclusions and directions for future research.

Page generated in 0.0525 seconds