Global ETD Search

21	Effective Techniques for Indonesian Text Retrieval Asian, Jelita, jelitayang@gmail.com January 2007 (has links) The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval. Information retrieval text retrieval cross lingual information retrieval Indonesian stemming stopping tokenisation proper noun identification language identification compound identification parallel corpora identification
22	Outomatiese Setswana lemma-identifisering / Jeanetta Hendrina Brits Brits, Jeanetta Hendrina January 2006 (has links) Within the context of natural language processing, a lemmatiser is one of the most important core technology modules that has to be developed for a particular language. A lemmatiser reduces words in a corpus to the corresponding lemmas of the words in the lexicon. A lemma is defined as the meaningful base form from which other more complex forms (i.e. variants) are derived. Before a lemmatiser can be developed for a specific language, the concept "lemma" as it applies to that specific language should first be defined clearly. This study concludes that, in Setswana, only stems (and not roots) can act independently as words; therefore, only stems should be accepted as lemmas in the context of automatic lemmatisation for Setswana. Five of the seven parts of speech in Setswana could be viewed as closed classes, which means that these classes are not extended by means of regular morphological processes. The two other parts of speech (nouns and verbs) require the implementation of alternation rules to determine the lemma. Such alternation rules were formalised in this study, for the purpose of development of a Setswana lemmatiser. The existing Setswana grammars were used as basis for these rules. Therewith the precision of the formalisation of these existing grammars to lemmatise Setswana words could be determined. The software developed by Van Noord (2002), FSA 6, is one of the best-known applications available for the development of finite state automata and transducers. Regular expressions based on the formalised morphological rules were used in FSA 6 to create finite state transducers. The code subsequently generated by FSA 6 was implemented in the lemmatiser. The metric that applies to the evaluation of the lemmatiser is precision. On a test corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained 70,96% and 70,52% respectively. Expressed in numbers the precision on 500 complex and simplex nouns was 78,45% and on complex and simplex verbs 79,59%. The quantitative achievement only gives an indication of the relative precision of the grammars. Nevertheless, it did offer analysed data with which the grammars were evaluated qualitatively. The study concludes with an overview of how these results might be improved in the future. / Thesis (M.A. (African Languages))--North-West University, Potchefstroom Campus, 2006. Computational linguistics Setswana grammar Setswana morphology Lemmatisation Stemming Lemma Natural language processing Regular expression Finite state automata Finite state transducer FSA 6
23	Metody stemmingu používané při dolování textu / Stemming Methods Used in Text Mining Adámek, Tomáš January 2010 (has links) The main theme of this master's thesis is a description of text mining. This document is specialized to English texts and their automatic data preprocessing. The main part of this thesis analyses various stemming algorithms (Lovins, Porter and Paice/Husk). Stemming is a procedure for automatic conflating semantically related terms together via the use of rule sets. Next part of this thesis describes design of an application for various types of stemming algorithms. Application is based on the Java platform with using of graphic library Swing and MVC architecture. Next chapter contains description of implementation of the application and stemming algorithms. In the last part of this master's thesis experiments with stemming algorithms and comparing the algorithm from viewpoint to the results of classification the text are described.
24	Odvození slovníku pro nástroj Process Inspector na platformě SharePoint / Derivation of Dictionary for Process Inspector Tool on SharePoint Platform Pavlín, Václav January 2012 (has links) This master's thesis presents methods for mining important pieces of information from text. It analyses the problem of terms extraction from large document collection and describes the implementation using C# language and Microsoft SQL Server. The system uses stemming and a number of statistical methods for term extraction. This project also compares used methods and suggests the process of the dictionary derivation.
25	Metody pro získávání asociačních pravidel z dat / Methods for Mining Association Rules from Data Uhlíř, Martin January 2007 (has links) The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
26	CLustering of Web Services Based on Semantic Similarity Konduri, Aparna 12 May 2008 (has links) No description available. Computer Science SIMILARITY OF WEB SERVICES Stemming Word sense disambiguation WORDNET BASED SEMANTIC SIMILARITY CLUSTERING OF WEB SERVICES Prediction of similar web services LERS-M algorithm
27	世界城市的概念輪廓與連結:以Flickr Tags為例 / The World Cities Concept Profiling And Concatenation:A Case Study On Flickr Tags 曹期鈞, Tsao, Chi Chun Unknown Date (has links) 在這社會網路蓬勃發展之中、網際網路頻寬與速度相繼提昇的資訊年代，結合網路科技所衍生的Flickr網路相簿因應而生。Flickr提供許多API程式讓使用者或有興趣研究的專家學者能透過Flickr所收集及其所探討的議題，來觀察社會網路的變化情形。社會網路主要是由節點以及節點間彼此相連結所形成，常見的網路模型大致可分為One-mode與Two-mode兩種網路結構，而本文則採用內部同時有兩種類節點、由兩個城市與Tags共同組合而成的Two-mode網路為基礎架構，期望藉此來闡述一個Tags系統分析法，利用Flickr使用者收集、標註之Flickr標記來與世界城市的概念輪廓相連結，透過提取城市語義分配給Flickr上照片的Tags，以及解決Part-Of-Speech （POS）、詞幹還原及雜訊處理…等問題，來達成依據排名結果分析出城市概念輪廓的最終目的。除此之外，本文還運用了Flickr tag資料來彙整出41個城市的前100名tag，再篩選出前10名的tag，將其與相關的城市歸類一起比較。本文亦使用字詞共現指標（Tag co-occurrence）來計算與該城市的關聯性，再利用此法則來歸納出這兩個城市字詞共同出現的機會，以便於了解城市與城市之間的關連字詞組合。最後，本研究亦透過Flickr網站本身Popular Tags經由分析及匯出標籤雲的結果來與本文之實驗結果相對照，本實驗85%的吻合度驗證了可靠性。 / The Flickr Web Albums was born in the information age of social network growth, internet bandwidth and speed improvement. Users and researchers can observe the changing of social network from topics collected and studied by Flickr using API programs provided by Flickr. The main structure of social network can be distinguished one-mode and two-mode network which is composed by nodes, generally. An approach for world cities concept profiling analysis is developed in this study by conbineing two types of nodes and two cities with tag which is the two-mode network using extracting city semantics for tags assigned to photos on Flickr, solving Part-of-Speech（POS）, Stemming reduction and noise handing by collecting Flickr's tags from Flickr users. The top 100 tags were slected for 41 cities and then top 10 tags for each city were also extracted. The Tag co-occurrence was also applied to analysis the relationship of cities. Then the connection between the cities can be understood by the result of tag co-occurrence opportunities. The 85% accurancy was demonstrated by comparing the result of analysised and exported Popular Tags from Flickr Website service and the result of experiments in this study. 社會網路標記系統標籤類型詞幹分析字詞共現指標 Social network Tagging systems POS Stemming Tag co-occurrence
28	Vyhledávání informací v digitálních knihovnách / Digital Library Information Retrieval Hochmal, Petr Unknown Date (has links) This thesis deals with methods of information retrieval. Firstly, it describes models of information retrieval and methods of retrieval evaluation. Then it brings closer the principles of the input text processing for IR with use of stopword list and stemmer. Furthermore, it shows the way of the query expansion with synonyms using the thesaurus, methods of handling phrases appearance in queries and introduces the idea of ranking documents by the degree of phrase occurrence similarity in documents. In the second part of this thesis is described the design of whole IR system with using vector model, query expansion with synonyms and phrases handling. This system has been implemented in C# as the application for retrieving and administration of the documents in digital libraries. The effectiveness of this system has been evaluated at the end of this thesis by several tests.
29	A security architecture for protecting dynamic components of mobile agents Yao, Ming January 2004 (has links) New techniques,languages and paradigms have facilitated the creation of distributed applications in several areas. Perhaps the most promising paradigm is the one that incorporates the mobile agent concept. A mobile agent in a large scale network can be viewed as a software program that travels through a heterogeneous network, crossing various security domains and executing autonomously in its destination. Mobile agent technology extends the traditional network communication model by including mobile processes, which can autonomously migrate to new remote servers. This basic idea results in numerous benefits including flexible, dynamic customisation of the behavior of clients and servers and robust interaction over unreliable networks. In spite of its advantages, widespread adoption of the mobile agent paradigm is being delayed due to various security concerns. Currently available mechanisms for reducing the security risks of this technology do not e±ciently cover all the existing threats. Due to the characteristics of the mobile agent paradigm and the threats to which it is exposed, security mechanisms must be designed to protect both agent hosting servers and agents. Protection to agent-hosting servers' security is a reasonably well researched issue, and many viable mechanisms have been developed to address it. Protecting agents is technically more challenging and solutions to do so are far less developed. The primary added complication is that, as an agent traverses multiple servers that are trusted to different degrees, the agent's owner has no control over the behaviors of the agent-hosting servers. Consequently the hosting servers can subvert the computation of the passing agent. Since it is infeasible to enforce the remote servers to enact the security policy that may prevent the server from corrupting agent's data, cryptographic mechanisms defined by the agent's owner may be one of the feasible solutions to protect agent's data.Hence the focus of this thesis is the development and deployment of cryptographic mechanisms for securing mobile agents in an open environment. Firstly, requirements for securing mobile agents' data are presented. For a sound mobile agent application, the data in an agent that is collected from each visiting server must be provided integrity. In some applications where servers intend to keep anonymous and will reveal their identities only under certain cir- cumstances, privacy is also necessitated. Aimed at these properties, four new schemes are designed to achieve different security levels: two schemes direct at preserving integrity for the agent's data, the other two focus on attaining data privacy. There are four new security techniques designed to support these new schemes. The first one is joint keys to discourage two servers from colluding to forge a victim server's signature. The second one is recoverable key commitment to enable detection of any illegal operation of hosting servers on an agent's data. The third one is conditionally anonymous digital signature schemes, utilising anonymous public-key certificates, to allow any server to digitally sign a document without leaking its identity. The fourth one is servers' pseudonyms that are analogues of identities, to enable servers to be recognised as legitimate servers while their identities remain unknown to anyone. Pseudonyms can be deanonymised with the assistance of authorities. Apart from these new techniques, other mechanisms such as hash chaining relationship and mandatory verification process are adopted in the new schemes. To enable the inter-operability of these mechanisms, a security architecture is therefore developed to integrate compatible techniques to provide a generic solution for securing an agent's data. The architecture can be used independently of the particular mobile agent application under consideration. It can be used for guiding and supporting developers in the analysis of security issues during the design and implementation of services and applications based on mobile agents technology. Mobile agents offer integrity offer privacy forward integrity security architecture modification attack insertion attack colluding servers attack truncation attack stemming attack interleaving attack hash chaining relationship recoverable key commitment joint keys.
30	Σχεδιασμός και υλοποίηση δημοσιογραφικού RDF portal με μηχανή αναζήτησης άρθρων Χάιδος, Γεώργιος 11 June 2013 (has links) Το Resource Description Framework (RDF) αποτελεί ένα πλαίσιο περιγραφής πόρων ως μεταδεδομένα για το σημασιολογικό ιστό. Ο σκοπός του σημασιολογικού ιστού είναι η εξέλιξη και επέκταση του υπάρχοντος παγκόσμιου ιστού, έτσι ώστε οι χρήστες του να μπορούν ευκολότερα να αντλούν συνδυασμένη την παρεχόμενη πληροφορία. Ο σημερινός ιστός είναι προσανατολισμένος στον άνθρωπο. Για τη διευκόλυνση σύνθετων αναζητήσεων και σύνθεσης επιμέρους πληροφοριών, ο ιστός αλλάζει προσανατολισμό, έτσι ώστε να μπορεί να ερμηνεύεται από μηχανές και να απαλλάσσει το χρήστη από τον επιπλέον φόρτο. Η πιο φιλόδοξη μορφή ενσωμάτωσης κατάλληλων μεταδεδομένων στον παγκόσμιο ιστό είναι με την περιγραφή των δεδομένων με RDF triples αποθηκευμένων ως XML. Το πλαίσιο RDF περιγράφει πόρους, ορισμένους με Uniform Resource Identifiers (URI’s) ή literals με τη μορφή υποκείμενου-κατηγορήματος-αντικειμένου. Για την ορθή περιγραφή των πόρων ενθαρρύνεται από το W3C η χρήση υπαρχόντων λεξιλογίων και σχημάτων , που περιγράφουν κλάσεις και ιδιότητες. Στην παρούσα εργασία γίνεται υλοποίηση ενός δημοσιογραφικού RDF portal. Για τη δημιουργία RDF/XML, έχουν χρησιμοποιηθεί τα λεξιλόγια και σχήματα που συνιστούνται από το W3C καθώς και των DCMI και PRISM. Επίσης χρησιμοποιείται για την περιγραφή typed literals to XML σχήμα του W3C και ένα σχήμα του portal. Η δημιουργία των μεταδεδομένων γίνεται αυτόματα από το portal με τη χρήση των στοιχείων που συμπληρώνονται στις φόρμες δημοσίευσης άρθρων και δημιουργίας λογαριασμών. Για τον περιορισμό του χώρου αποθήκευσης τα μεταδεδομένα δεν αποθηκεύονται αλλά δημιουργούνται όταν ζητηθούν. Στην υλοποίηση έχει δοθεί έμφαση στην ασφάλεια κατά τη δημιουργία λογαριασμών χρήστη με captcha και κωδικό ενεργοποίησης με hashing. Για τη διευκόλυνση του έργου του αρθρογράφου, έχει εισαχθεί και επεκταθεί ο TinyMCE Rich Text Editor, o οποίος επιτρέπει τη μορφοποίηση του κειμένου αλλά και την εισαγωγή εικόνων και media. Ο editor παράγει αυτόματα HTML κώδικα από το εμπλουτισμένο κείμενο. Οι δυνατότητες του editor επεκτάθηκαν κυρίως με τη δυνατότητα για upload εικόνων και media και με την αλλαγή κωδικοποίησης για συμβατότητα με τα πρότυπα της HTML5. Για επιπλέον συμβατότητα με την HTML5 εισάγονται από το portal στα άρθρα ετικέτες σημασιολογικής δομής. Εκτός από τα άρθρα που δημιουργούνται με τη χρήση του Editor, δημοσιοποιούνται και άρθρα από εξωτερικές πηγές. Στη διαδικασία που είναι αυτόματη και επαναλαμβανόμενη, γίνεται επεξεργασία και αποθήκευση μέρους των δεδομένων των εξωτερικών άρθρων. Στον αναγνώστη του portal παρουσιάζεται ένα πρωτοσέλιδο και σελίδες ανά κατηγορία με τα πρόσφατα άρθρα. Στο portal υπάρχει ενσωματωμένη μηχανή αναζήτησης των άρθρων, με πεδία για φιλτράρισμα χρονικά, κατηγορίας, αρθρογράφου-πηγής αλλά και λέξεων κλειδιών. Οι λέξεις κλειδιά προκύπτουν από την περιγραφή του άρθρου στη φόρμα δημιουργίας ή αυτόματα. Όταν τα άρθρα προέρχονται από εξωτερικές πηγές, η διαδικασία είναι υποχρεωτικά αυτόματη. Για την αυτόματη ανεύρεση των λέξεων κλειδιών από ένα άρθρο χρησιμοποιείται η συχνότητα της λέξης στο άρθρο, με τη βαρύτητα που δίνεται από την HTML για τη λέξη (τίτλος, έντονη γραφή), κανονικοποιημένη για το μέγεθος του άρθρου και η συχνότητα του λήμματος της λέξης σε ένα σύνολο άρθρων που ανανεώνεται. Για την ανάκτηση των άρθρων χρησιμοποιείται η τεχνική των inverted files για όλες τις λέξεις κλειδιά. Για τη μείωση του όγκου των δεδομένων και την επιτάχυνση απάντησης ερωτημάτων, αφαιρούνται από την περιγραφή λέξεις που παρουσιάζουν μεγάλη συχνότητα και μικρή αξία ανάκτησης πληροφορίας “stop words”. Η επιλογή μιας αντιπροσωπευτικής λίστας με stop words πραγματοποιήθηκε με τη χρήση ενός σώματος κειμένων από άρθρα εφημερίδων, τη μέτρηση της συχνότητας των λέξεων και τη σύγκριση τους με τη λίστα stop words της Google. Επίσης για τον περιορισμό του όγκου των δεδομένων αλλά και την ορθότερη απάντηση των ερωτημάτων, το portal κάνει stemming στις λέξεις κλειδιά, παράγοντας όρους που μοιάζουν με τα λήμματα των λέξεων. Για to stemming έγινε χρήση της διατριβής του Γεώργιου Νταή του Πανεπιστημίου της Στοκχόλμης που βασίζεται στη Γραμματική της Νεοελληνικής Γραμματικής του Μανώλη Τριανταφυλλίδη. Η επιστροφή των άρθρων στα ερωτήματα που περιλαμβάνουν λέξεις κλειδιά γίνεται με κατάταξη εγγύτητας των λέξεων κλειδιών του άρθρου με εκείνο του ερωτήματος. Γίνεται χρήση της συχνότητας των λέξεων κλειδιών και της συχνότητας που έχουν οι ίδιες λέξεις σε ένα σύνολο άρθρων που ανανεώνεται. Για την αναζήτηση γίνεται χρήση θησαυρού συνώνυμων λέξεων. / The Resource Description Framework (RDF) is an appropriate framework for describing resources as metadata in the Semantic Web. The aim of semantic web is the development and expansion of the existing web, so users can acquire more integrated the supplied information. Today's Web is human oriented. In order to facilitate complex queries and the combination of the acquired data, web is changing orientation. To relieve the user from the extra burden the semantic web shall be interpreted by machines. The most ambitious form incorporating appropriate metadata on the web is by the description of data with RDF triples stored as XML. The RDF framework describes resources, with the use of Uniform Resource Identifiers (URI's) or literals as subject-predicate-object. The use of existing RDF vocabularies to describe classes and properties is encouraged by the W3C. In this work an information-news RDF portal has been developed. The RDF / XML, is created using vocabularies and schemas recommended by W3C and the well known DCMI and PRISM. The metadata is created automatically with the use of data supplied when a new articles is published. To facilitate the journalist job, a Rich Text Editor, which enables formatting text and inserting images and media has been used and expanded. The editor automatically generates HTML code from text in a graphic environment. The capabilities of the editor were extended in order to support images and media uploading and media encoding changes for better compatibility with the standards of HTML5. Apart from uploading articles with the use of the editor the portal integrates articles published by external sources. The process is totally automatic and repetitive. The user of the portal is presented a front page and articles categorized by theme. The portal includes a search engine, with fields for filtering time, category, journalist-source and keywords. The keywords can be supplied by the publisher or selected automatically. When the articles are integrated from external sources, the process is necessarily automatic. For the automatic selection of the keywords the frequency of each word in the article is used. Extra weight is given by the HTML for the words stressed (e.g. title, bold, underlined), normalized for the size of the article and stem frequency of the word in a set of articles that were already uploaded. For the retrieval of articles by the search engine the portal is using an index as inverted files for all keywords. To reduce the data volume and accelerate the query processing words that have high frequency and low value information retrieval "stop words" are removed. The choice of a representative list of stop words is performed by using a corpus of newspaper articles, measuring the frequency of words and comparing them with the list of stop words of Google. To further reduce the volume of data and increase the recall to questions, the portal stems the keywords. For the stemming the rule based algorithm presented in the thesis of George Ntais in the University of Stockholm -based Grammar was used. The returned articles to the keywords queried by the search engine are ranked by the proximity of the keywords the article is indexed. To enhance the search engine synonymous words are also included by the portal. Σημασιολογικός ιστός Ανεστραμένα αρχεία Μηχανή αναζήτησης Λημματοποίηση Ανάκτηση πληροφορίας 025.042 7 Semantic web Inverted files Search engine Stemming Keyword indexing Information retrieval Resource Description Framework (RDF) Stopwords

Search results