Spelling suggestions: "subject:"stemming"" "subject:"temming""
1 |
Automatic text processing for Korean language free text retrievalLee, Hyo Sook January 2000 (has links)
No description available.
|
2 |
Effective retrieval techniques for Arabic textNwesri, Abdusalam F Ahmad, nwesri@yahoo.com January 2008 (has links)
Arabic is a major international language, spoken in more than 23 countries, and the lingua franca of the Islamic world. The number of Arabic-speaking Internet users has grown over nine-fold in the Middle East between the year 2000 and 2007, yet research in Arabic Information Retrieval (AIR) has not advanced as in other languages such as English. In this thesis, we explore techniques that improve the performance of AIR systems. Stemming is considered one of the most important factors to improve retrieval effectiveness of AIR systems. Most current stemmers remove affixes without checking whether the removed letters are actually affixes. We propose lexicon-based improvements to light stemming that distinguish core letters from proper Arabic affixes. We devise rules to stem most affixes and show their effects on retrieval effectiveness. Using the TREC 2001 test collection, we show that applying relevance feedback with our rules produces significantly better results than light stemming. Techniques for Arabic information retrieval have been studied in depth on clean collections of newswire dispatches. However, the effectiveness of such techniques is not known on other noisy collections in which text is generated using automatic speech recognition (ASR) systems and queries are generated using machine translations (MT). Using noisy collections, we show that normalisation, stopping and light stemming improve results as in normal text collections but that n-grams and root stemming decrease performance. Most recent AIR research has been undertaken using collections that are far smaller than the collections used for English text retrieval; consequently, the significance of some published results is debatable. Using the LDC Arabic GigaWord collection that contains more than 1 500 000 documents, we create a test collection of~90 topics with their relevance judgements. Using this test collection, we show empirically that for a large collection, root stemming is not competitive. Of the approaches we have studied, lexicon-based stemming approaches perform better than light stemming approaches alone. Arabic text commonly includes foreign words transliterated into Arabic characters. Several transliterated forms may be in common use for a single foreign word, but users rarely use more than one variant during search tasks. We test the effectiveness of lexicons, Arabic patterns, and n-grams in distinguishing foreign words from native Arabic words. We introduce rules that help filter foreign words and improve the n-gram approach used in language identification. Our combined n-grams and lexicon approach successfully identifies 80% of all foreign words with a precision of 93%. To find variants of a specific foreign word, we apply phonetic and string similarity techniques and introduce novel algorithms to normalise them in Arabic text. We modify phonetic techniques used for English to suit the Arabic language, and compare several techniques to determine their effectiveness in finding foreign word variants. We show that our algorithms significantly improve recall. We also show that expanding queries using variants identified by our Soutex4 phonetic algorithm results in a significant improvement in precision and recall. Together, the approaches described in this thesis represent an important step towards realising highly effective retrieval of Arabic text.
|
3 |
Morfologická segmentace českých slov / Morphological segmentation of Czech WordsVidra, Jonáš January 2018 (has links)
In linguistics, words are usually considered to be composed of morphemes: units that carry meaning and are not further subdivisible. The task of this thesis is to create an automatic method for segmenting Czech words into morphemes, usable within the network of Czech derivational relations DeriNet. We created two different methods. The first one finds morpheme boundaries by differentiating words against their derivational parents, and transitively against their whole derivational family. It explicitly models morphophonological alternations and finds the best boundaries using maximum likelihood estimation. At worst, the results are slightly worse than the state of the art method Morfessor FlatCat, and they are significantly better in some settings. The second method is a neural network made to jointly predict segmentation and derivational parents, trained using the output of the first method and the derivational pairs from DeriNet. Our hypothesis that such joint training would increase the quality of the segmentation over training purely on the segmentation task seems to hold in some cases, but not in other. The neural model performs worse than the first one, possibly due to being trained on data which already contains some errors, multiplying them.
|
4 |
The development of a blasthole stemming performance evaluation model using a purpose built testing facilityBoshoff, Dawid 26 November 2009 (has links)
The ability of an explosive to break rock is influenced considerably by the extent of confinement in the blasthole and it is believed that confinement is improved by the use of stemming. The aim of this paper is to present the first and second stages of results in developing a stemming performance testing and evaluation facility for small diameter boreholes. The results showed that different stemming products have differences in terms of their functionality, which can have a major impact on the efficiency of rock breaking. Two test procedures were used, one through the exclusive use of compressed air and the second using a purpose built high pressure test rig with small quantities of explosives. Both tests were used to identify and evaluate the ability of various stemming products to resist the escape of explosive gas through the collar of a blasthole. Extensive research was conducted to determine the types of stemming products most commonly used in South African underground hard rock mines, and the differences in design between the various products are discussed. The first stage of tests using compressed air only did not prove adequate to predict with certainty the pressure behaviour in the borehole of a particular product under high pressure conditions. The purpose built high pressure test rig did not prove to be a very effective tool to test stemming products under high pressure conditions. The test rig only incorporated the effect of gas pressure on the stemming product and in doing so omitted to take the effect of the shock wave into account. This study proved that to only take the gas pressure generated in the blasthole into account in not sufficient to effectively test stemming product design. A more comprehensive study should include the effect of gas pressure in the borehole, shock waves generated by the explosive and also the coefficient of friction of both the surface of the stemming product as well as the inside of the blasthole. / Dissertation (MEng)--University of Pretoria, 2009. / Mining Engineering / unrestricted
|
5 |
Jak kvalita lemmatizace ovlivňuje výsledky vyhledávání dokumentů v českém jazyce / Effect of the Czech Stemming Algorithm on the Document RetrievalPytelka, Petr January 2012 (has links)
This thesis deals with the measurement of the quality of the stemming/lemmatization algo-rithm for the Czech language in document processing systems and provides an analysis of the results. The theoretical part of the thesis describes the principles of the full-text search, the possibilities of implementation as well as the common problems which have to be solved in connection with the processing of natural language. Methods of evaluating the quality of lemmatization, using recall and precision, are discussed. In addition, the theoret-ical part covers the method of measuring the index of under-stemming and over-stemming, which can be applied for the purposes of a more detailed evaluation. An experiment for evaluating the lemmatization algorithms is proposed in the second part of the thesis. A specialized application has been developed to perform the experiment in three different systems, namely Apache Lucene, the PostgreSQL database systems and the Microsoft SQL Server. The experiment is based on the Prague Dependency Treebank cor-pus. It has been carried out both for the corpus as a whole and for selected word classes separately. Further analysis of the results for Czech stemmer in Apache Lucene leads to a proposal for several modifications of the algorithm. Such modifications result in measurable improvements. The results achieved show how metrics discussed, together with the values measured, can be used for improving the lemmatization algorithms and thus to improve the full-text search for Czech language.
|
6 |
An Evaluation of Existing Light Stemming Algorithms for Arabic Keyword SearchesBrittany E. Rogerson 17 November 2008 (has links)
The field of Information Retrieval recognizes the importance of stemming in improving retrieval effectiveness. This same tool, when applied to searches conducted in the Arabic language, increases the relevancy of documents returned and expands searches to encompass the general meaning of a word instead of the word itself. Since the Arabic language relies mainly on triconsonantal roots for verb forms and derives nouns by adding affixes, words with similar consonants are closely related in meaning. Stemming allows a search term to focus more on the meaning of a term and closely related terms and less on specific character matches. This paper discusses the strengths of light stemming, the best techniques, and components for algorithmic affix-based stemmers used in keyword searching in the Arabic language.
|
7 |
Enhancing a Web Crawler with Arabic Search.Nguyen, Qui V. 25 July 2012
Many advantages of the Internetâ ease of access, limited regulation, vast potential audience, and fast flow of
informationâ have turned it into the most popular way to communicate and exchange ideas. Criminal and terrorist
groups also use these advantages to turn the Internet into their new play/battle fields to conduct their illegal/terror
activities. There are millions of Web sites in different languages on the Internet, but the lack of foreign language
search engines makes it impossible to analyze foreign language Web sites efficiently. This thesis will enhance an
open source Web crawler with Arabic search capability, thus improving an existing social networking tool to perform
page correlation and analysis of Arabic Web sites. A social networking tool with Arabic search capabilities could
become a valuable tool for the intelligence community. Its page correlation and analysis results could be used to
collect open source intelligence and build a network of Web sites that are related to terrorist or criminal activities.
|
8 |
A stemming algorithm for LatvianKreslins, Karlis January 1996 (has links)
The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval.
|
9 |
Making Sense of Online Reviews: A Machine Learning Approach: An AbstractHarrison, Dana E., Ajjan, Haya 01 January 2020 (has links)
It is estimated that 80% of companies’ data is unstructured. Unstructured data, or data that is not predefined by numerical values, continues to grow at a rapid pace. Images, text, videos and voice are all examples of unstructured data. Companies can use this type of data to leverage novel insights unavailable through more easily manageable, structured data. Unstructured data, however, creates a challenge since it often requires substantial coding prior to performing an analysis. The purpose of this study is to describe the steps and introduce computational methods that can be adopted to further explore unstructured, online reviews. The unstructured nature of online reviews requires extensive text analytics processing. This study introduces methods for text analytics including tokenization at the sentence level, lemmatization or stemming to reduce inflectional forms of the words appearing in the text, and ‘bag of n-grams’ approach. We will also introduce lexicon-based feature engineering and methods to develop new lexicons for capturing theoretically established constructs and relationships that are specific to the domain of study. The numeric features generated in the analysis will then be analyzed using machine learning algorithms. This process can be applied to the analysis of other unstructured data such as dyadic information exchange between customer service, salespeople, customers and channel members. Although not a comprehensive set of examples, companies can apply results from unstructured data analysis to examine a variety of outcomes related to customer decisions, managing channels and mitigating potential crisis situations. Understanding interdisciplinary methods of analyzing unstructured data is critical as the availability of this type of data continues to accelerate and enables researchers to develop theoretical contributions within the marketing discipline.
|
10 |
Nalezení slovních kořenů v češtině / Stemming of Czech WordsHellebrand, David January 2010 (has links)
The goal of this master's thesis is to develop stemming algorithm for czech language based on grammatical rules. You can find a description of stemming process and a comparsion of stemming algorithms in this project. The basics of czech grammar and Snowball language are also described here. The main part of this thesis concerns the implementation of the new czech stemming algorithm.
|
Page generated in 0.0468 seconds