• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • 1
  • 1
  • 1
  • Tagged with
  • 7
  • 7
  • 4
  • 4
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Personalized User Trending Topics

Nerusupalli, Sathvik January 2011 (has links)
No description available.
2

Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning

Liao, Ting-Yi 23 October 2012 (has links)
From the large scale of documents effective to find the near-duplicate document, has been a very important issue. In this paper, we propose a new method to detect near-duplicate document from the large scale dataset, our method is divided into three parts, feature selection, similarity measure and discriminant derivation. In feature selection, document will be detected after preprocessed. Documents have to remove signals, stop words ... and so on. We measure the value of the term weight in the sentence, and then choose the terms which have higher weight in the sentence. These terms collected as a feature of the document. The document¡¦s feature set collected by these features. Similarity measure is based on similarity function to measure the similarity value between two feature sets. Discriminant derivation is based on support vector machine which train a classifiers to identify whether a document is a near-duplicate or not. support vector machine is a supervised learning strategy. It trains a classifier by the training patterns. In the characteristics of documents, the sentence-level features are more effective than terms-level features. Besides, learning a discriminant by SVM can avoid trial-and-error efforts required in conventional methods. Trial-and-error is going to find a threshold, a discriminant value to define document¡¦s relation. In the final analysis of experiment, our method is effective in near-duplicate document detection than other methods.
3

Určení základního tvaru slova / Determination of basic form of words

Šanda, Pavel January 2011 (has links)
Lemmatization is an important preprocessing step for many applications of text mining. Lemmatization process is similar to the stemming process, with the difference that determines not only the word stem, but it´s trying to determines the basic form of the word using the methods Brute Force and Suffix Stripping. The main aim of this paper is to present methods for algorithmic improvements Czech lemmatization. The created training set of data are content of this paper and can be freely used for student and academic works dealing with similar problematics.
4

Prototyp för att öka exponeringen av skönlitteratur på internet

Viderberg, Arvid, Hammersberg, Hampus January 2018 (has links)
På internet idag genereras information för att exponera böcker manuellt. Det är information som till exempel genre, författare, platser och sammanfattning. Böckernas fullständiga text är inte tillgänglig publikt på internet på grund av upphovsrättslagen och av den anledningen går det inte att automatiskt generera denna typ av information. En lösning är att konstruera en prototyp som behandlar originalverket och automatisk genererar information som kan exponeras på internet, utan att exponera hela verket. Denna rapport jämfört tre olika algoritmer som behandlar böcker: utbrytning av ordstam, stoppordsfiltrering och blandning av meningar inom stycken. Algoritmerna är jämförda med avseende på generering av relevant information till tjänsterna: sökmotorer, automatisk metadata, smarta annonser och textsammanfattning. Sökmotorer låter en användare söka på exempelvis bokens titel eller en mening ur boken. Automatisk metadata bryter automatiskt ut beskrivande information från boken. Smarta annonser använder beskrivande information för att rekommendera och marknadsföra böcker. Textsammanfattning kan skapa en kort, beskrivande sammanfattning av boken automatiskt. Informationen som sparas från böckerna ska endast vara relevant information till tjänsterna. Informationen ska inte heller har något litterärt värde1 för en människa. Resultatet av arbetet visar att kombinationerna blandning av meningar →stoppordsfiltrering och stoppordsfiltrering →blandning av meningar är optimala i form av sökbarhet. Det är också rekommenderat att lägga till utbrytning av ordstam som ett extra steg i behandlingen av originalverket, eftersom det genererar mer relevant automatisk metadata till boken. / On the internet today, information to expose books is generated manually. That includes information such as genre, author, places, and summary. The full text of books are not publicly available on the Internet due to copyright law, and for this reason it is not possible to generate this type of information automatically. One solution is to construct a prototype that processes the original book and automatically generates information that can be exposed to the Internet, without exposing the entire book. In this report, three different algorithms that deal with processing books are compared: stemming, filtering of stop words and scrambling of sentences within paragraphs. The algorithms are compared by generating relevant information to the services: search engines, automatic metadata, smart ads and text analysis. Search engines allows a user to search for e.g. the title or a sentence from the book. Automatic metadata automatically breaks out descriptive information from the book. Smart ads can use descriptive information to recommend and promote books. Text analysis can be used to automatically create a brief descriptive summary. The information stored from the books should only be relevant information for the services and the information should not have any literal value2 for a human to read. The result of the work shows that the combinations scrambling of sentences→filtering of stop words and filtering of stop words→scramlbing of sentences are optimal in terms of searchability. It is also recommended to add stemming as an additional step in the processing of the original book, as it generates more relevant automatic metadata to the book.
5

Metody pro získávání asociačních pravidel z dat / Methods for Mining Association Rules from Data

Uhlíř, Martin January 2007 (has links)
The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
6

Vyhledávání informací v české Wikipedii / Information Retrieval in Czech Wikipedia

Balgar, Marek January 2011 (has links)
The main task of this Masters Thesis is to understand questions of information retrieval and text classifi cation. The main research is focused on the text data, the semantic dictionaries and especially the knowledges inferred from the Wikipedia. In this thesis is also described implementation of the querying system, which is based on achieved knowledges. Finally properties and possible improvements of the system are talked over.
7

Metody sumarizace textových dokumentů / Methods of Text Document Summarization

Pokorný, Lubomír January 2012 (has links)
This thesis deals with one-document summarization of text data. Part of it is devoted to data preparation, mainly to the normalization. Listed are some of the stemming algorithms and it contains also description of lemmatization. The main part is devoted to Luhn"s method for summarization and its extension of use WordNet dictionary. Oswald summarization method is described and applied as well. Designed and implemented application performs automatic generation of abstracts using these methods. A set of experiments where developed, which verified correct functionality of the application and of extension of Luhn"s summarization method too.

Page generated in 0.0501 seconds