Spelling suggestions: "subject:"sameentity"" "subject:"amentity""
1 |
Rozpoznávání a propojování pojmenovaných entit / Named Entity Recognition and LinkingTaufer, Pavel January 2017 (has links)
The goal of this master thesis is to design and implement a named entity recognition and linking algorithm. A part of this goal is to propose and create a knowledge base that will be used in the algorithm. Because of the limited amount of data for languages other than English, we want to be able to train our method on one language, and then transfer the learned parameters to other languages (that do not have enough training data). The thesis consists of description of available knowledge bases, existing methods and design and implementation of our own knowledge base and entity linking method. Our method achieves state of the art result on a few variants of the AIDA CoNLL-YAGO dataset. The method also obtains comparable results on a sample of Czech annotated data from the PDT dataset using the parameters trained on the English CoNLL dataset. Powered by TCPDF (www.tcpdf.org)
|
2 |
Extracting Salient Named Entities from Financial News Articles / Extrahering av centrala entiteter från finansiella nyhetsartiklarGrönberg, David January 2021 (has links)
This thesis explores approaches for extracting company mentions from financial newsarticles that carry a central role in the news. The thesis introduces the task of salient named entity extraction (SNEE): extract all salient named entity mentions in a text document. Moreover, a neural sequence labeling approach is explored to address the SNEE task in an end-to-end fashion, both using a single-task and a multi-task learning setup. In order to train the models, a new procedure for automatically creating SNEE annotations for an existing news article corpus is explored. The neural sequence labeling approaches are compared against a two-stage approach utilizing NLP parsers, a knowledge base and a salience classifier. Textual features inspired from related work in salient entity detection are evaluated to determine what combination of features results in the highest performance on the SNEE task when used by a salience classifier. The experiments show that the difference in performance between the two-stage approach and the best performing sequence labeling approach is marginal, demonstrating the potential of the end-to-end sequence labeling approach on the SNEE task.
|
3 |
Creating a Graph Database from a Set of Documents / Skapandet av en grafdatabas från ett set av dokumentNikolic, Vladan January 2015 (has links)
In the context of search, it may be advantageous in some use-cases to have documents saved in a graph database rather than a document-orientated database. Graph databases are able to model relationships between objects, in this case documents, in ways which allow for efficient retrieval, as well as search queries that are slightly more specific or complex. This report will attempt to explore the possibilities of storing an existing set of documents into a graph database. A Named Entity Recognizer was used on a set of news articles in order to extract entities from each news article’s body of text. News articles that contain the same entities are then connected to each other in the graph. Ideas to improve this entity extraction are also explored. The method of evaluation that was utilized in this report proved not to be ideal for this task in that only a relative measure was given, not an absolute one. As such, no absolute answer with regards to the quality of the method can be presented. It is clear that improvements can be made, and the result should be subject to further study. / I ett sökkontext kan det vara födelaktigt att i några användarscenarion utgå från dokument lagrade i en grafdatabas gentemot en dokument-orienterad databas. Grafdatabaser kan modellera förhållanden mellan objekt, som i detta fall är dokument, på ett sätt som ökar effektiviteten för vissa mer specifika eller komplexa sökfrågor. Denna rapport utforskar möjligheterna i att lagra existerande dokument i en grafdatabas. En Named Entity Recognizer används för att extrahera entiter från en stor samling nyhetsartiklar. Nyhetsartiklar som innehåller samma entiteter är sedan kopplade till varandra i grafen. Dessutom undersöks möjligheter till att förbättra extraheringen av entiteter. Evalueringsmetoden som användes visade sig mindre än ideal, då endast en relativ snarare än absolut bedömning kan göras av den slutgiltiga grafen. Därav kan inget slutgiltigt svar ges angående grafens och metodens kvalitet, men resultatet bör vara av intresse för framtida undersökningar.
|
4 |
Using text mining to identify crime patterns from Arabic crime news report corpusAlruily, Meshrif January 2012 (has links)
Most text mining techniques have been proposed only for English text, and even here, most research has been conducted on specific texts related to special contexts within the English language, such as politics, medicine and crime. In contrast, although Arabic is a widely spoken language, few mining tools have been developed to process Arabic text, and some Arabic domains have not been studied at all. In fact, Arabic is a language with a very complex morphology because it is highly inflectional, and therefore, dealing with texts written in Arabic is highly complicated. This research studies the crime domain in the Arabic language, exploiting unstructured text using text mining techniques. Developing a system for extracting important information from crime reports would be useful for police investigators, for accelerating the investigative process (instead of reading entire reports) as well as for conducting further or wider analyses. We propose the Crime Profiling System (CPS) to extract crime-related information (crime type, crime location and nationality of persons involved in the event), automatically construct dictionaries for the existing information, cluster crime documents based on certain attributes and utilize visualisation techniques to assist in crime data analysis. The proposed information extraction approach is novel, and it relies on computational linguistic techniques to identify the abovementioned information, i.e. without using predefined dictionaries (e.g. lists of location names) and annotated corpus. The language used in crime reporting is studied to identify patterns of interest using a corpus-based approach. Frequency analysis, collocation analysis and concordance analysis are used to perform the syntactic analysis in order to discover the local grammar. Moreover, the Self Organising Map (SOM) approach is adopted in order to perform the clustering and visualisation tasks for crime documents based on crime type, location or nationality. This clustering technique is improved because only refined data containing meaningful keywords extracted through the information extraction process are inputted into it, i.e. the data is cleaned by removing noise. As a result, a huge reduction in the quantity of data fed into the SOM is obtained, consequently, saving memory, data loading time and the execution time needed to perform the clustering. Therefore, the computation of the SOM is accelerated. Finally, the quantization error is reduced, which leads to high quality clustering. The outcome of the clustering stage is also visualised and the system is able to provide statistical information in the form of graphs and tables about crimes committed within certain periods of time and within a particular area.
|
5 |
Benoemde-entiteitherkenning vir Afrikaans / G.D. MatthewMatthew, Gordon Derrac January 2013 (has links)
According to the Constitution of South Africa, the government is required to make all the infor-mation in the ten indigenous languages of South Africa (excluding English), available to the public. For this reason, the government made the information, that already existed for these ten languages, available to the public and an effort is also been made to increase the amount of resources available in these languages (Groenewald & Du Plooy, 2010). This release of infor-mation further helps to implement Krauwer‟s (2003) idea that there is an inventory for the mini-mal number of language-related resources required for a language to be competitive at the level of research and teaching. This inventory is known as the "Basic Language Resource Kit" (BLARK). Since most of the languages in South Africa are resource scarce, it is of the best in-terest for the cultural growth of the country, that each of the indigenous South African languages develops their own BLARK. In Chapter 1, the need for the development of an implementable named entity recogniser (NER) for Afrikaans is discussed by first referring to the Constitution of South Africa’s (Republic of South Africa, 2003) language policy. Secondly, the guidelines of BLARK (Krauwer, 2003) are discussed, which is followed by a discussion of an audit that focuses on the number of re-sources and the distribution of human language technology for all eleven South African languages (Sharma Grover, Van Huyssteen & Pretorius, 2010). In respect of an audit conducted by Sharma Grover et al. (2010), it was established that there is a shortage of text-based tools for Afrikaans. This study focuses on this need for text-based tools, by focusing on the develop-ment of a NER for Afrikaans. In Chapter 2 a description is given on what an entity and a named entity is. Later in the chapter the process of technology recycling is explained, by referring to other studies where the idea of technology recycling has been applied successfully (Rayner et al., 1997). Lastly, an analysis is done on the differences that may occur between Afrikaans and Dutch named entities. These differences are divided into three categories, namely: identical cognates, non-identical cognates and unrelated entities.
Chapter 3 begins with a description of Frog (van den Bosch et al, 2007), the Dutch NER used in this study, and the functions and operation of its NER-component. This is followed by a description of the Afrikaans-to-Dutch-converter (A2DC) (Van Huyssteen & Pilon, 2009) and finally the various experiments that were completed, are explained. The study consists of six experiments, the first of which was to determine the results of Frog on Dutch data. The second experiment evaluated the effectiveness of Frog on unchanged (raw) Afrikaans data. The following two experiments evaluated the results of Frog on “Dutched” Afrikaans data. The last two experiments evaluated the effectiveness of Frog on raw and “Dutched” Afrikaans data with the addition of gazetteers as part of the pre-processing step. In conclusion, a summary is given with regards to the comparisons between the NER for Afri-kaans that was developed in this study, and the NER-component that Puttkammer (2006) used in his tokeniser. Finally a few suggestions for future research are proposed. / MA (Applied Language and Literary Studies), North-West University, Vaal Triangle Campus, 2013
|
6 |
Benoemde-entiteitherkenning vir Afrikaans / G.D. MatthewMatthew, Gordon Derrac January 2013 (has links)
According to the Constitution of South Africa, the government is required to make all the infor-mation in the ten indigenous languages of South Africa (excluding English), available to the public. For this reason, the government made the information, that already existed for these ten languages, available to the public and an effort is also been made to increase the amount of resources available in these languages (Groenewald & Du Plooy, 2010). This release of infor-mation further helps to implement Krauwer‟s (2003) idea that there is an inventory for the mini-mal number of language-related resources required for a language to be competitive at the level of research and teaching. This inventory is known as the "Basic Language Resource Kit" (BLARK). Since most of the languages in South Africa are resource scarce, it is of the best in-terest for the cultural growth of the country, that each of the indigenous South African languages develops their own BLARK. In Chapter 1, the need for the development of an implementable named entity recogniser (NER) for Afrikaans is discussed by first referring to the Constitution of South Africa’s (Republic of South Africa, 2003) language policy. Secondly, the guidelines of BLARK (Krauwer, 2003) are discussed, which is followed by a discussion of an audit that focuses on the number of re-sources and the distribution of human language technology for all eleven South African languages (Sharma Grover, Van Huyssteen & Pretorius, 2010). In respect of an audit conducted by Sharma Grover et al. (2010), it was established that there is a shortage of text-based tools for Afrikaans. This study focuses on this need for text-based tools, by focusing on the develop-ment of a NER for Afrikaans. In Chapter 2 a description is given on what an entity and a named entity is. Later in the chapter the process of technology recycling is explained, by referring to other studies where the idea of technology recycling has been applied successfully (Rayner et al., 1997). Lastly, an analysis is done on the differences that may occur between Afrikaans and Dutch named entities. These differences are divided into three categories, namely: identical cognates, non-identical cognates and unrelated entities.
Chapter 3 begins with a description of Frog (van den Bosch et al, 2007), the Dutch NER used in this study, and the functions and operation of its NER-component. This is followed by a description of the Afrikaans-to-Dutch-converter (A2DC) (Van Huyssteen & Pilon, 2009) and finally the various experiments that were completed, are explained. The study consists of six experiments, the first of which was to determine the results of Frog on Dutch data. The second experiment evaluated the effectiveness of Frog on unchanged (raw) Afrikaans data. The following two experiments evaluated the results of Frog on “Dutched” Afrikaans data. The last two experiments evaluated the effectiveness of Frog on raw and “Dutched” Afrikaans data with the addition of gazetteers as part of the pre-processing step. In conclusion, a summary is given with regards to the comparisons between the NER for Afri-kaans that was developed in this study, and the NER-component that Puttkammer (2006) used in his tokeniser. Finally a few suggestions for future research are proposed. / MA (Applied Language and Literary Studies), North-West University, Vaal Triangle Campus, 2013
|
7 |
Using Freebase, An Automatically Generated Dictionary, And A Classifier To Identify A Person's Profession In TweetsHall, Abraham 01 January 2013 (has links)
Algorithms for classifying pre-tagged person entities in tweets into one of eight profession categories are presented. A classifier using a semi-supervised learning algorithm that takes into consideration the local context surrounding the entity in the tweet, hash tag information, and topic signature scores is described. In addition to the classifier, this research investigates two dictionaries containing the professions of persons. These two dictionaries are used in their own classification algorithms which are independent of the classifier. The method for creating the first dictionary dynamically from the web and the algorithm that accesses this dictionary to classify a person into one of the eight profession categories are explained next. The second dictionary is freebase, an openly available online database that is maintained by its online community. The algorithm that uses freebase for classifying a person into one of the eight professions is described. The results also show that classifications made using the automated constructed dictionary, freebase, or the classifier are all moderately successful. The results also show that classifications made with the automated constructed person dictionary are slightly more accurate than classifications made using freebase. Various hybrid methods, combining the classifier and the two dictionaries are also explained. The results of those hybrid methods show significant improvement over any of the individual methods.
|
8 |
Person Name Recognition In Turkish Financial Texts By Using Local Grammar ApproachBayraktar, Ozkan 01 September 2007 (has links) (PDF)
Named entity recognition (NER) is the task of identifying the named entities (NEs) in the texts and classifying them into semantic categories such as person, organization, and place names and time, date, monetary, and percent expressions. NER has two principal aims: identification of NEs and classification of them into semantic categories. The local grammar (LG) approach has recently been shown to be superior to other NER techniques such as the probabilistic approach, the symbolic approach, and the hybrid approach in terms of being able to work with untagged corpora. The LG approach does not require using any
dictionaries and gazetteers, which are lists of proper nouns (PNs) used in NER applications, unlike most of the other NER systems. As a consequence, it is able to recognize NEs in previously unseen texts at minimal costs. Most of the NER systems are costly due to manual rule compilation especially in large tagged corpora. They also require some semantic and syntactic analyses to be applied before pattern generation process, which can be avoided by using the LG approach.
In this thesis, we tried to acquire LGs for person names from a large untagged Turkish financial news corpus by using an approach successfully applied to a Reuter&rsquo / s financial English news corpus recently by H. N. Traboulsi. We explored its applicability to Turkish language by using frequency, collocation, and concordance analyses. In addition, we constructed a list of Turkish reporting verbs. It is an important part of this study because there is no major study about reporting verbs in Turkish.
|
9 |
Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. PuttkammerPuttkammer, Martin Johannes January 2006 (has links)
An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|
10 |
Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. PuttkammerPuttkammer, Martin Johannes January 2006 (has links)
An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|
Page generated in 0.0749 seconds