1 |
Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing.Csomai, Andras 05 1900 (has links)
This research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of novel features are proposed, including a statistical informativeness measure based on chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension. The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.
|
2 |
An Investigation into User Text Query and Text Descriptor ConstructionPfitzner, Darius Mark, pfit0022@flinders.edu.au January 2009 (has links)
Cognitive limitations such as those described in Miller's (1956) work on channel capacity and Cowen's (2001) on short-term memory are factors in determining user cognitive load and in turn task performance. Inappropriate user cognitive load can reduce user efficiency in goal realization. For instance, if the user's attentional capacity is not appropriately applied to the task, distractor processing can tend to appropriate capacity from it. Conversely, if a task drives users beyond their short-term memory envelope, information loss may be realized in its translation to long-term memory and subsequent retrieval for task base processing.
To manage user cognitive capacity in the task of text search the interface should allow users to draw on their powerful and innate pattern recognition abilities. This harmonizes with Johnson-Laird's (1983) proposal that propositional representation is tied to mental models. Combined with the theory that knowledge is highly organized when stored in memory an appropriate approach for cognitive load optimization would be to graphically present single documents, or clusters thereof, with an appropriate number and type of descriptors. These descriptors are commonly words and/or phrases.
Information theory research suggests that words have different levels of importance in document topic differentiation. Although key word identification is well researched, there is a lack of basic research into human preference regarding query formation and the heuristics users employ in search. This lack extends to features as elementary as the number of words preferred to describe and/or search for a document. Contrastive understanding these preferences will help balance processing overheads of tasks like clustering against user cognitive load to realize a more efficient document retrieval process. Common approaches such as search engine log analysis cannot provide this degree of understanding and do not allow clear identification of the intended set of target documents.
This research endeavours to improve the manner in which text search returns are presented so that user performance under real world situations is enhanced. To this end we explore both how to appropriately present search information and results graphically to facilitate optimal cognitive and perceptual load/utilization, as well as how people use textual information in describing documents or constructing queries.
|
3 |
一個對單篇中文文章擷取關鍵字之演算法 / A Keyword Extraction Algorithm for Single Chinese Document吳泰勳, Wu, Tai Hsun Unknown Date (has links)
數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物,例如:生物、考古、地質等15項主題,為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑,由於時事資料會出現新字詞,因此,本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字,此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字,例如:「中文」,隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字,在文章中字詞共現的分佈是很重要的,假設一字詞與所有頻率詞的機率分佈中,此字詞與幾個頻率詞的機率分佈偏差較大,則此字詞極有可能為一關鍵字。在字詞的呈現方面,中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞,造成中文在斷詞處理上產生了極大的問題,與英文比較起來中文斷詞明顯比英文來的複雜許多,在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具,分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟,再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異,實驗之資料將採用中央研究院數位典藏資源網的文章,文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同,且部分關鍵字與文章主題的關聯性更強,而使用Bigram斷詞的主要優點在於不用詞庫。最後,本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展,希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字,並透過主題關鍵字連結到相關的數位典藏資料,進而帶動新一波「數典潮」。 / In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.
|
4 |
Summarization and keyword extraction on customer feedback data : Comparing different unsupervised methods for extracting trends and insight from textSkoghäll, Therése, Öhman, David January 2022 (has links)
Polestar has during the last couple of months more than doubled its amount of customer feedback, and the forecast for the future is that this amount will increase even more. Manually reading this feedback is expensive and time-consuming, and for this reason there's a need to automatically analyse the customer feedback. The company wants to understand the customer and extract trends and topics that concerns the consumer in order to improve the customer experience. Over the last couple of years as Natural Language Processing developed immensely, new state of the art language models have pushed the boundaries in all type of benchmark tasks. In this thesis have three different extractive summarization models and three different keyword extraction methods been tested and evaluated based on two different quantitative measures and human evaluation to extract information from text. This master thesis has shown that extractive summarization models with a Transformer-based text representation are best at capturing the context in a text. Based on the quantitative results and the company's needs, Textrank with a Transformer-based embedding was chosen as the final extractive summarization model. For Keywords extraction was the best overall model YAKE!, based on the quantitative measure and human validation
|
5 |
Using WordNet Synonyms and Hypernyms in Automatic Topic DetectionWargärde, Nicko January 2020 (has links)
Detecting topics by extracting keywords from written text using TF-IDF has been studied and successfully used in many applications. Adding a semantic layer to TF-IDF-based topic detection using WordNet synonyms and hypernyms has been explored in document clustering by assigning concepts that describe texts or by adding all synonyms and hypernyms that occurring words have to a list of keywords. A new method where TF-IDF scores are calculated and WordNet synset members’ TF-IDFscores are added together to all occurring synonyms and/or hypernyms is explored in this paper. Here, such an approach is evaluated by comparing extracted keywords using TF-IDF and the new proposed method, SynPlusTF-IDF, against manually assigned keywords in a database of scientific abstracts. As topic detection is widely used in many contexts and applications, improving current methods is of great value as the methods can become more accurate at extracting correct and relevant keywords from written text. An experiment was conducted comparing the two methods and their accuracy measured using precision and recall and by calculating F1-scores.The F1-scores ranged from 0.11131 to 0.14264 for different variables and the results show that SynPlusTF-IDF is not better at topic detection compared to TF-IDF and both methods performed poorly at topic detection with the chosen dataset.
|
6 |
Student Interaction Network Analysis on Canvas LMSDesai, Urvashi 01 May 2020 (has links)
No description available.
|
7 |
CE Standard Documents Keyword Extraction and Comparison Between Different MachineLearning MethodsHuang, Junhao January 2018 (has links)
Conformité Européenne (CE) approval is a complex task for producers in Europe. The producers need to search for necessary standard documents and do the tests by themselves. CE-CHECK is a website which provides document searching service, and the company engineers want to use machine learning methods to analysis the documents and the results can improve the searching system. The first task is to construct an auto keyword extraction system to analysis the standard documents. This paper performed three different machine learning methods: Conditional Random Field (CRF), joint-layer Recurrent Neural Network (RNN), and double directional Long Short-Term Memory network (Bi-LSTM), for this task and tested their performances. CRF is a traditional probabilistic model which is widely used in sequential processing. RNN and LSTM are neural network models which show impressive performance on Natural Language processing in recent years. The result of the tests was that Bi-LSTM had the best performance: the keyword extraction recall was 76.97% while RNN was 72.99% and CRF was 70.18%. In conclusion, Bi-LSTM is the best model for this keyword extraction task, and the accuracy is high enough to provide a reliable result. The model also has good robustness that it have excellent performance on documents in different fields. Bi-LSTM model can analysis all documents in less than five minutes while manual works need months, so it saved both time and cost. The results can be used in searching system and further document analysis. / Att få Conformité Européenne (CE)-godkännande är en komplicerad process för producenter i Europa. Producenterna måste söka efter nödvändiga dokument för standarder samt utföra olika tester själva. CE-CHECK är en hemsida som erbjuder söktjänster för dokument. Företagets ingenjörer vill använda maskininlärningsmetoder för att analysera dokumenten då resultaten kan förbättra söksystemet. Den första uppgiften är att konstruera ett system som automatiskt extraherar nyckelord för att analysera dokument för standarder. Detta examensarbete använde tre olika maskininlärningsmetoder och testade deras prestanda: Conditional Random Field (CRF), joint-layer Recurrent Neural Network (RNN), samt Double directional Long Short-Term Memory network (Bi-LSTM). CRF är en traditionell probabilistisk modell som ofta används inom behandling av sekventiella data. RNN och LSTM är neurala nätverksmodeller som har visat imponerande resultat inom språkteknologi de senaste åren. Resultatet av undersökningen var att Bi-LSTM presterade bäst. Modellen lyckades extrahera 76.97% av nyckelorden medan resultatet för RNN var 72.99% och för CRF var det 70.18%. Slutsatsen blev således att Bi-LSTM är det bästa valet av modell för denna uppgift och dess exakthet är tillräckligt god för att producera pålitliga resultat. Modellen är även robust då den visar goda resultat på dokument från olika forskningsområden. Bi-LSTM kan analysera alla dokument på mindre än fem minuter medan manuellt arbete skulle kräva månader. Den minskar således både tidsåtgång och kostnad. Resultaten kan användas både i söksystem samt i vidare analys av dokument.
|
8 |
Towards terminology-based keyword extractionKrassow, Cornelius January 2022 (has links)
The digitization of information has provided an overflow of data in many areas of society, including the clinical sector. However, confidentiality issues concerning the privacy of both clinicians and patients have hampered research into how to best deal with this kind of "clinical" data. An example of clinical data which can be found in abundance are Electronic Medical Records, or EMR for short. EMRs contain information about a patient's medical history, such as summarizes of earlier visits, prescribed medications and more. These EMRs can be quite extensive and reading them in full can be time-consuming, especially when considering the often hectic nature of hospital work. Giving clinicians the ability to gain insight into what information is of importance when dealing with extensive EMRs might be very useful. Keyword extraction are methods developed in the field of language technology that aim to automatically extract the most important terms or phrases from a text. Applying these methods on EMR data successfully could help provide the clinicians with a helping hand when short on time. Clinical data are very domain-specific however, requiring different kinds of expert knowledge depending on what field of medicine is being investigated. Due to the scarcity of research on not only clinical keyword extractions but clinical data as a whole, foundational groundwork in how to best deal with the domain-specific demands of a clinical keyword extractor need to be laid. By exploring how the two unsupervised approaches YAKE! and KeyBERT deal with the domain-specific task of implant-focused keyword extraction, the limitations of clinical keyword extraction are tested. Furthermore, the performance of a general BERT model in comparison to a model finetuned on domain-specific data is investigated. Finally, an attempt is made to create a domain-specific set of gold-standard keywords by combining unsupervised approaches to keyword extraction is made. The results show that unsupervised approaches perform poorly when dealing with domain-specific tasks that do not have a clear correlation to the main domain of the data. Finetuned BERT models seem to perform almost as well as a general model when tasked with implant-focused keyword extraction, although further research is needed. Finally, the use of unsupervised approaches in conjunction with manual evaluations provided by domain experts show some promise.
|
9 |
Analysis of Remarks Using Clustering and Keyword Extraction : Clustering Remarks on Electrical Installations and Identifying the Clusters by Extracting Keywords / Analys av anmärkningar med hjälp av klustring och extrahering av nyckelord : Klustring av anmärkningar på elektriska installationer och identifiering av klustren med hjälp av extrahering av nyckelordStiff, Philip January 2018 (has links)
Nowadays it is common for companies to sit on and gather a lot of data related to their business. The size of this data is often too large to be analyzed by hand and it is therefore becoming more and more common to automate this analysis e.g. by running machine learning methods on this data. In this project we attempt at analyzing an unstructured dataset consisting of remarks, found by inspectors, on electrical installations. This is done by firstly clustering the dataset with the goal of having each cluster representing a specific type of error found in the field and then extracting ten keywords from each cluster. We investigate whether these keywords can be used for representing the clusters’ contents in a way that could be useful for a future end-user application. The solution developed in this project was evaluated by constructing a form where the respondents were shown example remarks from a random subset of clusters and got to evaluate both how well the extracted keywords matched the examples and to what degree the example remarks from the same cluster represented the same kind of error. We got a total of 22 responses consisting of 8 professional inspectors and 14 laymen. Our results show that the keyword extraction make sense in connection to the example remarks from the form and that the keywords show promise in describing the content of a cluster. Also, for a majority of the clusters a clear consensus can be seen between the respondents on what keywords they considered as relevant. However the average number of keywords that the respondents considered relevant for each remark (1.40) was deemed too low for us to be able to recommend the solution. Additionally the clustering quality follows the same pattern in showing promise but not quite giving satisfactory results in this study. For future work a larger study should be conducted where several combinations of clustering and keyword extraction methods could be evaluated more thoroughly to be able to draw more decisive conclusions. / Nuförtiden är det vanligt att företag samlar in och sitter på en mängd data kopplad till sin verksamhet. Denna datamängd är ofta för stor för att kunna analyseras för hand. Därför har det blivit allt vanligare att automatisera denna analys genom att köra maskininlärningsmetoder på datan. I detta projekt analyseras ett dataset bestående av fritext-poster innehållande anmärkningar på elinstallationer. Detta görs genom att först klustra datan med målet att varje kluster ska representera en viss typ av anmärkning från fältet för att sedan extrahera 10 st nyckelord från varje kluster. Vår undersökning går sedan ut på att undersöka till vilken grad dessa nyckelord kan sägas representera klustrens innehåll på ett sätt som skulle vara användbart för en applikation för slutanvändare. Den lösning som togs fram i projektet utvärderades genom en enkät där de svarande visades exempel på anmärkningar från ett antal slumpvist valda kluster och sedan fick ta ställning till hur väl nyckelorden passade in på exemplen och också till vilken grad exemplen från samma kluster representerade samma typ av anmärkning. Totalt fick vi in svar från 22 personer, nämligen 8 besiktningsingenjörer och 14 st lekmän. Resultaten visar att de extraherade nyckelorden hade en naturlig koppling till de respektive anmärkningarna från enkäten och att de har potential att förklara innehållet i klustren. Hos en majoritet av klustern kunde vi också se en tydlig samstämmighet bland de svarande i vilka specifika nyckelord som ansågs relevanta. Dock var det genomsnittliga antalet nyckelord som ansågs relevanta för ett anmärkningsexempel (1,40) för lågt för att vi ska kunna rekommendera den utvärderade lösningen. På ett liknande sätt visar våra resultat att klustringen av datan var lovande, men att den inte blev helt tillfredsställande. I ett fortsatt arbete borde en större undersökning göras där flera kombinationer av metoder för klustring och extrahering av nyckelord jämförs grundligare så att säkrare slutsatser kan dras.
|
10 |
Test Case Selection from Test Specifications using Natural Language ProcessingGupta, Alok January 2023 (has links)
The Cloud Radio Access Network (RAN) is a groundbreaking technology employed in the telecommunications industry, offering flexible, scalable, and cost-effective solutions for seamless wireless network services. However, testing Cloud RAN applications presents significant challenges due to their complexity, potentially leading to delays and increased costs. A paramount solution to overcome these obstacles is test automation. Automating the testing process not only dramatically reduces manual efforts but also enhances testing accuracy and efficiency, expediting the delivery of high-quality products. In the current era of cutting-edge advancements, artificial intelligence (AI) and machine learning (ML) play a transformative role in revolutionizing Cloud RAN testing. These innovative technologies enable rapid identification and resolution of complex issues, surpassing traditional methods. The objective of this thesis is to adopt an AI-enabled approach to Cloud RAN test automation, harnessing the potential of machine learning and natural language processing (NLP) techniques to automatically select test cases from test instructions. Through thorough analysis, relevant keywords are extracted from the test instructions using advanced NLP techniques. The performance of three keyword extraction methods is compared, with SpaCy proving to be the superior keyword extractor. Using the extracted keywords, test script prediction is conducted through two distinct approaches: using test script names and using test script contents. In both cases, Random Forest emerges as the top-performing model, showcasing its effectiveness with diverse datasets, regardless of oversampling or undersampling data augmentation techniques. Based on the rule-based approach, the predicted test scripts are utilized to determine the order of execution among the predicted test scripts. The research findings highlight the significant impact of AI and ML techniques in streamlining test case selection and automation for Cloud RAN applications. The proposed AI-enabled approach optimizes the testing process, resulting in faster product delivery, reduced manual workload, and overall cost savings.
|
Page generated in 0.0745 seconds