Spelling suggestions: "subject:"wordsimilarity"" "subject:"formalsimilarity""
1 |
A study of model parameters for scaling up word to sentence similarity tasks in distributional semanticsMilajevs, Dmitrijs January 2018 (has links)
Representation of sentences that captures semantics is an essential part of natural language processing systems, such as information retrieval or machine translation. The representation of a sentence is commonly built by combining the representations of the words that the sentence consists of. Similarity between words is widely used as a proxy to evaluate semantic representations. Word similarity models are well-studied and are shown to positively correlate with human similarity judgements. Current evaluation of models of sentential similarity builds on the results obtained in lexical experiments. The main focus is how the lexical representations are used, rather than what they should be. It is often assumed that the optimal representations for word similarity are also optimal for sentence similarity. This work discards this assumption and systematically looks for lexical representations that are optimal for similarity measurement between sentences. We find that the best representation for word similarity is not always the best for sentence similarity and vice versa. The best models in word similarity tasks perform best with additive composition. However, the best result on compositional tasks is achieved with Kroneckerbased composition. There are representations that are equally good in both tasks when used with multiplicative composition. The systematic study of the parameters of similarity models reveals that the more information lexical representations contain, the more attention should be paid to noise. In particular, the word vectors in models with the feature size at the magnitude of the vocabulary size should be sparse, but if a small number of context features is used then the vectors should be dense. Given the right lexical representations, compositional operators achieve state-of-the-art performance, improving over models that use neural-word embeddings. To avoid overfitting, either several test datasets should be used or parameter selection should be based on parameters' average behaviours.
|
2 |
詞義相似度的社會網路分析研究 / A study on word similarity with social network analysis溫文喆 Unknown Date (has links)
社會網路分析(social network analysis)將社會關係以網路形式表示,從原本純粹分析社會互動的工具,到近年來被廣泛被應用在社會學、組織研究、資訊科學、生物學、語言學等各種領域,藉由引入數學圖學理論與與日益精進的電腦處理能力,使得社會網路分析能從有別於以往的角度找出個體間行動的規律;而詞義相似度(word similarity)是資訊檢索等技術發展的基礎課題之一,近年來對詞義相似度的量測有許多方法的提出。
本研究針對英語字詞利用社會網路分析這樣的工具,藉由提出不同的網路建構方式,以語料庫為資料來源,設定網路節點與連結關係,以共現網路(co-occurrence networks)為基礎,經由改變產生與篩選的條件,觀察以社會網路分析已有的性質或指標做調整,是否可以對詞義相似度提供另一種量測方式;同時以目前詞義相似度研究上已有同義詞標準評比對前述產生的網路與所計算的性質做驗證,並進一步探討使用社會網路分析在詞義相似度研究上的適用性。
|
3 |
Finding Synonyms in Medical Texts : Creating a system for automatic synonym extraction from medical textsCederblad, Gustav January 2018 (has links)
This thesis describes the work of creating an automatic system for identifying synonyms and semantically related words in medical texts. Before this work, as a part of the project E-care@home, medical texts have been classified as either lay or specialized by both a lay annotator and an expert annotator. The lay annotator, in this case, is a person without any medical knowledge, whereas the expert annotator has professional knowledge in medicine. Using these texts made it possible to create co-occurrences matrices from which the related words could be identified. Fifteen medical terms were chosen as system input. The Dice similarity of these words in a context window of ten words around them was calculated. As output, five candidate related terms for each medical term was returned. Only unigrams were considered. The candidate related terms were evaluated using a questionnaire, where 223 healthcare professionals rated the similarity using a scale from one to five. A Fleiss kappa test showed that the agreement among these raters was 0.28, which is a fair agreement. The evaluation further showed that there was a significant correlation between the human ratings and the relatedness score (Dice similarity). That is, words with higher Dice similarity tended to get a higher human rating. However, the Dice similarity interval in which the words got the highest average human rating was 0.35-0.39. This result means that there is much room for improving the system. Further developments of the system should remove the unigram limitation and expand the corpus the provide a more accurate and reliable result.
|
4 |
Zpracování obrazu a automatické řešení křížovek / Image processing and automatic solving of crosswordsKobath, Martin January 2017 (has links)
The Master’s thesis is focused on development of Android application which is able to recognize grid and clue text of swedish crossword and find a solution for it in crossword dictionary. The thesis proposes a process of tile segmentation using corner detection, recognizes text using Tesseract OCR and searches for solutions in local database. Developed application is tested on a gallery of photos captured using a mobile phone. The tile segmentation and solution searching provide good results, the rest of the process provides unsatisfactory results thanks to inaccurate OCR output.
|
5 |
Deriving A Better Metric To Assess theQuality of Word Embeddings Trained OnLimited Specialized CorporaMunbodh, Mrinal January 2020 (has links)
No description available.
|
6 |
Improvement Of Corpus-based Semantic Word Similarity Using Vector Space ModelEsin, Yunus Emre 01 July 2009 (has links) (PDF)
This study presents a new approach for finding semantically similar words from corpora using
window based context methods. Previous studies mainly concentrate on either finding
new combination of distance-weight measurement methods or proposing new context methods.
The main difference of this new approach is that this study reprocesses the outputs of
the existing methods to update the representation of related word vectors used for measuring
semantic distance between words, to improve the results further. Moreover, this novel technique
provides a solution to the data sparseness of vectors which is a common problem in
methods which uses vector space model.
The main advantage of this new approach is that it is applicable to many of the existing word
similarity methods using the vector space model. The other and the most important advantage
of this approach is that it improves the performance of some of these existing word similarity
measuring methods.
|
7 |
Discovering Implant Terms in Medical RecordsJerdhaf, Oskar January 2021 (has links)
Implant terms are terms like "pacemaker" which indicate the presence of artifacts in the body of a human. These implant terms are key to determining if a patient can safely undergo Magnetic Resonance Imaging (MRI). However, to identify these terms in medical records is time-consuming, laborious and expensive, but necessary for taking the correct precautions before an MRI scan. Automating this process is of great interest to radiologists as it ideally saves time, prevents mistakes and as a result saves lives. The electronic medical records (EMR) contain the documented medical history of a patient, including any implants or objects that an individual would have inside their body. Information about such objects and implants are of great interest when determining if and how a patient can be scanned using MRI. This information is unfortunately not easily extracted through automatic means. Due to their sparse presence and the unusual structure of medical records compared to most written text, makes it very difficult to automate using simple means. By leveraging the recent advancements in Artificial Intelligence (AI), this thesis explores the ability to identify and extract such terms automatically in Swedish EMRs. For the task of identifying implant terms in medical records a generally trained Swedish Bidirectional Encoder Representations from Transformers (BERT) model is used, which is then fine-tuned on Swedish medical records. Using this model a variety of approaches are explored two of which will be covered in this thesis. Using this model a variety of approaches are explored, namely BERT-KDTree, BERT-BallTree, Cosine Brute Force and unsupervised NER. The results show that BERT-KDTree and BERT-BallTree are the most rewarding methods. Results from both methods have been evaluated by domain experts and appear promising for such an early stage, given the difficulty of the task. The evaluation of BERT-BallTree shows that multiple methods of extraction may be preferable as they provide different but still useful terms. Cosine brute force is deemed to be an unrealistic approach due to computational and memory requirements. The NER approach was deemed too impractical and laborious to justify for this study, yet is potentially useful if not more suitable given a different set of conditions and goals. While there is much to be explored and improved, these experiments are a clear indication that automatic identification of implant terms is possible, as a large number of implant terms were successfully discovered using automated means.
|
Page generated in 0.0357 seconds