Spelling suggestions: "subject:"chinese work segmentation"" "subject:"8hinese work segmentation""
1 |
An Enhanced Conditional Random Field Model for Chinese Word SegmentationHuang, Jhao-ming 03 February 2010 (has links)
In Chinese language, the smallest meaningful unit is a word which is composed of a sequence
of characters. A Chinese sentence is composed of a sequence of words without any separation
between them. In the area of information retrieval or data mining, the segmentation of a
sequence of Chinese characters should be done before anyone starts to use these segments of
characters. The process is called the Chinese word segmentation. The researches of Chinese
word segmentation have been developed for many years. Although some recent researches
have achieved very high performance, the recall of those words that are not in the dictionary
only achieves sixty or seventy percent. An approach described in this paper makes use of the
linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation.
The discriminatively trained model that uses two of our proposed feature templates for
deciding the boundaries between characters is used in our study. We also propose three other
methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix
could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment.
In the area of using the conditional random fields for Chinese word segmentation, we have
proposed a feature template for better result and three methods which focus on other specific
segmentation problems.
|
2 |
基於領域詞典之詞彙-語義網路建構方法研究 - 以財務金融領域詞典為例 / The Construction of a Lexical-semantic Network Based on Domain Dictionary: Dictionary of Finance and Banking as an Example曾建勛, Tzeng,Jian Shuin Unknown Date (has links)
領域詞典包含許多專業的詞彙以及對詞彙的定義,但詞典中詞彙間的關係是被隱藏起來的,本研究運用自然語言處理的相關技術,提出運用領域詞典找出詞彙間關係建構特定領域語義網路的方法。 / A domain dictionary contains many professional words and their definitions. In general, there are many hidden relations among words in a dictionary. In this thesis, we use techniques of natural language processing to find out these relations, and bring up a method to construct a domain specific lexical semantic network.
|
3 |
支援數位人文研究之文本自動標註系統發展與使用評估研究 / Development and evaluation of an automatic text annotation system for supporting digital humanities research劉鎮宇, Liu, Chen Yu Unknown Date (has links)
在傳統的人文研究中,人文學者大多以如古籍珍善本、歷史文獻等紙本出版形式之文本為主要研究文本型式,但是隨著資訊社會的來臨,許多研究機構陸續將這些紙本資料進行數位化並建置數位典藏資料庫,對人文研究環境與知識取得管道帶來巨大的改變,基於數位閱讀之文本研究型式也成為必然的發展趨勢。
因此,本研究發展支援數位人文研究之「文本自動標註系統」,藉由Linked Data的概念匯集來自不同資料庫的資源,並加以整合後,替文本進行自動註解,讓使用者在解讀文本時能夠即時參照其他資料庫的資源,並提供友善的具文本標註之閱讀介面,以利於人文學者透過閱讀進行資料的解讀。本研究以實驗研究法比較本研究所發展之「文本自動標註系統」與「MARKUS文本半自動標註系統」在支援人文學者進行文本資料解讀之閱讀成效與科技接受度是否具有顯著差異,並輔以半結構式深度訪談了解人文學者對於本研究發展之「文本自動標註系統」的看法及感受,也進一步分析「文本自動標註系統」閱讀成效、科技接受度及使用者行為歷程之間是否具有關聯性。
實驗結果發現,採用本研究發展之文本自動標註系統的閱讀成效高於MARKUS文本半自動標註系統,但未達顯著差異;而科技接受度分析結果則顯示文本自動標註系統之科技接受度顯著優於MARKUS文本半自動標註系統。另外,從訪談結果歸納得知,文本自動標註系統閱讀介面簡潔明瞭,比MARKUS文本半自動標註系統更適合閱讀,而閱讀介面是否易於使用與是否有用,是影響人文學者能否接受採用系統輔助數位人文研究的重要因素。此外,在兩個系統類似功能比較分析後也發現,文本自動標註系統在查詢詞彙功能、連結到來源網站功能及新增標註功能都比MARKUS文本半自動標註系統更為直覺易用。另外人文學者普遍認為斷句功能比自動斷詞功能更重要,鏈結來源資料庫則以萌典最有幫助。最後,採用文本自動標註系統之閱讀成效與使用者行為歷程之間無顯著關聯性。 / In traditional humanities research, most humanities scholars studied text-type paper-based publishing texts, such as rare ancient books and historical literature. However, many research institutes, in the information society, gradually digitalized such paper-based data and established digital archives database to result in great changes in humanities research environment and knowledge acquisition channels. The research pattern with digital reading based texts became the essential development trend.
For this reason, an “automatic text annotation system” for supporting digital humanities research is developed in this study. Resources from distinct database are gathered through Linked Data and integrated for the automatic annotation of texts. It allows users immediately referring to resources from other database when interpreting texts and provides friendly reading interface with text annotation for humanities scholars interpreting data through reading. With experimental research, the “automatic text annotation system” developed in this study is compared with “MARKUS semi-automatic text annotation system” for supporting humanities scholars interpreting text data to discussed the difference in reading achievement and technology acceptance. Semi-structured in-depth interviews are also proceeded to understand humanities scholars’ opinions and perception about the “automatic text annotation system” developed in this study as well as to analyze the correlations among reading achievement, technology acceptance, and user behavior course of the “automatic text annotation system”.
The experimental findings show that the reading achievement with the automatic text annotation system developed in this study is higher than that with MARKUS semi-automatic text annotation system, but not achieving the significance. The technology acceptance analysis reveals remarkably better technology acceptance of the automatic text annotation system than MARKUS semi-automatic text annotation system. According to the interviews, the reading interface of the automatic text annotation system is simple and clear that it is more suitable for reading than MARKUS semi-automatic text annotation system. The ease of use and usefulness of reading interface is a key factor in humanities scholars accepting the system for the digital humanities research. In regard to the comparison of similar functions between two systems, the functions of vocabulary enquiry, linking to source web sites, and annotation appending of the automatic text annotation system are more intuitive and easy to use than those of MARKUS semi-automatic text annotation system. What is more, humanities scholars emphasize more on the sentence segmentation function than the automatic word segmentation function, and the linked source database, Moedict, appears the best assistance. Finally, there is no significant correlation between reading achievement and user behavior course with the automatic text annotation system.
|
Page generated in 0.153 seconds