Spelling suggestions: "subject:"text aprocessing"" "subject:"text eprocessing""
61 |
Probabilistic models for information extraction: from cascaded approach to joint approach. / CUHK electronic theses & dissertations collectionJanuary 2010 (has links)
Based on these observations and analysis, we propose a joint discriminative probabilistic framework to optimize all relevant subtasks simultaneously. This framework defines a joint probability distribution for both segmentations in sequence data and relations of segments in the form of an exponential family. This model allows tight interactions between segmentations and relations of segments and it offers a natural way for IE tasks. Since exact parameter estimation and inference are prohibitively intractable, a structured variational inference algorithm is developed to perform parameter estimation approximately. For inference, we propose a strong bi-directional MH approach to find the MAP assignments for joint segmentations and relations to explore mutual benefits on both directions, such that segmentations can aid relations, and vice-versa. / Information Extraction (IE) aims at identifying specific pieces of information (data) in a unstructured or semi-structured textual document and transforming unstructured information in a corpus of documents or Web pages into a structured database. There are several representative tasks in IE: named entity recognition (NER), which aims at identifying phrases that denote types of named entities, entity relation extraction, which aims at discovering the events or relations related to the entities, and the task of coreference resolution, aims at determining whether two extracted mentions of entities refer to the same object. IE is useful for a wide variety of applications. / The end-to-end performance of high-level IE systems for compound tasks is often hampered by the use of cascaded frameworks. The integrated model we proposed can alleviate some of these problems, but it is only loosely coupled. Parameter estimation is performed independently and it only allows information to flow in one direction. In this top-down integration model, the decision of the bottom sub-model could guide the decision of the upper sub-model, but not vice-versa. Thus, deep interactions and dependencies between different tasks can hardly be well captured. / We have investigated and developed a cascaded framework in an attempt to consider entity extraction and qualitative domain knowledge based on undirected, discriminatively-trained probabilistic graphical models. This framework consists of two stages and it is the combination of statistical learning and first-order logic. As a pipeline model, the first stage is a base model and the second stage is used to validate and correct the errors made in the base model. We incorporated domain knowledge that can be well formulated into first-order logic to extract entity candidates from the base model. We have applied this framework and achieved encouraging results in Chinese NER on the People's Daily corpus. / We perform extensive experiments on three important IE tasks using real-world datasets, namely Chinese NER, entity identification and relationship extraction from Wikipedia's encyclopedic articles, and citation matching, to test our proposed models, including the bidirectional model, the integrated model, and the joint model. Experimental results show that our models significantly outperform current state-of-the-art probabilistic models, such as decoupled and joint models, illustrating the feasibility and promise of our proposed approaches. (Abstract shortened by UMI.) / We present a general, strongly-coupled, and bidirectional architecture based on discriminatively trained factor graphs for information extraction, which consists of two components---segmentation and relation. First we introduce joint factors connecting variables of relevant subtasks to capture dependencies and interactions between them. We then propose a strong bidirectional Markov chain Monte Carlo (MCMC) sampling inference algorithm which allows information to flow in both directions to find the approximate maximum a posteriori (MAP) solution for all subtasks. Notably, our framework is considerably simpler to implement, and outperforms previous ones. / Yu, Xiaofeng. / Adviser: Zam Wai. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 109-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
62 |
An empirical study on Chinese text compression: from character-based to word-based approach.January 1997 (has links)
by Kwok-Shing Cheng. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1997. / Includes bibliographical references (leaves 114-120). / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Importance of Text Compression --- p.1 / Chapter 1.2 --- Motivation of this Research --- p.2 / Chapter 1.3 --- Characteristics of Chinese --- p.2 / Chapter 1.3.1 --- Huge size of character set --- p.3 / Chapter 1.3.2 --- Lack of word segmentation --- p.3 / Chapter 1.3.3 --- Rich semantics --- p.3 / Chapter 1.4 --- Different Coding Schemes for Chinese --- p.4 / Chapter 1.4.1 --- Big5 Code --- p.4 / Chapter 1.4.2 --- GB (Guo Biao) Code --- p.4 / Chapter 1.4.3 --- HZ (Hanzi) Code --- p.5 / Chapter 1.4.4 --- Unicode Code --- p.5 / Chapter 1.5 --- Modeling and Coding for Chinese Text --- p.6 / Chapter 1.6 --- Static and Adaptive Modeling --- p.6 / Chapter 1.7 --- One-Pass and Two-Pass Modeling --- p.8 / Chapter 1.8 --- Ordering of models --- p.9 / Chapter 1.9 --- Two Sets of Benchmark Files and the Platform --- p.9 / Chapter 1.10 --- Outline of the Thesis --- p.11 / Chapter 2 --- A Survey of Chinese Text Compression --- p.13 / Chapter 2.1 --- Entropy for Chinese Text --- p.14 / Chapter 2.2 --- Weakness of Traditional Compression Algorithms on Chinese Text --- p.15 / Chapter 2.3 --- Statistical Class Algorithms for Compressing Chinese --- p.16 / Chapter 2.3.1 --- Huffman coding scheme --- p.17 / Chapter 2.3.2 --- Arithmetic Coding Scheme --- p.22 / Chapter 2.3.3 --- Restricted Variable Length Coding Scheme --- p.26 / Chapter 2.4 --- Dictionary-based Class Algorithms for Compressing Chinese --- p.27 / Chapter 2.5 --- Experiments and Results --- p.32 / Chapter 2.6 --- Chapter Summary --- p.35 / Chapter 3 --- Indicator Dependent Huffman Coding Scheme --- p.37 / Chapter 3.1 --- Chinese Character Identification Routine --- p.37 / Chapter 3.2 --- Reduction of Header Size --- p.39 / Chapter 3.3 --- Semi-adaptive IDC for Chinese Text --- p.44 / Chapter 3.3.1 --- Theoretical Analysis of Partition Technique for Com- pression --- p.48 / Chapter 3.3.2 --- Experiments and Results of the Semi-adaptive IDC --- p.50 / Chapter 3.4 --- Adaptive IDC for Chinese Text --- p.54 / Chapter 3.4.1 --- Experiments and Results of the Adaptive IDC --- p.57 / Chapter 3.5 --- Chapter Summary --- p.58 / Chapter 4 --- Cascading LZ Algorithms with Huffman Coding Schemes --- p.59 / Chapter 4.1 --- Variations of Huffman Coding Scheme --- p.60 / Chapter 4.1.1 --- Analysis of EPDC and PDC --- p.60 / Chapter 4.1.2 --- "Analysis of PDC, 16Huff and IDC" --- p.65 / Chapter 4.1.3 --- Time and Memory Consumption --- p.71 / Chapter 4.2 --- "Cascading LZSS with PDC, 16Huff and IDC" --- p.73 / Chapter 4.2.1 --- Experimental Results --- p.76 / Chapter 4.3 --- "Cascading LZW with PDC, 16Huff and IDC" --- p.79 / Chapter 4.3.1 --- Experimental Results --- p.82 / Chapter 4.4 --- Chapter Summary --- p.84 / Chapter 5 --- Applying Compression Algorithms to Word-segmented Chi- nese Text --- p.85 / Chapter 5.1 --- Background of word-based compression algorithms --- p.86 / Chapter 5.2 --- Terminology and Benchmark Files for Word Segmentation Model --- p.88 / Chapter 5.3 --- Word Segmentation Model --- p.88 / Chapter 5.4 --- Chinese Entropy from Byte to Word --- p.91 / Chapter 5.5 --- The Generalized Compression and Decompression Model for Word-segmented Chinese text --- p.92 / Chapter 5.6 --- Applying Huffman Coding Scheme to Word-segmented Chinese text --- p.94 / Chapter 5.7 --- Applying WLZSSHUF to Word-segmented Chinese text --- p.97 / Chapter 5.8 --- Applying WLZWHUF to Word-segmented Chinese text --- p.102 / Chapter 5.9 --- Match Ratio and Compression Ratio --- p.105 / Chapter 5.10 --- Chapter Summary --- p.108 / Chapter 6 --- Concluding Remarks --- p.110 / Chapter 6.1 --- Conclusions --- p.110 / Chapter 6.2 --- Contributions --- p.111 / Chapter 6.3 --- Future Directions --- p.112 / Chapter 6.3.1 --- Integrate Decremental Coding Scheme with IDC --- p.112 / Chapter 6.3.2 --- Re-order the Character Sequences in the Sliding Window of LZSS --- p.113 / Chapter 6.3.3 --- Multiple Huffman Trees for Word-based Compression --- p.113 / Bibliography --- p.114
|
63 |
Automatic construction and adaptation of wrappers for semi-structured web documents.January 2003 (has links)
Wong Tak Lam. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 88-94). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1 / Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6 / Chapter 1.3 --- Thesis Contributions --- p.7 / Chapter 1.4 --- Thesis Organization --- p.8 / Chapter 2 --- Related Work --- p.10 / Chapter 2.1 --- Related Work on Wrapper Induction --- p.10 / Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16 / Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20 / Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22 / Chapter 3.2 --- Extraction Rule Induction --- p.30 / Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38 / Chapter 4 --- Experimental Results for Wrapper Induction --- p.40 / Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52 / Chapter 5.1 --- Problem Definition --- p.52 / Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55 / Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58 / Chapter 5.3.1 --- Useful Text Fragments --- p.58 / Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60 / Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63 / Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64 / Chapter 5.4.1 --- Text Fragment Classification --- p.64 / Chapter 5.4.2 --- New Wrapper Learning --- p.69 / Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71 / Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71 / Chapter 6.2 --- Experimental Results --- p.73 / Chapter 6.2.1 --- Book Domain --- p.74 / Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79 / Chapter 7 --- Conclusions and Future Work --- p.83 / Bibliography --- p.88 / Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95 / Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.99
|
64 |
Relevance feedback-based optimization of search queries for PatentsCheng, Sijin January 2019 (has links)
In this project, we design a search query optimization system based on the user’s relevance feedback by generating customized query strings for existing patent alerts. Firstly, the Rocchio algorithm is used to generate a search string by analyzing the characteristics of related patents and unrelated patents. Then the collaborative filtering recommendation algorithm is used to rank the query results, which considering the previous relevance feedback and patent features, instead of only considering the similarity between query and patents as the traditional method. In order to further explore the performance of the optimization system, we design and conduct a series of evaluation experiments regarding TF-IDF as a baseline method. Experiments show that, with the use of generated search strings, the proportion of unrelated patents in search results is significantly reduced over time. In 4 months, the precision of the retrieved results is optimized from 53.5% to 72%. What’s more, the rank performance of the method we proposed is better than the baseline method. In terms of precision, top10 of recommendation algorithm is about 5 percentage points higher than the baseline method, and top20 is about 7.5% higher. It can be concluded that the approach we proposed can effectively optimize patent search results by learning relevance feedback.
|
65 |
Fouille de textes : des méthodes symboliques pour la construction d'ontologies et l'annotation sémantique guidée par les connaissancesToussaint, Yannick 21 November 2011 (has links) (PDF)
Il n'existe pas d'outils clé en main pour extraire des connaissances de textes et le passage de la langue naturelle à des connaissances est très fortement contextualisé et dépendant de la tâche que l'on s'est fixée. Nous montrons que le défi d'extraction de connaissances à partir de textes reste aujourd'hui très vaste, avec de très nombreuses pistes de recherche que ce soit en lien avec des approche de type recherche d'information, traitement automatique des langues, fouille de données ou représentation des connaissances. Chacun de ces domaines de recherche recensent de nombreux sous-domaines tous très actifs. Le projet de recherche que je souhaite développer peut être vu comme un chemin au travers de ces domaines qui vise à créer un continuum (sémantique) entre les différentes étapes de la fouille de textes. L'extraction de connaissances à partir de textes est avant tout une construction de connaissances et suppose une cohérence méthodologique entre les différentes étapes de la fouille de textes. J'ai fait le choix d'ancrer mes travaux dans le domaine du formel en visant notamment une représentation des connaissances en logique, plus particulièrement en logique de descriptions. Malgré les restrictions liées à ce choix, notamment en ce qui concerne l'interaction avec des humains experts d'un domaine, la mise à jour, ou la correction d'une ontologie, une représentation formelle reste à mon sens la solution pour raisonner sur les textes et assurer la cohérence d'une ontologie. Si le but final d'un processus de fouille est de construire une représentation formelle qui puisse être le support de raisonnements, je me suis concentré dans ce projet de recherche sur la construction des connaissances en exploitant des méthodes à base de motifs, d'extraction de règles d'association ou de l'analyse formelle de concepts. L'intérêt de ces approches est qu'elles assureront un lien constant entre les textes et les connaissances. La modification des textes engendre une modification des connaissances et inversement la modification des connaissances (les ressources externes par exemple) modifient l'annotation des textes et l'ontologie. Des environnements coopératifs pourraient à terme intégrer nos travaux et faciliter ainsi la synergie entre les processus humains et les processus automatiques.
|
66 |
Les ressources annotées, un enjeu pour l'analyse de contenu : vers une méthodologie de l'annotation manuelle de corpusFort, Karën 07 December 2012 (has links) (PDF)
L'annotation manuelle de corpus est devenue un enjeu fondamental pour le Traitement Automatique des Langues (TAL). En effet, les corpus annotés sont utilisés aussi bien pour créer que pour évaluer des outils de TAL. Or, le processus d'annotation manuelle est encore mal connu et les outils proposés pour supporter ce processus souvent mal utilisés, ce qui ne permet pas de garantir le niveau de qualité de ces annotations. Nous proposons dans cette thèse une vision unifiée de l'annotation manuelle de corpus pour le TAL. Ce travail est le fruit de diverses expériences de gestion et de participation à des campagnes d'annotation, mais également de collaborations avec différents chercheur(e)s. Nous proposons dans un premier temps une méthodologie globale pour la gestion de campagnes d'annotation manuelle de corpus qui repose sur deux piliers majeurs : une organisation des campagnes d'annotation qui met l'évaluation au cœur du processus et une grille d'analyse des dimensions de complexité d'une campagne d'annotation. Un second volet de notre travail a concerné les outils du gestionnaire de campagne. Nous avons pu évaluer l'influence exacte de la pré-annotation automatique sur la qualité et la rapidité de correction humaine, grâce à une série d'expériences menée sur l'annotation morpho-syntaxique de l'anglais. Nous avons également apporté des solutions pratiques concernant l'évaluation de l'annotation manuelle, en donnant au gestionnaire les moyens de sélectionner les mesures les plus appropriées. Enfin, nous avons mis au jour les processus en œuvre et les outils nécessaires pour une campagne d'annotation et instancié ainsi la méthodologie que nous avons décrite.
|
67 |
Generating documents by means of computational registersOldham, Joseph Dowell. January 2000 (has links) (PDF)
Thesis (Ph. D.)--University of Kentucky, 2000. / Title from document title page. Document formatted into pages; contains ix, 169 p. : ill. Includes abstract. Includes bibliographical references (p. 160-167).
|
68 |
On the construction and application of compressed text indexesHon, Wing-kai., 韓永楷. January 2004 (has links)
published_or_final_version / abstract / toc / Computer Science and Information Systems / Doctoral / Doctor of Philosophy
|
69 |
Investigating the Efficacy of XML and Stylesheets to Render Electronic Courseware for Multiple Learning Stylesdu Toit, Masha 01 June 2007 (has links)
The objective of this project was to test the efficacy of using Extensible Markup Language (XML) - in particular the DocBook 5.0b5 schema - and Extensible Stylesheet Language Transformation (XSLT) to render electronic courseware that can be dynamically re-formatted according to a student’s individual learning style.
The text of a typical lesson was marked up in XML according to the DocBook schema, and several XSLT stylesheets were created to transform the XML document into different versions, each according to particular learning needs. These learning needs were drawn from the Felder-Silverman learning style model. The notes had links to trigger JavaScript functions that allowed the student to reformat the notes to produce different views of the lesson.
The dynamic notes were tested on twelve users who filled out a feedback questionnaire. Feedback was largely positive. It suggested that users were able to navigate according to their learning style. There were some usability issues caused by lack of compatibility of the program with some browsers. However, the user test is not the most critical part of the evaluation. It served to confirm that the notes were usable, but the analysis of the use of XSLT and DocBook is the key aspect of this project. It was found that XML, and in particular the DocBook schema, was a useful tool in these circumstances, being easy to learn, well supported and having the appropriate structure for a project of this type.
The use of XSLT on the other hand was not so straightforward. Learning a declarative language was a challenge, as was using XSLT to transform the notes as necessary for this project. A particular problem was the need to move content from one area of the document to another - to hide it in some cases and reveal it in others. The solution was not straightforward to achieve using XSLT, and does not take proper advantage of the strengths of this technology. The fact that the XSLT processor uses the DOM API, which necessitates the loading of the entire XML document into memory, is particularly problematic in this instance where the document is constantly transformed and re-transformed. The manner in which stylesheets are assigned, as well as the need to use DOM objects to edit the source tree, necessitated the use of JavaScript to create the necessary usability. These mechanisms introduced a limitation in terms of compatibility with browsers and caused the program to freeze on older machines. The problems with browser compatibility and the synchronous loading of data are not insurmountable, and can be overcome with the appropriate use of JavaScript and the use of asynchronous data retrieval as is made possible by the use of AJAX.
|
70 |
Improving searchability of automatically transcribed lectures through dynamic language modellingMarquard, Stephen 01 December 2012 (has links)
Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings.
A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription.
Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture.
A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing.
The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia.
Three standard metrics – Perplexity, Word Error Rate and Word Correct Rate – are used to evaluate the extent to which the adapted language models improve the searchability of the resulting transcripts, and in particular improve the recognition of specialist words. Ranked Word Correct Rate is proposed as a new metric better aligned with the goals of improving transcript searchability and specialist word recognition.
Analysis of recognition performance shows that the language models derived using the similarity-based Wikipedia crawler outperform models created using the naïve crawler, and that transcripts using similarity-based language models have better perplexity and Ranked Word Correct Rate scores than those created using the HUB4 language model, but worse Word Error Rates.
It is concluded that English Wikipedia may successfully be used as a language resource for unsupervised topic adaptation of language models to improve recognition performance for better searchability of lecture recording transcripts, although possibly at the expense of other attributes such as readability.
|
Page generated in 0.0864 seconds