161 |
Exploiting common search interests across languages for web search. / 利用跨語言的共同搜索興趣幫助萬維網搜索 / CUHK electronic theses & dissertations collection / Li yong kua yu yan de gong tong sou suo xing qu bang zhu wan wei wang sou suoJanuary 2010 (has links)
This work studies something new in Web search to cater for users' cross-lingual information needs by using the common search interests found across different languages. We assume a generic scenario for monolingual users who are interested to find their relevant information under three general settings: (1) find relevant information in a foreign language, which needs machine to translate search results into the user's own language; (2) find relevant information in multiple languages including the source language, which also requires machine translation for back translating search results; (3) find relevant information only in the user's language, but due to the intrinsic cross-lingual nature of many queries, monolingual search can be done with the assistance of cross-lingual information from another language. / We approach the problem by substantially extending two core mechanics of information retrieval for Web search across languages, namely, query formulation and relevance ranking. First, unlike traditional cross-lingual methods such as query translation and expansion, we propose a novel Cross-Lingual Query Suggestion model by leveraging large-scale query logs of search engine to learn to suggest closely related queries in the target language for a given source language query. The rationale behind our approach is the ever-increasing common search interests across Web users in different languages. Second, we generalize the usefulness of common search interests to enhance relevance ranking of documents by exploiting the correlation among the search results derived from bilingual queries, and overcome the weakness of traditional relevance estimation that only uses information of a single language or that of different languages separately. To this end, we attempt to learn a ranking function that incorporates various similarity measures among the retrieved documents in different languages. By modeling the commonality or similarity of search results, relevant documents in one language may help the relevance estimation of documents in a different language, and hence can improve the overall relevance estimation. This similar intuition is applicable to all the three settings described above. / Gao, Wei. / Adviser: Kaw-Fai Wong. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 114-122). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
162 |
A corpus-based approach for cross-lingual information retrieval. / CUHK electronic theses & dissertations collection / Digital dissertation consortiumJanuary 2004 (has links)
Li Kar Wing. / "July 2004." / Thesis (Ph.D.)--Chinese University of Hong Kong, 2004. / Includes bibliographical references (p. 127-139). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Mode of access: World Wide Web. / Abstracts in English and Chinese.
|
163 |
Language evolution from a simulation perspective: on the coevolution of compositionality and regularity. / CUHK electronic theses & dissertations collectionJanuary 2007 (has links)
In addition to individual learning mechanisms, the thesis further explores the effects of cultural transmission, social and semantic structures on language evolution. First, it simulates some major forms of cultural transmission, and discusses the role of conventionalization during horizontal transmission in language evolution. Second, it traces the emergence and maintenance of language in some stable social structures, and explores the role of popular agents in language evolution, the relationship between mutual understanding and social hierarchy, and the effect of exoteric communications on the convergence of communal languages. Finally, it studies language maintenance given different semantic spaces, and illustrates that the semantic structure may cause bias in the constituent word order, which can help to predict the word order bias in human languages. These explorations examine the role of self-organization in language evolution, provide some reconsideration on the bottleneck effect during cultural transmission, and shed light on the study of the social structure effects on language evolution. / The thesis presents a multi-agent computational model to explore a key question in language emergence, i.e., whether syntactic abilities result from innate, species-specific competences, or they evolve from domain-general abilities through gradual adaptations. The model simulates a process of coevolutionary emergence of two linguistic universals (compositionality, in the form of lexical items; and regularity, in the form of constitute word orders) in human language, i.e., the acquisition and conventionalization of these features coevolve during the transition from a holistic signaling system to a compositional language. It also traces a "bottom-up" process of syntactic development, i.e., agents, by reiterating local orders between two lexical items, can gradually form global order(s) to regulate multiple lexical items in sentences. These results suggest that compositionality, regularity, and correlated linguistic abilities could have emerged as a result of some domain-general abilities, such as pattern extraction and sequential learning. / Gong, Tao. / "May 2007." / Adviser: William S-Y. Wang. / Source: Dissertation Abstracts International, Volume: 69-01, Section: A, page: 0200. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2007. / Includes bibliographical references (p. 317-346). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
164 |
Unsupervised learning of Arabic non-concatenative morphologyKhaliq, Bilal January 2015 (has links)
Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology.
|
165 |
Perspective Identification in Informal TextElfardy, Hebatallah January 2017 (has links)
This dissertation studies the problem of identifying the ideological perspective of people as expressed in their written text. One's perspective is often expressed in his/her stance towards polarizing topics. We are interested in studying how nuanced linguistic cues can be used to identify the perspective of a person in informal genres. Moreover, we are interested in exploring the problem from a multilingual perspective comparing and contrasting linguistics devices used in both English informal genres datasets discussing American ideological issues and Arabic discussion fora posts related to Egyptian politics. %In doing so, we solve several challenges.
Our first and utmost goal is building computational systems that can successfully identify the perspective from which a given informal text is written while studying what linguistic cues work best for each language and drawing insights into the similarities and differences between the notion of perspective in both studied languages. We build computational systems that can successfully identify the stance of a person in English informal text that deal with different topics that are determined by one's perspective, such as legalization of abortion, feminist movement, gay and gun rights; additionally, we are able to identify a more general notion of perspective–namely the 2012 choice of presidential candidate–as well as build systems for automatically identifying different elements of a person's perspective given an Egyptian discussion forum comment. The systems utilize several lexical and semantic features for both languages. Specifically, for English we explore the use of word sense disambiguation, opinion features, latent and frame semantics as well; as Linguistic Inquiry and Word Count features; in Arabic, however, in addition to using sentiment and latent semantics, we study whether linguistic code-switching (LCS) between the standard and dialectal forms for the language can help as a cue for uncovering the perspective from which a comment was written.
This leads us to the challenge of devising computational systems that can handle LCS in Arabic. The Arabic language has a diglossic nature where the standard form of the language (MSA) coexists with the regional dialects (DA) corresponding to the native mother tongue of Arabic speakers in different parts of the Arab world. DA is ubiquitously prevalent in written informal genres and in most cases it is code-switched with MSA. The presence of code-switching degrades the performance of almost any MSA-only trained Natural Language Processing tool when applied to DA or to code-switched MSA-DA content. In order to solve this challenge, we build a state-of-the-art system–AIDA–to computationally handle token and sentence-level code-switching.
On a conceptual level, for handling and processing Egyptian ideological perspectives, we note the lack of a taxonomy for the most common perspectives among Egyptians and the lack of corresponding annotated corpora. In solving this challenge, we develop a taxonomy for the most common community perspectives among Egyptians and use an iterative feedback-loop process to devise guidelines on how to successfully annotate a given online discussion forum post with different elements of a person's perspective. Using the proposed taxonomy and annotation guidelines, we annotate a large set of Egyptian discussion fora posts to identify a comment's perspective as conveyed in the priority expressed by the comment, as well as the stance on major political entities.
|
166 |
Conditional random fields with dynamic potentials for Chinese named entity recognition.January 2008 (has links)
Wu, Yiu Kei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (p. 69-75). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Chinese NER Problem --- p.1 / Chapter 1.2 --- Contribution of Our Proposed Framework --- p.3 / Chapter 2 --- Related Work --- p.6 / Chapter 2.1 --- Hidden Markov Models --- p.7 / Chapter 2.2 --- Maximum Entropy Models --- p.8 / Chapter 2.3 --- Conditional Random Fields --- p.10 / Chapter 3 --- Our Proposed Model --- p.14 / Chapter 3.1 --- Background --- p.14 / Chapter 3.1.1 --- Problem Formulation --- p.14 / Chapter 3.1.2 --- Conditional Random Fields --- p.16 / Chapter 3.1.3 --- Semi-Markov Conditional Random Fields --- p.26 / Chapter 3.2 --- The Formulation of Our Proposed Model --- p.28 / Chapter 3.2.1 --- The Main Principle --- p.28 / Chapter 3.2.2 --- The Detailed Formulation --- p.36 / Chapter 3.2.3 --- Adapting Features from Original CRF to CRFDP --- p.51 / Chapter 4 --- Experiments --- p.54 / Chapter 4.1 --- Datasets --- p.55 / Chapter 4.2 --- Features --- p.57 / Chapter 4.3 --- Evaluation Metrics --- p.61 / Chapter 4.4 --- Results and Discussion --- p.63 / Chapter 5 --- Conclusions and Future Work --- p.67 / Bibliography --- p.69 / A --- p.76 / B --- p.78 / C --- p.88
|
167 |
Application of Boolean Logic to Natural Language Complexity in Political DiscourseTaing, Austin 01 January 2019 (has links)
Press releases serve as a major influence on public opinion of a politician, since they are a primary means of communicating with the public and directing discussion. Thus, the public’s ability to digest them is an important factor for politicians to consider. This study employs several well-studied measures of linguistic complexity and proposes a new one to examine whether politicians change their language to become more or less difficult to parse in different situations. This study uses 27,500 press releases from the US Senate between 2004–2008 and examines election cycles and natural disasters, namely hurricanes, as situations where politicians’ language may change. We calculate the syntactic complexity measures clauses per sentence, T-unit length, and complex-T ratio, as well as the Automated Readability Index and Flesch Reading Ease of each press release. We also propose a proof-of-concept measure called logical complexity to find if classical Boolean logic can be applied as a practical linguistic complexity measure. We find that language becomes more complex in coastal senators’ press releases concerning hurricanes, but see no significant change for those in election cycles. Our measure shows similar results to the well-established ones, showing that logical complexity is a useful lens for measuring linguistic complexity.
|
168 |
Creation of a pronunciation dictionary for automatic speech recognition : a morphological approachNkosi, Mpho Caselinah January 2012 (has links)
Thesis (M.Sc. (Computer Science)) --University of Limpopo, 2012 / Pronunciation dictionaries or lexicons play an important role in guiding the predictive powers of an Automatic Speech Recognition (ASR) system. As the use of automatic speech recognition systems increases, there is a need for the development of dictionaries that cover a large number of inflected word forms to enhance the performance of ASR systems. The main purpose of this study is to investigate the contribution of the morphological approach to creating a more comprehensive and broadly representative Northern Sotho pronunciation dictionary for Automatic Speech Recognition systems.
The Northern Sotho verbs together with morphological rules are used to generate more valid inflected word forms in the Northern Sotho language for the creation of a pronunciation dictionary. The pronunciation dictionary is developed using the Dictionary Maker tool. The Hidden Markov Model Toolkit is used to develop a simple ASR system in order to evaluate the performance of the ASR system when using the created pronunciation dictionary.
|
169 |
Logic of Shared Significations on Internet Relay ChatMercier, David-Olivier 01 October 2019 (has links)
Through the observation of conversations on Internet Relay Chat and the quantitative analysis of “chat-logs”, I investigate the characteristics of this form of communication unique to the digital realm. My research rests on a theoretical framework integrating the semiotic and pragmatism of Charles S. Peirce (as primary groundwork) with the philosophy of Ludwig Wittgenstein and the sociology of Erving Goffman, to grasp shared significations in cyberspace simultaneously as logical process and as social practice. This exploratory case study yields evidence supporting the potential fruitfulness of Peircean philosophy as the foundation for a new paradigm in empirical communication research, and successfully puts to the test a particular type of method (computational and diagrammatic) suggested to accomplish such research.
|
170 |
Controlled Languages in Software User DocumentationSteensland, Henrik, Dervisevic, Dina January 2005 (has links)
<p>In order to facilitate comprehensibility and translation, the language used in software user documentation must be standardized. If the terminology and language rules are standardized and consistent, the time and cost of translation will be reduced. For this reason, controlled languages have been developed. Controlled languages are subsets of other languages, purposely limited by restricting the terminology and grammar that is allowed.</p><p>The purpose and goal of this thesis is to investigate how using a controlled language can improve comprehensibility and translatability of software user documentation written in English. In order to reach our goal, we have performed a case study at IFS AB. We specify a number of research questions that help satisfy some of the goals of IFS and, when generalized, fulfill the goal of this thesis.</p><p>A major result of our case study is a list of sixteen controlled language rules. Some examples of these rules are control of the maximum allowed number of words in a sentence, and control of when the author is allowed to use past participles. We have based our controlled language rules on existing controlled languages, style guides, research reports, and the opinions of technical writers at IFS.</p><p>When we applied these rules to different user documentation texts at IFS, we managed to increase the readability score for each of the texts. Also, during an assessment test of readability and translatability, the rewritten versions were chosen in 85 % of the cases by experienced technical writers at IFS.</p><p>Another result of our case study is a prototype application that shows that it is possible to develop and use a software checker for helping the authors when writing documentation according to our suggested controlled language rules.</p>
|
Page generated in 0.0233 seconds