Global ETD Search

1	Statistical Learning in a Bilingual Environment Tsui, Sin Mei 30 August 2018 (has links) Statistical learning refers to the ability to track regular patterns in sensory input from ambient environments. This learning mechanism can exploit a wide range of statistical structures (e.g., frequency, distribution, and co-occurrence probability). Given its regularities and hierarchical structures, language is essentially a pattern-based system and therefore researchers have argued that statistical learning is fundamental to language acquisition (e.g., Saffran, 2003). Indeed, young infants and adults can find words in artificial languages by tracking syllable co-occurrence probabilities and extracting words on that basis (e.g., Saffran. Aslin & Newport, 1996a). However, prior studies have mainly focused on whether learners can statistically segment words from a single language; whether learners can segment words from two artificial languages remains largely unknown. Given that the majority of the global population is bilingual (Grosjean, 2010), it is necessary to study whether learners can make use of the statistical learning mechanism to segment words from two language inputs, which is the focus of this thesis. I examined adult and infant learners to answer three questions: (i) Can learners make use of French and English phonetic cues within a single individual’s speech to segment words successfully from two languages?; 2) Do bilinguals outperform monolinguals?; and 3) Do specific factors, such as cognitive ability or bilingual experience, underlie any potential bilingual advantage in word segmentation across two languages? In Study 1, adult learners generally could make use of French and English phonetic cues to segment words from two overlapping artificial languages. Importantly, simultaneous bilinguals who learned French and English since birth segmented more correct words in comparison to monolinguals, multilinguals, and sequential French-English bilinguals. Early bilingual experience may lead learners to be more flexible when processing information in new environments and/or they are more sensitive to subtle cues that mark the changes of language inputs. Further, individuals’ cognitive abilities were not related to learners’ segmentation performance, suggesting that the observed simultaneous bilingual segmentation advantage is not related any bilingual cognitive advantages (Bialystok, Craik, & Luk, 2012). In Study 2, I tested 9.5-month-olds, who are currently discovering words in their natural environment, in an infant version of the adult task. Surprisingly, monolingual, but not bilingual, infants successfully used French and English phonetic cues to segment words from two languages. The observed difference in segmentation may be related to how infant process native and non-native phonetic cues, as the French phonetic cues are non-native to monolingual infants but are native to bilingual infants. Finally, the observed difference in segmentation ability was again not driven by cognitive skills. In sum, current thesis provides evidence that both adults and infants can make use of phonetic cues to statistically segment words from two languages. The implications of why early bilingualism plays a role in determining learners’ segmentation ability are discussed. Psychology Statistical learning Word segmentation Bilingualism
2	Efficient database management based on complex association rules Zhang, Heng January 2017 (has links) The large amount of data accumulated by applications is stored in a database. Because of the large amount, name conflicts or missing values sometimes occur. This prevents certain types of analysis. In this work, we solve the name conflict problem by comparing the similarity of the data, and changing the test data into the form of a given template dataset. Studies on data use many methods to discover knowledge from a given dataset. One popular method is association rules mining, which can find associations between items. This study unifies the incomplete data based on association rules. However, most rules based on traditional association rules mining are item-to-item rules, which is a less than perfect solution to the problem. The data recovery system is based on complex association rules able to find two more types of association rules, prefix pattern-to-item, and suffix pattern-to-item rules. Using complex association rules, several missing values are filled in. In order to find the frequent prefixes and frequent suffixes, this system used FP-tree to reduce the time, cost and redundancy. The segment phrases method can also be used for this system, which is a method based on the viscosity of two words to split a sentence into several phrases. Additionally, methods like data compression and hash map were used to speed up the search. Association rules word segmentation Computer Systems Datorsystem
3	Nonlinguistic Pitch and Timing Patterns in Word Segmentation Raybourn, Tracey L. 13 August 2010 (has links) No description available. Psychology word segmentation music perception prosody rhythm
4	Using Text mining Techniques for automatically classifying Public Opinion Documents Chen, Kuan-hsien 19 January 2009 (has links) In a democratic society, the number of public opinion documents increase with days, and there is a pressing need for automatically classifying these documents. Traditional approach for classifying documents involves the techniques of segmenting words and the use of stop words, corpus, and grammar analysis for retrieving the key terms of documents. However, with the emergence of new terms, the traditional methods that leverage dictionary or thesaurus may incur lower accuracy. Therefore, this study proposes a new method that does not require the prior establishment of a dictionary or thesaurus, and is applicable to documents written in any language and documents containing unstructured text. Specifically, the classification method employs genetic algorithm for achieving this goal. In this method, each training document is represented by several chromosomes, and based on the gene values of these chromosomes, the characteristic terms of the document are determined. The fitness function, which is required by the genetic algorithm for evaluating the fitness of an evolved chromosome, considers the similarity to the chromosomes of documents of other types. This study used data FAQ of e-mail box of Taipei city mayor for evaluating the proposed method by varying the length of documents. The results show that the proposed method achieves the average accuracy rate of 89%, the average precision rate of 47%, and the average recall rate of 45%. In addition, F-measure can reach up to 0.7. The results confirms that the number of training documents, content of training documents, the similarity between the types of documents, and the length of the documents all contribute to the effectiveness of the proposed method. Text Categorization Word Segmentation Genetic Algorithms Public Opinion Text Mining
5	An Enhanced Conditional Random Field Model for Chinese Word Segmentation Huang, Jhao-ming 03 February 2010 (has links) In Chinese language, the smallest meaningful unit is a word which is composed of a sequence of characters. A Chinese sentence is composed of a sequence of words without any separation between them. In the area of information retrieval or data mining, the segmentation of a sequence of Chinese characters should be done before anyone starts to use these segments of characters. The process is called the Chinese word segmentation. The researches of Chinese word segmentation have been developed for many years. Although some recent researches have achieved very high performance, the recall of those words that are not in the dictionary only achieves sixty or seventy percent. An approach described in this paper makes use of the linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation. The discriminatively trained model that uses two of our proposed feature templates for deciding the boundaries between characters is used in our study. We also propose three other methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment. In the area of using the conditional random fields for Chinese word segmentation, we have proposed a feature template for better result and three methods which focus on other specific segmentation problems. Conditional Random Fields (CRF) Chinese Word Segmentation Feature Template
6	Understanding Patterns in Infant-Directed Speech in Context: An Investigation of Statistical Cues to Word Boundaries Hartman, Rose 01 May 2017 (has links) People talk about coherent episodes of their experience, leading to strong dependencies between words and the contexts in which they appear. Consequently, language within a context is more repetitive and more coherent than language sampled from across contexts. In this dissertation, I investigated how patterns in infant-directed speech differ under context-sensitive compared to context-independent analysis. In particular, I tested the hypothesis that cues to word boundaries may be clearer within contexts. Analyzing a large corpus of transcribed infant-directed speech, I implemented three different approaches to defining context: a top-down approach using the occurrence of key words from pre-determined context lists, a bottom-up approach using topic modeling, and a subjective coding approach where contexts were determined by open-ended, subjective judgments of coders reading sections of the transcripts. I found substantial agreement among the context codes from the three different approaches, but also important differences in the proportion of the corpus that was identified by context, the distribution of the contexts identified, and some characteristics of the utterances selected by each approach. I discuss implications for the use and interpretation of contexts defined in each of these three ways, and the value of a multiple-method approach in the exploration of context. To test the strength of statistical cues to word boundaries in context-specific sub-corpora relative to a context-independent analysis of cues to word boundaries, I used a resampling procedure to compare the segmentability of context sub-corpora defined by each of the three approaches to a distribution of random sub-corpora, matched for size for each context sub-corpus. Although my analyses confirmed that context-specific sub-corpora are indeed more repetitive, the data did not support the hypothesis that speech within contexts provides richer information about the statistical dependencies among phonemes than is available when analyzing the same statistical dependencies without respect to context. Alternative hypotheses and future directions to further elucidate this phenomenon are discussed. / 2019-02-17 Bayesian modeling Context Language acquisition Resampling Statistical learning Word segmentation
7	Evaluation of word segmentation algorithms applied on handwritten text Isaac, Andreas January 2020 (has links) The aim of this thesis is to build and evaluate how a word segmentation algorithm performs when extracting words from historical handwritten documents. Since historical documents often consist of background noise, the aim will also be to investigate whether applying a background removal algorithm will affect the final result or not. Three different types of historical handwritten documents are used to be able to compare the output when applying two different word segmentation algorithms. The result attained indicates that the background removal algorithm increases the accuracy obtained when using the word segmentation algorithm. The word segmentation algorithm developed successfully manages to extract a majority of the words while the obtained algorithm has difficulties for some documents. A conclusion made was that the type of document plays the key role in whether a poor result will be obtained or not. Hence, different algorithms may be needed rather than using one for all types of documents. word segmentation historical documents handwritten documents Engineering and Technology Teknik och teknologier
8	A sensibilidade aos determinantes e a segmentação do DP por bebês brasileiros Uchôa, Danielle Novais 05 April 2013 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2016-03-22T11:42:28Z No. of bitstreams: 1 daniellenovaisuchoa.pdf: 1181307 bytes, checksum: 63df2d7f7f776f75f785a71fcacbc79d (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2016-04-24T02:19:05Z (GMT) No. of bitstreams: 1 daniellenovaisuchoa.pdf: 1181307 bytes, checksum: 63df2d7f7f776f75f785a71fcacbc79d (MD5) / Made available in DSpace on 2016-04-24T02:19:05Z (GMT). No. of bitstreams: 1 daniellenovaisuchoa.pdf: 1181307 bytes, checksum: 63df2d7f7f776f75f785a71fcacbc79d (MD5) Previous issue date: 2013-04-05 / CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico / Este estudo tem como objetivos investigar a sensibilidade à forma fônica dos determinantes e analisar se essa sensibilidade ajudaria bebês de 13 meses, adquirindo o português, a segmentar o sintagma determinante (DP) em unidades menores (determinante + nome). A perspectiva teórica adotada busca conciliar um tratamento psicolinguístico para aquisição de língua com uma teoria linguística, através da integração entre o modelo de Bootstrapping Fonológico (MORGAN & DEMUTH, 1996; CHRISTOPHE ET AL., 1997) e o Programa Minimalista, no que se refere, sobretudo, à sua concepção de Faculdade da Linguagem (HAUSER, CHOMSKY & FITCH, 2002), entendida sob duas perspectivas: no sentido estrito (FLN – Faculty of Language in the narrow sense) e no sentido amplo (FLB – Faculty of Language in the broad sense). Essa conciliação permite-nos explicar como a criança chega à sintaxe da sua língua a partir de pistas distribucionais e prosódicas disponibilizadas na interface fônica. Estudos conduzidos em diversas línguas, inclusive em português (Name, 2002), sugerem que, por volta dos 10 meses de idade, as crianças já seriam capazes de reconhecer os itens funcionais no fluxo da fala, a partir de suas características acústicas e distribucionais, utilizando-os como pistas para o acesso lexical e sintático. As hipóteses assumidas são de que (i) aos 13 meses, a criança é sensível à forma fônica dos determinantes, distinguindo, assim, os determinantes reais dos pseudodeterminantes, sendo capazes, (ii) de segmentar o DP formado por um determinante real + pseudonome. Nossos resultados sugerem que aos 13 meses, o bebê é sensível à forma fônica dos determinantes da língua, reagindo diferentemente quando apresentados aos determinantes (o / um/ este / aquele) ou aos pseudodeterminantes (ône / ór / ugi / ófupi). Além disso, sugerem também que as crianças foram capazes de segmentar o DP em unidades menores, já que reagiram diferentemente aos pseudonomes familiarizados quando antecedidos por determinante real ou pseudodeterminante. / This study aims at investigating the sensitivity to phonetic form of determiners and analyzing whether this sensitivity would help 13-month-old Brazilian babies to segment the Determiner Phrase (DP) into smaller unities (determiner + noun). The theoretical approach adopted seeks at conciliating a psycholinguistic treatment for the language acquisition with a linguistic theory, through the integration between Phonological Bootstrapping (MORGAN and DEMUTH, 1996; CHRISTOPHE et al., 1997) and the Minimalist Program, in relation, especially, to its conception of Language Faculty (HAUSER, CHOMSKY & FITCH, 2002), seen through two perspectives: in its narrow sense (FLN) and in its broad sense (FLB). This conciliation allows us to explain how the child reaches the language syntax from distributional and prosodic cues available at the phonic interface. Studies conducted in different languages, including Brazilian Portuguese (Name,2002) suggest that, around the age of 10 months, children would already be able of recognizing function words in the speech stream, from their acoustic and distributional characteristics, using them as cues for syntactic and lexical access. The hypothesis are that (i) at 13 months, the child is sensitive to the phonic form of determiners, distinguishing the real determiners from the nonsense determiners, being able of (ii) segment the DP consisting of a real determiner + a nonsense noun. Our results suggest that, at 13 month-old, babies are sensible to the phonic form of their language determiners, reacting differently when they are presented either to the determiners (o / um / este/ aquele) or the nonsense determiners (ône / ór / ugi / ófupi).It also suggests that the children were able of segmenting the DP into smaller unities, since they reacted differently to the familiarized nonsense nouns when they were preceded by a real determiner or a nonsense determiner. Segmentação Determinantes Sintagma determinante Word segmentation Determiners Determiner phrase
9	A Japanese-to-English Statistical Machine Translation System for Technical Documents / 技術文書に対する日英統計的機械翻訳システム Sudoh, Katsuhito 23 January 2015 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第18700号 / 情博第550号 / 新制\|\|情\|\|97(附属図書館) / 31633 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授河原達也, 教授黒橋禎夫, 教授鹿島久嗣, 准教授森信介 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM patent translation technical term word segmentation machine transliteration post-ordering 007
10	Exploiting Vocabulary, Morphological, and Subtree Knowledge to Improve Chinese Syntactic Analysis / 語彙的、形態的、および部分木知識を用いた中国語構文解析の精度向上 Shen, Mo 23 March 2016 (has links) In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Kyoto University's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. / 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19848号 / 情博第599号 / 新制\|\|情\|\|104(附属図書館) / 32884 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)准教授河原大輔, 教授黒橋禎夫, 教授鹿島久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Natural language processing Chinese language processing Syntax Dependency parsing Morphology Word segmentation Corpus building 007

Search results