Spelling suggestions: "subject:"[een] ALIGNMENTS"" "subject:"[enn] ALIGNMENTS""
11 |
Novas abordagens para o problema do alinhamento múltiplo de sequências / New approaches for the multiple sequence alignment problemAlmeida, André Atanasio Maranhão, 1981- 22 August 2018 (has links)
Orientador: Zanoni Dias / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-22T15:29:14Z (GMT). No. of bitstreams: 1
Almeida_AndreAtanasioMaranhao_D.pdf: 2248939 bytes, checksum: b57ed5328b80a2fc7f36d1509558e756 (MD5)
Previous issue date: 2013 / Resumo: Alinhamento de seqüências é, reconhecidamente, uma das tarefas de maior importância em bioinformática. Tal importância origina-se no fato de ser uma operação básica utilizada por diversos outros procedimentos na área, como busca em bases de dados, visualização do efeito da evolução em uma família de proteínas, construção de árvores filogenéticas e identificação de motifs preservados. Seqüências podem ser alinhadas aos pares, problema para o qual já se conhece algoritmo exato com complexidade de tempo O(l2), para seqüências de comprimento l. Pode-se também alinhar simultaneamente três ou mais seqüências, o que é chamado de alinhamento múltiplo de seqüências (MSA, do inglês Multiple Sequence Alignment ). Este, que é empregado em tarefas como detecção de padrões para caracterizar famílias protéicas e predição de estruturas secundárias e terciárias de proteínas, é um problema NP - Difícil. Neste trabalho foram desenvolvidos métodos heurísticos para alinhamento múltiplo de seqüências de proteína. Estudaram-se as principais abordagens e métodos existentes e foi realizada uma série de implementações e avaliações. Em um primeiro momento foram construídos 342 alinhadores múltiplos utilizando a abordagem progressiva. Esta, que é uma abordagem largamente utilizada para construção de MSAs, consiste em três etapas. Na primeira delas é computada a matriz de distâncias. Em seguida, uma árvore guia é gerada com base na matriz e, finalmente, o MSA é construído através de alinhamentos de pares, cuja ordem é definida pela árvore. Os alinhadores desenvolvidos combinam diferentes métodos aplicados a cada uma das etapas. Para a computação das matrizes de distâncias foram desenvolvidos dois métodos, que são capazes também de gerar alinhamentos de pares de seqüências. Um deles constrói o alinhamento com base em alinhamentos locais e o outro utiliza uma função logarítmica para a penalização de gaps. Foram utilizados ainda outros métodos disponíveis numa ferramenta chamada PHYLIP. Para a geração das árvores guias, foram utilizados os métodos clássicos UPGMA e Neighbor Joining. Usaram-se implementações disponíveis em uma ferramenta chamada R. Já para a construção do alinhamento múltiplo, foram implementados os métodos seleção por bloco único e seleção do par mais próximo. Estes, que se destinam a seleção xiii do par de alinhamentos a agrupar no ciclo corrente, são comumente utilizados para tal tarefa. Já para o agrupamento de um par de alinhamentos, foram implementados 12 métodos inspirados em métodos comumente utilizados - alinhamento de consensos e alinhamento de perfis. Foram feitas todas as combinações possíveis entre esses métodos, resultando em 342 alinhadores. Eles foram avaliados quanto à qualidade dos alinhamentos que geram e avaliou-se também o desempenho dos métodos, utilizados em cada etapa. Em seguida foram realizadas avaliações no contexto de alinhamento baseado em consistência. Nesta abordagem, considera-se MSA ótimo aquele que estão de acordo com a maioria dos alinhamentos ótimos para os n(n ? 1)/2 alinhamentos de pares contidos no MSA. Alterações foram realizadas em um alinhador múltiplo conhecido, MUMMALS, que usa a abordagem. As modificações foram feitas no método de contagem k-mer, assim como, em outro momento, substituiu-se a parte inicial do algoritmo. Foram alterados os métodos para computação da matriz de distâncias e para geração da árvore guia por outros que foram bem avaliados nos testes realizados para a abordagem progressiva. No total, foram implementadas e avaliadas 89 variações do algoritmo original do MUMMALS e, apesar do MUMMALS já produzir alinhamentos de alta qualidade, melhoras significativas foram alcançadas. O trabalho foi concluído com a implementação e a avaliação de algoritmos iterativos. Estes se caracterizam pela dependência de outros alinhadores para a produção de alinhamentos iniciais. Ao alinhador iterativo cabe a tarefa de refinar tais alinhamentos através de uma série de ciclos até que haja uma estabilização na qualidade dos alinhamentos. Foram implementados e avaliados dois alinhadores iterativos não estocásticos, assim como um algoritmo genético (GA) voltado para a geração de MSAs. Nesse algoritmo genético, implementado na forma de um ambiente parametrizável para execução de algoritmos genéticos para MSA, chamado ALGAe, foram realizadas diversas experiências que progressivamente elevaram a qualidade dos alinhamentos gerados. No ALGAe foram incluídas outras abordagens para construção de alinhamentos múltiplos, tais como baseada em blocos, em consenso e em modelos. A primeira foi aplicada na geração de indivíduos para a população inicial. Foram implementados alinhadores baseados em blocos usando duas abordagens distintas e, para uma delas, foram implementadas cinco variações. A segunda foi aplicada na definição de um operador de cruzamento, que faz uso da ferramenta M-COFFEE para realizar alinhamentos baseados em consenso a partir de indivíduos da população corrente do GA, e a terceira foi utilizada para definir uma função de aptidão, que utiliza a ferramenta PSIPRED para predição das estruturas secundárias das seqüências. O ALGAe permite a realização de uma grande variedade de novas avaliações / Abstract: Sequence alignment is one the most important tasks of bioinformatics. It is a basic operation used for several procedures in that domain, such as sequence database searches, evolution effect visualization in an entire protein family, phylogenetic trees construction and preserved motifs identification. Sequences can be aligned in pairs and generate a pairwise alignment. Three or more sequences can also be simultaneously aligned and generate a multiple sequence alignment (MSA). MSAs could be used for pattern recognition for protein family characterization and secondary and tertiary protein structure prediction. Let l be the sequence length. The pairwise alignment takes time O(l2) to build an exact alignment. However, multiple sequence alignment is a NP-Hard problem. In this work, heuristic methods were developed for multiple protein sequence alignment. The main approaches and methods applied to the problem were studied and a series of aligners developed and evaluated. In a first moment 342 multiple aligners using the progressive approach were built. That is a largely used approach for MSA construction and is composed by three steps. In the first one a distance matrix is computed. Then, a guide tree is built based on the matrix and finally the MSA is constructed through pairwise alignments. The order to the pairwise alignments is defined by the tree. The developed aligners combine distinct methods applied to each of steps. Then, evaluations in the consistency based alignment context were performed. In that approach, a MSA is optimal when agree with the majority along all possible optimal pairwise alignments. MUMMALS is a known consistency based aligner. It was changed in this evaluation. The k-mer counting method was modified in two distinct ways. The k value and the compressed alphabet were ranged. In another evaluation, the k-mer counting method and guide tree construction method were replaced. In the last stage of the work, iterative algorithms were developed and evaluated. Those methods are characterized by other aligner's dependence. The other aligners generate an initial population and the iterative aligner performs a refinement procedure, which iteratively changes the alignments until the alignments quality are stabilized. Several evaluations were performed. However, a genetic algorithm for MSA construction stood out along this stage. In that aligner were added other approaches for multiple sequence alignment construction, such as block based, consensus based and template based. The first one was applied to initial population generation, the second one was used for a crossover operator creation and the third one defined a fitness function / Doutorado / Ciência da Computação / Doutor em Ciência da Computação
|
12 |
Towards Dynamic Programming on Generalized Data Structures: and Applications of Dynamic Programming in BioinformaticsBerkemer, Sarah Juliane 11 March 2020 (has links)
Dynamische Programmierung (DP) ist eine Methode um Optimisierungsprobleme zu
lösen. Hierbei wird das Problem in sich überlappende Teilprobleme unterteilt und eine
optimale Lösung zu jedem der Teilprobleme berechnet. Diese werden dann wiederrum zur
Gesamtlösung zusammengesetzt. Teillösungen werden in einer Tabelle gespeichert, sodass
jede Teillösung nur einmal berechnet werden muss. So kann ein Suchraum exponentieller
Größe in polynomieller Zeit durchsucht und eine optimale Lösung gefunden werden. Die
dynamische Programmierung wurde 1952 von Bellman entwickelt und eine der ersten
Anwendung war die Detektion von Tippfehlern beim Programmieren.
DP Algorithmen werden oft und sehr vielschichtig in der Bioinformatik angewendet
wie zum Beispiel beim Vergleich von Gensequenzen, Sequenzalignment genannt, oder der
Vorhersage von Molekülstrukturen. Die Menge an Daten und somit auch deren Analyse
steigt stetig an, weshalb neue und komplexere Datenstrukturen immer wichtiger werden.
Ein Ziel ist es deswegen, DP Algorithmen zu entwickeln, die auf komplexeren Daten-
strukturen als Strings angewendet werden können. Durch das Prinzip der algebraischen
dynamischen Programmierung (ADP) können DP Algorithmen in kleinere Bestandteile
zerlegt werden, die dann unabhängig voneinander weiterentwickelt und abgeändert werden
können.
Die Arbeit ist in zwei Teile gegliedert, wobei der erste Teil die theoretische Arbeit
zur Entwicklung von Algorithmen der dynamischen Programmierung beinhaltet. Hierbei
werden zuerst Prinzipien und Definitionen zur dynamischen Programmierung vorgestellt
(Kapitel 2), um ein besseres Verständnis der darauffolgenden Kapitel zu gewährleisten.
Der zweite Teil der Arbeit zeigt unterschiedliche bioinformatische Anwendungen von
DP Algorithmen auf biologische Daten. In einem ersten Kapitel (Kapitel 5) werden
Grundsätze biologischer Daten und Algorithmen vorgestellt, die dann in den weiteren
Kapiteln benutzt werden.
|
13 |
Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring EvolutionRetzlaff, Nancy 20 April 2020 (has links)
Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend.
An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches.
Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still.
Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms.
While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing.
Yet, this does not hinder studies of language evolution using automated
tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks.
Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems.
|
14 |
Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignmentsMantzaris, Alexander Vassilios January 2011 (has links)
DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) strains may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Various methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments have recently been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them. One shortcoming that I have identified is related to an approximation made in various recently proposed Bayesian models. The Bayesian paradigm requires the solution of an integral over the space of parameters. To render this integration analytically tractable, these models assume that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, I show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone. The core model explored in my thesis is a phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to recombination and rate heterogeneity. The focus of my work is on improving the modelling of the latter aspect. Earlier research efforts by other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. I have improved these earlier phylogenetic FHMMs in two respects. Firstly, by sampling the rate vector from the posterior distribution with RJMCMC I have made the modelling of regional rate heterogeneity more flexible, and I infer the number of different degrees of divergence directly from the DNA sequence alignment, thereby dispensing with the need to arbitrarily select this quantity in advance. Secondly, I explicitly model within-codon rate heterogeneity via a separate rate modification vector. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which facilitates the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. I have carried out simulations on synthetic DNA sequence alignments, which have borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate variation from regional rate heterogeneity, resulting in more accurate predictions.
|
15 |
Nimuendajú revisitado: arqueologia da antiga Guiana Brasileira / Nimuendajú riviewed: Archaeology of ancient Brazilian GuyanaFonseca Júnior, João Aires Ataide da 16 December 2008 (has links)
O presente trabalho é um esforço metodológico ao tentar aplicar um modelo arqueológico preditivo em sítios do Amapá conhecidos como Alinhamentos de Pedra. Após serem feitas as análises de documentos históricos da década de 1920 e das pesquisas realizadas na década de 1940, juntamente com os levantamentos feitos pelo Museu Goeldi em 2005, foi possível testar em campo o modelo preditivo proposto. Para a sua construção foram utilizadas também as discussões sobre os processos de formação do registro arqueológico e o teste de hipóteses já levantadas sobre estes sítios oriundas desde as primeiras pesquisas em fins do século XIX. Os resultados alcançados, apesar de incipientes, permitiram um panorama da história da arqueologia amazônica e a avaliação que o uso de tecnologias como o Sistema de Informação Geográfica (SIG) podem trazer como resultados positivos para a pesquisa arqueológica na região. / This work is a methodological effort to apply an Archaeological Predictive Model on sites known as Stone Alignments at the State of Amapá-Brazil. After some analyses of historical documents from the 1920\'s and 1940\'s, and the last surveys realized by Goeldi Museum in 2005, was possible to test empirically the predictive model. To its construction were used the discussions on site formation processes and the use of previous hypotheses created since the end of the XIX century. The results achieved allowed a brief view on the history of Amazon archaeology and the evaluation of technologies as the Geographical Information System (GIS) as a positive archaeological tool to produce researches in the region.
|
16 |
Chatt som umgängesform : Unga skapar nätgemenskap / Chat room communities : Young people aligning on the internetSjöberg, Jeanette January 2010 (has links)
This dissertation focuses on social interaction patterns between young people in an online chat room, analyzing how social order is displayed and constituted. An overall issue concerns when and how the participants manage to co-create social communities within this setting. The data draw on an ethnographic study, where chat room observations and online recordings were carried out during three years. Methodological guidelines from discursive psychology and conversation analysis have been used in making detailed sequential analyses of chat room interactions. The thesis builds on social practice theories, including sociocultural theorizing and studies of language socialization, and work on positionings. The findings show that familiarity with chat language, including the use of emoticons and leet speak, as well as familiarity with netiquette and conversational routines such as greeting- and parting routines, are vital for the participants in order to become parts of local groups and alignments. Playful improvisation is an important feature in the chat room intercourse. Moreover, full participation requires involvement in the lives of co-participants and extended dialogues over time. In the process of moving from peripheral to more central participation, the participants formed alignments with other participants and positioned themselves and their co-participants in the chat room. Such alignments were often founded on a shared taste in, for example musical genres and everyday consumption patterns. Shared views on school, sex and relationships, as well as age or gender alignments also played a role in the creation of local communities. Conversely, issues of exclusion were recurrent features of chat room interplay. All considered this created participation patterns that formed local hierarchies which were not fixed or static, but rather fleeting and dynamic. And yet, the participants generally did not transcend or challenge contemporary age and gender boundaries.
|
17 |
Identification Of Functionally Orthologous Protein Groups In Different Species Based On Protein Network AlignmentYaveroglu, Omer Nebil 01 September 2010 (has links) (PDF)
In this study, an algorithm named ClustOrth is proposed for determining and matching functionally orthologous protein clusters in different species. The algorithm requires protein interaction networks of the organisms to be compared and GO terms of the proteins in these interaction networks as prior information. After determining the functionally related protein groups using the Repeated Random Walks algorithm, the method maps the identified protein groups according to the similarity metric defined. In order to evaluate the similarities of protein groups, graph theoretical information is used together with the context information about the proteins. The clusters are aligned using GO-Term-based protein similarity measures defined in previous studies. These alignments are used to evaluate cluster similarities by defining a cluster similarity metric from protein similarities. The top scoring cluster alignments are considered as orthologous. Several data sources providing orthology information have shown that the defined cluster similarity metric can be used to make inferences about the orthological relevance of protein groups. Comparison with a protein orthology prediction algorithm named ISORANK also showed that the ClustOrth algorithm is successful in determining orthologies between proteins. However, the cluster similarity metric is too strict and many cluster matches are not able to produce high scores for this metric. For this reason, the number of predictions performed is low. This problem can be overcomed with the introduction of different sources of information related to proteins in the clusters for the evaluation of the clusters. The ClustOrth algorithm also outperformed the NetworkBLAST algorithm which aims to find orthologous protein clusters using protein sequence information directly for determining orthologies. It can be concluded that this study is one of the leading studies addressing the protein cluster matching problem for identifying orthologous functional modules of protein interaction networks computationally.
|
18 |
Nimuendajú revisitado: arqueologia da antiga Guiana Brasileira / Nimuendajú riviewed: Archaeology of ancient Brazilian GuyanaJoão Aires Ataide da Fonseca Júnior 16 December 2008 (has links)
O presente trabalho é um esforço metodológico ao tentar aplicar um modelo arqueológico preditivo em sítios do Amapá conhecidos como Alinhamentos de Pedra. Após serem feitas as análises de documentos históricos da década de 1920 e das pesquisas realizadas na década de 1940, juntamente com os levantamentos feitos pelo Museu Goeldi em 2005, foi possível testar em campo o modelo preditivo proposto. Para a sua construção foram utilizadas também as discussões sobre os processos de formação do registro arqueológico e o teste de hipóteses já levantadas sobre estes sítios oriundas desde as primeiras pesquisas em fins do século XIX. Os resultados alcançados, apesar de incipientes, permitiram um panorama da história da arqueologia amazônica e a avaliação que o uso de tecnologias como o Sistema de Informação Geográfica (SIG) podem trazer como resultados positivos para a pesquisa arqueológica na região. / This work is a methodological effort to apply an Archaeological Predictive Model on sites known as Stone Alignments at the State of Amapá-Brazil. After some analyses of historical documents from the 1920\'s and 1940\'s, and the last surveys realized by Goeldi Museum in 2005, was possible to test empirically the predictive model. To its construction were used the discussions on site formation processes and the use of previous hypotheses created since the end of the XIX century. The results achieved allowed a brief view on the history of Amazon archaeology and the evaluation of technologies as the Geographical Information System (GIS) as a positive archaeological tool to produce researches in the region.
|
19 |
Functional heterointerfaces via electromodulation spectroscopyKhong, Siong-Hee January 2010 (has links)
Functional heterojunctions in organic electronic devices are interfaces formed either between a conducting electrode and an organic semiconductor or between two different organic semiconductors in blended and multilayered structures. This thesis is primarily concerned with the energy level alignment and the interfacial electronic structures at functional heterojunctions encountered in electronic devices made with solution-processable semiconducting polymers. Investigations on the electronic structures across these heterointerfaces are performed with the combined use of electromodulation and photoemission spectroscopic techniques. Electromodulation and ultraviolet photoemission spectroscopic techniques enable direct determination of the surface work functions of electrodes at the electrode/semiconducting polymer interfaces. We overcame the inherent problems faced by electromodulation spectroscopy, which undermine accurate determination of interfacial electronic structures, by performing electroabsorption (EA) measurements at reduced temperatures. We showed in this thesis that low-temperature EA spectroscopy is a surface sensitive technique that can determine the interface electronic structures in electrode/polymer semiconductor/electrode diodes. Using this technique, we demonstrated that the energy level alignments in these solution-processed organic electronic devices are determined by the surface work functions of passivated metals rather than by those of clean metals encountered in ultrahigh vacuum. This thesis also discloses our studies on the electronic structures in polymeric diodes with type II donor-acceptor heterojunctions using the EA spectroscopy. We showed that minimising meausurement temperature and attenuating EA illumination intensity enable accurate determinations of the electronic structures in these devices. We demonstrated that the electronic structures and the performance characteristics of multilayered polymer light-emitting diodes are also determined by the surface work functions of passivated metals. Our investigations confirm that electronic doping of the organic active layers, rather than minimisation of the Schottky barriers at electrode/polymer contacts, holds the key in realising high-performance organic light-emitting devices.
|
20 |
Ordföljdsvariation inom kardinaltalssystem : Extraktion av ordföljdstypologi ur parallella texter / Numeral-dependent word order of cardinal numbersKann, Amanda January 2019 (has links)
Typologisk klassificering av kardinaltals ordföljdstendenser har generellt utgått från en binär uppdelning i pre- och postnominella språk, men viss inomspråklig variation i ordföljdsmönster mellan olika kardinaltal har hittats bland världens språk. Tillgång till parallelltexter på många olika språk möjliggör storskalig kvantitativ typologisk analys av syntaktiska fenomen som detta, givet en lämplig strategi för språkoberoende parsning av icke-annoterat material. I denna studie undersöks aspekter av kardinaltalsberoende ordföljdsvariation i 1336 språk genom ordlänkning och annoteringsöverföring i en massivt parallell korpus av Bibelöversättningar. Källtexter märks upp med syntaktisk och lexikal annotering som förs över till icke-annoterad ordlänkad data på andra språk, och ordföljdstendenser för varje kardinaltal och språk mäts statistiskt. Utvärdering av metodens klassificering av generell kardinaltalsordföljd gav 87 % överensstämmelse med data från den manuellt sammanställda WALS-databasen, i linje med tidigare evalueringar av liknande metoder. Variation i ordföljdsmönster mellan individuella kardinaltal uppvisades i en väsentlig andel av undersökta språk, vilket motiverar värdet av en mer detaljerad klassificering av kardinaltals ordföljdstypologi. Undersökning av seriell ordföljdsvariation, där ett seriellt gränsvärde finns mellan olika dominerande ordföljdstyper i ett språks kardinaltalssystem, visade att den överlägset vanligaste strukturen för seriell variation i den undersökta datan var prenominella uttryck för 1 i språk där den dominerande kardinaltalsordföljden klassats som postnominell. / Typological word order classification for cardinal numerals has generally used a binary pre- or postnominal model, but in some languages word order behaviour has been shown to vary between individual cardinal numerals. This phenomenon can be quantitatively studied on a larger typological scale using massively parallel texts, given a cross-language method for parsing non-annotated texts. In this study, cardinal numeral-dependent word order variation is extracted from Bible translations in 1336 languages through word alignment and annotation transfer from syntactically and lexically annotated source texts to all translations in the corpus. Classification of dominant numeral word order using the transferred annotations agreed with manually gathered classifications from the WALS database for 87 % of common languages, which is in line with previous similar studies. Possible numeral-dependent word order variation was identified in a significant number of languages in the sample, supporting the case for use of a more nuanced word order classification structure. Analysis of serial word order variation, where a cardinal numeral of a certain value separates continuous numeral sequences with different dominant word orders, showed the most common structure for this type of variation to be the 1-numeral preceding the noun while all other numerals follow the noun they modify.
|
Page generated in 0.0382 seconds