Global ETD Search

1	A statistical investigation into the provenance of De Doctrina Christiana, attributed to John Milton Tweedie, Fiona Jane January 1997 (has links) The aim of this study is to conduct an objective investigation into the provenance of De Doctrina Christiana, a theological treatise attributed to Milton since its discovery in 1823. This attribution was questioned in 1991 provoking a series of papers, one of which makes a plea for an objective analysis, which I aim to supply. I begin by reviewing critically some techniques that have recently been applied to stylometry. They include methods from artificial intelligence, linguistics and statistics. The chapter concludes with an investigation into the QSUM technique, finding it to be invalid. As De Doctrina Christiana is written in neo-Latin I examine previous work carried out in Latin, then turn to historical issues and examine issues including censorship and the physical characteristics of the manuscript. The text is the only theological work in the extant Milton canon. As genre as well as authorship affects style, I consider theories of genre which influence the choice of suitable control texts. Chapter seven deals with the methodology used in the study. The analysis follows in a hierarchical structure. I establish which techniques distinguish between Milton and the control texts while maintaining the internal consistency of the authors. It is found that the most-frequently-occurring words are good discriminators. I then use this technique to examine De Doctrina Christiana and the Milton and control texts. A clear difference is found between texts from polemic and exegetical genres, and samples from De Doctrina Christiana form into two groups. This heterogeneity forms the third part of the analysis. No apparent difference is found between sections of the text with different amanuenses, but the Epistle appears to be markedly more Miltonic than the rest. In addition, postulated insertions into chapter X of Book I appear to have a Miltonic influence. I conclude by examining the hypothesis of a Ramist ordering to the text. 519.5 Stylometry; Quantitative linguistics
2	Structures in complex systems : Playing dice with networks and books Bernhardsson, Sebastian January 2009 (has links) Complex systems are neither perfectly regular nor completely random. They consist of a multitude of players who, in many cases, playtogether in a way that makes their combined strength greater than the sum of their individual achievements. It is often very effective to represent these systems as networks where the actual connections between the players take on a crucial role.Networks exist all around us and are an important part of our world, from the protein machinery inside our cells to social interactions and man-madecommunication systems. Many of these systems have developed over a long period of time and are constantly undergoing changes driven by complicated microscopic events. These events are often too complicated for us to accurately resolve, making the world seem random and unpredictable. There are however ways of using this unpredictability in our favor by replacing the true events by much simpler stochastic rules giving effectively the same outcome. This allows us to capture the macroscopic behavior of the system, to extract important information about the dynamics of the system and learn about the reason for what we observe. Statistical mechanics gives the tools to deal with such large systems driven by underlying random processes under various external constraints, much like how intracellular networks are driven by random mutations under the constraint of natural selection.This similarity makes it interesting to combine the two and to apply some of the tools provided by statistical mechanics on biological systems.In this thesis, several null models are presented, with this view point in mind, to capture and explain different types of structural properties of real biological networks. The most recent major transition in evolution is the development of language, both spoken and written. This thesis also brings up the subject of quantitative linguistics from the eyes of a physicist, here called linguaphysics. Also in this case the data is analyzed with an assumption of an underlying randomness. It is shown that some statistical properties of books, previously thought to be universal, turn out to exhibit author specific size dependencies. A meta book theory is put forward which explains this dependency by describing the writing of a text as pulling a section out of a huge, individual, abstract mother book. / Komplexa system är varken perfekt ordnade eller helt slumpmässiga. De består av en mängd aktörer, som i många fall agerar tillsammans på ett sådant sätt att deras kombinerade styrka är större än deras individuella prestationer. Det är ofta effektivt att representera dessa system som nätverk där de faktiska kopplingarna mellan aktörerna spelar en avgörande roll. Nätverk finns överallt omkring oss och är en viktig del av vår värld , från proteinmaskineriet inne i våra celler till sociala samspel och människotillverkade kommunikationssystem.Många av dessa system har utvecklats under lång tid och genomgår hela tiden förändringar som drivs på av komplicerade småskaliga händelser.Dessa händelser är ofta för komplicerade för oss att noggrant kunna analysera, vilket får vår värld att verka slumpmässig och oförutsägbar. Det finns dock sätt att använda denna oförutsägbarhet till vår fördel genom att byta ut de verkliga händelserna mot mycket enklare regler baserade på sannolikheter, som ger effektivt sett samma utfall. Detta tillåter oss att fånga systemets övergripande uppförande, att utvinna viktig information om systemets dynamik och att få kunskap om anledningen till vad vi observerar. Statistisk mekanik hanterar stora system pådrivna av sådana underliggande slumpmässiga processer under olika restriktioner, på liknande sätt som nätverk inne i celler drivs av slumpmässiga mutationer under restriktionerna från naturligt urval. Denna likhet gör det intressant att kombinera de två och att applicera de verktyg som ges av statistisk mekanik på biologiska system. I denna avhandling presenteras flera nollmodeller som, baserat på detta synsätt, fångar och förklarar olika typer av strukturella egenskaper hos verkliga biologiska nätverk. Den senaste stora evolutionära övergången är utvecklandet av språk, både talat och skrivet. Denna avhandling tar också upp ämnet om kvantitativ linguistik genom en fysikers ögon, här kallat linguafysik. även i detta fall så analyseras data med ett antagande om en underliggande slumpmässighet. Det demonstreras att vissa statistiska egenskaper av böcker, som man tidigare trott vara universella, egentligen beror på bokens längd och på författaren. En metaboksteori ställs fram vilken förklarar detta beroende genom att beskriva författandet av en text som att rycka ut en sektion ur en stor, individuell, abstrakt moderbok. Complex systems networks statistical physics biological networks quantitative linguistics word frequencies. Physics Fysik
3	Authorship Attribution Through Words Surrounding Named Entities Jacovino, Julia Maureen 03 April 2014 (has links) In text analysis, authorship attribution occurs in a variety of ways. The field of computational linguistics becomes more important as the need of authorship attribution and text analysis becomes more widespread. For this research, pre-existing authorship attribution software, Java Graphical Authorship Attribution Program (JGAAP), implements a named entity recognizer, specifically the Stanford Named Entity Recognizer, to probe into similar genre text and to aid in extricating the correct author. This research specifically examines the words authors use around named entities in order to test the ability of these words at attributing authorship / McAnulty College and Graduate School of Liberal Arts; / Computational Mathematics / MS; / Thesis;
4	Teorie komunikace jakožto explanatorní princip přirozené víceúrovňové segmentace textů / The Theory of Communication as an Explanatory Principle for the Natural Multilevel Text Segmentation Milička, Jiří January 2016 (has links) 1. Phonemes, words, clauses and sentences are not a logical necessity of language, unlike distinctive features and morphemes. 2. Despite this, such nested segmentation is very firmly present in languages and in our concepts of language description, 3. because nested segmentation and inserting redundancy on multiple levels is an efficient way to get the language signal through the burst-noise channel. 4. There are various strategies how redundancy can be added and what kind of redundancy can be added. 5. The segment delimiter is expressed by some additional information and the amount of delimiting information is independent from the length of the seg- ment it delimits. This principle can serve as a basis for a successful model for the Menzerath's relation.
5	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages Gerlach, Martin 10 March 2016 (has links) (PDF) Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers). Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820
6	L'attribution du "Roman de Violette" / Attribution of the novel Le Roman de Violette Petrova, Anastassia 29 April 2015 (has links) L’attribution d’un style à un auteur peut constituer un enjeu problématique. Certains « cas » surprennent, tel le cas du Roman de Violette. Au-delà de l’enjeu socio-littéraire, un enjeu stylistique s’impose. Trois voies de la recherché s’ouvrent : celle de l’analyse de l’écriture selon chacun des « auteurs supposés », celle de l’analyse biographique et historique; celle de la composition et configuration linguistique du matériau verbal constituant le « style » où l’analyse quantitative apporte sa contribution. À partir du corpus étudié (Le Roman de Violette, Une Aventure d’amour d’Alexandre Dumas et Les Cousines de la colonelle de la Marquise de Mannoury d’Ectot) la question a été posée de l’attribution de l’oeuvre publiée sous les noms d’Alexandre Dumas-père et de la Marquise de Mannoury d’Ectot. Pour atteindre cet objectif, il a été nécessaire d’appliquer plusieurs méthodes de type philologique : enquête biographique et contextuelle, analyse des biographies et des correspondances, analyse linguistique et stylistique couplée à des méthodes d’analyses quantitatives.Pour l’attribution des œuvres, la thèse croise diverses approches statistiques. Celle liée à la « théorie de reconnaissance des formes » élaborée au laboratoire d’Études linguistiques appliquées de l’Université d’État de Saint-Pétersbourg – expérimentée pour la première fois sur la langue française – est apparue décisive quant aux résultats obtenus. Elle permet de conclure, à partir de l’analyses systématique d’éléments syntaxiques, que Le Roman de Violette appartient à Mannoury d’Ectot. Or, les résultats de l’analyse littéraire et biographique laissent à supposer que l’idée du texte est à Dumas. / The attribution of a style to an author may constitute a problematic stake. Certain "cases" surprise, such as the case of Le Roman de Violette, published under the names of Alexandre Dumas-father and Henriette de Mannoury d’Ectot. Beyond the socio-literary stake, a stylistic stake leads. Three different ways open to the search: first, the biographic analysis; second, the stylistic analysis; third, the composition and the linguistic configuration of the verbal material constituting the "style", a search for which the quantitative analysis will be used. From the studied corpus (Le Roman de Violette, Les Cousines de la colonelle of Henriette de Mannouryd’Ectot, Une aventure d’amour of Dumas) the question of the role of Dumas and Mannoury d’Ectot in the creation of Le Roman de Violette was asked. To reach this objective, it turned out necessary to apply several methods of philological analysis: biographic anaysis, analysis of archival documents, linguistic and stylistic analysis coupled with quantitative methods. For the attribution of the novel, the thesis crosses different approaches. That based on the "theory of pattern recognition” elaborated in the Laboratory of Applied Linguistic Studies of the Saint-Petersburg State University - used for the first time on the French language - seemed decisive to obtain our results. This method allows to conclude, from the systematic analysis of syntactical elements, that the novel «Le Roman de Violette» was created by Mannoury d’Ectot, however the literary analysis gives a handle to suppose that the plot of the novel belong to Alexandre Dumas. Linguistique textuelle Stylistique Ecriture féminine Attribution Linguistique informatique Fabrique des romans Analyse littéraire Analyse historique et documentaire Textual linguistics Stylistics Women's writing Quantitative linguistics Literary analysis Historical analysis
7	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages Gerlach, Martin 01 March 2016 (has links) Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers). info:eu-repo/classification/ddc/530 ddc:530
8	Measuring coselectional constraint in learner corpora: A graph-based approach Shadrova, Anna Valer'evna 24 July 2020 (has links) Die korpuslinguistische Arbeit untersucht den Erwerb von Koselektionsbeschränkungen bei Lernerinnen des Deutschen als Fremdsprache in einem quasi-longitudinalen Forschungsdesign anhand des Kobalt-Korpus. Neben einigen statistischen Analysen wird vordergründig eine graphbasierte Analyse entwickelt, die auf der Graphmetrik Louvain-Modularität aufbaut. Diese wird für diverse Subkorpora nach verschiedenen Kriterien berechnet und mit Hilfe verschiedener Samplingtechniken umfassend intern validiert. Im Ergebnis zeigen sich eine Abhängigkeit der gemessenen Modularitätswerte vom Sprachstand der Teilnehmerinnen, eine höhere Modularität bei Muttersprachlerinnen, niedrigere Modularitätswerte bei weißrussischen vs. chinesischen Lernerinnen sowie ein U-Kurven-förmiger Erwerbsverlauf bei weißrussischen, nicht aber chinesischen Lerner*innen. Unterschiede zwischen den Gruppen werden aus typologischer, kognitiver, diskursiv-kultureller und Registerperspektive diskutiert. Abschließend werden Vorschläge für den Einsatz von graphbasierten Modellierungen in kernlinguistischen Fragestellungen entwickelt. Zusätzlich werden theoretische Lücken in der gebrauchsbasierten Beschreibung von Koselektionsphänomenen (Phraseologie, Idiomatizität, Kollokation) aufgezeigt und ein multidimensionales funktionales Modell als Alternative vorgeschlagen. / The thesis located in corpus linguistics analyzes the acquisition of coselectional constraint in learners of German as a second language in a quasi-longitudinal design based on the Kobalt corpus. Supplemented by a number of statistical analyses, the thesis primarily develops a graph-based analysis making use of Louvain modularity. The graph metric is computed for a range of subcorpora chosen by various criteria. Extensive internal validation is performed through a number of sampling techniques. Results robustly indicate a dependency of modularity on language acquisition progress, higher modularity in L1 vs. L2, lower modularity in Belarusian vs. Chinese learners, and a u-shaped learning development in Belarusian, but not in Chinese learners. Group differences are discussed from a typological, cognitive, cultural and cultural discourse, and register perspective. Finally, future applications of graph-based modeling in core-linguistic research are outlined. In addition, some gaps in the theoretical discussion of coselection phenomena (phraseology, idiomaticity, collocation) in usage-based linguistics are discussed and a multidimensional and functional model is proposed as an alternative. gebrauchsbasierte Linguistik Koselektion idiomatisches Prinzip Fremdsprachenerwerb quantitative Linguistik Korpuslinguistik Kollokation usage-based linguistics coselection second language acquisition idiom principle quantitative linguistics corpus linguistics collocation 410 Linguistik 430 Deutsch und verwandte Sprachen 006 Spezielle Computerverfahren GB 3026 ddc:410 ddc:430 ddc:006

Search results