Global ETD Search

21	Modeling Alcohol Consumption Using Blog Data Koh, Kok Chuan 05 1900 (has links) How do the content and writing style of people who drink alcohol beverages stand out from non-drinkers? How much information can we learn about a person's alcohol consumption behavior by reading text that they have authored? This thesis attempts to extend the methods deployed in authorship attribution and authorship profiling research into the domain of automatically identifying the human action of drinking alcohol beverages. I examine how a psycholinguistics dictionary (the Linguistics Inquiry and Word Count lexicon, developed by James Pennebaker), together with Kenneth Burke's concept of words as symbols of human action, and James Wertsch's concept of mediated action provide a framework for analyzing meaningful data patterns from the content of blogs written by consumers of alcohol beverages. The contributions of this thesis to the research field are twofold. First, I show that it is possible to automatically identify blog posts that have content related to the consumption of alcohol beverages. And second, I provide a framework and tools to model human behavior through text analysis of blog data. Natural language processing blog data LIWC PBAA linguistics inquiry and word count profile based authorship attribution alcohol consumption word symbols mediated action
22	Idiolekto požymiai elektroniniuose laiškuose / Features of Idiolect in E-mails Žalkauskaitė, Gintarė 18 January 2012 (has links) Šiuo darbu siekta nustatyti, ar asmeninių elektroninių laiškų kalboje atsiskleidžia autoriaus idiolektas ir kokiais leksiniais bei grafiniais požymiais jis pasireiškia.. Tyrimui buvo surinktas šešių autorių asmeninių neoficialaus bendravimo elektroninių laiškų tekstynas. Tekstyno duomenys apdoroti pasitelkiant WordSmith Tools programą ir atlikta gretinamoji tekstų analizė: lyginti kalbos vienetų pasikartojimo dažniai tiriamųjų autorių laiškuose ir nustatyta, kad vienų autorių dažniau ar rečiau nei kitų vartojami kalbos vienetai skiria autorių idiolektus. Iš nustatytų kalbos požymių apibendrintos su idiolektu sietinų kalbinės raiškos vienetų grupės. Nustatyta, kad leksikos lygmenyje idiolektus aiškiausiai skiria autoriaus vertinimą ir nuostatas perteikiantys bei modalumą reiškiantys žodžiai bei iš galimų leksinių konkurentų pasirenkami žodžiai ir trumpiniai. Taip pat idiolektus žymi skirtingų autorių nevienodai dažnai pasirenkamų skyrybos ir grafinių ženklų vartojimas. Remiantis atlikto tyrimo rezultatais disertacijoje pateikiamos rekomendacijos teismo lingvistinius autorystės tyrimus atliekantiems ekspertams. / The current study aims to establish, if authors idiolect can be recognized in electronic mails language and to determine the features of lexis and graphics, which can be linked to idiolect. The data has been derived from a corpus of 65,000 words consisting of electronic letters written in Lithuanian by six persons. The WordSmith Tools software was used to generate frequency lists of six subcorpora, representing each person’s language. By using the contrastive method the frequency data of six persons language were compared. The lexis and graphics elements, which were used by one person more often or more rarely than by others and were not determined by the topic, were linked to authors idiolect. As a result of the analysis the classification of lexical and graphical elements is given, which can help recognizing idiolect. The study shows that on a lexical level the main differences between idiolects are in the usage of the modality and stance expressing words, and also the words and abbreviations, which are differently chosen from possible variants. On a graphical level idiolects can be recognized from punctuation marks, emoticons and graphic symbols, used at a different frequency. Based on research results the recommendations for authorship attribution examinations are given. Philology Idiolektas Individualusis stilius Teismo lingvistika Autoriaus identifikacija Asmeniniai elektroniniai laiškai Idiolect Individual writing style Forensic linguistics Authorship attribution Personal electronic mails
23	Features of Idiolect in E-mails / Idiolekto požymiai elektroniniuose laiškuose Žalkauskaitė, Gintarė 18 January 2012 (has links) The current study aims to establish, if authors idiolect can be recognized in electronic mails language and to determine the features of lexis and graphics, which can be linked to idiolect. The data has been derived from a corpus of 65,000 words consisting of electronic letters written in Lithuanian by six persons. The WordSmith Tools software was used to generate frequency lists of six subcorpora, representing each person’s language. By using the contrastive method the frequency data of six persons language were compared. The lexis and graphics elements, which were used by one person more often or more rarely than by others and were not determined by the topic, were linked to authors idiolect. As a result of the analysis the classification of lexical and graphical elements is given, which can help recognizing idiolect. The study shows that on a lexical level the main differences between idiolects are in the usage of the modality and stance expressing words, and also the words and abbreviations, which are differently chosen from possible variants. On a graphical level idiolects can be recognized from punctuation marks, emoticons and graphic symbols, used at a different frequency. Based on research results the recommendations for authorship attribution examinations are given. / Šiuo darbu siekta nustatyti, ar asmeninių elektroninių laiškų kalboje atsiskleidžia autoriaus idiolektas ir kokiais leksiniais bei grafiniais požymiais jis pasireiškia.. Tyrimui buvo surinktas šešių autorių asmeninių neoficialaus bendravimo elektroninių laiškų tekstynas. Tekstyno duomenys apdoroti pasitelkiant WordSmith Tools programą ir atlikta gretinamoji tekstų analizė: lyginti kalbos vienetų pasikartojimo dažniai tiriamųjų autorių laiškuose ir nustatyta, kad vienų autorių dažniau ar rečiau nei kitų vartojami kalbos vienetai skiria autorių idiolektus. Iš nustatytų kalbos požymių apibendrintos su idiolektu sietinų kalbinės raiškos vienetų grupės. Nustatyta, kad leksikos lygmenyje idiolektus aiškiausiai skiria autoriaus vertinimą ir nuostatas perteikiantys bei modalumą reiškiantys žodžiai bei iš galimų leksinių konkurentų pasirenkami žodžiai ir trumpiniai. Taip pat idiolektus žymi skirtingų autorių nevienodai dažnai pasirenkamų skyrybos ir grafinių ženklų vartojimas. Remiantis atlikto tyrimo rezultatais disertacijoje pateikiamos rekomendacijos teismo lingvistinius autorystės tyrimus atliekantiems ekspertams. Philology Idiolect Individual writing style Forensic linguistics Authorship attribution Personal electronic mails Idiolektas Individualusis stilius Teismo lingvistika Autoriaus identifikacija Asmeniniai elektroniniai laiškai
24	Stylometric Embeddings for Book Similarities / Stilometriska vektorer för likhet mellan böcker Chen, Beichen January 2021 (has links) Stylometry is the field of research aimed at defining features for quantifying writing style, and the most studied question in stylometry has been authorship attribution, where given a set of texts with known authorship, we are asked to determine the author of a new unseen document. In this study a number of lexical and syntactic stylometric feature sets were extracted for two datasets, a smaller one containing 27 books from 25 authors, and a larger one containing 11,063 books from 316 authors. Neural networks were used to transform the features into embeddings after which the nearest neighbor method was used to attribute texts to their closest neighbor. The smaller dataset achieved an accuracy of 91.25% using frequencies of 50 most common functional words, dependency relations, and Part-of-speech (POS) tags as features, and the larger dataset achieved 69.18% accuracy using a similar feature set with 100 most common functional words. In addition to performing author attribution, a user test showed the potentials of the model in generating author similarities and hence being useful in an applied setting for recommending books to readers based on author style. / Stilometri eller stilistisk statistik är ett forskningsområde som arbetar med att definiera särdrag för att kvantitativt studera stilistisk variation hos författare. Stilometri har mest fokuserat på författarbestämning, där uppgiften är att avgöra vem som skrivit en viss text där författaren är okänd, givet tidigare texter med kända författare. I denna stude valdes ett antal lexikala och syntaktiska stilistiska särdrag vilka användes för att bestämma författare. Experimentella resultat redovisas för två samlingar litterära verk: en mindre med 27 böcker skrivna av 25 författare och en större med 11 063 böcker skrivna av 316 författare. Neurala nätverk användes för att koda de valda särdragen som vektorer varefter de närmaste grannarna för de okända texterna i vektorrummet användes för att bestämma författarna. För den mindre samlingen uppnåddes en träffsäkerhet på 91,25% genom att använda de 50 vanligaste funktionsorden, syntaktiska dependensrelationer och ordklassinformation. För den större samlingen uppnåddes en träffsäkerhet på 69,18% med liknande särdrag. Ett användartest visar att modellen utöver att bestämma författare har potential att representera likhet mellan författares stil. Detta skulle kunna tillämpas för att rekommendera böcker till läsare baserat på stil. Stylometry Authorship attribution Embeddings Neural networks Natural language processing Book recommendations Stilometri Författarbestämning Vektorrum Neurala nätverk Språkteknologi Bokrekommendationer Computer Sciences Datavetenskap (datalogi)
25	Atribuição automática de autoria de obras da literatura brasileira / Atribuição automática de autoria de obras da literatura brasileira Nobre Neto, Francisco Dantas 19 January 2010 (has links) Made available in DSpace on 2015-05-14T12:36:48Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 1280792 bytes, checksum: d335d67b212e054f48f0e8bca0798fe5 (MD5) Previous issue date: 2010-01-19 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Authorship attribution consists in categorizing an unknown document among some classes of authors previously selected. Knowledge about authorship of a text can be useful when it is required to detect plagiarism in any literary document or to properly give the credits to the author of a book. The most intuitive form of human analysis of a text is by selecting some characteristics that it has. The study of selecting attributes in any written document, such as average word length and vocabulary richness, is known as stylometry. For human analysis of an unknown text, the authorship discovery can take months, also becoming tiring activity. Some computational tools have the functionality of extracting such characteristics from the text, leaving the subjective analysis to the researcher. However, there are computational methods that, in addition to extract attributes, make the authorship attribution, based in the characteristics gathered in the text. Techniques such as neural network, decision tree and classification methods have been applied to this context and presented results that make them relevant to this question. This work presents a data compression method, Prediction by Partial Matching (PPM), as a solution of the authorship attribution problem of Brazilian literary works. The writers and works selected to compose the authors database were, mainly, by their representative in national literature. Besides, the availability of the books has also been considered. The PPM performs the authorship identification without any subjective interference in the text analysis. This method, also, does not make use of attributes presents in the text, differently of others methods. The correct classification rate obtained with PPM, in this work, was approximately 93%, while related works exposes a correct rate between 72% and 89%. In this work, was done, also, authorship attribution with SVM approach. For that, were selected attributes in the text divided in two groups, one word based and other in function-words frequency, obtaining a correct rate of 36,6% and 88,4%, respectively. / Atribuição de autoria consiste em categorizar um documento desconhecido dentre algumas classes de autores previamente selecionadas. Saber a autoria de um texto pode ser útil quando é necessário detectar plágio em alguma obra literária ou dar os devidos créditos ao autor de um livro. A forma mais intuitiva ao ser humano para se analisar um texto é selecionando algumas características que ele possui. O estudo de selecionar atributos em um documento escrito, como tamanho médio das palavras e riqueza vocabular, é conhecido como estilometria. Para análise humana de um texto desconhecido, descobrir a autoria pode demandar meses, além de se tornar uma tarefa cansativa. Algumas ferramentas computacionais têm a funcionalidade de extrair tais características do texto, deixando a análise subjetiva para o pesquisador. No entanto, existem métodos computacionais que, além de extrair atributos, atribuem a autoria baseado nas características colhidas ao longo do texto. Técnicas como redes neurais, árvores de decisão e métodos de classificação já foram aplicados neste contexto e apresentaram resultados que os tornam relevantes para tal questão. Este trabalho apresenta um método de compressão de dados, o Prediction by Partial Matching (PPM), para solução do problema de atribuição de autoria de obras da literatura brasileira. Os escritores e obras selecionados para compor o banco de autores se deram, principalmente, pela representatividade que possuem na literatura nacional. Além disso, a disponibilidade dos livros em formato eletrônico também foi considerada. O PPM realiza a identificação de autoria sem ter qualquer interferência subjetiva na análise do texto. Este método, também, não faz uso de atributos presentes ao longo do texto, diferentemente de outros métodos. A taxa de classificação correta alcançada com o PPM, neste trabalho, foi de aproximadamente 93%, enquanto que trabalhos relacionados mostram uma taxa de acerto entre 72% e 89%. Neste trabalho, também foi realizado atribuição de autoria com a abordagem SVM. Para isso, foram selecionados atributos no texto dividido em dois tipos, sendo um baseado em palavras e o outro na contagem de palavrasfunção, obtendo uma taxa de acerto de 36,6% e 88,4%, respectivamente. Atribuição de autoria Prediction by Partial Matching (PPM) Processamento de Linguagem Natural (PLN) literatura brasileira Estilometria Authorship Attribution Prediction by Partial Matching (PPM) Natural Language Processing (NLP) Brazilian literature stylometry
26	Personal information prediction from written texts Bibi, Khalil 03 1900 (has links) La détection de la paternité textuelle est un domaine de recherche qui existe depuis les années 1960. Il consiste à prédire l’auteur d’un texte en se basant sur d’autres textes dont les auteurs sont connus. Pour faire cela, plusieurs traits sur le style d’écriture et le contenu sont extraits. Pour ce mémoire, deux sous-problèmes de détection de la paternité textuelle ont été traités : la prédiction du genre et de l’âge de l’auteur. Des données collectées de blogs en ligne ont été utilisées pour faire cela. Dans ce travail, plusieurs traits (features) textuels ont été comparé en utilisant des méthodes d’apprentissage automatique. De même, des méthodes d’apprentissage profond ont été appliqués. Pour la tâche de classification du genre, les meilleurs résultats ont été obtenus en appliquant un système de vote majoritaire sur la prédiction d’autres modèles. Pour la classification d’âge, les meilleurs résultats ont été obtenu en utilisant un classificateur entrainé sur TF-IDF. / Authorship Attribution (AA) is a field of research that exists since the 60s. It consists of identifying the author of a certain text based on texts with known authors. This is done by extracting features about the writing style and the content of the text. In this master thesis, two sub problems of AA were treated: gender and age classification using a corpus collected from online blogs. In this work, several features were compared using several feature-based algorithms. As well as deep learning methods. For the gender classification task, the best results are the ones obtained by a majority vote system over the outputs of several classifiers. For the age classification task, the best result was obtained using classifier trained over TFIDF. Authorship attribution natural language processing machine learning deep learning privacy Détection de la paternité textuelle Apprentissage machine Apprentissage profond Vie privée
27	Investigating the use of forensic stylistic and stylometric techniques in the analyses of authorship on a publicly accessible social networking site (Facebook) Michell, Colin Simon 2013 July 1900 (has links) This research study examines the forensic application of a selection of stylistic and stylometric techniques in a simulated authorship attribution case involving texts on the social networking site, Facebook. Eight participants each submitted 2,000 words of self-authored text from their personal Facebook messages, and one of them submitted an extra 2,000 words to act as the ‘disputed text’. The texts were analysed in terms of the first 1,000 words received and then at the 2,000-word level to determine what effect text length has on the effectiveness of the chosen style markers (keywords, function words, most frequently occurring words, punctuation, use of digitally mediated communication features and spelling). It was found that despite accurately identifying the author of the disputed text at the 1,000-word level, the results were not entirely conclusive but at the 2,000-word level the results were more promising, with certain style markers being particularly effective. / Linguistics / MA (Linguistics) Facebook Authorship attribution Style markers Idiolect Forensic linguistics Forensic stylistics Stylometrics WordSmith Tools 363.2565 Forensic linguistics Language and the Internet Internet -- Law and legislation Computer crimes -- Investigation Authorship -- Identification
28	Investigating the use of forensic stylistic and stylometric techniques in the analyses of authorship on a publicly accessible social networking site (Facebook) Michell, Colin Simon 07 1900 (has links) This research study examines the forensic application of a selection of stylistic and stylometric techniques in a simulated authorship attribution case involving texts on the social networking site, Facebook. Eight participants each submitted 2,000 words of self-authored text from their personal Facebook messages, and one of them submitted an extra 2,000 words to act as the ‘disputed text’. The texts were analysed in terms of the first 1,000 words received and then at the 2,000-word level to determine what effect text length has on the effectiveness of the chosen style markers (keywords, function words, most frequently occurring words, punctuation, use of digitally mediated communication features and spelling). It was found that despite accurately identifying the author of the disputed text at the 1,000-word level, the results were not entirely conclusive but at the 2,000-word level the results were more promising, with certain style markers being particularly effective. / Linguistics and Modern Languages / M.A. (Linguistics) Facebook Authorship attribution Style markers Idiolect Forensic linguistics Forensic stylistics Stylometrics WordSmith Tools 363.2565 Forensic linguistics Language and the Internet Internet -- Law and legislation Computer crimes -- Investigation Authorship -- Identification
29	Stylometry: Quantifying Classic Literature For Authorship Attribution : - A Machine Learning Approach Yousif, Jacob, Scarano, Donato January 2024 (has links) Classic literature is rich, be it linguistically, historically, or culturally, making it valuable for future studies. Consequently, this project chose a set of 48 classic books to conduct a stylometric analysis on the defined set of books, adopting an approach used by a related work to divide the books into text segments, quantify the resulting text segments, and analyze the books using the quantified values to understand the linguistic attributes of the books. Apart from the latter, this project conducted different classification tasks for other objectives. In one respect, the study used the quantified values of the text segments of the books for classification tasks using advanced models like LightGBM and TabNet to assess the application of this approach in authorship attribution. From another perspective, the study utilized a State-Of-The-Art model, namely, RoBERTa for classification tasks using the segmented texts of the books instead to evaluate the performance of the model in authorship attribution. The results uncovered the characteristics of the books to a reasonable degree. Regarding the authorship attribution tasks, the results suggest that segmenting and quantifying text using stylometric analysis and supervised machine learning algorithms is practical in such tasks. This approach, while showing promise, may still require further improvements to achieve optimal performance. Lastly, RoBERTa demonstrated high performance in authorship attribution tasks. Authorship Attribution Classic Literature Analysis Clustering Data Science Deep Learning Feature Engineering Feature Extraction Gradient Descent K-Means LightGBM Machine Learning Multiclass Classification NLP Neural Network RoBERTa Stylometric Analysis Stylometry TabNet t-SNE Text Mining Transformer Models Computer Sciences Datavetenskap (datalogi) Computer and Information Sciences Data- och informationsvetenskap

Search results