Return to search

Data Segmentation Using NLP: Gender and Age

Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages to eventually use it in ways that enable yet further understanding of the interaction and communication between the human and the computer. When appropriate data is available, NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an online post. Previously conducted studies show aspects of NLP potentially going deeper into the subjective information, enabling author classification from text data. This thesis addresses the lack of demographic insights of online user data by studying language use in texts. It compares four popular yet diverse machine learning algorithms for gender and age segmentation. During the project, the age analysis was abandoned due to insufficient data. The online texts were analysed and quantified into 118 parameters based on linguistic differences. Using supervised learning, the researchers succeeded in correctly predicting the gender in 82% of the cases when analysing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and, in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-434622
Date January 2021
CreatorsDemmelmaier, Gustav, Westerberg, Carl
PublisherUppsala universitet, Avdelningen för datalogi, Uppsala universitet, Avdelningen för datalogi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationUPTEC STS, 1650-8319 ; 21001

Page generated in 0.0023 seconds