Global ETD Search

Return to search

Data Segmentation Using NLP: Gender and Age

Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages to eventually use it in ways that enable yet further understanding of the interaction and communication between the human and the computer. When appropriate data is available, NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an online post. Previously conducted studies show aspects of NLP potentially going deeper into the subjective information, enabling author classification from text data. This thesis addresses the lack of demographic insights of online user data by studying language use in texts. It compares four popular yet diverse machine learning algorithms for gender and age segmentation. During the project, the age analysis was abandoned due to insufficient data. The online texts were analysed and quantified into 118 parameters based on linguistic differences. Using supervised learning, the researchers succeeded in correctly predicting the gender in 82% of the cases when analysing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and, in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-434622

Natural language processing

Computer and Information Sciences

Data- och informationsvetenskap

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-434622
Date	January 2021
Creators	Demmelmaier, Gustav, Westerberg, Carl
Publisher	Uppsala universitet, Avdelningen för datalogi, Uppsala universitet, Avdelningen för datalogi
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	UPTEC STS, 1650-8319 ; 21001

Page generated in 0.0024 seconds

Data Segmentation Using NLP: Gender and Age

Description

Links & Downloads

Tags

Additional Fields