Global ETD Search

Return to search

Language Identification on Short Textual Data

Language identification is the task of automatically detecting the languages(s) written in a text or a document given, and is also the very first step of further natural language processing tasks. This task has been well-studied over decades in the past, however, most of the works have focused on long texts rather than the short that is proved to be more challenging due to the insufficiency of syntactic and semantic information. In this work, we present approaches to this problem based on deep learning techniques, traditional methods and their combination. The proposed ensemble model, composed of a learning based method and a dictionary based method, achieves 89.6% accuracy on our new generated gold test set, surpassing Google Translate API by 3.7% and an industry leading tool Langid.py by 26.1%. / Thesis / Master of Applied Science (MASc)

http://hdl.handle.net/11375/25126

Natural Language Processing

Language identification

Textual data

Identifer	oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/25126
Date	January 2020
Creators	Cui, Yexin
Contributors	Chen, Jun, Electrical and Computer Engineering
Source Sets	McMaster University
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0019 seconds

Language Identification on Short Textual Data

Description

Links & Downloads

Tags

Additional Fields