Global ETD Search

Return to search

Modélisation du langage à l'aide de pénalités structurées

Modeling natural language is among fundamental challenges of artificial intelligence and the design of interactive machines, with applications spanning across various domains, such as dialogue systems, text generation and machine translation. We propose a discriminatively trained log-linear model to learn the distribution of words following a given context. Due to data sparsity, it is necessary to appropriately regularize the model using a penalty term. We design a penalty term that properly encodes the structure of the feature space to avoid overfitting and improve generalization while appropriately capturing long range dependencies. Some nice properties of specific structured penalties can be used to reduce the number of parameters required to encode the model. The outcome is an efficient model that suitably captures long dependencies in language without a significant increase in time or space requirements. In a log-linear model, both training and testing become increasingly expensive with growing number of classes. The number of classes in a language model is the size of the vocabulary which is typically very large. A common trick is to cluster classes and apply the model in two-steps; the first step picks the most probable cluster and the second picks the most probable word from the chosen cluster. This idea can be generalized to a hierarchy of larger depth with multiple levels of clustering. However, the performance of the resulting hierarchical classifier depends on the suitability of the clustering to the problem. We study different strategies to build the hierarchy of categories from their observations.

[INFO:INFO_OH] Computer Science/Other

[INFO:INFO_OH] Informatique/Autre

Convex optimization

Natural language processing

Identifer	oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-01001634
Date	11 February 2014
Creators	Nelakanti, Anil Kumar
Publisher	Université Pierre et Marie Curie - Paris VI
Source Sets	CCSD theses-EN-ligne, France
Language	English
Detected Language	English
Type	PhD thesis

Page generated in 0.0019 seconds

Modélisation du langage à l'aide de pénalités structurées

Description

Links & Downloads

Tags

Additional Fields