Summary: Universal Language Model Fine Tuning

10 Jun 2018 | Sharing is caring

In early 2018, Jeremy Howard and Sebastian Ruder came up with one of the first methods to significantly improve performance of deep learning on nlp tasks by leveraging transfer learning Their findings were reported in their paper – Universal Language Model Fine-tuning for Text Classification (ULMFiT)

Pre-paper era
- Common Technologies
- Limitations
The paper
- Concept in brief
- Experiments and Results
After the paper

The idea behind transfer learning is simple: ^{[Peter Martigny, 2018]}

The hidden layers of models are able to learn general knowledge, which can then be used as one big featurizer
We can download a pre-trained model and remove the last layer of the network (the fully-connected layer, which projects the features)
We can then replace it with a classifier of our choice, tailored to our task (say, a binary classifier to classify a sentence as positive or negative)
Thus, we need to train our classification layer only
The data we use may be different than the data which was used for pre-training
So we do a fine-tuning step, where we train all layers for a [much] short amount of time.

Pre-paper era

Common Technologies

Traditional approaches such as one-hot encoding and bag-of-words do not capture information about a word’s meaning or context.
Word vectors were used as the defacto representation technique in NLP
Word2vec became popular as an approximation to language modeling, followed by others such as GloVe.
Transfer learning had meanwhile become a common place in other domains such as computer vision ^[ImageNet]

Limitations

Word vectors used previously learned knowledge only in the first layer of the model — the rest of the network still needed to be trained from scratch.
As languages can be quite ambiguous, huge amount of labelled data and the capacity to process it was required
Not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption.
LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier.

The paper

Concept in brief

The paper proposes using an AWD-LSTM model for transfer learning from a LM [source] task to the classification [target] task in the following manner:

pre-train on a general language modeling (LM) task:
- Wikitext-103 was used for this purpose ^{[Merity et al., 2017b]}
- Consists of 28,595 preprocessed English Wikipedia articles and 103 million words.
- Performed only once and improves performance and convergence of downstream models.
fine tune the LM on target task:
- Allows to train a robust LM even for small datasets.
- Different learning rates are used for different layers ^{[discriminatively fine-tuned]}
- Learning rates are first increased linearly, and then decreased gradually after a cut ^{[slanted triangular learning rates]}
fine tune for text classification:
- LM model is gradually unfreezed starting from the last layer
- variable length backpropagation sequences are used ^{[Merity et al., 2017a]}
- Classifier is fine-tuned for forward and backward language models and the average is used as final output

Experiments and Results

ULMFiT with 100 examples = Training from scratch with upto 20k examples ^{[IMDb sentiment analysis]}
Accuracy of ULMFiT with 1000 examples = Accuracy of FastText model from scratch on full dataset ^{[Kaggle project]}
combining the predictions of a forward and backwards LM-classifier results in a boost of around 0.5–0.7
Even regular LM reaches good performance on larger datasets
‘Discr’ and ‘Stlr’ improve performance across all three datasets and are necessary on TREC-6
ULMFiT is the only method that shows excellent performance universally
performance remains similar or improves until late epochs

ULFiT obtained reductions in error rates over state-of-the-art as follows:

IMDb	DBpedia	Yelp-bi	Yelp-full
22%	4.8%	18.2%	2.0%

Note: 10% of the training set was used and error rates with unidirectional LMs were reported. The classifier was fine-tuned for 50 epochs

After the paper

With the emergence of better language models, we will be able to further improve our model’s performance

The success achieved by ULMFiT has spurred an interest in applying transfer learning to NLP. In just the next few months after the paper, there were several frameworks proposed (especially fine-tuned language models). A few of the popular ones are:

Generative Pre-training: ^{[Radford et al., 2018]}
- Input is transformed differently while pre-training depending on the task to be performed
- Uses a Transformer network instead of the LSTM used here
Deep contextualized word representations: ^{[Peters et al., 2018]}
- word embeddings should incorporate word-level characteristics and contextual semantics
- Instead of just using the final layer, vectors of all internal states are weighted and combined to form the final embeddings

Summary: Universal Language Model Fine Tuning

Table of Contents

Pre-paper era

Common Technologies

Limitations

The paper

Concept in brief

Experiments and Results

After the paper

Salil

Summary: Universal Language Model Fine Tuning

Table of Contents

Pre-paper era

Common Technologies

Limitations

The paper

Concept in brief

Experiments and Results

After the paper

More Summaries

Text Adaptive Generative Adversarial Networks 10 Dec 2018

Salil