Summary: Universal Language Model Fine Tuning
10 Jun 2018 | Sharing is caringIn early 2018, Jeremy Howard and Sebastian Ruder came up with one of the first methods to significantly improve performance of deep learning on nlp tasks by leveraging transfer learning Their findings were reported in their paper – Universal Language Model Fine-tuning for Text Classification (ULMFiT)
Table of Contents
The idea behind transfer learning is simple: [Peter Martigny, 2018]
-
The hidden layers of models are able to learn general knowledge, which can then be used as one big featurizer
-
We can download a pre-trained model and remove the last layer of the network (the fully-connected layer, which projects the features)
-
We can then replace it with a classifier of our choice, tailored to our task (say, a binary classifier to classify a sentence as positive or negative)
-
Thus, we need to train our classification layer only
-
The data we use may be different than the data which was used for pre-training
-
So we do a fine-tuning step, where we train all layers for a [much] short amount of time.
Pre-paper era
Common Technologies
-
Traditional approaches such as one-hot encoding and bag-of-words do not capture information about a word’s meaning or context.
-
Word vectors were used as the defacto representation technique in NLP
-
Word2vec became popular as an approximation to language modeling, followed by others such as GloVe.
-
Transfer learning had meanwhile become a common place in other domains such as computer vision [ImageNet]
Limitations
-
Word vectors used previously learned knowledge only in the first layer of the model — the rest of the network still needed to be trained from scratch.
-
As languages can be quite ambiguous, huge amount of labelled data and the capacity to process it was required
-
Not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption.
-
LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier.
The paper
Concept in brief
The paper proposes using an AWD-LSTM model for transfer learning from a LM [source] task to the classification [target] task in the following manner:
-
pre-train on a general language modeling (LM) task:
-
Wikitext-103 was used for this purpose [Merity et al., 2017b]
-
Consists of 28,595 preprocessed English Wikipedia articles and 103 million words.
-
Performed only once and improves performance and convergence of downstream models.
-
-
fine tune the LM on target task:
-
Allows to train a robust LM even for small datasets.
-
Different learning rates are used for different layers [discriminatively fine-tuned]
-
Learning rates are first increased linearly, and then decreased gradually after a cut [slanted triangular learning rates]
-
-
fine tune for text classification:
-
LM model is gradually unfreezed starting from the last layer
-
variable length backpropagation sequences are used [Merity et al., 2017a]
-
Classifier is fine-tuned for forward and backward language models and the average is used as final output
-
Experiments and Results
-
ULMFiT with 100 examples = Training from scratch with upto 20k examples [IMDb sentiment analysis]
-
Accuracy of ULMFiT with 1000 examples = Accuracy of FastText model from scratch on full dataset [Kaggle project]
-
combining the predictions of a forward and backwards LM-classifier results in a boost of around 0.5–0.7
-
Even regular LM reaches good performance on larger datasets
-
‘Discr’ and ‘Stlr’ improve performance across all three datasets and are necessary on TREC-6
-
ULMFiT is the only method that shows excellent performance universally
-
performance remains similar or improves until late epochs
ULFiT obtained reductions in error rates over state-of-the-art as follows:
IMDb | DBpedia | Yelp-bi | Yelp-full |
22% | 4.8% | 18.2% | 2.0% |
Note: 10% of the training set was used and error rates with unidirectional LMs were reported. The classifier was fine-tuned for 50 epochs
After the paper
With the emergence of better language models, we will be able to further improve our model’s performance
The success achieved by ULMFiT has spurred an interest in applying transfer learning to NLP. In just the next few months after the paper, there were several frameworks proposed (especially fine-tuned language models). A few of the popular ones are:
-
Generative Pre-training: [Radford et al., 2018]
-
Input is transformed differently while pre-training depending on the task to be performed
-
Uses a Transformer network instead of the LSTM used here
-
-
Deep contextualized word representations: [Peters et al., 2018]
-
word embeddings should incorporate word-level characteristics and contextual semantics
-
Instead of just using the final layer, vectors of all internal states are weighted and combined to form the final embeddings
-