SarcDetect: Sarcasm detection with transfer learning

ULMFiT is a great technique for applying transfer learning in nlp. It has shown to achieve great results with small amounts of data. So, when I came across the SemEval2018 task for irony detection, I wanted to put it to the test.

Table of Contents

Usually in nlp, we use word embeddings, which are vectors of floats that represent different words. But in ULMFiT, we use a pretrained language model and fine-tune it for our task. A language model not only has word embeddings, but has also been trained to get a representation of full sentences and documents.

Pre-processing

We clean up the data a little. Remove user-mentions, links and numbers from the tweet. Also, I separated the hashes to better obtain their co-relation with the tweet Here is the code I used:

from nltk.tokenize import TweetTokenizer
import re
import wordsegment as ws
ws.load()

def preprocess(tweet):
  tknzr = TweetTokenizer(reduce_len=True, strip_handles=True)
  tokens = tknzr.tokenize(tweet)

  tweet = " ".join(tokens)  # we need to ensure that all the tokens are separated using the space

  tweet = tweet.lower()
  tweet = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', '', tweet)  # URLs
  tweet = re.sub(r'(?:@[\w_]+)', '', tweet)  # user-mentions
  tweet = re.sub(r'(?:\#+[\w_]+[\w\'_\-]*[\w_]+)', '', tweet)  # hashtags
  tweet = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', '', tweet)  # numbers

  return tweet

def segmentHashes(tweet):
  tags = " ".join(re.findall(r"#(\w+)", tweet))
  return " ".join(ws.segment(tags)).lower()
  

Preparing the data

Let’s get the training, validation and test data into the necessary format. We split the data 70-30 for validation

from fastai.text import *
from sklearn.model_selection import train_test_split
import wordsegment as ws

path = ''
ws.load()

df = pd.read_csv(path + 'SemEval2018-Task3-master/datasets/train/SemEval2018-T3-train-taskB_emoji_ironyHashtags.txt', delimiter='\t')
df['hashes'] = df.apply(lambda row: segmentHashes(row['Tweet text']), axis=1)
df['Tweet text'] = df.apply(lambda row: preprocess(row['Tweet text']), axis=1)

train_df, valid_df = train_test_split(df, test_size=0.7)

test_df = pd.read_csv(path + 'SemEval2018-Task3-master/datasets/goldtest_TaskB/SemEval2018-T3_gold_test_taskB_emoji.txt', delimiter='\t')
test_df['hashes'] = test_df.apply(lambda row: segmentHashes(row['Tweet text']), axis=1)
test_df['Tweet text'] = test_df.apply(lambda row: preprocess(row['Tweet text']), axis=1)

Data sample

Building our classifier

Fine-tune the language model

We first get a pre-trained model to use as our base. Since we are classifying English tweets, the WT103 model (from fastai) works fine for our case. Fine-tuning it for tweets is as simple as:

# Language model data
data_lm = TextLMDataBunch.from_df(
    path = 'salil',
    train_df = train_df,
    valid_df = valid_df,
    test_df = test_df,
    label_cols = ['Label'],
    text_cols = ['Tweet text', 'hashes']
)
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.7, callback_fns=ShowGraph)
learn.fit_one_cycle(1, 1e-2)

Loss graph

We unfreeze the model and fine-tune it:

learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

Loss graph

We can test our language model by having it create some tweets for us: Example prediction

Fine-tune for classification

Now, we fine-tune it for the purpose of classification.

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('tweet_enc')
learn.fit_one_cycle(1, 1e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

The classifier is tuned and ready to classify.

Results

Without much work, we easily got an accuracy of around 85% on the validation set Accuracy of 85%