SarcDetect: Sarcasm detection with transfer learning

15 Sep 2018 | I am not sarcastic!

ULMFiT is a great technique for applying transfer learning in nlp. It has shown to achieve great results with small amounts of data. So, when I came across the SemEval2018 task for irony detection, I wanted to put it to the test.

Pre-processing
Preparing the data
Building our classifier
- Fine-tune the language model
- Fine-tune for classification
Results

Usually in nlp, we use word embeddings, which are vectors of floats that represent different words. But in ULMFiT, we use a pretrained language model and fine-tune it for our task. A language model not only has word embeddings, but has also been trained to get a representation of full sentences and documents.

Pre-processing

We clean up the data a little. Remove user-mentions, links and numbers from the tweet. Also, I separated the hashes to better obtain their co-relation with the tweet Here is the code I used:

from nltk.tokenize import TweetTokenizer
import re
import wordsegment as ws
ws.load()

def preprocess(tweet):
  tknzr = TweetTokenizer(reduce_len=True, strip_handles=True)
  tokens = tknzr.tokenize(tweet)

  tweet = " ".join(tokens)  # we need to ensure that all the tokens are separated using the space

  tweet = tweet.lower()
  tweet = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', '', tweet)  # URLs
  tweet = re.sub(r'(?:@[\w_]+)', '', tweet)  # user-mentions
  tweet = re.sub(r'(?:\#+[\w_]+[\w\'_\-]*[\w_]+)', '', tweet)  # hashtags
  tweet = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', '', tweet)  # numbers

  return tweet

def segmentHashes(tweet):
  tags = " ".join(re.findall(r"#(\w+)", tweet))
  return " ".join(ws.segment(tags)).lower()
  

Preparing the data

Let’s get the training, validation and test data into the necessary format. We split the data 70-30 for validation

from fastai.text import *
from sklearn.model_selection import train_test_split
import wordsegment as ws

path = ''
ws.load()

df = pd.read_csv(path + 'SemEval2018-Task3-master/datasets/train/SemEval2018-T3-train-taskB_emoji_ironyHashtags.txt', delimiter='\t')
df['hashes'] = df.apply(lambda row: segmentHashes(row['Tweet text']), axis=1)
df['Tweet text'] = df.apply(lambda row: preprocess(row['Tweet text']), axis=1)

train_df, valid_df = train_test_split(df, test_size=0.7)

test_df = pd.read_csv(path + 'SemEval2018-Task3-master/datasets/goldtest_TaskB/SemEval2018-T3_gold_test_taskB_emoji.txt', delimiter='\t')
test_df['hashes'] = test_df.apply(lambda row: segmentHashes(row['Tweet text']), axis=1)
test_df['Tweet text'] = test_df.apply(lambda row: preprocess(row['Tweet text']), axis=1)

Data sample

Building our classifier

Fine-tune the language model

We first get a pre-trained model to use as our base. Since we are classifying English tweets, the WT103 model (from fastai) works fine for our case. Fine-tuning it for tweets is as simple as:

# Language model data
data_lm = TextLMDataBunch.from_df(
    path = 'salil',
    train_df = train_df,
    valid_df = valid_df,
    test_df = test_df,
    label_cols = ['Label'],
    text_cols = ['Tweet text', 'hashes']
)
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.7, callback_fns=ShowGraph)
learn.fit_one_cycle(1, 1e-2)

Loss graph

We unfreeze the model and fine-tune it:

learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

Loss graph

We can test our language model by having it create some tweets for us: Example prediction

Fine-tune for classification

Now, we fine-tune it for the purpose of classification.

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('tweet_enc')
learn.fit_one_cycle(1, 1e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

The classifier is tuned and ready to classify.

Results

Without much work, we easily got an accuracy of around 85% on the validation set Accuracy of 85%

SarcDetect: Sarcasm detection with transfer learning

Table of Contents

Pre-processing

Preparing the data

Building our classifier

Fine-tune the language model

Fine-tune for classification

Results

Salil

SarcDetect: Sarcasm detection with transfer learning

Table of Contents

Pre-processing

Preparing the data

Building our classifier

Fine-tune the language model

Fine-tune for classification

Results

Related Posts

What You See Is Not What You Get: Experiments in fooling neural nets 17 Aug 2017

Quibble: Marks Analytics Software 20 Aug 2016

Guide to Abuse Filters 17 Aug 2016

Salil