NLP: Beginners Guide To Featurize “Story Based” Text: MPST (Movie Plot Synopses Tags)

Hello peeps.. Before we dive into our topic, i first recommend you to check the above kaggle link so that you can understand the MPST problem better and also you can download the dataset from there (if you are interested ;)). I have used the MPST dataset to display some of the most common or beginner techniques to featurize our “Story Based” text data. In MPST, the text or synopses we will get is about the Story of the movie which is collected from the IMDB.

A Little More About Why Tag Prediction Is IMPORTANT ?

When we tag something, it delivers an information about that thing in just a “word”. It is a popular way to gather community feedback about online items in the form of tags. It’s also known as Collaborative tagging or Social tagging.

User-generated tags in recommendation systems like IMDb and MovieLens provide different types of summarized attributes of movies. These tags are effective search keywords, are also useful for discovering social interests, and improving recommendation performance.

Now why the “Tag Prediction” is important and can be useful ?? Well companies like NetFlix recommend movies to the users according to the movies they watched in past. Suppose a new movie is arrived and we know its synopses. Then by predicting the tags for that movie using its available synopses, we can recommend that movie to the users who earlier watched different movie in the past but with the same tags, right. Well this is only one of the all other important applications of Tag Prediction i have mentioned here.

Enough theory now. Let’s start the main part.

Objective :

To predict tags for the given movie synopses.

Metrics used :

  • F1_micro
  • Hamming Loss

MPST Dataset :

Let me clear the things here in case image is not perfectly visible to you. The important detail is the column name in our dataset which are- imdb_id (id of movie), title (movie title), plot_synopsis (movie story), tags (like horror, murder, romantic), split (train-test split) and synopsis_source (source of movie details). Among all these columns, we will use only two of them which are plot_synopsis (story of movie; text to featurize) and tags (as our class labels).

Exploratory Data Analysis :

There may be chances of repetition of rows. So we will remove duplicate rows now.

sorted_data=all_data.sort_values('imdb_id', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
all_data=sorted_data.drop_duplicates(subset={"title"}, keep='first', inplace=False)

Now we will check the distribution and repetition of tags.

vectorizer = CountVectorizer(tokenizer = lambda x: x.split(', '))
tag_vect = vectorizer.fit_transform(all_data['tags'].values)
tags = vectorizer.get_feature_names()
print("Number of unique tags :", tag_vect.shape[1])
print("Some of the tags we have :", tags[:10])

If you run above code then you will see that there are total 71 uniques tags in our dataset.

freqs = tag_vect.sum(axis=0).A1
data_tag_count = {'tag':tags,'count':freqs}
df_tag_count = pd.DataFrame(data_tag_count)
df_tag_count = df_tag_count.sort_values('count', axis = 0, ascending = False)
tag_counts = df_tag_count['count'].valuesplt.figure(figsize=(20,10))
plt.title("Distribution of number of times tag appeared questions")
plt.xlabel("Tag number")
plt.ylabel("Number of times tag appeared")

Further, now we will check the minimum and maximum tag a movie can have.

tag_quest_count = tag_vect.sum(axis=1).tolist()
tag_quest_count=[int(j) for i in tag_quest_count for j in i]
print ('We have total {} datapoints.'.format(len(tag_quest_count)))
print( "Maximum number of tags per question: %d"%max(tag_quest_count))
print( "Minimum number of tags per question: %d"%min(tag_quest_count))
print( "Avg. number of tags per question: %f"% ((sum(tag_quest_count)*1.0)/len(tag_quest_count)))

If you run above code, you will see that, at max, a movie can have 25 tags and at min, 1 tag. And on an average, a movie can have 3 tags.

let’s check the most frequent tags in space of 25 tags for movies in our dataset.

import seaborn as snssns.countplot(tag_quest_count, palette='gist_rainbow')
plt.title("Number of tags in the questions ")
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")

we can see that only few tags in all space of 25 tags are very very repeating and that’s why i choose F1_micro as a metric.

Let’s see which are those most repeating tags.

plt.title('Frequency of top 30 tags')
plt.xticks(i, df_tag_count['tag'])

We can see that Murder, Violence and Flashback are the top three most occurring tags.

NOTE: As i showed earlier, on an average a movie can have 3 tags right. So we can make our Baseline model by assigning these three most occurring tags to each movie as “predicted” tags and then compare the performance of your other models by this baseline model.

Data Cleaning :

In this data cleaning process, we will clean our text data (plot_synopsis) like removing special characters, numeric, lemmatize and stemming. But we will do it twice because for TFIDF and Avgw2v, stemmed text is okay. But if we are trying to find the sentic levels and emotions in the story, flow of the story, finding Parts Of Speech etc. then its better if we do not stem our text.

I have used the below helper function to clean data. You can modify it according to your need.

import re
from bs4 import BeautifulSoup
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can't", "can not", phrase)
phrase = re.sub(r"couldn't", "could not", phrase)
phrase = re.sub(r"wouldn't", "would not", phrase)
phrase = re.sub(r"shouldn't", "should not", phrase)
phrase = re.sub(r"don't", "do not", phrase)
phrase = re.sub(r"doesn't", "does not", phrase)
phrase = re.sub(r"haven't", "have not", phrase)
phrase = re.sub(r"hasn't", "has not", phrase)
phrase = re.sub(r"ain't", "not", phrase)
phrase = re.sub(r"hadn't", "had not", phrase)
phrase = re.sub(r"didn't", "did not", phrase)
phrase = re.sub(r"wasn't", "was not", phrase)
phrase = re.sub(r"aren't", "are not", phrase)
phrase = re.sub(r"isn't", "is not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
stop_words = stopwords.words('english')
sno = SnowballStemmer('english')
# Cleaning data with stemming: for tfidf and avg w2v lexical feature
for sentance in all_data['plot_synopsis'].values:
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'lxml').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z\.]+', ' ', sentance)
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stop_words)
sentance = ' '.join([sno.stem(word) for word in sentance.split()])
all_data['plot_synopsis_cleaned'] = cleaned_final_data# Cleaning data without stemming: for sentiment and emotional #features
no_backslash_synopsis = []
for sentance in all_data['plot_synopsis'].values:
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'lxml').get_text()
sentance = re.sub(r"\'t", "'t", sentance)
sentance = re.sub(r"\'s", "'s", sentance)
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
sentance = ' '.join(e for e in sentance.split())
all_data['no_backslash_synopsis']= no_backslash_synopsis

Now we have cleaned data at our hands, ready for featurizing process. But before we start, lets also make our class labels. As we know our class labels can have multiple outputs (Tags), its a multilabel classificiation problem. For that, we will construct a vector of length 71 (total number of tags) and one hot encode our tags for each movie. We can do it manually or simply use CountVectorizer.

vectorizer = CountVectorizer(tokenizer = lambda x: x.split(', '), binary='true')
multilabel_y = vectorizer.fit_transform(all_data['tags'].values)

Featurising Story Based Cleaned Text Data (without stemming) :

I will be using below mentioned libs and packages :

  • SenticNet
  • SenticPhrase
  • FrameSentic (Nltk)
  • BlobText
  • Pattern
  • FastText- GloVe
  • Topic Modeling- LDA (Gensim)

Before we go made emotion and sentiments based features, i want to throw some light on how we will work with it. Just a basic idea of what we going to do next.

Getting Emotions And Sentiments :

Sentiments are inherent part of stories and one of the key elements that determine the possible experiences found from a story. For example, depressive stories are expected to be full of sadness, anger, disgust and negativity, whereas a funny movie is possibly full of joy and surprise.

Human emotions are characterized into four affective dimensions (attention, sensitivity, aptitude and pleasantness). Each of these affective dimensions is represented by six different activation levels called sentic levels. These make up to 24 distinct labels called ‘elementary emotions’ that represent the total emotional state of the human mind. SenticNet and SenticPhrase knowledge base consists of 50,000 commonsense concepts with their semantics, polarity value and scores for the basic four affective dimensions. We used this knowledge base to compute average polarity, attention, sensitivity, aptitude, and pleasantness for the synopses.

We will use SenticPhrase for full synopsis and SenticNet for a word.

Using SenticPhrase :

We can get sentic levels (attention, sensitivity, aptitude and pleasantness) values by .get_sentics() and moodtags (anger, fear, joy, admiration, disgust, interest, surprise, sadness) values by .get_moodtags() functions. We are also using the polarity values as it tell us that how negative or positive is the story emotionally. We can get polarity by using .get_polarity() function.

from sentic import SenticPhrase
def get_sentics_polarity_moodtags(synopsis):
sp = SenticPhrase(synopsis)
dict_sentics = sp.get_sentics()
dict_moodtags = sp.get_moodtags()

all_tags = list(dict_moodtags.keys())

polarity = sp.get_polarity()
vect = np.zeros(13,dtype = float)

if float(dict_sentics.get('pleasantness')) != 0:
vect[0] = float(dict_sentics.get('pleasantness'))
if float(dict_sentics.get('attention')) != 0:
vect[1] = float(dict_sentics.get('attention'))
if float(dict_sentics.get('sensitivity')) !=0:
vect[2] = float(dict_sentics.get('sensitivity'))
if float(dict_sentics.get('aptitude')) !=0:
vect[3] = float(dict_sentics.get('aptitude'))
if '#anger' in all_tags:
vect[4] = float(dict_moodtags.get('#anger'))
if '#admiration' in all_tags:
vect[5] = float(dict_moodtags.get('#admiration'))
if '#joy' in all_tags:
vect[6] = float(dict_moodtags.get('#joy'))
if '#interest' in all_tags:
vect[7] = float(dict_moodtags.get('#interest'))
if '#disgust' in all_tags:
vect[8] = float(dict_moodtags.get('#disgust'))
vect[8] = 0.000001
if '#sadness' in all_tags:
vect[9] = float(dict_moodtags.get('#sadness'))
if '#surprise' in all_tags:
vect[10] = float(dict_moodtags.get('#surprise'))
if '#fear' in all_tags:
vect[11] = float(dict_moodtags.get('#fear'))
vect[12] = float(polarity)
return vect

So by using above function, we can get the emotion and sentiments for whole synopses. But as we know that when a story goes on from beginning to end, there maybe many emotions going parallel and maybe at some point a specific emotion is dominating other at some point of story. So instead of getting whole synopses’s emotions and sentiments value in once, its better if we divide our synopses into chunks and then get the emotion values for each chunk. This is called the Flow of Story.

Divide an example synopses into 5 chunks :

synopsis = no_backslash_synopsis[122] 
split_synopsis = synopsis.split(' ')
chunk_size = int(len(split_synopsis)/5)
chunk1 = ' '.join(e for e in split_synopsis[0:chunk_size])
chunk2 = ' '.join(e for e in split_synopsis[chunk_size:chunk_size*2])
chunk3 = ' '.join(e for e in split_synopsis[chunk_size*2:chunk_size*3])
chunk4 = ' '.join(e for e in split_synopsis[chunk_size*3:chunk_size*4])
chunk5 = ' '.join(e for e in split_synopsis[chunk_size*4:])

Obtaining emotions- fear, joy and anger for each chunk :

anger = []
joy = []
fear = []
for chunk in [chunk1,chunk2,chunk3,chunk4,chunk5]:
sp = SenticPhrase(chunk)
dict_tags = sp.get_moodtags()

Plotting emotions for movie :

import matplotlib.pyplot as plt 
plt.title("flow of emotions")
plt.legend(['anger', 'joy', 'fear'], loc='best')

As we can see that in first chunk of the story, fear is very less and anger too. Joy is at top. But in second chunk, there is some anger shoots up in emotions. Similarly in end of the story, there is some fear which overtake anger in terms of emotions. But overall joy is most dominating.

So its more reasonable to divide synopsis to chunks and then measure emotions instead of obtaining from whole synopsis in once. But here in this case study, i have divided chunk into 2 parts and not 5. The reason behind this is that not every chunk is as large as i took as in example. There are some synopses which is just like less than 600 words. So we don’t want our chunks too sparse. Hence i decide to divide in two chunks only.

Using TextBlob and SenticNet for Verbs emotions and sentic levels :

Verbs and nouns tells us more about what is going inside the story. Like if we have verbs like kill, run, scream etc. the we can get an idea easily that something bad going on in story which generates a sense of negative emotions.

We will use TextBlob library for obtaining verbs(VB) and nouns(NN) out of our synopsis and will get all sentic level and emotions values and ploarity values for each verb and noun in the synopsis. We are just taking average of all these in the end for one synopsis.

count = 0
blob = TextBlob(x_train['no_backslash_synopsis'].values[9063])
for word, pos in blob.tags:
if pos == 'VB' or pos == 'NN' or pos == 'VBZ' or pos == 'VBP' or pos == 'VBD' or pos == 'VBN' or pos == 'VBG'\
or pos=='NNS'or pos =='NNP' or pos=='NNPS':
if word not in stop_words and len(word)>2:
concept_info = sn.concept(word.lower())
print("sentic concept not present")
print("\nTotal words marked by BlobText :",len(blob.tags))
print("\nTotal words parsed by SenticNet :",count)

If you run the above code, you will see that senticNet fail to give sentic levels and emotions for the verbs or nouns declared by BlobText. The reason being is that for senticNet, the word should be in the verb first form and singular. So the words like NNP: resides, stallions, returns, advantages; VBZ: Gets, Approves; VBN: proposed etc. are not parse by SenticNet. So we need to convert these words into verb first form and singular. We will use NLTK’s WordNet library for that

We also have some spellings which are incorrect like - educatiion, husbanc, conditiion etc.. So we will try correct these spelling mistakes as much as we can. We will use Pattern library for that.

Improved SenticNet :

from nltk.stem.wordnet import WordNetLemmatizer
from pattern import en
from pattern.text.en import singularize
# Applying NLTK WordNEt on output of BlobText
blob = TextBlob(x_train['no_backslash_synopsis'].values[9063])
for word, pos in blob.tags:
if pos == 'VB' or pos == 'NN' or pos == 'VBZ' or pos == 'VBP' or pos == 'VBD' or pos == 'VBN' or pos == 'VBG'\
or pos=='NNS'or pos =='NNP' or pos=='NNPS':
if word not in stop_words and len(word)>2:
word = WordNetLemmatizer().lemmatize(word.lower(),'v') # verb base form conversion
word = singularize(word)
concept_info = sn.concept(word)
word = en.spelling.suggest(word.lower())[0][0] # list of tuples, suggestions with probability: [(suggestion1,p1),(suggestion2,p2)...] where p1>p2
word = WordNetLemmatizer().lemmatize(word.lower(),'v')
word = singularize(word)
concept_info = sn.concept(word)
print("sentic concept not present")
print("\nTotal words marked by BlobText :",len(blob.tags))
print("\nTotal words parsed by SenticNet :",count)

Another sentic feature: Frame Sentics

For every verb and noun, if we can get the related contextual words then it would be more defined. We called this as Frame. USnig NLTK library, we can access the frames they made on their corpus and almost every word is related to some frame. We can think of frames as ‘related words’.

Let’s see the related words to word ‘music’

from nltk.corpus import framenet as fn
[ for f in fn.frames_by_lemma('music')]
['People_by_vocation', 'Performers', 'Performing_arts']

FrameNet on full synopses :

word_pos = []
blob = TextBlob(x_train['no_backslash_synopsis'].values[121])
for word, pos in blob.tags:
if pos == 'VB' or pos== 'NN':
if word not in stop_words:
names =[ for f in fn.frames_by_lemma(word.lower())]
if len(names) != 0:
print("frame not present")

If you run the above code then you will see the ‘related words’ to every verb(VB) and noun(NN) in the synposes. But i want to mention that not all words declared as verb or noun by TextBlob is present in the frame net of NLTK. In above example you will see that the word ‘toreador’ is not present in framenet. So we will just skip like these words when we will train our vectors.

How we will obtain our features from these frame words :

We will use all train frame words and use countvectoriser to get our feature for train and test.

# getting train set frame words
from nltk.corpus import framenet as fn
from tqdm import tqdm_notebook as tqdm
import itertools
all_word_pos_train = []
for synopsis in tqdm(x_train['no_backslash_synopsis'].values):
blob = TextBlob(synopsis)
word_pos = []
for word, pos in blob.tags:
if pos == 'VB' or pos== 'NN':
if word not in stop_words:
names =[ for f in fn.frames_by_lemma(word)]
if len(names) != 0:
word_pos = list(itertools.chain.from_iterable(word_pos))

So far we have made three type of features based on the story :

  • Sentiments and emotional values from whole synopses after dividing them into chunks.
  • Sentiment and emotion for every verb and noun present in the synopses.
  • Frames or related words featurization using CountVectoriser.

Now let’s move to our other two lexical features. For this we will use our cleaned text with stemming. Earlier we used cleaned text without stemming for above three features. Remember that.

FastText : Glove

Fasttext (which is essentially an extension of word2vec model), treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. For example the word vector “apple” is a sum of the vectors of the n-grams ap, app, appl, apple, apple, ppl, pple, pple, ple, ple, le (assuming hyperparameters for smallest ngram[minn] is 3 and largest ngram[maxn] is 6). You can check other difference between simple Word2Vec and FastText here.

from gensim.models import KeyedVectors# Creating the model
en_model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M-subword.vec')
# Getting the tokens
all_words = []
for word in en_model.vocab:
# Printing out number of tokens available
print("Number of Tokens: {}".format(len(all_words)))

Now we will just train Avgw2v vectors using the above FastText model we made.

from tqdm import tqdmtrain_avgw2v = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(x_train['plot_synopsis_cleaned'].values): # for each review/sentence
sent_vec = np.zeros(300) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sent: # for each word in a review/sentence
if word in all_words:
vec = en_model[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words

Topic Modeling : Latent Dirichilet Allocation (LDA)

It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. We are going to apply LDA to our synopses and split them into topics. It just work like a unsupervised learning model. Here this LDA algorithm will classify our all synopses into the Topic it belongs and also will tell the probability score for its belonging.


import gensim
import nltk'wordnet')
dictionary_train = gensim.corpora.Dictionary(processed_docs_train)
bow_corpus_train = [dictionary_train.doc2bow(doc) for doc in processed_docs_train]

Note : alpha and eta are hyperparameters in LdaMulticore. But i am using default value which is 1.0 per topic. Num_topics can also be thought of hyperparameter but according to size of our corpus, i just took 100 as random value.

lda_model = gensim.models.LdaMulticore(bow_corpus_train, num_topics=100, id2word=dictionary_train, passes=2, workers=2)for index, score in sorted(lda_model[bow_corpus_train[1]], key=lambda tup: -1*tup[1]):
print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

Above i just demonstrated the Topic Modeling classification on our single synopses (bow_corpus_train[1]). We can see that on top our probability score is 0.71 which is highest. It tells that our example synopses belongs to this topic with probability 0.71 approx. Also you can use these topic weights also to multiply with your word vectors to give them more weights. This LDA topic modelling can be useful in many ways. But here in this case study, i just used it to classify our synopses according to the topic it belongs among all 100 topics we choose on our train set.

topic_modeling_feat_train = []
for synopsis in bow_corpus_train:
t = lda_model[synopsis]
temp_dict = dict((y, x) for x, y in t)

That’s all for featurization of story based text data for now. After this, you can use logistic regression with OneVSRest as it’s a multilabel classification problem.

classifier = OneVsRestClassifier(LogisticRegression(C=1.0,penalty='l2')), y_train)
predictions = classifier.predict(test_final_all_feats.tocsr())
print("Accuracy :",metrics.accuracy_score(y_test, predictions))
print("Hamming loss ",metrics.hamming_loss(y_test,predictions))
precision = precision_score(y_test, predictions, average='micro')
recall = recall_score(y_test, predictions, average='micro')
f1 = f1_score(y_test, predictions, average='micro')

print("Micro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))
Accuracy : 0.03815406976744186
Hamming loss 0.055452628562070096
Micro-average quality numbers
Precision: 0.2795, Recall: 0.2433, F1-measure: 0.2602

Feature Wise Metrics Score :

As you can see that in above pretty table image, our baseline model still have the highest f1_micro but accuracy is very very less. The reason is that our baseline model predicts the three most occurring tags for every movie. Hence f1_micro is more but also hamming loss and accuracy is not good. With our features we made, we getting better accuracy and abit improved hamming loss.

It would be great if you try out this nice NLP problem and share your hamming loss and f1_micro score values in comments. If you find more useful story based features then please share link too. Feel free to ask if you like discussions. And don’t forget the clap if you enjoyed reading. thanks !!