a

Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Multi-label classification to predict movie genres

Blog: Multi-label classification to predict movie genres


What are we actually going to achieve ?

We are going to assign genres to the films based on the plot.

For that, we need multi-label classification. There are three types of classifications: Binary, multi-class, and multi-label.

You can see major differences between all three cases.

  • ‘y’ is a binary target variable in Table 1. Hence, there are only two labels — t1 and t2
  • ‘y’ contains more than two labels in Table 2. But, notice how there is only one label for every input in both these tables
  • For table 3, we have multiple tags, not just across the table, but for individual inputs as well.

We cannot apply traditional classification algorithms directly on this kind of dataset. Why? Because these algorithms expect a single label for every input, when instead we have multiple labels. It’s an intriguing challenge and one that we will solve in this article.

We will use the CMU Movie Summary Corpus open dataset for our project. You can download the dataset directly from this link.

This dataset contains multiple files, but we’ll focus on only two of them for now:

  • movie.metadata.tsv: Metadata for 81,741 movies, extracted from the November 4, 2012 dump of Freebase. The movie genre tags are available in this file
  • plot_summaries.txt: Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the plot summary

Our Strategy to Build a Movie Genre Prediction Model

We know that we can’t use supervised classification algorithms directly on a multi-label dataset. Therefore, we’ll first have to transform our target variable. Let’s see how to do this using a dummy dataset:

Here, X and y are the features and labels, respectively — it is a multi-label dataset. Now, we will use the Binary Relevance approach to transform our target variable, y. We will first take out the unique labels in our dataset:

Unique labels = [ t1, t2, t3, t4, t5 ]

There are 5 unique tags in the data. Next, we need to replace the current target variable with multiple target variables, each belonging to the unique labels of the dataset. Since there are 5 unique labels, there will be 5 new target variables with values 0 and 1 as shown below:

Import the libraries

Read the dataset

Here you can see the dataset. First thing to notice is the column headers. Change the column names.

Here, we have changed the names of columns in which we are interested.

Now, we will load movie plot dataset.

Next, split the movie ids and the plots into two separate lists. We will use these lists to form a dataframe

movies.head()

Add the movie names and their genres from the movie metadata file by merging the latter into the former based on the movie_id column

We have to convert genre from dictionary into the list type.

As data type is string, you can’t access it with .values(). Therefore, we need json.

Now you can see genre column is looking nice and comfy.

If you look dataset closely, you may notice there are some empty genre rows. We are going to remove those rows.

movies[movies.genre.str.len()!=0]

Next, we want count of unique tags. We can find it like this:

len(set(sum(genres,[])))

This gives output as 363.

Dominating tags are ‘Family’, ‘Romance’, ‘Crime’, ‘Comedy’, ‘Science’, ‘Action’, ‘Romance’, ‘World’.

Remove unwanted information from plot column such as capitalization, whitespace, backslash, apostrophe.

Let’s look at the clean plot. I’ve included only first 1000 rows. It takes a lot of computational power.

It’s pretty much clear.

Converting Text to Features

I mentioned earlier that we will treat this multi-label classification problem as a Binary Relevance problem. Hence, we will now one hot encode the target variable, i.e., genre by using sklearn’s MultiLabelBinarizer( ). Since there are 363 unique genre tags, there are going to be 363 new target variables.

Now, it’s time to turn our focus to extracting features from the cleaned version of the movie plots data.

We will be using TF-IDF features. We will split into training and validating data.

Now we can create features for the train and the validation set

xtrain_tfidf = tf_idf.fit_transform(Xtrain)
xval_tfidf = tf_idf.fit_transform(Xval)

Build the model

Remember, we will have to build a model for every one-hot encoded target variable. Since we have 363 target variables, we will have to fit 363 different models with the same set of predictors (TF-IDF features).

Fit the model on the train set

clf.fit(xtrain_tfidf, ytrain)

Predict movie genres

y_pred = clf.predict(xval_tfidf)
y_pred[2050]

This will be the output

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Convert it into movie tags.

mu_b.inverse_transform(y_pred)[2050]

Output:

('Silent film',)

You can calculate f1 score of this model.

f1_score(yval, y_pred, average="micro")

I created two models by using LogisticRegression() and LinearSVC().

Out of all of them, LinearSVC() will give higher f1_score.

Though, score is not that much, but considering the time required to train the model and data available, it does pretty decent job.

You can increase the score by tuning various parameters of different models.

Let me know how you increased the score.

Source: Artificial Intelligence on Medium

(Visited 4 times, 1 visits today)
Post a Comment

Newsletter