Data Science in the Real World

Blending NB And SVM

Before getting into Model variant of NB and SVM, will discuss when does NB performance better than SVM, which is very important to understand the importance of NB-SVM model variant for better performance.

NB and SVM have different options including the choice of kernel function for each. They are both sensitive to parameter optimization (i.e. different parameter selection can significantly change their output). So, if you have a result showing that NB is performing better than SVM. This is only true for the selected parameters. However, for another parameter selection, you might find SVM is performing better.

In general, if the assumption of independence in NB is satisfied by the variables of your dataset and the degree of class overlapping is small (i.e. potential linear decision boundary), NB would be expected to achieve well. For some datasets, with optimization using wrapper feature selection, for example, NB may defeat other classifiers. Even if it achieves a comparable performance, NB will be more desirable because of its high speed.

Variants of NB and SVM are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/ dataset. Based on these observations, we identify simple NB and SVM variants which outperform most published results on text datasets, sometimes providing a new state-of-the-art performance level.

Now let’s get into implementation, here I am using IMDB dataset to classify the movie review into positive and negative classes.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.
To get the dataset, in your terminal run the following commands:
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
gunzip aclImdb_v1.tar.gz
tar -xvf aclImdb_v1.tar

Tokenizing and term-document matrix creation

PATH='data/aclImdb/'
names = ['neg','pos']
%ls {PATH}
aclImdb_v1.tar.gz imdbEr.txt imdb.vocab models/ README test/ tmp/ train/
%ls {PATH}train
aclImdb/ all_val/ neg/ tmp/ unsupBow.feat urls_pos.txt
all/ labeledBow.feat pos/ unsup/ urls_neg.txt urls_unsup.txt
%ls {PATH}train/pos | head
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review

trn[0]
"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
trn_y[0]
0

CountVectorizer converts a collection of text documents to a matrix of token counts (part of sklearn.feature_extraction.text).

veczr = CountVectorizer(tokenizer=tokenize)

fit_transform(trn) finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the same transformation to your validation set, the second line uses just the method transform(val). trn_term_doc and val_term_doc are sparse matrices. trn_term_doc[i] represents training document I and it contains a count of words for each document for each word in the vocabulary.

trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
trn_term_doc
<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
with 3749745 stored elements in Compressed Sparse Row format>
trn_term_doc[0]
<1x75132 sparse matrix of type '<class 'numpy.int64'>'
with 93 stored elements in Compressed Sparse Row format>
vocab = veczr.get_feature_names(); vocab[5000:5005]
['aussie', 'aussies', 'austen', 'austeniana', 'austens']
w0 = set([o.lower() for o in trn[0].split(' ')]); w0
len(w0)

92

veczr.vocabulary_['absurd']

1297

Naive Bayes

We define the log-count ratio

for each word

:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where the ratio of feature

in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

def pr(y_i):
p = x[y==y_i].sum(0)
return (p+1) / ((y==y_i).sum()+1)
x=trn_term_doc
y=trn_y
r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here is the formula for Naive Bayes.

pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()
0.80691999999999997

…and binarized Naive Bayes.

x=trn_term_doc.sign()
r = np.log(pr(1)/pr(0))
pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()
0.83016000000000001

Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.85504000000000002
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()
0.85487999999999997

…and the regularized version

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()
0.88275999999999999
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()
0.88404000000000005

Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described here. For every document, we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

veczr = CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)
trn_term_doc.shape
(25000, 800000)
vocab = veczr.get_feature_names()
vocab[200000:200005]
['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);
preds = m.predict(val_x)
(preds.T==val_y).mean()
0.90500000000000003
np.exp(r)
matrix([[ 0.94678, 0.85129, 0.78049, ..., 3. , 0.5 , 0.5 ]])

Here we fit regularized logistic regression where the features are the trigrams’ log-count ratios.

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);
val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768000000000005

fast.ai NBSVM

sl=2000
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

[ 0. 0.0251 0.12003 0.91552]
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)
[ 0. 0.02014 0.11387 0.92012] 
[ 1. 0.01275 0.11149 0.92124]
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)
[ 0. 0.01681 0.11089 0.92129] 
[ 1. 0.00949 0.10951 0.92223]

I have used NB-SVM in ongoing Kaggle toxic comment competition achieve top 8%, NB-SVM can be used for other NLP problem, but different feature engineering should be tried and go

Source: Artificial Intelligence on Medium