### Blog: How Negative Sampling work on word2vec?

During neural network training, it always adjust all neuron weight so that it learn how to do the prediction correctly. In NLP, we may face more than 100 k (or even 1M) words and it will cause performance (in term of time) concern. How can we reduce the training sample in a better way ?We have hierarchical softmax previously but word2vec introduces negative sampling methodology to resolve this problem.

### What is that?

When we try to predict word (we call it as context), within a certain window (e.g. 5), those word are considered as positive word. So that we can either use those positive word to predict the context word (CBOW) or using context word to predict positive word (skip-garm).

However, we need some false label to do the prediction. So we need to figure out some negative case for that. So we can pick some non-surrounding words (we call it as negative word) to be a false label.

### How do we select negative samples?

To reduce the number of neuron weight updating to reduce training time and having a better prediction result, negative sampling is introduced in word2vec . For example, we have 10 positive words and 1 predicting words, then the total number of neuron weight updating operations is 11 instead of updating whole corpus’s neuron weight.

Will introduce two way to do the sampling which are simple sampling and adjusted sampling

#### Simple Sampling

The basic sampling technique is picking data point randomly but it has a drawback which is high frequency data point will be pick according to the distribution.

For example, we have a bag including 3 apples, 10 oranges and 1 banana. If we pick 1 fruit from bag, the probability of orange is 0.71 (10/14), apple is 0.21 (3/14), banana is 0.07 (1/14).

#### Adjusted Sampling

Since we do not want to pick a high frequency words all the time as those words usually provide less value than rare words. In word2vec c implementation, they apply the power of 3/4 to the probability. For example, the probability of orange is 0.71 which is converted to 0.77 while banana’s probability is converted from 0.07 to 0.14. In that cases, the probability of banana is doubled.

# Copy from gensim

power = 0.75

for word_index in xrange(vocab_size):train_words_pow += wv.vocab[wv.index2word[word_index]].count**power

cumulative = 0.0

for word_index in xrange(vocab_size):cumulative += wv.vocab[wv.index2word[word_index]].count**power

self.cum_table[word_index] = round(cumulative / train_words_pow * domain)

What if the predicting word is picked from negative sampling? According to gensim implementation, it will try to pick another word again.

# Copy from gensim

word_indices = [predict_word.index]

while len(word_indices) < model.negative + 1:

w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))if w != predict_word.index:

word_indices.append(w)

### Hierarchical Softmax

What is the benefit of using hierarchical softmax? By design, it is a binary tree, so the computational complexity reduce to log2 V (log base is 2)

→ O(V) to O(log2 V)

High frequency words are assigned short codes and it spends less time to access it high frequency words then rare words. But it is also a disadvantage as it takes longer time to those low frequency words. That

Assume that the node is word vector, it is -1 (or 0) if we go left while it is 1 if going right.

# Copy from gensim

sgn = (-1.0) ** predict_word.codelprob = -log(expit(-sgn * prod_term))

model.running_training_loss += sum(lprob)

### Code

If you are using gensim, only need to define whether using negative sampling or hierarchical softmax by passing parameter is okay.

# Copy from gensim API

hs(int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.negative(int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

### General Practice

- According to paper, it suggests that using 5 ~ 20 negative words in smaller data set while only using 2–5 words for large dataset

### Like to learn?

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

## Leave a Reply