a

Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Natural Language Processing

Blog: Natural Language Processing


Natural Language Processing (NLP) has become an important part of modern systems. It is used extensively in search engines, conversational interfaces, document processors, and so on.

Go to the profile of Dammnn

Machines can handle structured data well. But when it comes to working with free-form text, they have a hard time. The goal of NLP is to develop algorithms that enable computers to understand freeform text and help them understand language.

One of the most challenging things about processing freeform natural language is the sheer number of variations. The context plays a very important role in how a particular sentence is understood. Humans are great at these things because we have been trained for many years. We immediately use our past knowledge to understand the context and know what the other person is talking about.

To address this issue, NLP researchers started developing various applications using machine learning approaches. To build such applications, we need to collect a large corpus of text and then train the algorithm to perform various tasks like categorizing text, analyzing sentiments, or modeling topics. These algorithms are trained to detect patterns in input text data and derive insights from it.

In this article, we will discuss various underlying concepts that are used to analyze text and build NLP applications. This will enable us to understand how to extract meaningful information from the given text data. We will use a Python package called Natural Language Toolkit (NLTK) to build these applications. Make sure that you install this before you proceed. You can install it by running the following command on your Terminal:

$ pip3 install nltk

You can find more information about NLTK at http://www.nltk.org .

In order to access all the datasets provided by NLTK, we need to download it. Open up a Python shell by typing the following on your Terminal:

$ python3

We are now inside the Python shell. Type the following to download the data:

>>> import nltk
>>> nltk.download()

We will also use a package called gensim in this article. It’s a robust semantic modeling library that’s useful for many applications. You can install it by running the following command on your Terminal:

$ pip3 install gensim

You might need another package called pattern for gensim to function properly. You can install it by running the following command on your Terminal:

$ pip3 install pattern

You can find more information about gensim at https://radimrehurek.com/gensim. Now that you have installed the NLTK and gensim, let’s proceed with the discussion.

Tokenizing text data

When we deal with text, we need to break it down into smaller pieces for analysis. This is where tokenization comes into the picture. It is the process of dividing the input text into a set of pieces like words or sentences. These pieces are called tokens. Depending on what we want to do, we can define our own methods to divide the text into many tokens. Let’s take a look at how to tokenize the input text using NLTK.

Create a new Python file and import the following packages:

from nltk.tokenize import sent_tokenize, \ 
word_tokenize, WordPunctTokenizer

Define some input text that will be used for tokenization:

# Define input text 
input_text = "Do you know how tokenization works? It's actually quite interesting! Let's analyze a couple of sentences and figure it out."

Divide the input text into sentence tokens:

# Sentence tokenizer 
print("\nSentence tokenizer:")
print(sent_tokenize(input_text))

Divide the input text into word tokens:

# Word tokenizer 
print("\nWord tokenizer:")
print(word_tokenize(input_text))

Divide the input text into word tokens using word punct tokenizer:

# WordPunct tokenizer 
print("\nWord punct tokenizer:")
print(WordPunctTokenizer().tokenize(input_text))

If you run the code, you will get the following output on your Terminal:

We can see that the sentence tokenizer divides the input text into sentences. The two word tokenizers behave differently when it comes to punctuation. For example, the word “It’s” is divided differently in the punct tokenizer as compared to the regular tokenizer.

Converting words to their base forms using stemming

Working with text has a lot of variations included in it. We have to deal with different forms of the same word and enable the computer to understand that these different words have the same base form. For example, the word sing can appear in many forms such as sang, singer, singing, singer, and so on. We just saw a set of words with similar meanings. Humans can easily identify these base forms and derive context.

When we analyze text, it’s useful to extract these base forms. It will enable us to extract useful statistics to analyze the input text. Stemming is one way to achieve this. The goal of a stemmer is to reduce words in their different forms into a common base form. It is basically a heuristic process that cuts off the ends of words to extract their base forms. Let’s see how to do it using NLTK.

Create a new python file and import the following packages:

from nltk.stem.porter import PorterStemmer 
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

Define some input words:

input_words = ['writing', 'calves', 'be', 'branded', 'horse', 'randomize', 
'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'code']

Create objects for Porter, Lancaster, and Snowball stemmers:

# Create various stemmer objects 
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

Create a list of names for table display and format the output text accordingly:

#Create a list of stemmer names for display 
stemmer_names = ['PORTER', 'LANCASTER', 'SNOWBALL']
formatted_text = '{:>16}' * (len(stemmer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *stemmer_names),
'\n', '='*68)

Iterate through the words and stem them using the three stemmers:

# Stem each word and display the output 
for word in input_words:
output = [word, porter.stem(word),
lancaster.stem(word), snowball.stem(word)]
print(formatted_text.format(*output))

If you run the code, you will get the following output on your Terminal:

Let’s talk a bit about the three stemming algorithms that are being used here. All of them basically try to achieve the same goal. The difference between them is the level of strictness that’s used to arrive at the base form.

The Porter stemmer is the least in terms of strictness and Lancaster is the strictest. If you closely observe the outputs, you will notice the differences. Stemmers behave differently when it comes to words like possibly or provision. The stemmed outputs that are obtained from the Lancaster stemmer are a bit obfuscated because it reduces the words a lot. At the same time, the algorithm is really fast. A good rule of thumb is to use the Snowball stemmer because it’s a good trade off between speed and strictness.

Converting words to their base forms using lemmatization

Lemmatization is another way of reducing words to their base forms. In the previous section, we saw that the base forms that were obtained from those stemmers didn’t make sense. For example, all the three stemmers said that the base form of calves is calv, which is not a real word. Lemmatization takes a more structured approach to solve this problem.

The lemmatization process uses a vocabulary and morphological analysis of words. It obtains the base forms by removing the inflectional word endings such as ing or ed. This base form of any word is known as the lemma. If you lemmatize the word calves, you should get calf as the output. One thing to note is that the output depends on whether the word is a verb or a noun. Let’s take a look at how to do this using NLTK.

Create a new python file and import the following packages:

from nltk.stem import WordNetLemmatizer

Define some input words. We will be using the same set of words that we used in the previous section so that we can compare the outputs.

input_words = ['writing', 'calves', 'be', 'branded', 'horse', 'randomize', 
'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'code']

Create a lemmatizer object:

# Create lemmatizer object 
lemmatizer = WordNetLemmatizer()

Create a list of lemmatizer names for table display and format the text accordingly:

# Create a list of lemmatizer names for display 
lemmatizer_names = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']
formatted_text = '{:>24}' * (len(lemmatizer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *lemmatizer_names),
'\n', '='*75)

Iterate through the words and lemmatize the words using Noun and Verb lemmatizers:

# Lemmatize each word and display the output 
for word in input_words:
output = [word, lemmatizer.lemmatize(word, pos='n'),
lemmatizer.lemmatize(word, pos='v')]
print(formatted_text.format(*output))

If you run the code, you will get the following output on your Terminal:

We can see that the noun lemmatizer works differently than the verb lemmatizer when it comes to words like writing or calves. If you compare these outputs to stemmer outputs, you will see that there are differences too. The lemmatizeroutputs are all meaningful whereas stemmer outputs may or may not be meaningful.

Dividing text data into chunks

Text data usually needs to be divided into pieces for further analysis. This process is known as chunking. This is used frequently in text analysis. The conditions that are used to divide the text into chunks can vary based on the problem at hand. This is not the same as tokenization where we also divide text into pieces. During chunking, we do not adhere to any constraints and the output chunks need to be meaningful.

When we deal with large text documents, it becomes important to divide the text into chunks to extract meaningful information. In this section, we will see how to divide the input text into a number of pieces.

Create a new python file and import the following packages:

import numpy as np 
from nltk.corpus import brown

Define a function to divide the input text into chunks. The first parameter is the text and the second parameter is the number of words in each chunk:

# Split the input text into chunks, where 
# each chunk contains N words
def chunker(input_data, N):
input_words = input_data.split(' ')
output = []

Iterate through the words and divide them into chunks using the input parameter. The function returns a list:

cur_chunk = [] 
count = 0
for word in input_words:
cur_chunk.append(word)
count += 1
if count == N:
output.append(' '.join(cur_chunk))
count, cur_chunk = 0, []

output.append(' '.join(cur_chunk))

return output

Define the main function and read the input data using the Brown corpus. We will read 12,000 words in this case. You are free to read as many words as you want.

if __name__=='__main__': 
# Read the first 12000 words from the Brown corpus
input_data = ' '.join(brown.words()[:12000])

Define the number of words in each chunk:

# Define the number of words in each chunk 
chunk_size = 700

Divide the input text into chunks and display the output:

chunks = chunker(input_data, chunk_size) 
print('\nNumber of text chunks =', len(chunks), '\n')
for i, chunk in enumerate(chunks):
print('Chunk', i+1, '==>', chunk[:50])

If you run the code, you will get the following output on your Terminal:

The preceding screenshot shows the first 50 characters of each chunk.

Extracting the frequency of terms using a Bag of Words model

One of the main goals of text analysis is to convert text into numeric form so that we can use machine learning on it. Let’s consider text documents that contain many millions of words. In order to analyze these documents, we need to extract the text and convert it into a form of numeric representation.

Machine learning algorithms need numeric data to work with so that they can analyze the data and extract meaningful information. This is where the Bag of Words model comes into picture. This model extracts a vocabulary from all the words in the documents and builds a model using a document term matrix. This allows us to represent every document as a bag of words. We just keep track of word counts and disregard the grammatical details and the word order.

Let’s see what a document-term matrix is all about. A document term matrix is basically a table that gives us counts of various words that occur in the document. So a text document can be represented as a weighted combination of various words. We can set thresholds and choose words that are more meaningful. In a way, we are building a histogram of all the words in the document that will be used as a feature vector. This feature vector is used for text classification.

Consider the following sentences:

  • Sentence 1: The children are playing in the hall
  • Sentence 2: The hall has a lot of space
  • Sentence 3: Lots of children like playing in an open space

If you consider all the three sentences, we have the following nine unique words:

  • the
  • children
  • are
  • playing
  • in
  • hall
  • has
  • a
  • lot
  • of
  • space
  • like
  • an
  • open

There are 14 distinct words here. Let’s construct a histogram for each sentence by using the word count in each sentence. Each feature vector will be 14-dimensional because we have 14 distinct words overall:

  • Sentence 1: [2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
  • Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
  • Sentence 3: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]

Now that we have extracted these feature vectors, we can use machine learning algorithms to analyze this data.

Let’s see how to build a Bag of Words model in NLTK. Create a new python file and import the following packages:

import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import brown
from text_chunker import chunker

Read the input data from Brown corpus. We will read 5,400 words. You are free to read as many number of words as you want.

# Read the data from the Brown corpus 
input_data = ' '.join(brown.words()[:5400])

Define the number of words in each chunk:

# Number of words in each chunk 
chunk_size = 800

Divide the input text into chunks:

text_chunks = chunker(input_data, chunk_size)

Convert the chunks into dictionary items:

# Convert to dict items 
chunks = []
for count, chunk in enumerate(text_chunks):
d = {'index': count, 'text': chunk}
chunks.append(d)

Extract the document term matrix where we get the count of each word. We will achieve this using the CountVectorizer method that takes two input parameters. The first parameter is the minimum document frequency and the second parameter is the maximum document frequency. The frequency refers to the number of occurrences of a word in the text.

# Extract the document term matrix 
count_vectorizer = CountVectorizer(min_df=7, max_df=20)
document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])

Extract the vocabulary and display it. The vocabulary refers to the list of distinct words that were extracted in the previous step.

# Extract the vocabulary and display it 
vocabulary = np.array(count_vectorizer.get_feature_names())
print("\nVocabulary:\n", vocabulary)

Generate the names for display:

# Generate names for chunks 
chunk_names = []
for i in range(len(text_chunks)):
chunk_names.append('Chunk-' + str(i+1))

Print the document term matrix:

# Print the document term matrix 
print("\nDocument term matrix:")
formatted_text = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_text.format('Word', *chunk_names), '\n')
for word, item in zip(vocabulary, document_term_matrix.T):
# 'item' is a 'csr_matrix' data structure
output = [word] + [str(freq) for freq in item.data]
print(formatted_text.format(*output))

If you run the code, you will get the following output on your Terminal:

We can see all the words in the document term matrix and the corresponding counts in each chunk.

Building a category predictor

A category predictor is used to predict the category to which a given piece of text belongs. This is frequently used in text classification to categorize text documents. Search engines frequently use this tool to order the search results by relevance. For example, let’s say that we want to predict whether a given sentence belongs to sports, politics, or science. To do this, we build a corpus of data and train an algorithm. This algorithm can then be used for inference on unknown data.

In order to build this predictor, we will use a statistic called Term Frequency — Inverse Document Frequency (tf-idf). In a set of documents, we need to understand the importance of each word. The tf-idf statistic helps us understand how important a given word is to a document in a set of documents.

Let’s consider the first part of this statistic. The Term Frequency (tf) is basically a measure of how frequently each word appears in a given document. Since different documents have a different number of words, the exact numbers in the histogram will vary. In order to have a level playing field, we need to normalize the histograms. So we divide the count of each word by the total number of words in a given document to obtain the term frequency.

The second part of the statistic is the Inverse Document Frequency (idf), which is a measure of how unique a word is to this document in the given set of documents. When we compute the term frequency, the assumption is that all the words are equally important. But we cannot just rely on the frequency of each word because words like and and the appear a lot. To balance the frequencies of these commonly occurring words, we need to reduce their weights and weigh up the rare words. This helps us identify words that are unique to each document as well, which in turn helps us formulate a distinctive feature vector.

To compute this statistic, we need to compute the ratio of the number of documents with the given word and divide it by the total number of documents. This ratio is essentially the fraction of the documents that contain the given word. Inverse document frequency is then calculated by taking the negative algorithm of this ratio.

We then combine term frequency and inverse document frequency to formulate a feature vector to categorize documents. Let’s see how to build a category predictor.

Create a new python file and import the following packages:

from sklearn.datasets import fetch_20newsgroups 
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

Define the map of categories that will be used for training. We will be using five categories in this case. The keys in this dictionary object refer to the names in the scikit-learn dataset.

# Define the category map 
category_map = {'talk.politics.misc': 'Politics', 'rec.autos': 'Autos',
'rec.sport.hockey': 'Hockey', 'sci.electronics': 'Electronics',
'sci.med': 'Medicine'}

Get the training dataset using fetch_20newsgroups:

# Get the training dataset 
training_data = fetch_20newsgroups(subset='train',
categories=category_map.keys(), shuffle=True, random_state=5)

Extract the term counts using the CountVectorizer object:

# Build a count vectorizer and extract term counts 
count_vectorizer = CountVectorizer()
train_tc = count_vectorizer.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

Create Term Frequency — Inverse Document Frequency (tf-idf) transformer and train it using the data:

# Create the tf-idf transformer 
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

Define some sample input sentences that will be used for testing:

# Define test data 
input_data = [
'You need to be careful with cars when you are driving on slippery roads',
'A lot of devices can be operated wirelessly',
'Players need to be careful when they are close to goal posts',
'Political debates help us understand the perspectives of both sides'
]

Train a Multinomial Bayes classifier using the training data:

# Train a Multinomial Naive Bayes classifier 
classifier = MultinomialNB().fit(train_tfidf, training_data.target)

Transform the input data using the count vectorizer:

# Transform input data using count vectorizer 
input_tc = count_vectorizer.transform(input_data)

Transform the vectorized data using the tf-idf transformer so that it can be run through the inference model:

# Transform vectorized data using tfidf transformer 
input_tfidf = tfidf.transform(input_tc)

Predict the output using the tf-idf transformed vector:

# Predict the output categories 
predictions = classifier.predict(input_tfidf)

Print the output category for each sample in the input test data:

# Print the outputs 
for sent, category in zip(input_data, predictions):
print('\nInput:', sent, '\nPredicted category:', \
category_map[training_data.target_names[category]])

If you run the code, you will get the following output on your Terminal:

We can see intuitively that the predicted categories are correct.

Constructing a gender identifier

Gender identification is an interesting problem. In this case, we will use the heuristic to construct a feature vector and use it to train a classifier. The heuristic that will be used here is the last N letters of a given name. For example, if the name ends with ia, it’s most likely a female name, such as Amelia or Genelia. On the other hand, if the name ends with rk, it’s likely a male name such as Mark or Clark. Since we are not sure of the exact number of letters to use, we will play around with this parameter and find out what the best answer is. Let’s see how to do it.

Create a new python file and import the following packages:

import random 

from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

Define a function to extract the last N letters from the input word:

# Extract last N letters from the input word 
# and that will act as our "feature"
def extract_features(word, N=2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}

Define the main function and extract training data from the scikit-learn package. This data contains labeled male and female names:

if __name__=='__main__': 
# Create training data using labeled names available in NLTK
male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)

Seed the random number generator and shuffle the data:

# Seed the random number generator 
random.seed(5)

# Shuffle the data
random.shuffle(data)

Create some sample names that will be used for testing:

# Create test data 
input_names = ['Alexander', 'Danielle', 'David', 'Cheryl']

Define the percentage of data that will be used for training and testing:

# Define the number of samples used for train and test 
num_train = int(0.8 * len(data))

We will be using the last N characters as the feature vector to predict the gender. We will vary this parameter to see how the performance varies. In this case, we will go from 1 to 6:

# Iterate through different lengths to compare the accuracy 
for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]

Separate the data into training and testing:

train_data, test_data = features[:num_train], features[num_train:]

Build a NaiveBayes Classifier using the training data:

classifier = NaiveBayesClassifier.train(train_data)

Compute the accuracy of the classifier using the inbuilt method available in NLTK:

# Compute the accuracy of the classifier 
accuracy = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy) + '%')

Predict the output for each name in the input test list:

# Predict outputs for input names using the trained classifier model 
for name in input_names:
print(name, '==>', classifier.classify(extract_features(name, i)))

If you run the code, you will get the following output on your Terminal:

The preceding screenshot shows the accuracy as well as the predicted outputs for the test data. Let’s go further and see what happens:

We can see that the accuracy peaked at two letters and then started decreasing after that.

Building a sentiment analyzer

Sentiment analysis is the process of determining the sentiment of a given piece of text. For example, it can used to determine whether a movie review is positive or negative. This is one of the most popular applications of natural language processing. We can add more categories as well depending on the problem at hand. This technique is generally used to get a sense of how people feel about a particular product, brand, or topic. It is frequently used to analyze marketing campaigns, opinion polls, social media presence, product reviews on e-commerce sites, and so on. Let’s see how to determine the sentiment of a movie review.

We will use a Naive Bayes classifier to build this classifier. We first need to extract all the unique words from the text. The NLTK classifier needs this data to be arranged in the form of a dictionary so that it can ingest it. Once we divide the text data into training and testing datasets, we will train the Naive Bayes classifier to classify the reviews into positive and negative. We will also print out the top informative words to indicate positive and negative reviews. This information is interesting because it tells us what words are being used to denote various reactions.

Create a new python file and import the following packages:

from nltk.corpus import movie_reviews 
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

Define a function to construct a dictionary object based on the input words and return it:

# Extract features from the input list of words 
def extract_features(words):
return dict([(word, True) for word in words])

Define the main function and load the labeled movie reviews:

if __name__=='__main__': 
# Load the reviews from the corpus
fileids_pos = movie_reviews.fileids('pos')
fileids_neg = movie_reviews.fileids('neg')

Extract the features from the movie reviews and label it accordingly:

# Extract the features from the reviews 
features_pos = [(extract_features(movie_reviews.words(
fileids=[f])), 'Positive') for f in fileids_pos]
features_neg = [(extract_features(movie_reviews.words(
fileids=[f])), 'Negative') for f in fileids_neg]

Define the split between training and testing. In this case, we will allocate 80% for training and 20% for testing:

# Define the train and test split (80% and 20%) 
threshold = 0.8
num_pos = int(threshold * len(features_pos))
num_neg = int(threshold * len(features_neg))

Separate the feature vectors for training and testing:

# Create training and training datasets 
features_train = features_pos[:num_pos] + features_neg[:num_neg]
features_test = features_pos[num_pos:] + features_neg[num_neg:]

Print the number of datapoints used for training and testing:

# Print the number of datapoints used 
print('\nNumber of training datapoints:', len(features_train))
print('Number of test datapoints:', len(features_test))

Train a Naive Bayes classifier using the training data and compute the accuracy using the inbuilt method available in NLTK:

# Train a Naive Bayes classifier 
classifier = NaiveBayesClassifier.train(features_train)
print('\nAccuracy of the classifier:', nltk_accuracy(
classifier, features_test))

Print the top N most informative words:

N = 15 
print('\nTop ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()):
print(str(i+1) + '. ' + item[0])
if i == N - 1:
break

Define sample sentences to be used for testing:

# Test input movie reviews 
input_reviews = [
'The costumes in this movie were great',
'I think the story was terrible and the characters were very weak',
'People say that the director of the movie is amazing',
'This is such an idiotic movie. I will not recommend it to anyone.'
]

Iterate through the sample data and predict the output:

print("\nMovie review predictions:") 
for review in input_reviews:
print("\nReview:", review)

Compute the probabilities for each class:

# Compute the probabilities 
probabilities = classifier.prob_classify(extract_features(review.split()))

Pick the maximum value among the probabilities:

# Pick the maximum value 
predicted_sentiment = probabilities.max()

Print the predicted output class (positive or negative sentiment):

# Print outputs 
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))

If you run the code, you will get the following output on your Terminal:

The preceding screenshot shows the top 15 most informative words. If you scroll down your Terminal, you will see this:

We can see and verify intuitively that the predictions are correct.

Topic modeling using Latent Dirichlet Allocation

Topic modeling is the process of identifying patterns in text data that correspond to a topic. If the text contains multiple topics, then this technique can be used to identify and separate those themes within the input text. We do this to uncover hidden thematic structure in the given set of documents.

Topic modeling helps us to organize our documents in an optimal way, which can then be used for analysis. One thing to note about topic modeling algorithms is that we don’t need any labeled data. It is like unsupervised learning where it will identify the patterns on its own. Given the enormous volumes of text data generated on the Internet, topic modeling becomes very important because it enables us to summarize all this data, which would otherwise not be possible.

Latent Dirichlet Allocation is a topic modeling technique where the underlying intuition is that a given piece of text is a combination of multiple topics. Let’s consider the following sentence — Data visualization is an important tool in financial analysis. This sentence has multiple topics like data, visualization, finance, and so on. This particular combination helps us identify this text in a large document. In essence, it is a statistical model that tries to capture this idea and create a model based on it. The model assumes that documents are generated from a random process based on these topics. A topic is basically a distribution over a fixed vocabulary of words. Let’s see how to do topic modeling in Python.

We will use a library called gensim in this section. We have already installed this library in the first section of this article. Make sure that you have it before you proceed. Create a new python file and import the following packages:

from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

Define a function to load the input data. The input file contains 10 line-separated sentences:

# Load input data 
def load_data(input_file):
data = []
with open(input_file, 'r') as f:
for line in f.readlines():
data.append(line[:-1])

return data

Define a function to process the input text. The first step is to tokenize it:

# Processor function for tokenizing, removing stop 
# words, and stemming
def process(input_text):
# Create a regular expression tokenizer
tokenizer = RegexpTokenizer(r'\w+')

We then need to stem the tokenized text:

# Create a Snowball stemmer 
stemmer = SnowballStemmer('english')

We need to remove the stop words from the input text because they don’t add information. Let’s get the list of stop-words:

# Get the list of stop words 
stop_words = stopwords.words('english')

Tokenize the input string:

# Tokenize the input string 
tokens = tokenizer.tokenize(input_text.lower())

Remove the stop-words:

# Remove the stop words 
tokens = [x for x in tokens if not x in stop_words]

Stem the tokenized words and return the list:

# Perform stemming on the tokenized words 
tokens_stemmed = [stemmer.stem(x) for x in tokens]

return tokens_stemmed

Define the main function and load the input data from the file data.txt provided to you:

if __name__=='__main__': 
# Load input data
data = load_data('data.txt')

Tokenize the text:

# Create a list for sentence tokens 
tokens = [process(x) for x in data]

Create a dictionary based on the tokenized sentences:

# Create a dictionary based on the sentence tokens 
dict_tokens = corpora.Dictionary(tokens)

Create a document term matrix using the sentence tokens:

# Create a document-term matrix 
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]

We need to provide the number of topics as the input parameter. In this case, we know that the input text has two distinct topics. Let’s specify that.

# Define the number of topics for the LDA model 
num_topics = 2

Generate the Latent Dirichlet Model:

# Generate the LDA model 
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
num_topics=num_topics, id2word=dict_tokens, passes=25)

Print the top 5 contributing words for each topic:

num_words = 5 
print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
print('\nTopic', item[0])

# Print the contributing words along with their relative contributions
list_of_strings = item[1].split(' + ')
for text in list_of_strings:
weight = text.split('*')[0]
word = text.split('*')[1]
print(word, '==>', str(round(float(weight) * 100, 2)) + '%')

If you run the code, you will get the following output on your Terminal:

We can see that it does a reasonably good job of separating the two topics — mathematics and history. If you look into the text, you can verify that each sentence is either about mathematics or history.

Source: Artificial Intelligence on Medium

(Visited 63 times, 1 visits today)
Post a Comment

Newsletter