ProjectBlog: Tokenization And POS Tagging Using NLTK Library

NLTK is a very rich library used for different operation on Natural Language processing (NLP). Tokenization and POS tagging are among those. Here i am going to explain what are these and how can we apply tokenization and POS tagging for a given text file.

Tokens are the individual words and tokenization is taking word or group of words and breaking it into individual words. For Example — If the given sentence is “NLP is easy to understand”, NLP, is, easy, to, understand are tokens here and breaking the given sentence in tokens is tokenization.

For making the concept crystal clear I have taken an example using python. A text file named ‘what_is_nlp.txt’ is taken which looks like –


Following code is reading the text file –

Tokenization of text document in sentences
Further Tkenization into words

Parts of Speech tagging or POS tagging also called as POST , Word-Category disambiguation is the process of marking up a word in a text (corpus) as corresponding to a part of speech this words belongs to.

POS Tagging

Below is the list of different POS and description ( For which parts of speech words it’s used) .

The python code can be obtanied from

