Blog: Bayes Text Classification in Kotlin for Android without TensorFlow
Text Classification has been an important task in Natural Language Processing because of its capabilities and a wide range of uses. We will learn about using this technique in a non-deep learning way, without using TensorFlow and Neural Networks. This classifier will work in an Android application so it’s needed to write in Kotlin or Java.
But why Kotlin, why not our TensorFlow or Python?
TensorFlow and TensorFlow Lite can work efficiently ( or sometimes in-mind blowing ways ) on Android. A similar algorithm could be created in any programming language like C, C++ or even Swift ( native to iOS ), if it can be created in Kotlin ( native to Android ).
Sometimes, the classifier which is coded natively in a platform can perform far better than TensorFlow or its APIs. Also, we can have more control-flow over its working and inferencing.
Which Machine Learning algorithm are we going to use? What are we exactly creating?
We will be using a Naive Bayes Text Classifier for classifying text in Kotlin which will ultimately run on an Android device.
Talking about Naive Bayes Text Classification,
Naive Bayes Text Classification uses the power of Bayes Theorem to classify a document ( text ) into a certain class.
If we cast the equation according to our needs for text classification, then it would become like,
Where we represent our document as tokens x₁, x₂ … xₙ and C is the class for which we will calculate the probability. The denominator is omitted and see here for its explanation ( since in case of both the classes ( C₁ and C₂ ), P( x₁ , x₂ … xₙ ) will remain constant and will act as an normalizing constant )
We will calculate the probabilities for 2 classes namely SPAM ( C₁ ) and HAM ( C₂ ). The one which has higher probability will be our output.
For each class, we have a vocabulary or set of words which occur in spam or ham words which will represent our class corpus.
Let’s Start With Kotlin.
First, we will define our corpus
negativeBagOfWords which contain spam and ham words respectively.
Now, we create a new class named
Classifier for handling the classification task. We need to define two constants and a method which extracts tokens from a given piece of text ( by removing unnecessary words, punctuation etc.).
getTokens( document ) = tokens. Hence we can transform a document D to a set of tokens like x₁ ,x₂ … xₙ.
Finding the probabilities
First, we need to find P( C ) or the class probability. This is nothing but the probability of how many words from both the corpora belong to class C.
Next, we need to find P( X | C ) which is the probability of X given that it belongs to a class C.
Before this, we will need a method to find P( xᵢ | C ) given xᵢ which is a token in the given document. We can use this method.
class_vocabis one of the corpora. It represents the C in P( xᵢ | C ). Wondering from where the 1 came in? That’s Laplace Smoothing. If P( xᵢ | C ) is 0 when xᵢ does not exist in our corpus, then all our P( X | C ) could become 0 . Adding 1 can solve this problem.
Now, we need to multiply all the P( xᵢ | C ) together and finally multiply it with P( C ) which is our class probability in the method below.
That’s All. Now we need to check which class has a higher likelihood.
You can see the full code at one glance in this gist.
That’s long and a bit Math-y. It’s the end!
Hope you liked the idea of Naive Bayes in Kotlin. Feel free to share your feedback in the comments section below.
It’s my first Math-Heavy article, so apologize me for incorrections in notation on precision. :-)
Happy Machine Learning.