Blog: Markov Chains and HMMs
In this article, we’ll focus on Markov Models, where an when they should be used, and Hidden Markov Models. This article will focus on the theoretical part. In a second article, I’ll present Python implementations of these subjects.
Markov Models, and especially Hidden Markov Models (HMM) are used for :
- Speech recognition
- Writing recognition
- Object or face detection
- Economic Scenario Generation and specific Finance tasks
- And several NLP tasks …
This article was originally published on my personal blog: https://maelfabien.github.io/machinelearning/HMM_1/#
I publish all my articles and the corresponding code on this repository :
I. Stochastic model
Let’s start by defining what a stochastic model is. It is essentially a discrete time process indexed at times 1,2,… that takes values, called “states”, which are observed: q1, q2,…. The states simply correspond to the actual values of the process, usually defined by a finite space: S=1,…Q.
The process starts at an initial state q1. Then, according to transition probabilities, we move between the states. We can compute the probability of a sequence of states using Bayes Rule :
In order to characterize the model, we need :
- The initial probability P(q1)
- All the transition probabilities
As you might guess, this is complex to achieve since we need to know a lot of parameters.
II. Discrete Time Markov Chain Models (DTMC)
1. What is a Markov Chain?
Discrete Time Markov Chain (DTMC) are time and event discrete stochastic process. Markov Chains rely on the Markov Property that there is a limited dependence within the process :
Let’s illustrate this: Consider a simple maze in which a mouse is trapped. We will denote qt the position of the maze in which the mouse stands after t steps. We will assume that the mouse does not have a memory of the steps it took within the maze. It simply goes to the position randomly, following the probability written next to each move.
The states here could represent many things, including in NLP. For example, we could have :
- 1 = Noun,
- 2 = Verb,
- 3 = Adjective…
And we would be interested in the probabilities have a verb after a noun for example.
2. Transition probabilities
A Discrete Time Markov chain is said to be homogeneous if its transition probabilities do not depend on the time t :
We can summarize the process is a transition matrix denoted A=[aij], i ∈ 1…Q, j ∈ 1…Q. A transition matrix is stochastic if :
- All entries are non-negative
- Each line sums to 1
In our example, the transition matrix would be :
Note that if A is stochastic, then A^n is stochastic.
There are several ways to describe a state. Let pii be the probability of returning to state i after leaving i :
- A state i is transient if pii<1
- A state i is recurrent if pii=1
- A state i is absorbing if aii=1
Therefore, a state is positive recurrent if the average time before return to this same state denoted Tii is finite.
A DTMC is irreducible if a state j can be reached in a finite number of steps from any other state i. An irreducible DTMC is, in fact, a strongly connected graph.
A state in a discrete-time Markov chain is periodic if the chain can return to the state only at multiples of some integer larger than 1.
For example :
Otherwise, it is called aperiodic. A state with a self-loop is always aperiodic.
4. Sojourn time
Let Ti be the time spent in state i before jumping to other states.
Then, Ti, the sojourn time, follows a geometric distribution :
The expected average time spent is E(T)=1 / aii
5. M-step transition
The probability of going from i to j in m steps is denoted by :
We can see a22(4) as the probability for the mouse to be in position 2 at time t=4. Therefore, the probability of going from i to j in exactly n steps is given by fij(n) where :
6. Probability distribution of states
Let πi(n) be the probability of being in state i at time n : πi(n)=P(Xn=i)
Then, π(n)=[π1(n),π2(n),…] is the vector of probability distribution which depends on :
- the initial transition matrix A
- the initial distribution π(0)
Notice that π(n+1) = π(n)A and recursively :
For an irreducible/aperiodic DTMC, the distribution π(n) converges to a limit vector π which is independent of π(0) and is the unique solution of: π = πP
And ∑i πi = 1
πi is also called stationary probabilities, steady state or equilibrium distribution.
7. Generating sequences
To simulate the path of the mouse in the maze, we might want to generate sequences.
When we want to generate a sequence, we start from an initial state q1=1 for example. The general idea is that :
- We pick a random number to know which state we should start from
- Then, pick a random number to know which state we move to
Suppose we are given the following simple model :
This corresponds to the following matrix A and initial vector of probabilities π:
The generator works as follows, by drawing successively random number of identifying which transition refers to.
At the first step, we pick a random number and see where it falls in the initial vector of probabilities. This gives us our first state.
Then, we pick the following number, which corresponds, from the state q1, to the transition probabilities (first row of matrix A). If the value is smaller than 0.3, we stay in q1. Else, we move to q2. And so on…
8. Decoding sequences
The aim of decoding a sequence is to identify the most likely path that lead to the current state. For example, if the mouse is in state 3 and went there in 5 steps, you want to identify the most likely path.
9. Use cases
Markov Chains can be used :
- To identify the language of a sentence by decoding the sequences of characters and identifying the most likely language.
- To predict macroeconomic situations like market crashes and cycles between recession and expansion.
- To predict asset and option prices, and calculating credit risks.
III. Hidden Markov Models
Hidden Markov Models (HMM) are widely used for :
- speech recognition
- writing recognition
- object or face detection
- part-of-speech tagging and other NLP tasks…
I recommend checking the introduction made by Luis Serrano on HMM.
We will be focusing on Part-of-Speech (PoS) tagging. Part-of-speech tagging is the process by which we are able to tag a given word as being a noun, pronoun, verb, adverb…
PoS can, for example, be used for Text to Speech conversion or Word sense disambiguation.
In this specific case, the same word
bear has completely different meanings, and the corresponding PoS is therefore different.
Let’s consider the following scenario. In your office, there are 2 colleagues that talk a lot. You know they either talk about Work or Holidays. Since they look cool, you’d like to join them. But you’re too far to understand the whole conversation, and you only get some words of the sentence
Before joining the conversation, in order not to sound too weird, you’d like to guess whether he talks about Work or Holidays. For example, here is the kind of sentence your friends might be pronouncing :
A. Emission probabilities
You only hear distinctively the words python or bear, and try to guess the context of the sentence. Since your friends are Python developers, when they talk about work, they talk about Python 80% of the time.
These probabilities are called the Emission Probabilities.
B. Transition probabilities
You listen to their conversations and keep trying to understand the subject every minute. There is some sort of coherence in the conversation of your friends. Indeed, if one hour they talk about work, there is a lower probability that the next minute they talk about holidays.
We can define what we call the Hidden Markov Model for this situation :
The probabilities to change the topic of the conversation or not are called the transition probabilities.
- The words you understand are called the observations since you observe them.
- The subject they talk about is called the hidden state since you can’t observe it.
C. Discrete-Time Hidden Markov Models
An HMM λ is a sequence made of a combination of 2 stochastic processes :
- An observed one: O=o1,o2,…,oT, here the words
- A hidden one: q=q1,q2,…qT, here the topic of the conversation. This is called the state of the process.
An HMM model is defined by :
- The vector of initial probabilities π=[π1,…πq], where πi=P(q1=i)
- A transition matrix for unobserved sequence A : A=[aij]=P(qt=j ∣ qt−1=j)
- A matrix of the probabilities of the observations B=[bki]=P(ot=sk ∣ qt=i)
What are the main hypothesis behind HMMs?
- Independence of the observations conditionally to the hidden states : P(o1,…,ot,…,oT ∣ q1,…,qt,…,qT, λ) = ∏i P(ot ∣ qt, λ)
- The stationary Markov Chain : P(q1,q2,…,qT) = P(q1)P(q2∣q1)P(q3∣q2)…P(qT∣qT−1)
- Joint probability for a sequence of observations and states : P(o1,o2,…oT,q1,…,qT ∣ λ) = P(o1,…,oT ∣ q1,…,qT, λ) P(q1,…,qT)
A HMM is a subcase of Bayesian Networks.
D. Find the transition probabilities
Transition probabilities are based on the observations we have made. We can suppose that after carefully listening, every minute, we manage to understand the topic they were talking about. This does not give us the full information on the topic they are currently talking about though.
You have 15 observations, taken over the last 15 minutes, W denotes Work and H Holidays.
We notice that in 2 cases out of 5, the topic Work lead to the topic Holidays, which explains the transition probability in the graph above.
E. Find the emission probabilities
Well, since we have observations on the topic they were discussing, and we observe the words that were used during the discussion, we can define estimates of the emission probabilities :
F. Probability for a topic at a random time
Suppose that you have to grab a coffee, and when you come back, they are still talking. You have no clue what they are talking about! What is at that random moment the probability that they are talking about Work or Holidays?
We can count from the previous observations: 10 times they were talking about Holidays, 5 times about Work. Therefore, it states that we have 1/3 chance that they talk about Work and 2/3 chance that they talk about Holidays.
G. If you hear the word “Python”, what is the probability of each topic?
If you hear the word “Python”, the probability that the topic is Work or Holidays is defined by Bayes Theorem!
Which heads to 57%.
H. If you hear a sequence of words, what is the probability of each topic?
Let’s start with 2 observations in a row. Let’s suppose that we hear the words “Python” and “Bear” in a row. What are the possible combinations?
- Python was linked to Work, Bear was linked to work
- Python was linked to Holidays, Bear was linked to work
- Python was linked to Holidays, Bear was linked to Holidays
- Python was linked to Work, Bear was linked to Holidays
These scenarios can be summarized this way :
Therefore, the most likely hidden state is Holidays and Holidays. What if you hear more than 2 words? Let’s say 50? It becomes really challenging to compute all the possible paths! This is why the Viterbi Algorithm was introduced, to overcome this issue.
I. Decoding with Viterbi Algorithm
The main idea behind the Viterbi Algorithm is that when we compute the optimal decoding sequence, we don’t keep all the potential paths, but only the path corresponding to the maximum likelihood.
Here’s how it works. We start with a sequence of observed events, say
Python, Python, Python, Bear, Bear, Python. This sequence corresponds simply to a sequence of observations : P(o1, o2,…, oT ∣ λm)
For the first observation, the probability that the subject is Work given that we observe Python is the probability that it is Work times the probability that it is Python given that it is Work.
The most likely sequence of states simply corresponds to :
We can then move on to the next observation. Here’s what will happen :
For each position, we compute the probability using the fact that the previous topic was either Work or Holidays, and for each case, we only keep the maximum since our aim is to find the maximum likelihood. Therefore, the next step is to estimate the same thing for the Holidays topic and keep the maximum between the 2 paths.
If you decode the whole sequence, you should get something similar to this (I’ve rounded the values at each step, so you might get slightly different results):
The most likely sequence when we observe
Python, Python, Python, Bear, Bear, Python is therefore
Work, Work, Work, Holidays, Holidays, Holidays.
If you finally go talk to your colleagues after such a long stalking time, you should expect them to be talking about Holidays :)
The joint probability of the best sequence of potential states ending in state i at time t and corresponding to observations o1,…,oT is denoted by δT(i). This is one of the potential paths described above.
By recursion, it can be shown that :
Where bj denotes a probability of the matrix of observations B and aij denotes a value of the transition matrix for unobserved sequence. Those parameters are estimated from the sequence of observations and states available. The δ is simply the maximum we take at each step when moving forward.
I won’t really go into further details here. You should simply remember that there are 2 ways to solve Viterbi, forward (as we have seen) and backward.
When we only observe partially the sequence and face incomplete data, the EM algorithm is used.
J. Generating a sequence
As we have seen with Markov Chains, we can generate sequences with HMMs. In order to do so, we need to :
- Generate first the hidden state q1
- From q1, generate o1, e.g Work then Python
- Then generate the transition q1 to q2
- From q2, generate o2
How does the process work? As stated above, this is now a 2 step process, where we first generate the state, then the observation.
In this article, we covered the basic theory of Markov Chains and HMMs, including terminology, hypothesis, properties, sequence generation, and decoding.
I hope this introduction to Markov Chains and Hidden Markov Models was helpful. I have listed my sources in the section below.
In the next article, I’ll try to illustrate these concepts in Python.
- Ecole National Supérieure des Mines: https://www.emse.fr/~xie/SJTU/Ch4DMC.ppt
- MVE220 Financial Risk : http://www.math.chalmers.se/Stat/Grundutb/CTH/mve220/1617/redingprojects16-17/IntroMarkovChainsandApplications.pdf
- HMMs by Udacity : https://www.youtube.com/watch?v=kqSzLo9fenk
- Personal blog MC : https://maelfabien.github.io/machinelearning/HMM_1/#
- Personal blog HMM : https://maelfabien.github.io/machinelearning/HMM_2/#