### Blog: Probabilistic Reasoning for Sequential Data

## In the world of machine learning, we encounter many types of data, such as images, text, video, sensor readings, and so on. Different types of data require different types of modelling techniques. Sequential data refers to data where the ordering is important.

Time-series data is a particular manifestation of sequential data. It is basically time-stamped values obtained from any data source such as sensors, microphones, stock markets, and so on. Time-series data has a lot of important characteristics that need to be modelled in order to effectively analyze the data.

The measurements that we encounter in time-series data are taken at regular time intervals and correspond to predetermined parameters. These measurements are arranged on a timeline for storage, and the order of their appearance is very important. We use this order to extract patterns from the data.

In this chapter, we will see how to build models that describe the given time-series data or any sequence in general. These models are used to understand the behaviour of the time series variable. We then use these models to predict the future based on past behaviour.

Time-series data analysis is used extensively in financial analysis, sensor data analysis, speech recognition, economics, weather forecasting, manufacturing, and many more. We will explore a variety of scenarios where we encounter time-series data and see how we can build a solution. We will be using a library called `Pandas`

to handle all the time-series related operations. We will also use a couple of other useful packages like `hmmlearn`

and `pystruct`

during this chapter. Make sure you install them before you proceed.

You can install them by running the following commands on your Terminal:

$ pip3 install pandas

$ pip3 install hmmlearn

$ pip3 install pystruct

$ pip3 install cvxopt

If you get an error when installing `cvxopt`

, you will find further instructions at http://cvxopt.org/install . Now that you have successfully installed the packages, let’s go ahead to the next section.

#### Handling time-series data with Pandas

Let’s get started by learning how to handle time-series data in Pandas. In this section, we will convert a sequence of numbers into time series data and visualize it. Pandas provides options to add timestamps, organize data, and then efficiently operate on it.

Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Define a function to read the data from the input file. The parameter `index`

indicates the column number that contains the relevant data:

def read_data(input_file, index):

# Read the data from the input file

input_data = np.loadtxt(input_file, delimiter=',')

Define a `lambda`

function to convert strings to Pandas date format:

# Lambda function to convert strings to Pandas date format

to_date = lambda x, y: str(int(x)) + '-' + str(int(y))

Use this `lambda`

function to get the start date from the first line in the input file:

# Extract the start date

start = to_date(input_data[0, 0], input_data[0, 1])

`Pandas`

library needs the end date to be exclusive when we perform operations, so we need to increase the date field in the last line by one month:

# Extract the end date

if input_data[-1, 1] == 12:

year = input_data[-1, 0] + 1

month = 1

else:

year = input_data[-1, 0]

month = input_data[-1, 1] + 1

end = to_date(year, month)

Create a list of indices with dates using the start and end dates with a monthly frequency:

# Create a date list with a monthly frequency

date_indices = pd.date_range(start, end, freq='M')

Create pandas data series using the timestamps:

# Add timestamps to the input data to create time-series data

output = pd.Series(input_data[:, index], index=date_indices)

return output

Define the main function and specify the input file:

if __name__=='__main__':

# Input filename

input_file = 'data_2D.txt'

Specify the columns that contain the data:

# Specify the columns that need to be converted

# into time-series data

indices = [2, 3]

Iterate through the columns and read the data in each column:

# Iterate through the columns and plot the data

for index in indices:

# Convert the column to timeseries format

timeseries = read_data(input_file, index)

Plot the time-series data:

# Plot the data

plt.figure()

timeseries.plot()

plt.title('Dimension ' + str(index - 1))

plt.show()

If you run the code, you will see two screenshots.

The following screenshot indicates the data in the first dimension:

The second screenshot indicates the data in the second dimension:

#### Slicing time-series data

Now that we know how to handle time-series data, let’s see how we can slice it. The process of slicing refers to dividing the data into various sub-intervals and extracting relevant information. This is very useful when you are working with time-series datasets. Instead of using indices, we will use timestamp to slice our data.

Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from timeseries import read_data

Load the third column (zero-indexed) from the input data file:

# Load input data

index = 2

data = read_data('data_2D.txt', index)

Define the start and end years, and then plot the data with year-level granularity:

# Plot data with year-level granularity

start = '2003'

end = '2011'

plt.figure()

data[start:end].plot()

plt.title('Input data from ' + start + ' to ' + end)

Define the start and end months, and then plot the data with month-level granularity:

# Plot data with month-level granularity

start = '1998-2'

end = '2006-7'

plt.figure()

data[start:end].plot()

plt.title('Input data from ' + start + ' to ' + end)

plt.show()

The full code is given in the file `slicer.py`

. If you run the code, you will see two figures. The first screenshot shows the data from *2003* to *2011*:

The second screenshot shows the data from *February 1998* to *July 2006*:

#### Operating on time-series data

Pandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition. You can simply set some conditions and Pandas will filter the dataset and return the right subset. You can add two time-series variables as well. This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from timeseries import read_data

Define the input filename:

# Input filename

input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load data

x1 = read_data(input_file, 2)

x2 = read_data(input_file, 3)

Create a Pandas dataframe object by naming the two dimensions:

# Create pandas dataframe for slicing

data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Plot the data by specifying the start and end years:

# Plot data

start = '1968'

end = '1975'

data[start:end].plot()

plt.title('Data overlapped on top of each other')

Filter the data using conditions and then display it. In this case, we will take all the datapoints in `dim1`

that are less than `45`

and all the values in `dim2`

that are greater than `30`

:

# Filtering using conditions

# - 'dim1' is smaller than a certain threshold

# - 'dim2' is greater than a certain threshold

data[(data['dim1'] < 45) & (data['dim2'] > 30)].plot()

plt.title('dim1 < 45 and dim2 > 30')

We can also add two series in `Pandas`

. Let’s add `dim1`

and `dim2`

between the given start and end dates:

# Adding two dataframes

plt.figure()

diff = data[start:end]['dim1'] + data[start:end]['dim2']

diff.plot()

plt.title('Summation (dim1 + dim2)')

plt.show()

The full code is given in the file `operator.py`

. If you run the code, you will see three screenshots. The first screenshot shows the data from *1968* to *1975:*

The second screenshot shows the filtered data:

The third screenshot shows the summation result:

Pandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition. You can simply set some conditions and Pandas will filter the dataset and return the right subset. You can add two time-series variables as well. This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from timeseries import read_data

Define the input filename:

# Input filename

input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load data

x1 = read_data(input_file, 2)

x2 = read_data(input_file, 3)

Create a Pandas dataframe object by naming the two dimensions:

# Create pandas dataframe for slicing

data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Plot the data by specifying the start and end years:

# Plot data

start = '1968'

end = '1975'

data[start:end].plot()

plt.title('Data overlapped on top of each other')

Filter the data using conditions and then display it. In this case, we will take all the datapoints in `dim1`

that are less than `45`

and all the values in `dim2`

that are greater than `30`

:

# Filtering using conditions

# - 'dim1' is smaller than a certain threshold

# - 'dim2' is greater than a certain threshold

data[(data['dim1'] < 45) & (data['dim2'] > 30)].plot()

plt.title('dim1 < 45 and dim2 > 30')

We can also add two series in `Pandas`

. Let’s add `dim1`

and `dim2`

between the given start and end dates:

# Adding two dataframes

plt.figure()

diff = data[start:end]['dim1'] + data[start:end]['dim2']

diff.plot()

plt.title('Summation (dim1 + dim2)')

plt.show()

If you run the code, you will see three screenshots. The first screenshot shows the data from *1968* to *1975:*

The second screenshot shows the filtered data:

The third screenshot shows the summation result:

#### Extracting statistics from time-series data

In order to extract meaningful insights from time-series data, we have to extract statistics from it. These stats can be things like mean, variance, correlation, maximum value, and so on. These stats have to be computed on a rolling basis using a window. We use a predetermined window size and keep computing these stats. When we visualize the stats over time, we will see interesting patterns. Let’s see how to extract these stats from time-series data.

Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from timeseries import read_data

Define the input filename:

# Input filename

input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load input data in time series format

x1 = read_data(input_file, 2)

x2 = read_data(input_file, 3)

Create a pandas dataframe by naming the two dimensions:

# Create pandas dataframe for slicing

data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Extract maximum and minimum values along each dimension:

# Extract max and min values

print('\nMaximum values for each dimension:')

print(data.max())

print('\nMinimum values for each dimension:')

print(data.min())

Extract the overall mean and the row-wise mean for the first *12* rows:

# Extract overall mean and row-wise mean values

print('\nOverall mean:')

print(data.mean())

print('\nRow-wise mean:')

print(data.mean(1)[:12])

Plot the rolling mean using a window size of `24`

:

# Plot the rolling mean using a window size of 24

data.rolling(center=False, window=24).mean().plot()

plt.title('Rolling mean')

Print the correlation coefficients:

# Extract correlation coefficients

print('\nCorrelation coefficients:\n', data.corr())

Plot the rolling correlation using a window size of *60*:

# Plot rolling correlation using a window size of 60

plt.figure()

plt.title('Rolling correlation')

data['dim1'].rolling(window=60).corr(other=data['dim2']).plot()

plt.show()

The full code is given in the file `stats_extractor.py`

. If you run the code, you will see two screenshots. The first screenshot shows the rolling mean:

The second screenshot shows the rolling correlation:

You will see the following on your Terminal:

If you scroll down, you will see row-wise mean values and the correlation coefficients printed on your Terminal:

The correlation coefficients in the preceding figures indicate the level of correlation of each dimension with all the other dimensions. A correlation of `1.0`

indicates perfect correlation, whereas a correlation of `0.0`

indicates that they the variables are not related to each other.

#### Generating data using Hidden Markov Models

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city *X* to city *Y*. In the information given below, *P(X -> Y)* indicates the probability of going from city *X* to city *Y*:

P(London -> London) = 0.10

P(London -> Barcelona) = 0.70

P(London -> NY) = 0.20

P(Barcelona -> Barcelona) = 0.15

P(Barcelona -> London) = 0.75

P(Barcelona -> NY) = 0.10

P(NY -> NY) = 0.05

P(NY -> London) = 0.60

P(NY -> Barcelona) = 0.35

Let’s represent this information with a transition matrix:

London Barcelona NY

London 0.10 0.70 0.20

Barcelona 0.75 0.15 0.10

NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by *T* and the current day by *X(i)*, then:

*X(i+1) = X(i).T*

In our case, Friday is *3* days away from Tuesday. This means we have to compute *X(i+3)*. The computations will looks like this:

*X(i+1) = X(i).T*

*X(i+2) = X(i+1).T*

*X(i+3) = X(i+2).T*

So in essence:

*X(i+3) = X(i).T³*

We need to set *X(i)* as given here:

*X(i) = [0.10 0.70 0.20]*

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

*P(London) = 0.31*

*P(Barcelona) = 0.53*

*P(NY) = 0.16*

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime

import numpy as np

import matplotlib.pyplot as plt

from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data

data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training

X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM

num_components = 5

hmm = GaussianHMM(n_components=num_components,

covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM

print('\nTraining the Hidden Markov Model...')

hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats

print('\nMeans and variances:')

for i in range(hmm.n_components):

print('\nHidden state', i+1)

print('Mean =', round(hmm.means_[i][0], 2))

print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate `1200`

samples using the trained HMM model and plot them:

# Generate data using the HMM model

num_samples = 1200

generated_data, _ = hmm.sample(num_samples)

plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')

plt.title('Generated data')

plt.show()

The full code is given in the file `hmm.py`

. If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city *X* to city *Y*. In the information given below, *P(X -> Y)* indicates the probability of going from city *X* to city *Y*:

P(London -> London) = 0.10

P(London -> Barcelona) = 0.70

P(London -> NY) = 0.20

P(Barcelona -> Barcelona) = 0.15

P(Barcelona -> London) = 0.75

P(Barcelona -> NY) = 0.10

P(NY -> NY) = 0.05

P(NY -> London) = 0.60

P(NY -> Barcelona) = 0.35

Let’s represent this information with a transition matrix:

London Barcelona NY

London 0.10 0.70 0.20

Barcelona 0.75 0.15 0.10

NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by *T* and the current day by *X(i)*, then:

*X(i+1) = X(i).T*

In our case, Friday is *3* days away from Tuesday. This means we have to compute *X(i+3)*. The computations will looks like this:

X(i+1) = X(i).T

X(i+2) = X(i+1).T

X(i+3) = X(i+2).T

So in essence:

X(i+3) = X(i).T³

We need to setX(i)as given here:

X(i) = [0.10 0.70 0.20]

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

*P(London) = 0.31*

*P(Barcelona) = 0.53*

*P(NY) = 0.16*

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime

import numpy as np

import matplotlib.pyplot as plt

from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data

data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training

X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM

num_components = 5

hmm = GaussianHMM(n_components=num_components,

covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM

print('\nTraining the Hidden Markov Model...')

hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats

print('\nMeans and variances:')

for i in range(hmm.n_components):

print('\nHidden state', i+1)

print('Mean =', round(hmm.means_[i][0], 2))

print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate `1200`

samples using the trained HMM model and plot them:

# Generate data using the HMM model

num_samples = 1200

generated_data, _ = hmm.sample(num_samples)

plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')

plt.title('Generated data')

plt.show()

The full code is given in the file `hmm.py`

. If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city *X* to city *Y*. In the information given below, *P(X -> Y)* indicates the probability of going from city *X* to city *Y*:

P(London -> London) = 0.10

P(London -> Barcelona) = 0.70

P(London -> NY) = 0.20

P(Barcelona -> Barcelona) = 0.15

P(Barcelona -> London) = 0.75

P(Barcelona -> NY) = 0.10

P(NY -> NY) = 0.05

P(NY -> London) = 0.60

P(NY -> Barcelona) = 0.35

Let’s represent this information with a transition matrix:

London Barcelona NY

London 0.10 0.70 0.20

Barcelona 0.75 0.15 0.10

NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by *T* and the current day by *X(i)*, then:

X(i+1) = X(i).T

In our case, Friday is *3* days away from Tuesday. This means we have to compute *X(i+3)*. The computations will looks like this:

X(i+1) = X(i).T

X(i+2) = X(i+1).T

X(i+3) = X(i+2).T

So in essence:

X(i+3) = X(i).T³

We need to setX(i)as given here:

X(i) = [0.10 0.70 0.20]

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

*P(London) = 0.31*

*P(Barcelona) = 0.53*

*P(NY) = 0.16*

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime

import numpy as np

import matplotlib.pyplot as plt

from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data

data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training

X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM

num_components = 5

hmm = GaussianHMM(n_components=num_components,

covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM

print('\nTraining the Hidden Markov Model...')

hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats

print('\nMeans and variances:')

for i in range(hmm.n_components):

print('\nHidden state', i+1)

print('Mean =', round(hmm.means_[i][0], 2))

print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate `1200`

samples using the trained HMM model and plot them:

# Generate data using the HMM model

num_samples = 1200

generated_data, _ = hmm.sample(num_samples)

plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')

plt.title('Generated data')

plt.show()

If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

#### Identifying alphabet sequences with Conditional Random Fields

Conditional Random Fields (CRFs) are probabilistic models that are frequently used to analyze structured data. We use them to label and segment sequential data in various forms. One thing to note about CRFs is that they are discriminative models. This is in contrast to HMMs, which are generative models.

We can define a conditional probability distribution over a labeled sequence of measurements. We use this framework to build a CRF model. In HMMs, we have to define a joint distribution over the observation sequence and the labels.

One of the main advantages of CRFs is that they are conditional by nature. This is not the case with HMMs. CRFs do not assume any independence between output observations. HMMs assume that the output at any given time is statistically independent of the previous outputs. HMMs need this assumption to ensure that the inference process works in a robust way. But this assumption is not always true! Real world data is filled with temporal dependencies.

CRFs tend to outperform HMMs in a variety of applications such as natural language processing, speech recognition, biotechnology, and so on. In this section, we will discuss how to use CRFs to analyze sequences of alphabets. Create a new python file and import the following packages:

import os

import argparse

import string

import pickle

import numpy as np

import matplotlib.pyplot as plt

from pystruct.datasets import load_letters

from pystruct.models import ChainCRF

from pystruct.learners import FrankWolfeSSVM

Define a function to parse the input arguments. We can pass the `C`

value as the input parameter. The C parameter controls how much we want to penalize misclassification. A higher value of C would mean that we are imposing a higher penalty for misclassification during training, but we might end up overfitting the model. On the other hand, if we choose a lower value for C, we are allowing the model to generalize well. But this also means that we are imposing a lower penalty for misclassification during training data points.

def build_arg_parser():

parser = argparse.ArgumentParser(description='Trains a Conditional\

Random Field classifier')

parser.add_argument("--C", dest="c_val", required=False, type=float,

default=1.0, help='C value to be used for training')

return parser

Define a class to handle all the functionality of building the CRF model. We will use a chain CRF model with `FrankWolfeSSVM`

:

# Class to model the CRF

class CRFModel(object):

def __init__(self, c_val=1.0):

self.clf = FrankWolfeSSVM(model=ChainCRF(),

C=c_val, max_iter=50)

Define a function to load the training data:

# Load the training data

def load_data(self):

alphabets = load_letters()

X = np.array(alphabets['data'])

y = np.array(alphabets['labels'])

folds = alphabets['folds']

return X, y, folds

Define a function to train the CRF model:

# Train the CRF

def train(self, X_train, y_train):

self.clf.fit(X_train, y_train)

Define a function to evaluate the accuracy of the CRF model:

# Evaluate the accuracy of the CRF

def evaluate(self, X_test, y_test):

return self.clf.score(X_test, y_test)

Define a function to run the trained CRF model on an unknown datapoint:

# Run the CRF on unknown data

def classify(self, input_data):

return self.clf.predict(input_data)[0]

Define a function to extract a substring from the alphabets based on a list of indices:

# Convert indices to alphabets

def convert_to_letters(indices):

# Create a numpy array of all alphabets

alphabets = np.array(list(string.ascii_lowercase))

Extract the letters:

# Extract the letters based on input indices

output = np.take(alphabets, indices)

output = ''.join(output)

return output

Define the main function and parse the input arguments:

if __name__=='__main__':

args = build_arg_parser().parse_args()

c_val = args.c_val

Create the CRF model object:

# Create the CRF model

crf = CRFModel(c_val)

Load the input data and separate it into train and test sets:

# Load the train and test data

X, y, folds = crf.load_data()

X_train, X_test = X[folds == 1], X[folds != 1]

y_train, y_test = y[folds == 1], y[folds != 1]

Train the CRF model:

# Train the CRF model

print('\nTraining the CRF model...')

crf.train(X_train, y_train)

Evaluate the accuracy of the CRF model and print it:

# Evaluate the accuracy

score = crf.evaluate(X_test, y_test)

print('\nAccuracy score =', str(round(score*100, 2)) + '%')

Run it on some test datapoints and print the output:

indices = range(3000, len(y_test), 200)

for index in indices:

print("\nOriginal =", convert_to_letters(y_test[index]))

predicted = crf.classify([X_test[index]])

print("Predicted =", convert_to_letters(predicted))

If you run the code, you will see the following output on your Terminal:

If you scroll to the end, you will see the following on your Terminal:

As we can see, it predicts most of the words correctly.

#### Stock market analysis

We will analyze stock market data in this section using Hidden Markov Models. This is an example where the data is already organized timestamped. We will use the dataset available in the `matplotlib`

package. The dataset contains the stock values of various companies over the years. Hidden Markov models are generative models that can analyze such time series data and extract the underlying structure. We will use this model to analyze stock price variations and generate the outputs.

Create a new python file and import the following packages:

import datetime

import warnings

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.finance import quotes_historical_yahoo_ochl\

as quotes_yahoo

from hmmlearn.hmm import GaussianHMM

Load historical stock market quotes from September 4, 1970 to May 17, 2016. You are free to choose any date range you wish.

# Load historical stock quotes from matplotlib package

start = datetime.date(1970, 9, 4)

end = datetime.date(2016, 5, 17)

stock_quotes = quotes_yahoo('INTC', start, end)

Extract the closing quote each day and the volume of shares traded that day:

# Extract the closing quotes everyday

closing_quotes = np.array([quote[2] for quote in stock_quotes])

# Extract the volume of shares traded everyday

volumes = np.array([quote[5] for quote in stock_quotes])[1:]

Take the percentage difference of closing quotes each day:

# Take the percentage difference of closing stock prices

diff_percentages = 100.0 * np.diff(closing_quotes) / closing_quotes[:-1]

Since the differencing reduces the length of the array by `1`

, you need to adjust the date array too:

# Take the list of dates starting from the second value

dates = np.array([quote[0] for quote in stock_quotes], dtype=np.int)[1:]

Stack the two data columns to create the training dataset:

# Stack the differences and volume values column-wise for training

training_data = np.column_stack([diff_percentages, volumes])

Create and train the Gaussian HMM with 7 components and diagonal covariance:

# Create and train Gaussian HMM

hmm = GaussianHMM(n_components=7, covariance_type='diag', n_iter=1000)

with warnings.catch_warnings():

warnings.simplefilter('ignore')

hmm.fit(training_data)

Use the trained HMM model to generate 300 samples. You can choose to generate any number of samples you want.

# Generate data using the HMM model

num_samples = 300

samples, _ = hmm.sample(num_samples)

Plot the generated values for difference percentages:

# Plot the difference percentages

plt.figure()

plt.title('Difference percentages')

plt.plot(np.arange(num_samples), samples[:, 0], c='black')

Plot the generated values for volume of shares traded:

# Plot the volume of shares traded

plt.figure()

plt.title('Volume of shares')

plt.plot(np.arange(num_samples), samples[:, 1], c='black')

plt.ylim(ymin=0)

plt.show()

If you run the code, you will see the following two screenshots. The first screenshot shows the difference percentages generated by the HMM:

The second screenshot shows the values generated by the HMM for volume of shares traded:

## Leave a Reply