Blog

ProjectBlog: Probabilistic Reasoning for Sequential Data

Blog: Probabilistic Reasoning for Sequential Data


In the world of machine learning, we encounter many types of data, such as images, text, video, sensor readings, and so on. Different types of data require different types of modelling techniques. Sequential data refers to data where the ordering is important.

Go to the profile of Dammnn

Time-series data is a particular manifestation of sequential data. It is basically time-stamped values obtained from any data source such as sensors, microphones, stock markets, and so on. Time-series data has a lot of important characteristics that need to be modelled in order to effectively analyze the data.

The measurements that we encounter in time-series data are taken at regular time intervals and correspond to predetermined parameters. These measurements are arranged on a timeline for storage, and the order of their appearance is very important. We use this order to extract patterns from the data.

In this chapter, we will see how to build models that describe the given time-series data or any sequence in general. These models are used to understand the behaviour of the time series variable. We then use these models to predict the future based on past behaviour.

Time-series data analysis is used extensively in financial analysis, sensor data analysis, speech recognition, economics, weather forecasting, manufacturing, and many more. We will explore a variety of scenarios where we encounter time-series data and see how we can build a solution. We will be using a library called Pandas to handle all the time-series related operations. We will also use a couple of other useful packages like hmmlearn and pystruct during this chapter. Make sure you install them before you proceed.

You can install them by running the following commands on your Terminal:

$ pip3 install pandas
$ pip3 install hmmlearn
$ pip3 install pystruct
$ pip3 install cvxopt

If you get an error when installing cvxopt, you will find further instructions at http://cvxopt.org/install . Now that you have successfully installed the packages, let’s go ahead to the next section.

Handling time-series data with Pandas

Let’s get started by learning how to handle time-series data in Pandas. In this section, we will convert a sequence of numbers into time series data and visualize it. Pandas provides options to add timestamps, organize data, and then efficiently operate on it.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd

Define a function to read the data from the input file. The parameter index indicates the column number that contains the relevant data:

def read_data(input_file, index): 
# Read the data from the input file
input_data = np.loadtxt(input_file, delimiter=',')

Define a lambda function to convert strings to Pandas date format:

# Lambda function to convert strings to Pandas date format 
to_date = lambda x, y: str(int(x)) + '-' + str(int(y))

Use this lambda function to get the start date from the first line in the input file:

# Extract the start date 
start = to_date(input_data[0, 0], input_data[0, 1])

Pandas library needs the end date to be exclusive when we perform operations, so we need to increase the date field in the last line by one month:

# Extract the end date 
if input_data[-1, 1] == 12:
year = input_data[-1, 0] + 1
month = 1
else:
year = input_data[-1, 0]
month = input_data[-1, 1] + 1

end = to_date(year, month)

Create a list of indices with dates using the start and end dates with a monthly frequency:

# Create a date list with a monthly frequency 
date_indices = pd.date_range(start, end, freq='M')

Create pandas data series using the timestamps:

# Add timestamps to the input data to create time-series data 
output = pd.Series(input_data[:, index], index=date_indices)

return output

Define the main function and specify the input file:

if __name__=='__main__': 
# Input filename
input_file = 'data_2D.txt'

Specify the columns that contain the data:

# Specify the columns that need to be converted 
# into time-series data
indices = [2, 3]

Iterate through the columns and read the data in each column:

# Iterate through the columns and plot the data 
for index in indices:
# Convert the column to timeseries format
timeseries = read_data(input_file, index)

Plot the time-series data:

# Plot the data 
plt.figure()
timeseries.plot()
plt.title('Dimension ' + str(index - 1))

plt.show()

If you run the code, you will see two screenshots.

The following screenshot indicates the data in the first dimension:

The second screenshot indicates the data in the second dimension:

Slicing time-series data

Now that we know how to handle time-series data, let’s see how we can slice it. The process of slicing refers to dividing the data into various sub-intervals and extracting relevant information. This is very useful when you are working with time-series datasets. Instead of using indices, we will use timestamp to slice our data.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd

from timeseries import read_data

Load the third column (zero-indexed) from the input data file:

# Load input data 
index = 2
data = read_data('data_2D.txt', index)

Define the start and end years, and then plot the data with year-level granularity:

# Plot data with year-level granularity 
start = '2003'
end = '2011'
plt.figure()
data[start:end].plot()
plt.title('Input data from ' + start + ' to ' + end)

Define the start and end months, and then plot the data with month-level granularity:

# Plot data with month-level granularity 
start = '1998-2'
end = '2006-7'
plt.figure()
data[start:end].plot()
plt.title('Input data from ' + start + ' to ' + end)

plt.show()

The full code is given in the file slicer.py. If you run the code, you will see two figures. The first screenshot shows the data from 2003 to 2011:

The second screenshot shows the data from February 1998 to July 2006:

Operating on time-series data

Pandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition. You can simply set some conditions and Pandas will filter the dataset and return the right subset. You can add two time-series variables as well. This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from timeseries import read_data

Define the input filename:

# Input filename 
input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load data 
x1 = read_data(input_file, 2)
x2 = read_data(input_file, 3)

Create a Pandas dataframe object by naming the two dimensions:

# Create pandas dataframe for slicing 
data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Plot the data by specifying the start and end years:

# Plot data 
start = '1968'
end = '1975'
data[start:end].plot()
plt.title('Data overlapped on top of each other')

Filter the data using conditions and then display it. In this case, we will take all the datapoints in dim1 that are less than 45 and all the values in dim2 that are greater than 30:

# Filtering using conditions 
# - 'dim1' is smaller than a certain threshold
# - 'dim2' is greater than a certain threshold
data[(data['dim1'] < 45) & (data['dim2'] > 30)].plot()
plt.title('dim1 < 45 and dim2 > 30')

We can also add two series in Pandas. Let’s add dim1 and dim2 between the given start and end dates:

# Adding two dataframes 
plt.figure()
diff = data[start:end]['dim1'] + data[start:end]['dim2']
diff.plot()
plt.title('Summation (dim1 + dim2)')

plt.show()

The full code is given in the file operator.py. If you run the code, you will see three screenshots. The first screenshot shows the data from 1968 to 1975:

The second screenshot shows the filtered data:

The third screenshot shows the summation result:

Pandas allows us to operate on time-series data efficiently and perform various operations like filtering and addition. You can simply set some conditions and Pandas will filter the dataset and return the right subset. You can add two time-series variables as well. This allows us to build various applications quickly without having to reinvent the wheel.

Create a new Python file and import the following packages:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from timeseries import read_data

Define the input filename:

# Input filename 
input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load data 
x1 = read_data(input_file, 2)
x2 = read_data(input_file, 3)

Create a Pandas dataframe object by naming the two dimensions:

# Create pandas dataframe for slicing 
data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Plot the data by specifying the start and end years:

# Plot data 
start = '1968'
end = '1975'
data[start:end].plot()
plt.title('Data overlapped on top of each other')

Filter the data using conditions and then display it. In this case, we will take all the datapoints in dim1 that are less than 45 and all the values in dim2 that are greater than 30:

# Filtering using conditions 
# - 'dim1' is smaller than a certain threshold
# - 'dim2' is greater than a certain threshold
data[(data['dim1'] < 45) & (data['dim2'] > 30)].plot()
plt.title('dim1 < 45 and dim2 > 30')

We can also add two series in Pandas. Let’s add dim1 and dim2 between the given start and end dates:

# Adding two dataframes 
plt.figure()
diff = data[start:end]['dim1'] + data[start:end]['dim2']
diff.plot()
plt.title('Summation (dim1 + dim2)')

plt.show()

If you run the code, you will see three screenshots. The first screenshot shows the data from 1968 to 1975:

The second screenshot shows the filtered data:

The third screenshot shows the summation result:

Extracting statistics from time-series data

In order to extract meaningful insights from time-series data, we have to extract statistics from it. These stats can be things like mean, variance, correlation, maximum value, and so on. These stats have to be computed on a rolling basis using a window. We use a predetermined window size and keep computing these stats. When we visualize the stats over time, we will see interesting patterns. Let’s see how to extract these stats from time-series data.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd

from timeseries import read_data

Define the input filename:

# Input filename 
input_file = 'data_2D.txt'

Load the third and fourth columns into separate variables:

# Load input data in time series format 
x1 = read_data(input_file, 2)
x2 = read_data(input_file, 3)

Create a pandas dataframe by naming the two dimensions:

# Create pandas dataframe for slicing 
data = pd.DataFrame({'dim1': x1, 'dim2': x2})

Extract maximum and minimum values along each dimension:

# Extract max and min values 
print('\nMaximum values for each dimension:')
print(data.max())
print('\nMinimum values for each dimension:')
print(data.min())

Extract the overall mean and the row-wise mean for the first 12 rows:

# Extract overall mean and row-wise mean values 
print('\nOverall mean:')
print(data.mean())
print('\nRow-wise mean:')
print(data.mean(1)[:12])

Plot the rolling mean using a window size of 24:

# Plot the rolling mean using a window size of 24 
data.rolling(center=False, window=24).mean().plot()
plt.title('Rolling mean')

Print the correlation coefficients:

# Extract correlation coefficients 
print('\nCorrelation coefficients:\n', data.corr())

Plot the rolling correlation using a window size of 60:

# Plot rolling correlation using a window size of 60 
plt.figure()
plt.title('Rolling correlation')
data['dim1'].rolling(window=60).corr(other=data['dim2']).plot()

plt.show()

The full code is given in the file stats_extractor.py. If you run the code, you will see two screenshots. The first screenshot shows the rolling mean:

The second screenshot shows the rolling correlation:

You will see the following on your Terminal:

If you scroll down, you will see row-wise mean values and the correlation coefficients printed on your Terminal:

The correlation coefficients in the preceding figures indicate the level of correlation of each dimension with all the other dimensions. A correlation of 1.0 indicates perfect correlation, whereas a correlation of 0.0 indicates that they the variables are not related to each other.

Generating data using Hidden Markov Models

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city X to city Y. In the information given below, P(X -> Y) indicates the probability of going from city X to city Y:

P(London -> London) = 0.10

P(London -> Barcelona) = 0.70

P(London -> NY) = 0.20

P(Barcelona -> Barcelona) = 0.15

P(Barcelona -> London) = 0.75

P(Barcelona -> NY) = 0.10

P(NY -> NY) = 0.05

P(NY -> London) = 0.60

P(NY -> Barcelona) = 0.35

Let’s represent this information with a transition matrix:

London Barcelona NY

London 0.10 0.70 0.20

Barcelona 0.75 0.15 0.10

NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by T and the current day by X(i), then:

X(i+1) = X(i).T

In our case, Friday is 3 days away from Tuesday. This means we have to compute X(i+3). The computations will looks like this:

X(i+1) = X(i).T

X(i+2) = X(i+1).T

X(i+3) = X(i+2).T

So in essence:

X(i+3) = X(i).T³

We need to set X(i) as given here:

X(i) = [0.10 0.70 0.20]

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

P(London) = 0.31

P(Barcelona) = 0.53

P(NY) = 0.16

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime 

import numpy as np
import matplotlib.pyplot as plt
from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data 
data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training 
X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM 
num_components = 5
hmm = GaussianHMM(n_components=num_components,
covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM 
print('\nTraining the Hidden Markov Model...')
hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats 
print('\nMeans and variances:')
for i in range(hmm.n_components):
print('\nHidden state', i+1)
print('Mean =', round(hmm.means_[i][0], 2))
print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate 1200 samples using the trained HMM model and plot them:

# Generate data using the HMM model 
num_samples = 1200
generated_data, _ = hmm.sample(num_samples)
plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')
plt.title('Generated data')

plt.show()

The full code is given in the file hmm.py. If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city X to city Y. In the information given below, P(X -> Y) indicates the probability of going from city X to city Y:

P(London -> London) = 0.10

P(London -> Barcelona) = 0.70

P(London -> NY) = 0.20

P(Barcelona -> Barcelona) = 0.15

P(Barcelona -> London) = 0.75

P(Barcelona -> NY) = 0.10

P(NY -> NY) = 0.05

P(NY -> London) = 0.60

P(NY -> Barcelona) = 0.35

Let’s represent this information with a transition matrix:

London Barcelona NY

London 0.10 0.70 0.20

Barcelona 0.75 0.15 0.10

NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by T and the current day by X(i), then:

X(i+1) = X(i).T

In our case, Friday is 3 days away from Tuesday. This means we have to compute X(i+3). The computations will looks like this:

X(i+1) = X(i).T
X(i+2) = X(i+1).T
X(i+3) = X(i+2).T
So in essence:
X(i+3) = X(i).T³
We need to set X(i) as given here:
X(i) = [0.10 0.70 0.20]

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

P(London) = 0.31

P(Barcelona) = 0.53

P(NY) = 0.16

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime 

import numpy as np
import matplotlib.pyplot as plt
from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data 
data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training 
X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM 
num_components = 5
hmm = GaussianHMM(n_components=num_components,
covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM 
print('\nTraining the Hidden Markov Model...')
hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats 
print('\nMeans and variances:')
for i in range(hmm.n_components):
print('\nHidden state', i+1)
print('Mean =', round(hmm.means_[i][0], 2))
print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate 1200 samples using the trained HMM model and plot them:

# Generate data using the HMM model 
num_samples = 1200
generated_data, _ = hmm.sample(num_samples)
plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')
plt.title('Generated data')

plt.show()

The full code is given in the file hmm.py. If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

A Hidden Markov Model (HMM) is a powerful analysis technique for analyzing sequential data. It assumes that the system being modeled is a Markov process with hidden states. This means that the underlying system can be one among a set of possible states. It goes through a sequence of state transitions, thereby producing a sequence of outputs. We can only observe the outputs but not the states. Hence these states are hidden from us. Our goal is to model the data so that we can infer the state transitions of unknown data.

In order to understand HMMs, let’s consider the example of a salesman who has to travel between the following three cities for his job — London, Barcelona, and New York. His goal is to minimize the traveling time so that he can be more efficient. Considering his work commitments and schedule, we have a set of probabilities that dictate the chances of going from city X to city Y. In the information given below, P(X -> Y) indicates the probability of going from city X to city Y:

P(London -> London) = 0.10
P(London -> Barcelona) = 0.70
P(London -> NY) = 0.20
P(Barcelona -> Barcelona) = 0.15
P(Barcelona -> London) = 0.75
P(Barcelona -> NY) = 0.10
P(NY -> NY) = 0.05
P(NY -> London) = 0.60
P(NY -> Barcelona) = 0.35
Let’s represent this information with a transition matrix:
London Barcelona NY
London 0.10 0.70 0.20
Barcelona 0.75 0.15 0.10
NY 0.60 0.35 0.05

Now that we have all the information, let’s go ahead and set the problem statement. The salesman starts his journey on Tuesday from London and he has to plan something on Friday. But that will depend on where he is. What is the probability that he will be in Barcelona on Friday? This table will help us figure it out.

If we do not have a Markov Chain to model this problem, then we will not know what his travel schedule looks like. Our goal is to say with a good amount of certainty that he will be in a particular city on a given day. If we denote the transition matrix by T and the current day by X(i), then:

X(i+1) = X(i).T

In our case, Friday is 3 days away from Tuesday. This means we have to compute X(i+3). The computations will looks like this:

X(i+1) = X(i).T
X(i+2) = X(i+1).T
X(i+3) = X(i+2).T
So in essence:
X(i+3) = X(i).T³
We need to set X(i) as given here:
X(i) = [0.10 0.70 0.20]

The next step is to compute the cube of the matrix. There are many tools available online to perform matrix operations such as http://matrix.reshish.com/multiplication.php . If you do all the matrix calculations, then you will see that you will get the following probabilities for Thursday:

P(London) = 0.31

P(Barcelona) = 0.53

P(NY) = 0.16

We can see that there is a higher chance of him being in Barcelona than in any other city. This makes geographical sense as well because Barcelona is closer to London compared to New York. Let’s see how to model HMMs in Python.

Create a new Python file and import the following packages:

import datetime 

import numpy as np
import matplotlib.pyplot as plt
from hmmlearn.hmm import GaussianHMM

from timeseries import read_data

Load data from the input file:

# Load input data 
data = np.loadtxt('data_1D.txt', delimiter=',')

Extract the third column for training:

# Extract the data column (third column) for training 
X = np.column_stack([data[:, 2]])

Create a Gaussian HMM with 5 components and diagonal covariance:

# Create a Gaussian HMM 
num_components = 5
hmm = GaussianHMM(n_components=num_components,
covariance_type='diag', n_iter=1000)

Train the HMM:

# Train the HMM 
print('\nTraining the Hidden Markov Model...')
hmm.fit(X)

Print the mean and variance values for each component of the HMM:

# Print HMM stats 
print('\nMeans and variances:')
for i in range(hmm.n_components):
print('\nHidden state', i+1)
print('Mean =', round(hmm.means_[i][0], 2))
print('Variance =', round(np.diag(hmm.covars_[i])[0], 2))

Generate 1200 samples using the trained HMM model and plot them:

# Generate data using the HMM model 
num_samples = 1200
generated_data, _ = hmm.sample(num_samples)
plt.plot(np.arange(num_samples), generated_data[:, 0], c='black')
plt.title('Generated data')

plt.show()

If you run the code, you will see the following screenshot that shows the 1200 generated samples:

You will see the following printed on your Terminal:

Identifying alphabet sequences with Conditional Random Fields

Conditional Random Fields (CRFs) are probabilistic models that are frequently used to analyze structured data. We use them to label and segment sequential data in various forms. One thing to note about CRFs is that they are discriminative models. This is in contrast to HMMs, which are generative models.

We can define a conditional probability distribution over a labeled sequence of measurements. We use this framework to build a CRF model. In HMMs, we have to define a joint distribution over the observation sequence and the labels.

One of the main advantages of CRFs is that they are conditional by nature. This is not the case with HMMs. CRFs do not assume any independence between output observations. HMMs assume that the output at any given time is statistically independent of the previous outputs. HMMs need this assumption to ensure that the inference process works in a robust way. But this assumption is not always true! Real world data is filled with temporal dependencies.

CRFs tend to outperform HMMs in a variety of applications such as natural language processing, speech recognition, biotechnology, and so on. In this section, we will discuss how to use CRFs to analyze sequences of alphabets. Create a new python file and import the following packages:

import os 
import argparse
import string
import pickle

import numpy as np
import matplotlib.pyplot as plt
from pystruct.datasets import load_letters
from pystruct.models import ChainCRF
from pystruct.learners import FrankWolfeSSVM

Define a function to parse the input arguments. We can pass the C value as the input parameter. The C parameter controls how much we want to penalize misclassification. A higher value of C would mean that we are imposing a higher penalty for misclassification during training, but we might end up overfitting the model. On the other hand, if we choose a lower value for C, we are allowing the model to generalize well. But this also means that we are imposing a lower penalty for misclassification during training data points.

def build_arg_parser(): 
parser = argparse.ArgumentParser(description='Trains a Conditional\
Random Field classifier')
parser.add_argument("--C", dest="c_val", required=False, type=float,
default=1.0, help='C value to be used for training')
return parser

Define a class to handle all the functionality of building the CRF model. We will use a chain CRF model with FrankWolfeSSVM:

# Class to model the CRF 
class CRFModel(object):
def __init__(self, c_val=1.0):
self.clf = FrankWolfeSSVM(model=ChainCRF(),
C=c_val, max_iter=50)

Define a function to load the training data:

# Load the training data 
def load_data(self):
alphabets = load_letters()
X = np.array(alphabets['data'])
y = np.array(alphabets['labels'])
folds = alphabets['folds']

return X, y, folds

Define a function to train the CRF model:

# Train the CRF 
def train(self, X_train, y_train):
self.clf.fit(X_train, y_train)

Define a function to evaluate the accuracy of the CRF model:

# Evaluate the accuracy of the CRF 
def evaluate(self, X_test, y_test):
return self.clf.score(X_test, y_test)

Define a function to run the trained CRF model on an unknown datapoint:

# Run the CRF on unknown data 
def classify(self, input_data):
return self.clf.predict(input_data)[0]

Define a function to extract a substring from the alphabets based on a list of indices:

# Convert indices to alphabets 
def convert_to_letters(indices):
# Create a numpy array of all alphabets
alphabets = np.array(list(string.ascii_lowercase))

Extract the letters:

# Extract the letters based on input indices 
output = np.take(alphabets, indices)
output = ''.join(output)

return output

Define the main function and parse the input arguments:

if __name__=='__main__': 
args = build_arg_parser().parse_args()
c_val = args.c_val

Create the CRF model object:

# Create the CRF model 
crf = CRFModel(c_val)

Load the input data and separate it into train and test sets:

# Load the train and test data 
X, y, folds = crf.load_data()
X_train, X_test = X[folds == 1], X[folds != 1]
y_train, y_test = y[folds == 1], y[folds != 1]

Train the CRF model:

# Train the CRF model 
print('\nTraining the CRF model...')
crf.train(X_train, y_train)

Evaluate the accuracy of the CRF model and print it:

# Evaluate the accuracy 
score = crf.evaluate(X_test, y_test)
print('\nAccuracy score =', str(round(score*100, 2)) + '%')

Run it on some test datapoints and print the output:

indices = range(3000, len(y_test), 200) 
for index in indices:
print("\nOriginal =", convert_to_letters(y_test[index]))
predicted = crf.classify([X_test[index]])
print("Predicted =", convert_to_letters(predicted))

If you run the code, you will see the following output on your Terminal:

If you scroll to the end, you will see the following on your Terminal:

As we can see, it predicts most of the words correctly.

Stock market analysis

We will analyze stock market data in this section using Hidden Markov Models. This is an example where the data is already organized timestamped. We will use the dataset available in the matplotlib package. The dataset contains the stock values of various companies over the years. Hidden Markov models are generative models that can analyze such time series data and extract the underlying structure. We will use this model to analyze stock price variations and generate the outputs.

Create a new python file and import the following packages:

import datetime 
import warnings

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.finance import quotes_historical_yahoo_ochl\
as quotes_yahoo
from hmmlearn.hmm import GaussianHMM

Load historical stock market quotes from September 4, 1970 to May 17, 2016. You are free to choose any date range you wish.

# Load historical stock quotes from matplotlib package 
start = datetime.date(1970, 9, 4)
end = datetime.date(2016, 5, 17)
stock_quotes = quotes_yahoo('INTC', start, end)

Extract the closing quote each day and the volume of shares traded that day:

# Extract the closing quotes everyday 
closing_quotes = np.array([quote[2] for quote in stock_quotes])

# Extract the volume of shares traded everyday
volumes = np.array([quote[5] for quote in stock_quotes])[1:]

Take the percentage difference of closing quotes each day:

# Take the percentage difference of closing stock prices 
diff_percentages = 100.0 * np.diff(closing_quotes) / closing_quotes[:-1]

Since the differencing reduces the length of the array by 1, you need to adjust the date array too:

# Take the list of dates starting from the second value 
dates = np.array([quote[0] for quote in stock_quotes], dtype=np.int)[1:]

Stack the two data columns to create the training dataset:

# Stack the differences and volume values column-wise for training 
training_data = np.column_stack([diff_percentages, volumes])

Create and train the Gaussian HMM with 7 components and diagonal covariance:

# Create and train Gaussian HMM 
hmm = GaussianHMM(n_components=7, covariance_type='diag', n_iter=1000)
with warnings.catch_warnings():
warnings.simplefilter('ignore')
hmm.fit(training_data)

Use the trained HMM model to generate 300 samples. You can choose to generate any number of samples you want.

# Generate data using the HMM model 
num_samples = 300
samples, _ = hmm.sample(num_samples)

Plot the generated values for difference percentages:

# Plot the difference percentages 
plt.figure()
plt.title('Difference percentages')
plt.plot(np.arange(num_samples), samples[:, 0], c='black')

Plot the generated values for volume of shares traded:

# Plot the volume of shares traded 
plt.figure()
plt.title('Volume of shares')
plt.plot(np.arange(num_samples), samples[:, 1], c='black')
plt.ylim(ymin=0)

plt.show()

If you run the code, you will see the following two screenshots. The first screenshot shows the difference percentages generated by the HMM:

The second screenshot shows the values generated by the HMM for volume of shares traded:

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
a

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.

Social