Time Series Machine Learning (and Feature Engineering) in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Machine learning is a powerful way to analyze Time Series. With innovations in the tidyverse
modeling infrastructure (tidymodels
), we now have a common set of packages to perform machine learning in R. These packages include parsnip
, recipes
, tune
, and workflows
. But what about Machine Learning with Time Series Data? The key is Feature Engineering. (Read the updated article at Business Science)
The timetk
package has a feature engineering innovation in version 0.1.3. A recipe step called step_timeseries_signature()
for Time Series Feature Engineering that is designed to fit right into the tidymodels
workflow for machine learning with timeseries data.
The small innovation creates 25+ time series features, which has a big impact in improving our machine learning models. Further, these “core features” are the basis for creating 200+ time-series features to improve forecasting performance. Let’s see how to do Time Series Machine Learning in R.
Time Series Feature Engineering
with the Time Series Signature
The time series signature is a collection of useful engineered features that describe the time series index of a time-based data set. It contains a 25+ time-series features that can be used to forecast time series that contain common seasonal and trend patterns:
Trend in Seconds Granularity: index.num
Yearly Seasonality: Year, Month, Quarter
Weekly Seasonality: Week of Month, Day of Month, Day of Week, and more
Daily Seasonality: Hour, Minute, Second
Weekly Cyclic Patterns: 2 weeks, 3 weeks, 4 weeks
We can then build 200+ of new features from these core 25+ features by applying well-thought-out time series feature engineering strategies.
Time Series Forecast Strategy
6-Month Forecast of Bike Transaction Counts
In this tutorial, the user will learn methods to implement machine learning to predict future outcomes in a time-based data set. The tutorial example uses a well known time series dataset, the Bike Sharing Dataset, from the UCI Machine Learning Repository. The objective is to build a model and predict the next 6-months of Bike Sharing daily transaction counts.
Feature Engineering Strategy
I’ll use timetk
to build a basic Machine Learning Feature Set using the new step_timeseries_signature()
function that is part of preprocessing specification via the recipes
package. I’ll show how you can add interaction terms, dummy variables, and more to build 200+ new features from the pre-packaged feature set.
Machine Learning Strategy
We’ll then perform Time Series Machine Learning using parsnip
and workflows
to construct and train a GLM-based time series machine learning model. The model is evaluated on out-of-sample data. A final model is trained on the full dataset, and extended to a future dataset containing 6-months to daily timestamp data.
Time Series Forecast using Feature Engineering
How to Learn Forecasting Beyond this Tutorial
I can’t possibly show you all the Time Series Forecasting techniques you need to learn in this post, which is why I have a NEW Advanced Time Series Forecasting Course on its way. The course includes detailed explanations from 3 Time Series Competitions. We go over competition solutions and show how you can integrate the key strategies into your organization’s time series forecasting projects. Check out the course page, and Sign-Up to get notifications on the Advanced Time Series Forecasting Course (Coming soon).
Need to improve forecasting at your company?
I have the Advanced Time Series Forecasting Course (Coming Soon). This course pulls forecasting strategies from experts that have placed 1st and 2nd solutions in 3 of the most important Time Series Competitions. Learn the strategies that win forecasting competitions. Then apply them to your time series projects.
Join the waitlist to get notified of the Course Launch!
Join the Advanced Time Series Course Waitlist
Prerequisites
Please use timetk
0.1.3 or greater for this tutorial. You can install via remotes::install_github("business-science/timetk")
until released on CRAN.
Before we get started, load the following packages.
library(workflows)
library(parsnip)
library(recipes)
library(yardstick)
library(glmnet)
library(tidyverse)
library(tidyquant)
library(timetk) # Use >= 0.1.3, remotes::install_github("business-science/timetk")
Data
We’ll be using the Bike Sharing Dataset from the UCI Machine Learning Repository. Download the data and select the “day.csv” file which is aggregated to daily periodicity.
# Read data
bikes <- read_csv("2020-03-18-timeseries-ml/day.csv")
# Select date and count
bikes_tbl <- bikes %>%
select(dteday, cnt) %>%
rename(date = dteday,
value = cnt)
A visualization will help understand how we plan to tackle the problem of forecasting the data. We’ll split the data into two regions: a training region and a testing region.
# Visualize data and training/testing regions
bikes_tbl %>%
ggplot(aes(x = date, y = value)) +
geom_rect(xmin = as.numeric(ymd("2012-07-01")),
xmax = as.numeric(ymd("2013-01-01")),
ymin = 0, ymax = 10000,
fill = palette_light()[[4]], alpha = 0.01) +
annotate("text", x = ymd("2011-10-01"), y = 7800,
color = palette_light()[[1]], label = "Train Region") +
annotate("text", x = ymd("2012-10-01"), y = 1550,
color = palette_light()[[1]], label = "Test Region") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
labs(title = "Bikes Sharing Dataset: Daily Scale", x = "") +
theme_tq()
Split the data into train and test sets at “2012-07-01”.
# Split into training and test sets
train_tbl <- bikes_tbl %>% filter(date ymd("2012-07-01"))
test_tbl <- bikes_tbl %>% filter(date >= ymd("2012-07-01"))
Modeling
Start with the training set, which has the “date” and “value” columns.
# Training set
train_tbl
## # A tibble: 547 x 2
## date value
##
Read the Full Article here: >R-bloggers
Leave a Reply