Time Series Machine Learning (and Feature Engineering) in R
Machine learning is a powerful way to analyze Time Series. With innovations in the
tidyverse modeling infrastructure (
tidymodels), we now have a common set of packages to perform machine learning in R. These packages include
workflows. But what about Machine Learning with Time Series Data? The key is Feature Engineering. (Read the updated article at Business Science)
timetk package has a feature engineering innovation in version 0.1.3. A recipe step called
step_timeseries_signature() for Time Series Feature Engineering that is designed to fit right into the
tidymodels workflow for machine learning with timeseries data.
The small innovation creates 25+ time series features, which has a big impact in improving our machine learning models. Further, these “core features” are the basis for creating 200+ time-series features to improve forecasting performance. Let’s see how to do Time Series Machine Learning in R.
Time Series Feature Engineering
with the Time Series Signature
The time series signature is a collection of useful engineered features that describe the time series index of a time-based data set. It contains a 25+ time-series features that can be used to forecast time series that contain common seasonal and trend patterns:
Trend in Seconds Granularity: index.num
Yearly Seasonality: Year, Month, Quarter
Weekly Seasonality: Week of Month, Day of Month, Day of Week, and more
Daily Seasonality: Hour, Minute, Second
Weekly Cyclic Patterns: 2 weeks, 3 weeks, 4 weeks
We can then build 200+ of new features from these core 25+ features by applying well-thought-out time series feature engineering strategies.
Time Series Forecast Strategy
6-Month Forecast of Bike Transaction Counts
In this tutorial, the user will learn methods to implement machine learning to predict future outcomes in a time-based data set. The tutorial example uses a well known time series dataset, the Bike Sharing Dataset, from the UCI Machine Learning Repository. The objective is to build a model and predict the next 6-months of Bike Sharing daily transaction counts.
Feature Engineering Strategy
timetk to build a basic Machine Learning Feature Set using the new
step_timeseries_signature() function that is part of preprocessing specification via the
recipes package. I’ll show how you can add interaction terms, dummy variables, and more to build 200+ new features from the pre-packaged feature set.
Machine Learning Strategy
We’ll then perform Time Series Machine Learning using
workflows to construct and train a GLM-based time series machine learning model. The model is evaluated on out-of-sample data. A final model is trained on the full dataset, and extended to a future dataset containing 6-months to daily timestamp data.
Time Series Forecast using Feature Engineering
How to Learn Forecasting Beyond this Tutorial
I can’t possibly show you all the Time Series Forecasting techniques you need to learn in this post, which is why I have a NEW Advanced Time Series Forecasting Course on its way. The course includes detailed explanations from 3 Time Series Competitions. We go over competition solutions and show how you can integrate the key strategies into your organization’s time series forecasting projects. Check out the course page, and Sign-Up to get notifications on the Advanced Time Series Forecasting Course (Coming soon).
Need to improve forecasting at your company?
I have the Advanced Time Series Forecasting Course (Coming Soon). This course pulls forecasting strategies from experts that have placed 1st and 2nd solutions in 3 of the most important Time Series Competitions. Learn the strategies that win forecasting competitions. Then apply them to your time series projects.
Join the waitlist to get notified of the Course Launch!
timetk 0.1.3 or greater for this tutorial. You can install via
remotes::install_github("business-science/timetk") until released on CRAN.
Before we get started, load the following packages.
We’ll be using the Bike Sharing Dataset from the UCI Machine Learning Repository. Download the data and select the “day.csv” file which is aggregated to daily periodicity.
A visualization will help understand how we plan to tackle the problem of forecasting the data. We’ll split the data into two regions: a training region and a testing region.
Split the data into train and test sets at “2012-07-01”.
Start with the training set, which has the “date” and “value” columns.
Read the Full Article here: >R-bloggers