Blog: SQLFlow: Our Motivation and the Prototype
SQLFlow allows users to write standard SQL statements, which are acceptable by the SQL engine, and statements with extended syntax to support AI functionality. For example, this statement trains a model using data from table flowers:
SELECT * FROM flowers TRAIN DNNClassifier LABEL kind INTO my_model;
The following statement predicts kind in table more_flowers using the trained model:
SELECT * FROM more_flowers PREDICT more_flowers.kind USING my_model;
SQLFlow is under active development and works with MySQL and TensorFlow at the moment. In the near future, we will support more SQL engines. Another important piece of our work is towards distributed model training and prediction. The ultimate goal is to abstract end-to-end AI development lifecycle, from data to model, with pluggable backend engine and intelligent data-to-feature mapping.
Many engineers in SQLFlow team have experience with Google, Baidu, and Hulu, where SQL is not a predominant programming language. However, SQL is such one at Ant Financial, where decisions and actions in financial products depend on data analysis, just like what is happening in LinkedIn where data scientists play a major role in sales strategy of data products.
An additional motivation comes from the observation that software engineers who usually program in Java and C++ would like to learn more SQL in depth if it can describe and build data pipelines from log collection, data augmentation, to AI, which would usually costs weeks of engineering hours in Java or C++. With SQLFlow we wish to reduce to 10 minutes or less.
Extending from SQL
We start building SQLFlow on top of SQL language, because SQL natively describes data pipelines. In addition, it brings in the advantage that engineers can use SQL to describe their purpose (end state) without providing detailed procedure. This advantage usually leads to concise source code that describes complex data pipeline. The only problem here is SQL doesn’t support AI natively and we need to extend it. A common approach is to add UDF (user-defined function). However, if we go for UDFs, we’d have to re-implement them for every supported engines SQLFlow connects to, which could be tedious. So we choose another approach which is to extend SQL syntax for AI. In particular, we begin with adding two clauses, TRAIN and PREDICT, as illustrated in the example above, to the SELECT statement.
Integration with SQL Engines
We hope our SQL extension is extremely easy to learn and integrates well with existing SQL programs despite many existing ones contain long and nested SELECT statements. Our SQLFlow users could simply append a TRAIN or PREDICT clause in those programs and add AI functionalities with ease.
One challenge here is that although all SQL engines tend to be compatible with the standard SQL, each of them might have some unique dialects. As a result, their users may use SELECT statements with some unique features without being aware of it. We don’t want users to modify their existing SQL statements, we only want them to append a TRAIN or PREDICT clause. To achieve this goal, we need to integrate parsers of various SQL engines with SQLFlow parser. This is not complete yet, we will publish a design document describing our approach soon. Your contribution will be very welcome!
Another challenge is that SQLFlow needs to convert data from SQL engines into model inputs. The conversion is necessary for AI applications in the industry but less discussed in the academic world. Luckily, TensorFlow community published feature column API, and users can invoke API to describe how to convert data into model inputs. Therefore, the challenge is to decide automatically which set of feature column APIs to invoke to convert the result of a SELECT statement into model inputs. For other AI frameworks, we will hopefully use a similar approach.
There are many attempts to enhance SQL to support AI:
- Microsoft SQL Server: Microsoft SQL Server has the machine learning service that runs machine learning programs in R or Python as an external script.
- Teradata SQL for DL: Teradata also provides a RESTful service, which is callable from the extended SQL SELECT syntax.
- Google BigQuery: Google BigQuery enables machine learning in SQL by introducing the CREATE MODEL statement.
These solutions try to extend a specific SQL engine with AI functionality. SQLFlow is not tied to any specific engine, instead we want to work with the community to support as many as possible, hence, the open sourcing of SQLFlow in its very early stage.
In addition to contribution from SQLFlow team, we would like to acknowledge contributors below. Tony (Kuisong) Tong from Snapchat is the designer and primary contributor of the Jupyter Notebook magic command. Duomu Ming also contribute to the SQLFlow website. If you are interested in enabling AI capability for SQL, be sure to check out SQLFlow. We have related documents to help you further walk through the code base.