Blog: Columnar Database: The contribution to Artificial Intelligence.
Dear Readers, in this article, I am going to discuss about what columnar database is, its use, how to create it and which system supports it. Going a little behind, I am giving you a glance of some databases which we have used till date and how this columnar database came into the picture. We have used Traditional Transactional DB, NO SQL Databases and Columnar Databases.
Transactional Databases:Few years ago the main purpose of databases was to store info and provide the information as and when required. The operations were mainly write heavy and information was stored in normalized form to avoid redundancy and maintain the integrity of information. These are called Online Transactional Process Database.
The most popular OLTP databases are:Oracle,SQL Server,My SQL
The core properties of these systems is ACID properties:Atomicity, Consistency, Isolation, Durability
With time these systems have also developed and support a variety of features like horizontal and vertical scaling, JSON support, in memory operations, caching, fault tolerance, replication etc.
NO SQL Database:With time the data has grown so much just for the fact over 80% of data that we have as of now is generated the in last 5 years for, obvious reasons like smartphones popularity, internet availability. In order to analyse such huge data, we need some advanced systems which are built specifically for analytical querying.
Some of the examples for OLTP and OLAP querying are:
· What is the current balance of a customer?
· What was the last transaction done by the customer?
· Add a new customer
· Update the phone number for a customer.
· How many transactions of the amount greater than 1000 are done by the customer in last year?
· How many new customers are added and have deposited over 10K in the last 5 years?
In OLTP we are majorly working in 1 dimension but in the OLAP systems we are slicing and dicing the dimensions and working with multiple records. Also in OLAP queries, we are mostly reading the data and not updating it.
To cater to these kinds of requirements advanced systems like Hadoop, InfluxDB, Postgres, MongoDB came into picture.
These systems come under the category of NoSQL databases. All these systems solve some specific data problems.
NOSQL Databases are categorized in 4 classes:Document Store,Graph DB,Key Value Store,Columnar Stores
Columnar Stores: A columnar database is a database management system (DBMS) that stores data in columns instead of rows.
In a columnar database, all the column 1 values are physically together, followed by all the column 2 values, etc. The data is stored in record order, so the 100th entry for column 1 and the 100th entry for column 2 belong to the same input record. This allows individual data elements, such as customer name for instance, to be accessed in columns as a group, rather than individually row-by-row.
Row Storage: The data sequence consists of the data fields in one table row(Example Oracle RDBMS).
Column Storage — The data sequence consists of the entries in one table column ( Example HANA)
One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG — to be performed very rapidly.
The following example shows the different usage of column and row storage, and positions them relative to row and column queries. Column storage is most useful for OLAP queries because these queries get just a few attributes from every data entry. But for traditional OLTP queries, it is more advantageous to store all attributes side-by-side in row tables
As the use of in-memory analytics increases, however, the relative benefits of row-oriented vs. column oriented databases may become less important. In-memory analytics is not concerned with efficiently reading and writing data to a hard disk. Instead, it allows data to be queried in random access memory (RAM).
Amazon Redshift Columnar Storage
As a large scale, cloud-hosted data warehouse solution, Redshift is optimized for fast data analysis on vast amounts of data. Because Redshift is used for analytical queries, which are concerned with aggregates across many rows, a columnar structure performs much better:
- Using columnar storage, each data block can hold column field values for as many as three times the records as row-based storage. This reduces the number of I/O operations by 2/3. In tables with very large numbers of columns and a large number of rows, storage efficiency is even greater.
- An added advantage is that a columnar structure enables much better compression. In a database table, each column contains the same data type, typically with similar data. This creates an opportunity for much more efficient compression compared to a traditional database structure.
- The savings in storage space also carry over to retrieving and storing the data in memory. Many database operations only need to access or operate on a small number of columns at a time, and so you can save memory space by only retrieving the columns you actually need for your query.
For example, consider a table that contains 100 columns. A query that uses five columns will only need to read about 5% of the data. This saving is repeated for possibly billions or even trillions of records in large datasets. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
Column Compression in Redshift: By default, Redshift stores data in a raw, uncompressed format, and you can choose whether to compress data. Each column within a table can use a different type of compression.
There are several ways to encode columnar data when compressing it; choosing the right type of encoding for each data type is key to achieving efficient compression. Redshift supports seven column encoding formats:
It is possible to let Redshift automatically select encoding for column compression, or select it manually when creating a table.
We can buy, install, and host a column-oriented database in your own data center, using software such as Apache Cassandra, and Apache HBase. If we have high-end hardware, we can expect good performance from on-premises databases, as long as the load is relatively constant. If we have variation in our workloads, we could see performance impacts. We will also need more people in our IT department to help manage the hardware and software.
Many organizations prefer to host their data warehouses in the cloud, using services such as Amazon Redshift and Google BigQuery. Cloud applications offer several benefits:
· No capital requirements for hardware
· Ability to architect for high availability with built-in fault tolerance
· Flexible capacity and near-infinite scalability to deal with elastic demands
· Always the latest technology
Challenges with Redshift’s Columnar Structure
For each column, you’ll need to define one of the supported Redshift datatypes. Unlike other RDBMS systems, Redshift is used solely for data warehousing. You will not see many of the data types like LOBs or LONG in Redshift.
We can design a table for data warehousing using data types like SMALLINT, INTEGER, and BIGINT to store whole numbers of various ranges. Use DECIMAL or NUMERIC to store values with user-defined precision. Similarly, we can use VARCHAR, DATE, TIMESTAMP for its respective data.
The challenge lies in data integrity. We need to be careful about how the conversion and compatibility of number data type works while manipulating or querying data. Make sure we select only the data types the table needs, rather than choosing data types at random. Careful selection of data type reduces the size of the rows, and the database, making reads and writes faster and more efficient.
We should not select data types nor design the ETL in a hurry as it can cause errors, or worse, data corruption. Knowing how each data type behaves and the events that occur when the checks are performed will be of help when working on table design. Compatibility checks and conversion occur during various database operations, and if the wrong data type is defined or data is not converted properly, data integrity can be compromised.
Panoply is a smart data warehouse based on Redshift, which can help us automatically map our data to Redshift’s data types, guaranteeing data integrity. Panoply learns our data and makes intelligent choices, making sure all data makes the proper transition from the old data types to the types allowed in Redshift.
Role of Columnar Database into AI: Recent years have witnessed an explosion of columnar databases such as Oracle 12c Database In-Memory Option, AWS Redshift, MonetDB and SAP HANA. Analytics on structured data stored in databases can be more than just SQL analytical queries. Databases are increasingly being used as inputs for machine learning (ML aka Big Data analytics), statistical processing, and deep learning (DL). We interchangeably use the term artificial intelligence (AI) to encompass both ML and DL. We introduce innovations to improve in-memory columnar databases during the process of creating features for use with ML and DL analysis. These innovations use extensions to the dictionary encoding data structures used to compress data for columnar VLDBs. The efficiency of these innovations increases the value of using columnar databases for AI. Given that, columnar databases are well suited for use with AI. When viewed as an analytics cycle a richer set of data reuse, information flows and feedbacks between databases and AI are evident.
THE SCOPE OF ANALYTICS: Traditionally, the approach many Big Data and Machine Learning (ML) practitioners have taken is to select data that is stored in file systems (CSV, JSON, …), then manipulate the data into the proper numerical format for subsequent ML algorithms (Linear Regression, Generalized Linear Models, Gradient Boosted Trees, SVD, …) to produce models and predictions. And finally DL practitioners have largely focused their analytics on image, video and text data using a variety of Deep Neural Networks (DNN) architectures to produce models and inferences. But recently, Practitioners are beginning to move out of their traditional silos and the analytical approaches are merging. The goal of these innovations is to advance DL and ML for full stack analytics. In full stack AI, a wide variety of data and information will need to be accessed. The many traditional advantages of very large databases for securely managing data are well suited for AI analytics cycle. We need performance and efficiency at every level of the stack.
ANALYTICS CYCLE: The analytics process is traditionally called the analytics pipeline. In the pipeline the data is selected, filtered, reduced dimensionally, and transformed to get it into the form required for ML or DL. The data is then analysed with the appropriate algorithms and finally the model or the results generated. It is rare that only one pass of the analytics is performed. Developing an understanding of subtleties of complex data generally requires multiple analyses or experiments. Often insights gained from one analysis are used to refine the next analysis. Thus we believe it is critical to recognize that the process is an analytical cycle.
A. Feature Characteristics: We next explore the characteristics of the data that will serve as input for analytics. In databases, we place data of the same type into each column (ex: age, gender, zip code, …). In ML each column can be a feature or the basis of a feature for analysis. Sometimes the data in a single ML feature is based on multiple columns. There are two major classifications of features, categorical and numerical. Categorical features take on a fixed number of discrete values that have no mathematical ordering (ex: gender has no greater than or less than ordering). Typical categorical features are represented using strings or integers. Numerical features are quantitative variables that can take on a range of numerical values and they have a mathematical ordering (ex: income).
B. Cardinality Cardinality is the number of unique values a column (feature) may contain. For example, while one may have millions of US customers, they will reside in one of 50 states, so the state column would have a maximum cardinality of 50. Of course, the true cardinality of a dataset may be less than the maximum possible (ex: having only customers in 38 states). In the process of data analysis we can use data binning to reduce the cardinality of the data. Binning will group high-cardinality data into a set of intervals. The number of buckets (cardinality) is often much less than the original. Binning can be helpful in two ways, first it reduces the effects of minor observation errors and secondly it aligns the analysis with business decisions (ex: target marketing segments). Binning can be either linear or non-linear. An example of linear Binning is to group people into equal size buckets (ex: 0–9, 10–19, 20–29, …). An example of a non-linear Binning of “age life-cycle groups” (ex: 0–3, 4–12, 13–16, 16–21, 21–65,…). 5.
COLUMNAR DATABASE The advantages of columnar databases to accelerate performance of queries are well known. By storing data in a columnar form, the database can access only the data it needs to perform a query without having to access the unneeded data in the rest of the row. Query performance is often dramatically faster. Many software and hardware techniques have been developed to accelerate performance on column databases such as compression.
A. Dictionary Columnar Compression Compression can dramatically improve the performance of analytical queries by reducing the amount of data that needs to be read. In addition, large multi-TB databases that could previously only be stored on disk can be completely stored in memory if compression techniques are used. Examples of 2x to 30x compression have been achieved using multi-level compression techniques. Columnar databases typically use dictionary compression and then this data additionally compressed with run-length encoded (RLE) or Huffman-based compression. Dictionary encoding provides data compression and also captures metadata that are used for queries, featurization as well as ML and DL analytics. In dictionary encoding each original data value is mapped to an integer. For example, in a state column, the string “Alabama” is mapped to “1”, “Alaska” is mapped to “2”, and “The State of Rhode Island and Providence Plantations” is mapped to “39.” Given that there are 50 states we can determine that only 6 bits are needed (ceil(log2(50)) in each column entry, which is a dramatic reduction in storage required. The dictionary encoding technique can also be used for analytics packages like Apache Spark SQL, which does processing on columnar data.
COLUMNAR DB STRUCTURE FOR AI: Data Scientists use a variety of tools to explore the data and perform featurization. This approach inherently involves making one or more intermediate copies of the data, which is both inefficient and has security issues. Since it is increasingly common for data scientists to use SQL or DSL-style SQL to both extract data from databases and transform them into features for ML frameworks, performing all the whole operations in a database would be an easy transition. With SQL one can already leverage the efficiencies of RDBMS and NoSQL databases manipulating large data at scale. Database execution benefits from a wide variety of innovations including: indexes, efficient joins, query optimizers, etc. In addition database optimizers have also become very efficient at extracting efficient parallelism from high-level SQL. Columnar databases (RDBMS and NoSQL) are well suited to efficiently access just the data required to construct DL and ML features. Working entirely within the database for featurization the unsecure working copies of the data are avoided as well.
A. Feature Transformations for DL and ML: There are several different ways to get features ready for analysis. Feature selection is the process of selecting a subset of relevant features for model use. Feature creation (or feature transformation) is the process of creating a new feature from other information.
B. Normalization Most data exists at a wide variety of scales. It is best to standardize (or normalize) all data into similar ranges. Depending on the data there are a variety of manners to “standardize” data, these include: linear scaling, min/max scaling, mean normalization, standard deviation normalization, logarithmic scaling, z-score transforms, and others.
C. One-hot encoding One-hot encoding maps a single low-cardinality column into multiple columns of binary vectors, each of which has at most a single one-value. Some also refer to one-hot encoding as “Categorical Identity Columns.” This is an effective encoding for categorical data of low cardinality of both numerical and string types.
D. Binarizer, quantile, hash buckets, and Binning Binarizer, quantile, hash buckets, and bucketization discretization are all different types of cardinality reduction techniques for ML analysis. Binarization is the process of thresholding numerical features to binary (either: 0.0 or 1.0) features. This can be done by linear thresholding or applying a logistic function. This can also be viewed as selective one-hot encoding. Quantile discretization is the term used when the range of is divided into number of quantiles or bins. Typically with quantile discretization the range of input values is divided in a linear fashion. If the quantile discretization is set to create 4 buckets then the output would be 4 one-hot columns (a 4-element vector). Hash buckets are another way to reduce cardinality of numeric columns. The hashing used is a modulo operator. While this may map completely unrelated features together it seems that this works well in practice, in part because features from other columns can be used to differentiate them. Binning is the term used for dividing the feature into buckets where the ranges of each bucket may be different. Typically the range of input values is divided in a non-linear fashion. Sometimes the term Binning is used as a general term to describe every type described in this section. All of these discretizations serve two purposes in DL and ML. First, they produce lower cardinality representations which reduce the chances of over-fitting. Second, they can present data in a manner that is closer to the needs of the desired output.
COLUMMAR DATABASE AND AI ARCHITECTURE: We propose an optimized full-stack architecture encompassing columnar database and AI analytics (DL and ML), to both maximize efficiency of data movement as well as address the needs of the analytics cycle, Optimized columnar database and AI architecture. When we use the term columnar database we are referring both to Columnar RDBMS and Columnar NoSQL. The database should manage and store all data, metadata, featurization methods, and information around every analysis. The guiding principles for this architecture are increasing data movement efficiency, providing performance at scale, and incorporating feedback based on learning to improve analysis. An example of the feedback mechanisms from the analysis to the columnar database is new bucketizations that are inferred from DL/ML analysis. These bucketizations may be useful for future analysis. We expect data sources to contain a wide variety of data, metadata and hyperparameters about all aspects of AI. Examples of information stored and managed by the columnar database include, but are not limited to:
• created features, metadata on the features, and the calculations used to derive these created features;
• the analysis performed and the hyperparameters used for each analysis;
• weights and biases from previous deep learning neural nets to facilitate transfer learning;
• feedback on the importance/ranking/relevance of each feature, as well as, various feature Binning that worked well or were learned in the course of training;
- results from the analysis may also be used for future analysis.
Looking more closely at the various featurization operations we can see where new AI algorithms can provide feedback or improvement for subsequent analysis.
First and foremost is the section of what type of featurization operation that should be performed. For example is it better to “Normalize” a column or to “Standardize” a column. Some of the operations below such as Binning should only be applied to low-cardinality data for practical concerns about number of output columns.
Various AI algorithms can also be created to provide improvement feedback for these operations. Databases also have the proper mechanism for collaborative work typical of most use cases in production analytics. Databases are also designed to have governance mechanisms for security and regulatory tracking. In the future we expect Wide & Deep architectures to evolve to Wide-To-Deep (WTD) architectures that employ a range of techniques orchestrated by intelligent agents. We also expect multi-level architectures where for instance an ML result feeds into DL and ML to DL. These multi-level architectures may also both features that are passed through to the next level as well as have new features added at a level. In this evolution the value of the columnar database architecture and the ADV innovation will still hold.
CONCLUSION The scope of the various disciplines of AI analytics is expanding to include every discipline in a unified full stack. In addition when we look at typical analytical workflows we find that the analytics pipeline is actually a rich analytic cycle that has data reuse as well as a variety of information flows and feedbacks. We also see many data types that can be analysed with a variety of algorithms. Finally we see that all of the data and analytics need to be stored and managed in a secure fashion. This has led us to see the importance of an integrated database (data warehouse, NoSQL, RDBMs) combined with AI analytics. It is well known that columnar databases are the most efficient for implementing data warehouses. The dictionary compression techniques used in columnar databases both reduce the data that needs to be stored and improve query execution. The ADV innovations described in this paper are additions to the dictionary for columnar data for accelerating the wide variety of featurization techniques that prepare data for AI analytics (ML & DL). The ADV columnar database innovation increases the efficiency of featurization by minimizing data movement and data duplication. We expect to seem more innovations for integrated columnar database and AI architecture as full stack AI is more fully explored.