Blog: Apache Hive and Applications — 1
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Apache Hive enables advanced work on Apache Hadoop Distributed File System and MapReduce. It enables SQL developers to write Hive Query Language statements similar to standard SQL statements.
Today Hadoop has the privilege of being one of the most widespread technologies when it comes to crunching huge amounts of Big Data. But Hadoop is a like an ocean with a vast array of tools and technologies that are exclusively associated with it. One such technology is the Apache Hive technology. Apache Hive is a Hadoop component that is normally deployed by data analysts even though Apache Pig is deployed for the same purpose it is used more by researchers and programmers. It is an Open Source data warehousing system. It is exclusively used to query and analyze huge data sets stored in the Hadoop storage.
- Hive an essential tool in the Hadoop ecosystem that provides an SQL dialect for querying data stored in HDFS, other file systems that integrate with Hadoop such as MapR-FS and Amazon’s S3 and databases like HBase(the Hadoop database) and Cassandra.
- Hive is a Hadoop based system for querying and analysing large volumes of structured data which is stored on HDFS.
- Hive is a query engine built to work on top of Hadoop that can compile queries into MapReduce jobs and run them on the cluster.
- Hive was one of the first abstraction engines to be built on top of MapReduce.
- Started at Facebook to enable data analysts to analyse data in Hadoop by using familiar SQL syntax without having to learn how to write MapReduce.
When to use Hive
- Most suitable for data warehouse applications where relatively static data is analyzed.
- Fast response time is not required.
- Data is not changing rapidly.
- An abstraction to underlying MR program.
- Hive of course is a good choice for queries that lend themselves to being
- expressed in SQL, particularly long-running queries where fault tolerance is desirable.
- Hive can be a good choice if you’d like to write feature-rich, fault-tolerant, batch (i.e., not near-real-time) transformation or ETL jobs in a pluggable SQL engine.
Executing Hive Sql using Python
Here is a small example of a Python script to write Hive tables from CSV Data. In order to use this script you mush install “pyhive” by using below code.
“pip install pyhive”
Then you are ready to write to Hive script:
from pyhive import hive
conn = hive.Connection(host=”Server_IP”, port=port_no, username=”username”)
cursor = conn.cursor()
cursor.execute(“Hive Sql Script”)