Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, query, and analysis through a SQL-like query language called HiveQL. HiveQL queries are converted into MapReduce jobs, allowing Connect for Big Data Sort to be invoked for the sort phases when specified.
When a query is converted into a MapReduce job, Hive converts the fields inside the query into key/value pairs as follows:
- The join, group by, order by, and sort by fields are all grouped together to form the key, which is converted to a HiveKey object.
- The remaining fields are grouped together to form the value.
All non-user-defined data types allowed as HiveKey fields are supported for Connect for Big Data Sort:
TINYINT
1 byte integerSMALLINT
2 byte integerINT
4 byte integerBIGINT
8 byte integerBOOLEAN TRUE/FALSE
FLOAT
single precisionDOUBLE
double precisionSTRING
sequence of characters in a specified character setBINARY
(starting with Hive 0.8.0)TIMESTAMP
(starting with Hive 0.8.0)DECIMAL
(starting with Hive 0.11.0)
To enable Connect for Big Data Sort for HiveQL queries, edit the Hive script as follows:
- Set the required Connect for Big Data Sort properties shown in Connect for Big Data Sort Accelerator Properties using the Hive CLI syntax:
set <property_name>=<property_value>;
- Set the dmx.hive.support property to
true
:
set dmx.hive.support=true;
- Specify the location of the dmxhadoop jar file as follows, where
<type>
is the MapReduce version, as described at the beginning of Connect for Big Data Sort Accelerator Properties; this adds the file to the distributed cache as well as the CLASSPATH:
add JAR <connect_install>/lib/dmxhadoop_<type>.jar;