Hive Support - Connect_ETL - 9.13

Connect ETL for Big Data Sort User Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL for Big Data Sort User Guide
Copyright
2023
First publish date
2003
Last updated
2023-09-11
Published on
2023-09-11T19:03:59.237517

Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, query, and analysis through a SQL-like query language called HiveQL. HiveQL queries are converted into MapReduce jobs, allowing Connect for Big Data Sort to be invoked for the sort phases when specified.

When a query is converted into a MapReduce job, Hive converts the fields inside the query into key/value pairs as follows:

  • The join, group by, order by, and sort by fields are all grouped together to form the key, which is converted to a HiveKey object.
  • The remaining fields are grouped together to form the value.

All non-user-defined data types allowed as HiveKey fields are supported for Connect for Big Data Sort:

  • TINYINT 1 byte integer
  • SMALLINT 2 byte integer
  • INT 4 byte integer
  • BIGINT 8 byte integer
  • BOOLEAN TRUE/FALSE
  • FLOAT single precision
  • DOUBLE double precision
  • STRING sequence of characters in a specified character set
  • BINARY (starting with Hive 0.8.0)
  • TIMESTAMP (starting with Hive 0.8.0)
  • DECIMAL (starting with Hive 0.11.0)

To enable Connect for Big Data Sort for HiveQL queries, edit the Hive script as follows:

set <property_name>=<property_value>;
  • Set the dmx.hive.support property to true:
set dmx.hive.support=true;
  • Specify the location of the dmxhadoop jar file as follows, where <type> is the MapReduce version, as described at the beginning of Connect for Big Data Sort Accelerator Properties; this adds the file to the distributed cache as well as the CLASSPATH:
add JAR <connect_install>/lib/dmxhadoop_<type>.jar;