Connecting to Impala requires configuration steps before the connections can be defined. Connection requirements and behavior differ between Impala sources and targets.
Impala source connections
JDBC connectivity
When Connect for Big Data reads from an Impala database table on an ETL server/edge node or in the cluster via JDBC, the data is staged temporarily in uncompressed format to a text-backed Impala table.
Impala target connections
JDBC connectivity
When Connect for Big Data writes to an Impala database table via JDBC, data is generally loaded directly into target tables. Writes are staged temporarily in compressed or non-compressed format to a text-backed Impala table only when one or more of the following conditions limits direct access:
- A target table has one or more partitions
- A parquet-backed target table has any timestamp columns
- A target table performs Truncate or Apply Change (CDC) dispositions
- The job runs on localnode or singleclusternode
- A user forces Connect for Big Data to stage data by setting the environment variable DMX_IMPALA_TARGET_FORCE_STAGING to 1, which uses the two-step process implemented in previous versions of Connect
Update and Upsert dispositions are supported only for kudu tables.
At run-time, Connect for Big Data accesses the kudu jars from /opt/cloudera/parcels/CDH/lib/kudu on the edge/master node for Impala access. You can override this default location by using environment variable KUDU_HOME. For example, export KUDU_HOME=/opt/cloudera/parcels/CDH/lib/kudu sets the location accessed at run-time to /opt/cloudera/parcels/CDH/lib/kudu.