Connect to Databricks database tables - Connect_ETL - 9.13

Connect ETL Installation Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL Installation Guide
Copyright
2024
First publish date
2003
Last updated
2024-11-08
Published on
2024-11-08T16:36:35.232000

Through JDBC connectivity, Connect for Big Data supports Databricks databases as sources and targets. This section describes how to create a JDBC Databricks database connection.

Before you create the connection, specify parameters in a Connect execution profile file to support the distributed Databricks configuration. See Work with the Connect Execution Profile File.

Databricks database connection requirements

Databricks requires a JDBC connection configuration with the driver name and location for all connections.

Before attempting to connect to a Databricks database, do the following:

  • Install the Connect server on an Amazon Elastic Compute Cloud (EC2) instance, Azure Virtual Machine (VM), or your local machine.
  • Specify JDBC and Spark parallelization parameters in the Connect JDBC configuration file. The parameters outlined in the Connect JDBC configuration file, as defined by the DMX_JDBC_INI_FILE environment variable, provide Connect with the mandatory and optional values required to access an Amazon S3 bucket or Microsoft Azure blob to start a Databricks query. Connect accesses Databricks using keys-based authentication. If no access keys are provided, Connect issues a UNIAMCRE error message aborts the job.
  • Specify Databricks deployment configuration parameters in a Connect execution profile. Use a global, user, and/or job-specific execution profile. Without this configuration, the Databricks database tables are unreachable. See Work with the Connect Execution Profile FileConnect to Databricks File Systems (DBFS).

The parameters outlined in a Connect Databricks configuration file include the following:

  • DriverName - Required JDBC driver name.
  • DriverClassPath - Required JDBC class path.
  • ANALYZETABLESTATISTICS - When set to y, Connect can run analyze queries that collect table statistics. Default value in n.
  • ANALYZECOLUMNSTATISTICS - When set to y, Connect can run analyze queries that collect column statistics. Default value is n.
  • MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel streams that can be established for loading data to the staging data file. By default, MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine.
  • WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket or Azure blob container using the Databricks File System (DBFS). Example URLs could include:
    • s3a://dev for an S3 bucket
    • wasbs://dmxdbfstorage@dmxdbfstorage.blob.core.windows.net/dev for an Azure Blob
    • dbfs:/dev for a DBFS store
  • WORKTABLENCRYPTION - Server side encryption algorithm for encrypting staging data in the S3 bucket. Valid values are AES256 and aws:kms.
  • DBFSMOUNTPOINT – DBFS mount point (DBFS path) required by WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory maps to an S3/Azure URL.
  • MAXWORKFILESIZE – Optional integer. The maximum size of a file in bytes of the staging file written by task. The default value is 134217728, which is equivalent to 128 MB.
  • WORKTABLESCHEMA - Optional schema name to use for staging data. The default schema for staging data is the same as the target data schema.
  • WORKTABLECODEC - A compression codec to compress the files in the stagiung directory. Valid values are gzip (default), bzip2, and uncompressed.
  • AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon establishing an AWS account. Connect ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket. If Connect runs in EC2, AWSACCESSKEYID is optional.
  • AWSACCESSKEY - The 40-character string, also known as the secret access key, which Amazon provides upon establishing an AWS account. Connect ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket. If Connect runs in EC2, AWSACCESSKEY is optional.

Connect requires the access key id and the secret access key to send requests to an Amazon S3 bucket unless an AWS temporary session token is required, in which case Connect requires the access key id and AWS temporary session token. See the AWSTOKEN parameter below.

  • AWSTOKEN - An AWS temporary session token, granting temporary security credentials (temporary access keys and a security token) to any IAM user enabling them to access AWS services. This alternative authentication method replaces a full-access AWS storage access key. Connect ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket.
  • AzureStorageAccessKey - A 512-bit Azure Blob Storage access key for an active account of which Microsoft issues two upon establishing an Azure Portal account. If Connect runs in the Azure Blob container, AzureStorageAccessKey is optional. If the storage access is required and the key is missing or invalid, Connect issues an AZSQDWTERR error message and aborts the job. Connect ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob container.
  • AzureStorageSAS - A shared access signature (SAS) URI that grants restricted access rights to Azure Storage resources. This alternative authentication method replaces a full-access Azure Storage access key. Connect ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob container. 

Defining Databricks database table connections

In the Database Connection dialog, define a connection to a Databricks database as follows:

  • At DBMS, select Databricks.
  • At Access Method, select JDBC.
  • At Database, select a previously defined Databricks JDBC database connection URL.
  • At Authentication, select Auto-detect.

Databricks target connections

Using a Databricks JDBC connection, Connect for Big Data can write supported Databricks data types to Databricks targets directly for optimal performance.

At the Target Database Table dialog, define a Databricks database table target:
  1. At Connection, select a previously defined Databricks target connection or select Add new... to add a new one.
  2. Select a table from the list of Tables, or select Create new... to create a new one.
    • User defined SQL statement is not supported.
    • All target disposition methods are supported. For information, see the Connect help topic, “Supported dispostions for target database tables.”
  3. On the Parameters tab, the following optional parameters are available for Databricks target database tables. Values specified here take precedence over their corresponding property in the JDBC configuration file, if any.
    • Analyze table statistics - enables analyze queries that collect table statistics.
    • Analyze column statistics - enables analyze queries that collect column statistics.
    • Maximum parallel streams - Optional integer representing the maximum number of parallel streams that Connect can establish for loading data into the staging data file. By default, this is set to the number of CPUs available in the client machine.
    • Work table directory - the parent-level directory in s3, blob, and/or dbfs in which Connect creates job-specific subdirectories.

      When the work table directory is an s3 bucket, you must mount the s3 bucket through DBFS. For more details, see the Databricks documentation concerning Amazon S3.

      When the work table directory is an azure blob container, you must mount the blob container through DBFS. For more details, see the Databricks documentation concerning Azure storage.
    • Work table schema - the schema used to create the staging table. By default, Connect creates the staging table in the same schema as the target table.
    • Work table codec - specifies the compression algorithm used to compress Databricks data. Valid values are gzip (default), bzip2, and uncompressed.
      Note: Set commit interval and Abort task if any record is rejected are not supported.

Defining Databricks source connections

Using a Spark JDBC connection, Connect can read supported Databricks data types from any Databricks table. For all ETL jobs, Connect for Big Data supports Databricks database tables as sources and as lookup sources.

At the Source Database Table dialog or at the Lookup Source Database Table dialog, define either a Databricks database table source or lookup source, respectively.

At Connection, select one of the following:
  • a previously defined Databricks source connection
  • select Add new... to add a new connection.