Deploy Connect to a Databricks cluster in the cloud - Connect_ETL - 9.13

Connect ETL Installation Guide

Product type
Software
Portfolio
Integrate
Product family
Connect
Product
Connect > Connect (ETL, Sort, AppMod, Big Data)
Version
9.13
Language
English
Product name
Connect ETL
Title
Connect ETL Installation Guide
Copyright
2024
First publish date
2003
Last updated
2024-11-08
Published on
2024-11-08T16:36:35.232000

To run jobs on a Databricks cluster, you must deploy the Connect Debian package to the cluster and install it on each node. Connect CloudFSUtil, a command-line utility included with Connect for Big Data, is used to move the required files to all the nodes in a cluster on Amazon Web Services (AWS) and Microsoft Azure. After you deploy Databricks, you can access Databricks file systems and database tables.

Requirements

Set up the Spark cluster, install Databricks, and configure JDBC for Spark and Databricks. When connecting to Databricks databases, we recommend using the latest JDBC drivers available on the Databricks website.

If your Connect for Big Data jobs connect to DB2, Oracle, or SQL Server databases, install the JDBC drivers on Databricks in a subdirectory of the same work directory as the dmxspark3ix.jar file and the work directory configured in Connect execution profile files. The name of the subdirectory for each driver must be:

  • DB2: db2
  • Oracle: oracle
  • SQL Server: sqlserver

For information on work directory configuration, see Connect to Databricks File Systems (DBFS).

Before you use CloudFSUtil, you must configure it for your environment. For more information, see Connect CloudFSUtil configuration.

Run CloudFSUtil to copy the jars to the appropriate location. For example:
cloudfsutil -put mssql-jdbc-8.2.2.jre8.jar
dbfs:/connect/files/work/sqlserver-jdbc-8.2.2.jre8.jar

You also need a user account with permission to run sudo to run the Debian install.

Prepare the install files

To install Connect on Databricks, you require the following files.

  • A DMExpress executable bin file for Connect, which is typically named dmexpress_$<DMX_version>-1.bin. Run this on your Linux system outside the cluster.
  • A license key package bin file for Connect, which is typically named dmexpresslicense-$<license_site_ID>_$<license_date>.bin. Run this on your Linux system outside the cluster.
  • A Debian install script. Databricks can use Debian initialization scripts during cluster creation to extract the Connect executable bin files and install them on each node in the cluster. If you do not configure the init script, then you need to install the Debian package manually on each node in the cluster. For more information, see Debian Package Installation.
    Note: Databricks requires UNIX (LF) end-of-line characters in the script to run properly.
A sample Debian installation script is shown below:
#!/bin/bash
workDir=/dbfs/mnt/azuregen2/work
debPackageDBFSPath=/dbfs/mnt/azuregen2/dir1 
connectDEBPackageName=dmexpress_9.13.33-1.deb
cp -f $debPackageDBFSPath/$connectDEBPackageName 
$connectDEBPackageName 
chmod a+x $connectDEBPackageName
sudo apt install ./$connectDEBPackageName 
rm $connectDEBPackageName

Within the example Debian script, the following variables are set:

variable value
debPackageDBFSPath The mount point within Databricks where you keep the DMExpress bin executable files.
connectDEBPackageName The name of the Debian package to be deployed.
workDir A directory path to which to save Connect job staging materials. This should match the configured workDirectory in your Connect execution profile file. For information, see Databricks.
version The version of Connect to install.

After variable assignments, the script uses Debian to install the software and makes the /usr/tmp directory for temporary storage.

Move the install files to Databricks

Before you install Connect Debian package on each node in the cluster, you first use the executable CloudFSUtil within Connect to transfer the install files from your local computer (Windows or Linux, outside the Databricks cluster) to the DBFS mounted drive, where each node can access it. For example, from a directory containing all install files, run:
cloudfsutil -put dmexpress_9.13.11-1.bin 
wasbs://account.blob.core.windows.net/container/
cloudfsutil -put dmexpresslicense-70590_20200925-1.bin 
wasbs://account.blob.core.windows.net/container/ 
cloudfsutil -put dmexpresslicense-70590-20200925-1.bin s3a://bucket/
cloudfsutil -put dmexpress_9.13.11-1.deb dbfs:/rootdir/ 
dmexpress_9.13.33-1.deb cloudfsutil -put 
dmexpresslicense-70590_20200925-1.deb dbfs:/rootdir/ dmexpresslicense-70590_20200925-1.deb

Note the following guidelines:

  • The Connect bin executables must reside on a mounted drive.
  • You can also move files with Azure storage and the AWS S3 portal as well.
  • To make the Debian install script available during cluster initialization, you must save it to a DBFS root folder.

Deploy to Databricks clusters

Depending on your needs, you deploy Databricks to an interactive cluster or configure a job cluster to start on-demand each time a Connect job is run. Interactive clusters run continuously. Job clusters start up only when a job is run, afterwhich they are deleted. Configure the type of cluster appropriate for your needs and environment.

A typical deployment sequence for Connect ETL on Databricks on Azure

Prerequisites:

  • Install Connect ETL on Databricks Gateway VM or on-prem Linux system.
  • Upload Connect ETL Debian installation package to Databricks File System and create and upload initialisation script to Databricks File System, see section Debian Package Installation
Sequence number Description
1. Create Databricks Connect ETL Jobs and tasks as per Databricks section in the online help. Create execution profile. This will include a Databricks job cluster. Please see the Deploy to a Databricks job cluster section below for more information. Upon execution the Connect ETL job will communicate this information to the Databricks control plane.
2. The Databricks Control plane will retrieve the previously uploaded initialization script and Connect ETL Debian installation package from the Databricks File System.
3. Databricks will use the information from Step 2 to create a Databricks job cluster and install Connect ETL there. It will then execute the Connect ETL job and tasks described in Step 1.
4. The Connect ETL job running in Databricks cluster will ingest data from the on-prem sources defined in the Connect ETL tasks.
5. The Connect ETL job running in Databricks cluster will write data to the Databricks files and/or tables defined in the Connect ETL tasks.
6. Once the Connect ETL job completes Databricks will terminate and clean up the job cluster and return control to Connect ETL running on the Databricks Gateway.

This section describes how to deploy to interactive and job Databricks clusters.

Deploy to a Databricks interactive cluster

You must first configure and create a Databricks interactive cluster before you deploy Connect for Big Data and run Connect jobs. For information on how to create an Azure Databricks cluster, see Create a cluster.

This procedure assumes you are configuring the Databricks cluster initialization script to include in the Connect for Big Data Debian install commands. An alternative method is to run the Debian install script manually on each cluster node.

On the interactive cluster on which to install Connect, run the following procedure
  1. At the top of the cluster’s page, click Edit.
  2. Click Advanced Options.
  3. Under Spark, add the JAVA_HOME environment variable. For example,
    JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/jre
  4. Under Logging, set the destination to DBFS and the cluster log path to your desired location. For example:
    dbfs:/mnt/azuregen2/cluster-logs 
    Note: When you save these values, Databricks adds an additional subdirectory to the end of the log path for the cluster ID.
  5. Under Init Scripts, select Destination DBFS and add the path and filename of your Debian install script (see, Prepare the install files section above) to the Init Script Path value. For example:
    dbfs:/rootDir/install_connect.sh
  6. Click Confirm at the top of the cluster’s page to save these changes.
  7. To confirm a successful install,
    1. Click on Event Log on the cluster’s page to verify the INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED messages and times.
    2. Review the software installed in /usr/dmexpress by running the following command within a notebook:
      ls -l /usr/dmexpress/
      If the cluster reported errors during Connect for Big Data install, review the init script logs stored in the init_scripts/cluster_ID_container_IP directory under the logging directory set up in the cluster’s Advanced Options. For example:
      ls -l /dbfs/mnt/azuregen2/cluster-logs/init_scripts

Additional steps . After you deploy the interactive cluster, complete the following steps:

  1. Start the cluster. To start the cluster, click Start at the top of the cluster’s page. See your cluster documentation for more information.
  2. Create the Connect execution profile file. See Work with the Connect Execution Profile File and the Connect help topic “Execution profile file”.
  3. Configure Databricks source and target connections. See Databricks.
  4. Run Connect jobs on the cluster. Use the dmxjob command when you run Connect tasks and jobs on the cluster. The dmxjob command refers to the interactive deployment configuration in the Connect execution profile file. For more information, see the Connect help topic, “Running a job from the command prompt.”

Deploy to a Databricks job cluster

A Databricks job cluster is created automatically (on-demand) each time a Connect job runs. Job clusters are only created at runtime. After the job runs, the cluster is deleted. The run-time information that defines the Connect job must include the cluster configuration as the cluster is created and only exists for the duration of the job.

Complete the following steps:

Create a configuration file for the job cluster. When you view the configuration properties for an existing Databricks cluster (for example, in the Azure interface), you can view the properties in JSON format. You can use the JSON configuration of an existing interactive cluster as a template to create the JSON configuration file for a job cluster. For information on cluster configuration properties, see the Azure Databricks Jobs API webpage.

Create the Connect execution profile file. You must reference the cluster configuration JSON file name in the clusterConfig property of the databricks deploymentConfiguration section. For example:
{"databricks" : {
"deploymentConfigurations":[{
"name": "databricks_interactive_cluster", 
"host": "https://westus.azuredatabricks.net", 
"token": "tokenname", 
"workDirectory": "dbfs:/mnt/azureblob/", "clusterID": "0111-311131-oka45"
},
"clusterConfig":
"/mydir/dbricks_config/create_job_cluster.json"
} ]
}
Run Connect in the Databricks cluster with the dmxjob command. Use the dmxjob command each time you run Connect tasks and jobs. The dmxjob command should reference both the Connect execution profile file and the job cluster deployment item in the profile file, as shown in the following example:
dmxjob /run /mydir/databricks/config/job/myjob_0001.dxj /runon SPARK 
databricks[databricks_job_cluster] /profile
/nis4/mydir/databricks/config/databricks_profile/spark_execution_profile.json

For more information, see the Connect help topic, “Running a job from the command prompt.”