To run jobs on a Databricks cluster, you must deploy the Connect Debian package to the cluster and install it on each node. Connect CloudFSUtil, a command-line utility included with Connect for Big Data, is used to move the required files to all the nodes in a cluster on Amazon Web Services (AWS) and Microsoft Azure. After you deploy Databricks, you can access Databricks file systems and database tables.
Requirements
Set up the Spark cluster, install Databricks, and configure JDBC for Spark and Databricks. When connecting to Databricks databases, we recommend using the latest JDBC drivers available on the Databricks website.
If your Connect for Big Data jobs connect to DB2, Oracle, or SQL Server databases, install the JDBC drivers on Databricks in a subdirectory of the same work directory as the dmxspark3ix.jar file and the work directory configured in Connect execution profile files. The name of the subdirectory for each driver must be:
- DB2: db2
- Oracle: oracle
- SQL Server: sqlserver
For information on work directory configuration, see Connect to Databricks File Systems (DBFS).
Before you use CloudFSUtil, you must configure it for your environment. For more information, see Connect CloudFSUtil configuration.
cloudfsutil -put mssql-jdbc-8.2.2.jre8.jar
dbfs:/connect/files/work/sqlserver-jdbc-8.2.2.jre8.jar
You also need a user account with permission to run sudo to run the Debian install.
Prepare the install files
To install Connect on Databricks, you require the following files.
- A DMExpress executable bin file for Connect, which is typically named
dmexpress_$<DMX_version>-1.bin.
Run this on your Linux system outside the cluster. - A license key package bin file for Connect, which is typically named
dmexpresslicense-$<license_site_ID>_$<license_date>.bin
. Run this on your Linux system outside the cluster. - A Debian install script. Databricks can use Debian initialization scripts during cluster creation to extract the Connect executable bin files and install them on each node in the cluster. If you do not configure the init script, then you need to install the Debian package manually on each node in the cluster. For more information, see Debian Package Installation.Note: Databricks requires UNIX (LF) end-of-line characters in the script to run properly.
#!/bin/bash
workDir=/dbfs/mnt/azuregen2/work
debPackageDBFSPath=/dbfs/mnt/azuregen2/dir1
connectDEBPackageName=dmexpress_9.13.33-1.deb
cp -f $debPackageDBFSPath/$connectDEBPackageName
$connectDEBPackageName
chmod a+x $connectDEBPackageName
sudo apt install ./$connectDEBPackageName
rm $connectDEBPackageName
Within the example Debian script, the following variables are set:
variable | value |
---|---|
debPackageDBFSPath | The mount point within Databricks where you keep the DMExpress bin executable files. |
connectDEBPackageName | The name of the Debian package to be deployed. |
workDir | A directory path to which to save Connect job staging materials. This should match the configured workDirectory in your Connect execution profile file. For information, see Databricks. |
version | The version of Connect to install. |
After variable assignments, the script uses Debian to install the software and makes the /usr/tmp directory for temporary storage.
Move the install files to Databricks
cloudfsutil -put dmexpress_9.13.11-1.bin
wasbs://account.blob.core.windows.net/container/
cloudfsutil -put dmexpresslicense-70590_20200925-1.bin
wasbs://account.blob.core.windows.net/container/
cloudfsutil -put dmexpresslicense-70590-20200925-1.bin s3a://bucket/
cloudfsutil -put dmexpress_9.13.11-1.deb dbfs:/rootdir/
dmexpress_9.13.33-1.deb cloudfsutil -put
dmexpresslicense-70590_20200925-1.deb dbfs:/rootdir/ dmexpresslicense-70590_20200925-1.deb
Note the following guidelines:
- The Connect bin executables must reside on a mounted drive.
- You can also move files with Azure storage and the AWS S3 portal as well.
- To make the Debian install script available during cluster initialization, you must save it to a DBFS root folder.
Deploy to Databricks clusters
Depending on your needs, you deploy Databricks to an interactive cluster or configure a job cluster to start on-demand each time a Connect job is run. Interactive clusters run continuously. Job clusters start up only when a job is run, afterwhich they are deleted. Configure the type of cluster appropriate for your needs and environment.
A typical deployment sequence for Connect ETL on Databricks on Azure
Prerequisites:
- Install Connect ETL on Databricks Gateway VM or on-prem Linux system.
- Upload Connect ETL Debian installation package to Databricks File System and create and upload initialisation script to Databricks File System, see section Debian Package Installation
Sequence number | Description |
1. | Create Databricks Connect ETL Jobs and tasks as per Databricks section in the online help. Create execution profile. This will include a Databricks job cluster. Please see the Deploy to a Databricks job cluster section below for more information. Upon execution the Connect ETL job will communicate this information to the Databricks control plane. |
2. | The Databricks Control plane will retrieve the previously uploaded initialization script and Connect ETL Debian installation package from the Databricks File System. |
3. | Databricks will use the information from Step 2 to create a Databricks job cluster and install Connect ETL there. It will then execute the Connect ETL job and tasks described in Step 1. |
4. | The Connect ETL job running in Databricks cluster will ingest data from the on-prem sources defined in the Connect ETL tasks. |
5. | The Connect ETL job running in Databricks cluster will write data to the Databricks files and/or tables defined in the Connect ETL tasks. |
6. | Once the Connect ETL job completes Databricks will terminate and clean up the job cluster and return control to Connect ETL running on the Databricks Gateway. |
This section describes how to deploy to interactive and job Databricks clusters.
Deploy to a Databricks interactive cluster
You must first configure and create a Databricks interactive cluster before you deploy Connect for Big Data and run Connect jobs. For information on how to create an Azure Databricks cluster, see Create a cluster.
This procedure assumes you are configuring the Databricks cluster initialization script to include in the Connect for Big Data Debian install commands. An alternative method is to run the Debian install script manually on each cluster node.
- At the top of the cluster’s page, click Edit.
- Click Advanced Options.
- Under Spark, add the JAVA_HOME environment variable. For example,
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/jre
- Under Logging, set the destination to DBFS and the cluster log path to your desired
location. For example:
dbfs:/mnt/azuregen2/cluster-logs
Note: When you save these values, Databricks adds an additional subdirectory to the end of the log path for the cluster ID. - Under Init Scripts, select Destination DBFS and add the path and filename of your
Debian install script (see, Prepare the install files section above) to the Init Script
Path value. For example:
dbfs:/rootDir/install_connect.sh
- Click Confirm at the top of the cluster’s page to save these changes.
- To confirm a successful install,
- Click on Event Log on the cluster’s page to verify the INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED messages and times.
- Review the software installed in /usr/dmexpress by running the following command
within a notebook:
ls -l /usr/dmexpress/
If the cluster reported errors during Connect for Big Data install, review the init script logs stored in the init_scripts/cluster_ID_container_IP directory under the logging directory set up in the cluster’s Advanced Options. For example:ls -l /dbfs/mnt/azuregen2/cluster-logs/init_scripts
Additional steps . After you deploy the interactive cluster, complete the following steps:
- Start the cluster. To start the cluster, click Start at the top of the cluster’s page. See your cluster documentation for more information.
- Create the Connect execution profile file. See Work with the Connect Execution Profile File and the Connect help topic “Execution profile file”.
- Configure Databricks source and target connections. See Databricks.
- Run Connect jobs on the cluster. Use the dmxjob command when you run Connect tasks and jobs on the cluster. The dmxjob command refers to the interactive deployment configuration in the Connect execution profile file. For more information, see the Connect help topic, “Running a job from the command prompt.”
Deploy to a Databricks job cluster
A Databricks job cluster is created automatically (on-demand) each time a Connect job runs. Job clusters are only created at runtime. After the job runs, the cluster is deleted. The run-time information that defines the Connect job must include the cluster configuration as the cluster is created and only exists for the duration of the job.
Complete the following steps:
Create a configuration file for the job cluster. When you view the configuration properties for an existing Databricks cluster (for example, in the Azure interface), you can view the properties in JSON format. You can use the JSON configuration of an existing interactive cluster as a template to create the JSON configuration file for a job cluster. For information on cluster configuration properties, see the Azure Databricks Jobs API webpage.
{"databricks" : {
"deploymentConfigurations":[{
"name": "databricks_interactive_cluster",
"host": "https://westus.azuredatabricks.net",
"token": "tokenname",
"workDirectory": "dbfs:/mnt/azureblob/", "clusterID": "0111-311131-oka45"
},
"clusterConfig":
"/mydir/dbricks_config/create_job_cluster.json"
} ]
}
dmxjob /run /mydir/databricks/config/job/myjob_0001.dxj /runon SPARK
databricks[databricks_job_cluster] /profile
/nis4/mydir/databricks/config/databricks_profile/spark_execution_profile.json
For more information, see the Connect help topic, “Running a job from the command prompt.”