Operating system requirements
- RedHat Enterprise Linux 8.9
- ext3, ext4 or XFS file system
Vertica prerequisites
Please consider the following prerequisites which are required to run Vertica, the component that is responsible for running analytics for Data360 DQ+.
Vertica file system requirements
Vertica requires that you have one of the following file systems:
- ext3
- ext4
- XFS
Vertica package dependencies
Vertica requires the following packages be installed on the system:
- dialog
- mcelog
- sysstat
- gdb
- perl
- gcc-c++
Setting Vertica swap space
"Swap space" is a way to obtain more memory resources when RAM is full. It allows you to reserve space on a hard disk that can be used like RAM. To function properly, Data360 DQ+ needs a minimum of 2 GB swap space.
- Check if you have swap space already set up by running the following command:
free -m - If you do not have swap space set up, you can use the following commands to create new swap space:
sudo mkdir -p /var/swap sudo touch /var/swap/swap.1 sudo /bin/dd if=/dev/zero of=/var/swap/swap.1 bs=1M count=2050 sudo /sbin/mkswap /var/swap/swap.1 sudo chmod 600 /var/swap/swap.1 sudo /sbin/swapon /var/swap/swap.1 sudo /bin/sh -c "echo '/var/swap/swap.1 swap swap defaults 0 0' >> /etc/fstab"
Setting read-ahead for Vertica
- Run the following commands, inserting the disk path where indicated:
sudo /sbin/blockdev --setra 2050 {disk path here}sudo /bin/sh -c "echo '/sbin/blockdev --setra 2050 {disk path here}' >> / etc/rc.local"
Setting SELinux to permissive mode
For Vertica to function properly, SELinux must be set to Permissive mode, as follows:
- Edit the
/etc/selinux/configfile to setSELINUX=permissive. This ensures that SELinux remains set to permissive after a reboot.
If you want to disable SELinux immediately, within the current session, type the following command:
sudo setenforce permissive
Enabling or disabling transparent hugepages for Vertica
- Determine if transparent hugepages is enabled by running this command:
cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise neverThe setting returned in brackets is your current setting. For systems that do not support
/etc/rc.local, use the equivalent startup script that is run after the destination runlevel has been reached. For example SuSE uses/etc/init.d/after.local. - You can enable transparent hugepages by editing
/etc/rc.localand adding this script:if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo always > /sys/kernel/mm/transparent_hugepage/enabled fi - Reboot your system for the setting to take effect.
- To disable transparent hugepages, edit your boot loader (for example
/etc/grub.conf). Typically, you add the following to the end of the kernel line. However, consult the documentation for your system before editing your bootloader configuration:transparent_hugepage=neverAlternatively, edit
/etc/rc.local(on systems that support rc.local) and add this script:if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fiFor systems that do not support
/etc/rc.local, use the equivalent startup script that is run after the destination runlevel has been reached. For example SuSE uses/etc/init.d/after.local. - Reboot your system for the setting to take effect.
For more information, see the Vertica documentation here.
Hadoop prerequisites
Data360 DQ+ can connect to an existing Cloudera Data Platform (CDP) installation with Spark 3.3.x support configured. The Hadoop cluster can fulfill the role of Analysis processing. It is referred to elsewhere in this guide as, for example, "Compute primary and Compute secondary", and in the install.properties file as "Compute Master" and "Compute Slave" (or the "Compute Cluster" collectively).
To connect to your Hadoop cluster, there are a number of Hadoop related properties in the install.properites file for which you will need to provide values, see Hadoop connectivity properties (Compute properties).
The following steps must be completed in order to connect to an existing CDP cluster:
- Configure the keytab directories. For each Application Server node in your setup, the installation process expects a directory with a name matching the IP address of the Application Server node, and, the Sagacity system user keytab needs to be in each one of these directories. For example, the directory structure may look like this:
<keytabsDir>/[app sever_ ip_address1]/sagacity.keytab /[app sever_ ip_address2]/sagacity.keytab - Download the Hadoop Yarn configuration file must and save the file at an accessible location. This will be checked by the
verifyEnvironmentscript during installation.
Google Dataproc prerequisites
Data360 DQ+ can connect to Google Dataproc when running on Google Cloud Platform. A Dataproc cluster can fulfill the role of Analysis processing. It is referred to elsewhere in this guide as, for example, "Compute primary and Compute secondary", and in the install.properties file as "Compute Master" and "Compute Slave" (or the "Compute Cluster" collectively).
The steps in the following sections must be completed to connect to Google Dataproc.
Create a custom role
Create a custom role with the permissions required for Data360 DQ+ to use Google DataProc for executions.
- Go to IAM & Admin > Roles > CREATE ROLE. Give the role a name and ID as appropriate.
- Add these permissions:
dataproc.autoscalingPolicies.usedataproc.clusters.getdataproc.clusters.listdataproc.clusters.usedataproc.jobs.canceldataproc.jobs.createdataproc.jobs.getdataproc.jobs.listdataproc.jobs.updatedataproc.operations.canceldataproc.operations.getdataproc.operations.listdataproc.workflowTemplates.createdataproc.workflowTemplates.getdataproc.workflowTemplates.instantiatedataproc.workflowTemplates.instantiateInlinedataproc.workflowTemplates.listiam.serviceAccounts.actAs
- Click Create.
Create a service account
Use the GCP console to create a service account key, which is then used by Data360 DQ+ to make API calls to GCP services.
- Go to IAM & Admin > Service Accounts and click CREATE SERVICE ACCOUNT. Enter the service account name, then click Create.
- Go to Permissions > Grant Access. Select the custom role created above, click Save, then click the account created.
- Go to KEYS > ADD KEY > Create New Key: Key type (JSON).
- Save the JSON file locally. This file will be used for configuring the
GCP_SERVICE_ACCOUNT_KEY_FILE_LOCALandGCP_SERVICE_ACCOUNT_KEY_FILEdeployment properties.
Create keys in the Google Cloud Key Management Service (KMS)
Create the cryptography keys used by Data360 DQ+.
- Go to Security > Key Management.
- Create a new key ring if necessary by clicking Create key ring. Enter the Key ring name, for example: dqplus-key-ring. Use the created key ring name as the
GCP_KMS_MASTER_KEY_RING_IDdeployment property. - Select a region, then click Create.
- Add a master key used by Data360 DQ+ to the key ring. Use the created key name as the
GCP_KMS_MASTER_KEY_IDdeployment property.- KEY RINGS: Click on the key ring to use > CREATE KEY.
- Enter the key name, for example: dqplus-master-key.
- Protection Level (Software).
- Purpose (Symmetric encrypt/decrypt).
- Choose a key rotation period.
- Add key encrypt/decrypt permissions to the service account.
- Go to Key Management > KEY RINGS > Select Key Ring: Select the key to add permissions.
- Go to PERMISSIONS > ADD PRINCIPAL.
- New principals: Enter the service account you are using.
- Select a role: Select Cloud KMS CryptoKey Encrypter/Decrypter.
- Click SAVE.
- Repeat the steps above to create another key using the same key ring for Hadoop encryption, but a different key name. For example, dqplus-dqplusHadoopKey.
- Use the created key name as the
HADOOP_ENCRYPTION_KEYdeployment property.
Create a Google Cloud Storage Bucket
- Go to Cloud Storage in the GCP Console and click Create bucket.
-
Name your bucket and choose where to store your data by selecting the Region and the project region.
-
Choose a default storage class for your data (standard).
-
Choose how to control access to objects. Select Enforce public access prevention on this bucket and Uniform access control.
- Click Create.
- After Creating the bucket, select the bucket and the Permissions tab and click Add.
- Select the service account for the Data360 DQ+ principal.
- Select Storage Object Admin from the Cloud Storage service for the role and click Save.
Use the created bucket name as the DATAPROC_BUCKET_NAME and DATAPROC_LOGGING_BUCKET_NAME deployment property.
Create Google Dataproc Cluster
- Prior to creating the cluster, create a shell script named
dataproc_bootstrap.sh, with the following content:#!/usr/bin/env bash sudo apt-get -y update sudo apt install -y temurin-11-jdk export JAVA_HOME="/usr/lib/jvm/temurin-11-jdk-amd64" export PATH=$PATH:$JAVA_HOME/bin echo $JAVA_HOME java -version echo "Successfully installed java 11 packages." - Upload the
dataproc_bootstrap.shscript to the<dqplus cloud storage bucket>/bootstrap/folder. - Use the Google Cloud command line to create the shared DataProc cluster:
gcloud dataproc clusters create <your cluster name> --autoscaling-policy <your policy id> --enable-component-gateway --region <your gcp region> --zone <your gcp zone> --num-masters 3 --master-machine-type n2-highmem-2 --master-boot-disk-size 1000 --num-workers 2 --worker-machine-type n2-highmem-4 --worker-boot-disk-size 1000 --num-secondary-workers 2 --secondary-worker-boot-disk-size 1000 --num-secondary-worker-local-ssds 0 --image-version 2.1-debian11 --properties yarn:yarn.nodemanager.remote-app-log-dir=gs://<dqplus cloud storage bucket>/logs/shared,yarn:yarn.log-aggregation.retain-seconds=-1,spark:spark.executorEnv
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64,spark-env:JAVA_HOME=/usr/lib/jvm/temurin
-11-jdk-amd64 --scopes 'https://www.googleapis.com/auth/cloud-platform' --initialization-actions 'gs://<dqplus cloud storage bucket>/bootstrap/dataproc_bootstrap.sh' --project <your project id>
The dataproc_bootstrap.sh script will install the required version of Apache Spark software package for all nodes of the cluster, including nodes provisioned after cluster creation by auto scaling policy.
The dataproc_bootstrap.sh script will download the Apache Spark software package from the https://archive.apache.org/dist/spark/spark-3.3.x/spark-3.3.x-bin-without-hadoop.tgz location. Ensure that the cluster has access to this site. Alternatively, download this package yourself, put it on an internal site accessible to the cluster, and modify the dataproc_bootstrap.sh script to reference your site.
The command can be run from the GCP console, using the Google cloud shell, for example: https://console.cloud.google.com/dataproc/clusters?cloudshell=true&project=project-name
Substitute the cluster name, project, region and zone as needed. Use the Google storage bucket created for Data360 DQ+ earlier, for the remote-app-log-dir property for logs.
Use the created DataProc shared cluster name as the DATAPROC_SHARED_CLUSTER_NAME deployment property.
DataProc/Autoscaling policies, click CREATE POLICY, select Spark with dynamic allocation and adjust if needed.Other prerequisites
Before installing Data360 DQ+ you must ensure that you have completed the following steps:
- Run the following command on every machine that Data360 DQ+ will be installed on:
sudo systemctl start chronydInstallation requires the chronyd service to be running on every node.
- Run the following command on each machine:
sudo yum install -y libtool-ltdlInstallation requires that libltdl is installed on every node.
- Set up a shared file system. This file system will be shared across all machines in your Data360 DQ+ cluster and it will be used to hold data, logs, and backup content. During installation, you will need to point to the path of this shared file system using the sagacitySharedMountPoint property. You can use any type of shared file system.