Setup requirements - Data360_DQ+ - 11.X

Data360 DQ+ Enterprise Installation

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
11.X
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Enterprise Installation
Copyright
2024
First publish date
2016
ft:lastEdition
2024-06-06
ft:lastPublication
2024-06-06T12:37:34.761477

Operating system requirements

  • RedHat Enterprise Linux 8.9
  • ext3, ext4 or XFS file system

Vertica prerequisites

Please consider the following prerequisites which are required to run Vertica, the component that is responsible for running analytics for Data360 DQ+.

Vertica file system requirements

Vertica requires that you have one of the following file systems:

  • ext3
  • ext4
  • XFS

Vertica package dependencies

Vertica requires the following packages be installed on the system:

  • dialog
  • mcelog
  • sysstat
  • gdb
  • perl
  • gcc-c++
Note: If your environment does not allow package installation via yum due to security policy restrictions, you will need to install these packages manually on all machines that run Vertica (ComputeDb), prior to running the Data360 DQ+ installer. Otherwise, the installer will install the packages automatically.

Setting Vertica swap space

"Swap space" is a way to obtain more memory resources when RAM is full. It allows you to reserve space on a hard disk that can be used like RAM. To function properly, Data360 DQ+ needs a minimum of 2 GB swap space.

  1. Check if you have swap space already set up by running the following command:

    free -m

  2. If you do not have swap space set up, you can use the following commands to create new swap space:
           sudo mkdir -p /var/swap
    	sudo touch /var/swap/swap.1
    	sudo /bin/dd if=/dev/zero of=/var/swap/swap.1 bs=1M count=2050
    	sudo /sbin/mkswap /var/swap/swap.1
    	sudo chmod 600 /var/swap/swap.1
    	sudo /sbin/swapon /var/swap/swap.1
    	sudo /bin/sh -c "echo '/var/swap/swap.1 swap   swap defaults 0 0' 		>> /etc/fstab"
    
Note: Swap space must be configured on all machines in the Vertica (i.e. ComputeDb) cluster.

Setting read-ahead for Vertica

  1. Run the following commands, inserting the disk path where indicated:

    sudo /sbin/blockdev --setra 2050 {disk path here}

    sudo /bin/sh -c "echo '/sbin/blockdev --setra 2050 {disk path here}' >> / etc/rc.local"

Setting SELinux to permissive mode

For Vertica to function properly, SELinux must be set to Permissive mode, as follows:

  1. Edit the /etc/selinux/config file to set SELINUX=permissive. This ensures that SELinux remains set to permissive after a reboot.

If you want to disable SELinux immediately, within the current session, type the following command:

sudo setenforce permissive

Enabling or disabling transparent hugepages for Vertica

  1. Determine if transparent hugepages is enabled by running this command:

    cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never

    The setting returned in brackets is your current setting. For systems that do not support /etc/rc.local, use the equivalent startup script that is run after the destination runlevel has been reached. For example SuSE uses /etc/init.d/after.local.

  2. You can enable transparent hugepages by editing /etc/rc.local and adding this script:

    if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo always > /sys/kernel/mm/transparent_hugepage/enabled fi

  3. Reboot your system for the setting to take effect.
  4. To disable transparent hugepages, edit your boot loader (for example /etc/grub.conf). Typically, you add the following to the end of the kernel line. However, consult the documentation for your system before editing your bootloader configuration:

    transparent_hugepage=never

    Alternatively, edit /etc/rc.local (on systems that support rc.local) and add this script:

    if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi

    For systems that do not support /etc/rc.local, use the equivalent startup script that is run after the destination runlevel has been reached. For example SuSE uses /etc/init.d/after.local.

  5. Reboot your system for the setting to take effect.

For more information, see the Vertica documentation here.

Hadoop prerequisites

Data360 DQ+ can connect to an existing Cloudera Data Platform (CDP) installation with Spark 3.3.x support configured. The Hadoop cluster can fulfill the role of Analysis processing. It is referred to elsewhere in this guide as, for example, "Compute primary and Compute secondary", and in the install.properties file as "Compute Master" and "Compute Slave" (or the "Compute Cluster" collectively).

To connect to your Hadoop cluster, there are a number of Hadoop related properties in the install.properites file for which you will need to provide values, see Hadoop connectivity properties (Compute properties).

The following steps must be completed in order to connect to an existing CDP cluster:

  1. Configure the keytab directories. For each Application Server node in your setup, the installation process expects a directory with a name matching the IP address of the Application Server node, and, the Sagacity system user keytab needs to be in each one of these directories. For example, the directory structure may look like this:
    <keytabsDir>/[app sever_ ip_address1]/sagacity.keytab
                /[app sever_ ip_address2]/sagacity.keytab
  2. Download the Hadoop Yarn configuration file must and save the file at an accessible location. This will be checked by the verifyEnvironment script during installation.
Note: The compute cluster for spark 3.3.x should be running with Java 11.

Google Dataproc prerequisites

Data360 DQ+ can connect to Google Dataproc when running on Google Cloud Platform. A Dataproc cluster can fulfill the role of Analysis processing. It is referred to elsewhere in this guide as, for example, "Compute primary and Compute secondary", and in the install.properties file as "Compute Master" and "Compute Slave" (or the "Compute Cluster" collectively).

The steps in the following sections must be completed to connect to Google Dataproc.

Create a custom role

Create a custom role with the permissions required for Data360 DQ+ to use Google DataProc for executions.

  1. Go to IAM & Admin > Roles > CREATE ROLE. Give the role a name and ID as appropriate.
  2. Add these permissions:
    • dataproc.autoscalingPolicies.use
    • dataproc.clusters.get
    • dataproc.clusters.list
    • dataproc.clusters.use
    • dataproc.jobs.cancel
    • dataproc.jobs.create
    • dataproc.jobs.get
    • dataproc.jobs.list
    • dataproc.jobs.update
    • dataproc.operations.cancel
    • dataproc.operations.get
    • dataproc.operations.list
    • dataproc.workflowTemplates.create
    • dataproc.workflowTemplates.get
    • dataproc.workflowTemplates.instantiate
    • dataproc.workflowTemplates.instantiateInline
    • dataproc.workflowTemplates.list
    • iam.serviceAccounts.actAs
  3. Click Create.

Create a service account

Use the GCP console to create a service account key, which is then used by Data360 DQ+ to make API calls to GCP services.

  1. Go to IAM & Admin > Service Accounts and click CREATE SERVICE ACCOUNT. Enter the service account name, then click Create.
  2. Go to Permissions > Grant Access. Select the custom role created above, click Save, then click the account created.
  3. Go to KEYS > ADD KEY > Create New Key: Key type (JSON).
  4. Save the JSON file locally. This file will be used for configuring the GCP_SERVICE_ACCOUNT_KEY_FILE_LOCAL and GCP_SERVICE_ACCOUNT_KEY_FILE deployment properties.

Create keys in the Google Cloud Key Management Service (KMS)

Create the cryptography keys used by Data360 DQ+.

  1. Go to Security > Key Management.
  2. Create a new key ring if necessary by clicking Create key ring. Enter the Key ring name, for example: dqplus-key-ring. Use the created key ring name as the GCP_KMS_MASTER_KEY_RING_ID deployment property.
  3. Select a region, then click Create.
  4. Add a master key used by Data360 DQ+ to the key ring. Use the created key name as the GCP_KMS_MASTER_KEY_ID deployment property.
    • KEY RINGS: Click on the key ring to use > CREATE KEY.
    • Enter the key name, for example: dqplus-master-key.
    • Protection Level (Software).
    • Purpose (Symmetric encrypt/decrypt).
    • Choose a key rotation period.
  5. Add key encrypt/decrypt permissions to the service account.
    • Go to Key Management > KEY RINGS > Select Key Ring: Select the key to add permissions.
    • Go to PERMISSIONS > ADD PRINCIPAL.
    • New principals: Enter the service account you are using.
    • Select a role: Select Cloud KMS CryptoKey Encrypter/Decrypter.
    • Click SAVE.
  6. Repeat the steps above to create another key using the same key ring for Hadoop encryption, but a different key name. For example, dqplus-dqplusHadoopKey.
  7. Use the created key name as the HADOOP_ENCRYPTION_KEY deployment property.

Create a Google Cloud Storage Bucket

  1. Go to Cloud Storage in the GCP Console and click Create bucket.
  2. Name your bucket and choose where to store your data by selecting the Region and the project region.

  3. Choose a default storage class for your data (standard).

  4. Choose how to control access to objects. Select Enforce public access prevention on this bucket and Uniform access control.

  5. Click Create.
  6. After Creating the bucket, select the bucket and the Permissions tab and click Add.
  7. Select the service account for the Data360 DQ+ principal.
  8. Select Storage Object Admin from the Cloud Storage service for the role and click Save.

Use the created bucket name as the DATAPROC_BUCKET_NAME and DATAPROC_LOGGING_BUCKET_NAME deployment property.

Create Google Dataproc Cluster

  1. Prior to creating the cluster, create a shell script named dataproc_bootstrap.sh, with the following content:
    #!/usr/bin/env bash
    sudo apt-get -y update
    sudo apt install -y temurin-11-jdk
    export JAVA_HOME="/usr/lib/jvm/temurin-11-jdk-amd64"
    export PATH=$PATH:$JAVA_HOME/bin
    echo $JAVA_HOME
    java -version
    echo "Successfully installed java 11 packages."
  2. Upload the dataproc_bootstrap.sh script to the <dqplus cloud storage bucket>/bootstrap/ folder.
  3. Use the Google Cloud command line to create the shared DataProc cluster:


          gcloud dataproc clusters create <your cluster name> --autoscaling-policy <your policy id> --enable-component-gateway --region <your gcp region> --zone <your gcp zone> --num-masters 3 --master-machine-type n2-highmem-2 --master-boot-disk-size 1000 --num-workers 2 --worker-machine-type n2-highmem-4 --worker-boot-disk-size 1000 --num-secondary-workers 2 --secondary-worker-boot-disk-size 1000 --num-secondary-worker-local-ssds 0 --image-version 2.1-debian11 --properties yarn:yarn.nodemanager.remote-app-log-dir=gs://<dqplus cloud storage bucket>/logs/shared,yarn:yarn.log-aggregation.retain-seconds=-1,spark:spark.executorEnv
        


          JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64,spark-env:JAVA_HOME=/usr/lib/jvm/temurin
        


          -11-jdk-amd64 --scopes 'https://www.googleapis.com/auth/cloud-platform' --initialization-actions 'gs://<dqplus cloud storage bucket>/bootstrap/dataproc_bootstrap.sh' --project <your project id>
        

The dataproc_bootstrap.sh script will install the required version of Apache Spark software package for all nodes of the cluster, including nodes provisioned after cluster creation by auto scaling policy.

The dataproc_bootstrap.sh script will download the Apache Spark software package from the https://archive.apache.org/dist/spark/spark-3.3.x/spark-3.3.x-bin-without-hadoop.tgz location. Ensure that the cluster has access to this site. Alternatively, download this package yourself, put it on an internal site accessible to the cluster, and modify the dataproc_bootstrap.sh script to reference your site.

The command can be run from the GCP console, using the Google cloud shell, for example: https://console.cloud.google.com/dataproc/clusters?cloudshell=true&project=project-name

Substitute the cluster name, project, region and zone as needed. Use the Google storage bucket created for Data360 DQ+ earlier, for the remote-app-log-dir property for logs.

Use the created DataProc shared cluster name as the DATAPROC_SHARED_CLUSTER_NAME deployment property.

Note: For the reference to the autoscaling policy, you can use an existing policy or create another. To create new policy, go to DataProc/Autoscaling policies, click CREATE POLICY, select Spark with dynamic allocation and adjust if needed.

Other prerequisites

Before installing Data360 DQ+ you must ensure that you have completed the following steps:

  1. Run the following command on every machine that Data360 DQ+ will be installed on:

    sudo systemctl start chronyd

    Installation requires the chronyd service to be running on every node.

  2. Run the following command on each machine:

    sudo yum install -y libtool-ltdl

    Installation requires that libltdl is installed on every node.

  3. Set up a shared file system. This file system will be shared across all machines in your Data360 DQ+ cluster and it will be used to hold data, logs, and backup content. During installation, you will need to point to the path of this shared file system using the sagacitySharedMountPoint property. You can use any type of shared file system.