Details - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
Last updated
2024-10-09
Published on
2024-10-09T14:37:51.625264

The data store Details tab allows you to select which type of data store you want to create.

  1. Type a Name for the data store.
  2. Optionally, enter a Description for the data store.

    If you enter a description, this will be displayed in a tooltip when you hover over the data store in the Pipelines view.

  3. Select a Store Type. Choose from:
    • External - An external data store points to data coming into Data360 DQ+ from an external source, allowing for the exploration of virtually any set of data. You can create an external data store by uploading a file directly or by uploading data to an external location and then pointing to that location. This can take the form of uploading a file to an external repository and then pointing to that repository, or pointing to an existing database that contains your data.
    • Internal - An internal data store can hold data pushed from another (external) application, such as Assure DQ or ER DQ, or it can hold data from other data stores that are already set up within Data360 DQ+. These types of data stores are typically outputs from other data stages, such as an analysis that transforms a set of data stores to create a new one. Additionally, internal data stores can be used to hold the detail records for case stores within a database.
  4. Select a Store Repository Type. The options that are available depend on whether you have selected an External or InternalStore Type.

    External store repository types - For external data stores, "Store Repository Type" refers to the external location at which your data is stored.

    Data360 DQ+ Cloud edition - If you are using the Cloud edition of the product, you can choose from the following store repository types:

    • Default - Places your uploaded file into an Amazon S3 bucket if running on AWS, or uses Azure Data Lake Storage Gen2 if running on Azure. Each time you use the default external data store, you are accessing this bucket on S3 or the Azure Data Lake Storage. This is the option that will be chosen if you create an external data store via file upload.
    • Database - If you have an existing database, this option allows you to point to it to access data from specific tables.
    • Kafka - Allows you to read from or write to a Kafka channel. You can use a Kafka channel to handle streaming data, for example to store and aggregate logs, and to access the store in an analysis.
    • Spark Data Source - Allows the system to use any spark data source that is available without having to add each one separately.
    • S3 - Only available on the AWS edition. Places your uploaded file into an Amazon S3 bucket. However, instead of uploading a file via Data360 DQ+, it requires you to upload files to an S3 bucket via AWS and then point Data360 DQ+ to that bucket. This option is preferable when you need to upload a large volume of data, because Data360 DQ+ does not currently support multi-file upload.

    Data360 DQ+ Enterprise edition - If you are using the Enterprise edition of the product, you can choose from the following store repository types:

    • Default - In the Enterprise edition of Data360 DQ+, Default refers to the Hadoop Distributed File System, or HDFS. Choosing Default places your uploaded file into HDFS. Each time you use the default external data store, you are essentially accessing HDFS.
    • Database - If you have an existing database, this option allows you to point to it to access data from specific tables.
    • Kafka - Allows you to read from or write to a Kafka channel. You can use a Kafka channel to handle streaming data, for example to store and aggregate logs, and to access the store in an analysis.
    • Spark Data Source - Allows the system to use any spark data source that is available without having to add each one separately.
    • File System - If you have an existing file system other than HDFS, this option allows you to point to it to access data from specific files.
    • Hadoop Distributed File System (HDFS) - This option also utilizes HDFS, however, instead of uploading a file via Data360 DQ+, it requires you to upload files to HDFS and then point Data360 DQ+ to that location. This option is preferable when you need to upload a large volume of data, because the Data360 DQ+ UI does not currently support multi-file upload.

    Internal store repository types

    There are three internal data store repository types:

    • Default - This is the option that will be chosen if you are pushing data from another Infogix application, or if you are creating a data store output in an analysis.
    • Database - Allows you to create a data store that will be associated with and will hold detail records for a case store. To populate this type of data store, you need to construct an analysis that pushes data to it from other sources.
      Note: Internal database data stores should only be used when you need to associate a data store with a case store.

      This option includes an Indexes tab and a Screens tab that are not present in other data store types.

    • Delta Lake - Allows you to use a Delta Lake repository. This option is preferable when you need to handle a large volume of data, and allows records to be updated via an analysis after initial loading.
      Note: The Delta Lake store repository type is only supported on Cloud editions of the product and is not available on Enterprise deployments.
  5. Configure the following properties where available for your chosen repository type:
    • Create new definition version on each save - Select this option if you want to create a new version of your data store in the back end file system each time you upload a new file, make changes to the data store's configuration, and then save.
    • Log Activity
    • Allow Ad Hoc Queries - You can use ad hoc queries to perform direct SQL queries on files in data stores without having to first create a data view, allowing you to explore and understand your data before creating an analysis. Ad hoc queries should only be used when access is infrequent, and performance is not critical. Avoid using ad hoc queries in dashboards that expose data in file based data stores directly to end users. For more information, see Ad hoc queries.
    • Cache Query Results - This option can be used to improve performance. When Cache Query Results is selected, duplicate queries use cached data. Cached data expires one hour after it is cached.

      Depending on your data store settings, the Cache Query Results option might appear grayed out (disabled).

      By default, the Cache Query Results option is selected (and grayed out) for external data stores if the Store Repository Type is S3, HDFS or Database, and for Default external data stores where the Layout Type supports ad hoc querying. It is not selected (and grayed out) for all other data stores.

Additional configuration options

Depending on the selected Store Repository Type, additional configuration options will be displayed.

External default data store

  1. Select a Layout Type. Choose from:
    • Delimited
    • JSON
    • Parquet
    • ORC
    • Avro
    • Microsoft Excel
    • Line
    • XML
  2. Configure the Layout options. The available Layout options depend on the Layout Type that you have selected.

    If you select JSON, Parquet, ORC, Avro or XML, you can use the Transformation Builder to define the data store fields. For more information, see Using the Schema and Transformation Builder.

    Tip: For information on configuring layouts, see Configuring data store layouts.

External database data store

  1. Configure the Database Channel properties to allow Data360 DQ+ to connect to your database. You may need to contact the person who manages the database to gather the required information:
    • Database Type - The brand of database you are using. If your type is not listed, select OTHER. For example, to point to a DB2 database select OTHER. Note that from release 2.1.1 onwards, DB2 based data stores cannot be used as data store outputs within an analysis.
    • JDBC Driver Class - The component that helps Data360 DQ+ to communicate with your database. Common drivers for each database type will autofill this parameter when a Database Type is selected.
    • URL - The external location of your database.
      Note: For a Cloudera Hive database, specify the URL in the following form: jdbc:hive2://<hostname>:10000/;principal=hive/<hostname>@INFOGIX.COM;ssl=true;sslTrustStore=$ {igx.runtime.dir} /cm-auto-global_truststore.jks
    • Username and Password - The credentials that you use to access the database.

      For the Enterprise edition of Data360 DQ+, the following configuration options are also available:

      • Authentication Type - If the Database Type is HIVE, you can use choose between Username and Password authentication, and Kerberos authentication.
      • Principal Name - If the Authentication Type is Kerberos, enter the Kerberos Principal Name, for example principle@example.com.
      • Keytab File Name - if the Authentication Type is Kerberos, enter the Keytab file name, prefaced by the path ${igx.runtime.dir}, for example ${igx.runtime.dir}/sagacity.hive.keytab. Ensure that you use distinct names for all keytab files, for example sagacity.hadoop.keytab, sagacity.kafka.keytab, and sagacity.hive.keytab.
      Note: Data360 DQ+ does not support Hive database datastores for output.

      Click Test to test your connection to the database. Testing should always be performed prior to field generation to ensure that communication with the database has been properly established.

  2. Configure the SQL Layout properties:
    • SQL - Write an SQL query to pull data from the tables you would like to include in your data store. During field generation, each field pulled from the database becomes a field in the data store. For example, the following statement would allow you to generate ID and CREATETIME fields in your data store:

      SELECT ID, CREATETIME FROM student_Data360 DQ+_data_pull

      The following statement would allow you to create all the fields from student_Data360 DQ+_data_pull in your data store:

      SELECT * FROM student_Data360 DQ+_data_pull

      Tip: As with standard SQL, more advanced statements may also be made. For example, you could use a WHERE condition to pull in specific records for specific fields, or a JOIN to combine fields from multiple tables.

      Using time stamp variables in SQL statements

      The following variables may be used in your SQL:

      Variable name

      Timezone applied

      fromTimestamp/toTimestamp

      Applies the timezone of the system's web application server to the time value.

      fromTimestampUTC/toTimestampUTC

      Applies the UTC timezone to the time value.

      fromTimestampOrgTz/toTimestampOrgTz

      Applies the timezone set in the Admin->System screen of the product to the time value.

    • Sample SQL - Optionally, write a statement that will be used when sampling the data store in an analysis. If a Sample SQL statement is not used, analysis sampling will occur according to the analysis settings.

      For example, you could write the following statements:

      SQL: SELECT * FROM EMPLOYEE where ID > 2

      Sample SQL: SELECT * FROM EMPLOYEE where ID > 5

      A data store that used these statements would bring in all records from the EMPLOYEE table that had an ID greater than 2. These records would be available for processing throughout the application. However, analysis sheets that displayed this data store's contents would only display records with IDs greater than 5.

    • Output Table Name - If the database data store is going to be used as a data store output within an analysis, you must specify the name of the table within the external database that the data store is using.
    • Concurrent Reads - You can use this setting to optimize the speed at which data is read from a data store's database, particularly when the database contains a large number of records. For example, if the database contained 1,000,000 records, a normal query would utilize 1 read, to read all 1,000,000 records. If you were to increase the number of Concurrent Reads to 10, however, this would cause the query to use 10 reads in parallel, and each read would handle 100,000 records.

      To experiment with different numbers of Concurrent Reads, select User Defined and choose a number.

      Alternatively, you can let the system make this optimization for you by selecting System Defined.

      Note: If you have set the Concurrent Reads property to System Defined or User Defined, you must order your SQL query by a field that contains unique values for each record in the data set. Note that this only applies to the SQL property and not to the Sample SQL property.
    • Once you have configured your Database Channel and SQL Layout properties, select the Fields tab and use the Generate button to generate fields. Assuming you have successfully connected to the database, generated fields and the records that populate them will be based on the SQL you have written in your layout. See Fields.

External Kafka data store

  1. Select a Layout Type. Choose from:
    • JSON
    • Avro
    • Text
  2. Configure the Kafka Channel properties to connect Data360 DQ+ to your Kafka channel. You may need to contact the person who manages the database to gather the required information:
    • Servers - Enter a comma-separated list of server addresses. Entries should be formatted as <hostname>:<port>. Note that Data360 DQ+ does not validate server entries.
    • Topic - Enter a comma-separated list of topics.
    • This channel will be used to write to Kafka - Determines whether the channel will write to, or read from the Kafka channel.
    • Additional Connection Properties - Use this panel to define any other connection properties. For example, to specify an SSL connection, you could add a property called "kafka.security.protocol", with the value "SSL".
  3. Configure the Layout properties. The available Layout options depend on the Layout Type that you have selected:
    • JSON Layout - Click Build Schema and Transformations in the JSON Layout panel to open the Build Schema and Transformations dialog. Use this dialog to define the JSON or Avro schema for the Kafka data store, and the transform actions and the type and name of the output fields. For more information, see Using the Schema and Transformation Builder.
    • Avro Layout - You can use a Confluent Schema Registry to store and retrieve your Avro schemas, or provide a schema name and namespace.
      • Registry URL - Enter the URL of the Confluent Schema Registry, in the format <hostname>:<port>, for example 192.0.4.0:8081. A value is required for this field.
      • Schema ID - A value is required for this field. The default value is "latest".
      • Topic - Enter a comma-separated list of topics. A value is required for this field.
      • Schema Naming Strategy - Choose a naming strategy from the drop-down list. The default value is topic.name.

      Click Build Schema and Transformations to open the Build Schema and Transformations dialog. Use this dialog to define the Avro schema for the Kafka Data Store, and the transform actions and the type and name of the output fields. For more information, see Using the Schema and Transformation Builder.

      Enter the Schema Name. This field is required if Confluent Schema Registry is not selected.

      Enter the schema namespace. This field is required if Confluent Schema Registry is not selected.

    • Text - There are no applicable settings for a Layout Type of text, and the Layout panel is not displayed.

The following restrictions apply to Kafka data stores:

  • Transform and Profile tabs are not applicable to Kafka data stores.
  • Field definitions are not editable, and fields cannot be added, deleted, imported, or generated.
  • Fields can be tested only if This channel will be used to write to Kafka is not selected.

External Spark Data Source

  1. Spark Data Source Channel
    • Click Add, to define the spark parameters required to read the data from database. Below is an example of the parameters used to read the data,
      • Example:
        • Name - read.spark.url
        • Value - jdbc:oracle:thin:@172.17.13.220:1521:CAFE
        • Name - read.spark.driver
        • Value - oracle.jdbc.driver.OracleDriver
        • Name - read.spark.user
        • Value - CAFEUSER
        • Name - read.spark.password
        • Value - Infogix1

        And, if you want to sample the data then you can use below parameters,

        • Name - read.spark.sample
        • Value - 50%
    • This source also supports writing data into database. Below is an example of the parameters used for writing into the database,
      • Example:
        • Name - write.spark.url
        • Value - jdbc:oracle:thin:@10.2.20.120:1521:iadb
        • Name - write.spark.driver
        • Value - oracle.jdbc.driver.OracleDriver
        • Name - write.spark.user
        • Value - IRWHOUSE
        • Name - write.spark.password
        • Value - IRWHOUSE
  2. Spark Data Source Layout
    • Spark Format: This will be jdbc if you are fetching the data from external database.
    • Click Add, to define the spark query which will be the database query. Below is an example of parameters that can be used,
      • Example:
        • Name - read.spark.query
        • Value - select * from AFDS_TEST
        • Name - write.spark.dbtable
        • Value - AFDS_TEST
  3. Selection Criteria
    • Click Add, Opens SQL Editor that allows you to perform and execute SQL queries.

External S3 data store

  1. Configure the S3 Channel properties to point Data360 DQ+ toward the location of the external data.
    • Bucket Name - The S3 bucket name.
    • Folder - The folder in the S3 bucket that contains the external data.
    • Filter Extension - This setting can be used when you have files in the same location that share the same name but have a different file extension. The specified Filter Extension will tell Data360 DQ+ which file to use to create the data store. For example, if you had a folder containing a customerData.txt and a customerData.csv, specifying either .txt or .csv as a Filter Extension would allow you to choose between the two.
    • Marker Extension - This setting can be used to prevent reading of files that should not be read until a Marker is present. It is available for the S3 and HDFS file systems.
    • Regular Expression for File Path - When configuring the external channel, either literal strings that represent the appropriate file path or regular expressions that represent the file path may be used.
    • Region - Choose the appropriate region-specific endpoints to connect with an AWS service.
    • Security Options - Choose relevant security options and fill in the additional information.
      • Internal Account - Select the Internal Account if S3 data is stored in default region.
      • External Account - Select the External Account if S3 data is stored in other regions apart from default region.
        • Access Key - The Access key ID key of the authenticated user that will connect to AWS account.
        • Secret Key – The secret access key of the authenticated user that will connect to AWS account.
      • Assume Role – Returns a set of temporary security credentials that you can use to access AWS resources that you might not normally have access to.
        • Role ARN – Provide The Amazon Resource Name (ARN) of the role to assume.
        • Session Name – Provide the identifier for the assumed role session.
  2. Configure the Layout options. The available Layout options depend on the Layout Type that you have selected.
    Tip: For information on configuring layouts, see Configuring data store layouts.

External ADLS data store

  1. Configure the ADLS Channel properties to point Data360 DQ+ toward the location of the external data.
    • Path - When configuring the external channel, the strings that represent the appropriate full path may be used, with this syntax: abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/<path>
    • Filter Extension - This setting can be used when you have files in the same location that share the same name but have a different file extension. The specified Filter Extension will tell Data360 DQ+ which file to use to create the data store. For example, if you had a folder containing a customerData.txt and a customerData.csv, specifying either .txt or .csv as a Filter Extension would allow you to choose between the two.
    • Regular Expression for File Path - When configuring the external channel, either literal strings that represent the appropriate file path or regular expressions that represent the file path may be used.
    • Marker Extension - This setting can be used to prevent reading files that should not be read unless a Marker is found.
    Note: File Extension, Regular Expression for File Path, and Marker Extension has been removed in the ADLS Channel for the Delta Lake Layout.
  2. Configure the Security Options.
    • Storage Account Name -The name of the ADLS Gen2 storage account.
    • Application ID - The Application (client) ID for the Azure Active Directory application.
    • Service Credential - The client secret.
    • Directory ID - The Directory (tenant) ID for the Azure Active Directory application.
Note: ADLS data store is supported only on Azure Cloud edition of the product.

External Delta Lake Data Store

  1. Configure the Layout options. The available Layout options depend on the Layout Type that you have selected.
    Tip: For information on configuring layouts, see Configuring data store layouts.
    Note: Layout options, Edit Transformations and Selection Criteria will be shown for the Delta Lake Layout while Create New Definition Version, Log Activity, Ad hoc queries and Cache Query Results will be disabled.

Internal database data store

If you want to associate a data store with a case store, that is, to have it contain detail records that support the header records of the case store, you can populate an internal database data store by pushing data into it via an analysis and then associate it to a case store by adding it to the case store's workflow.

Note: Internal database data stores should only be used when you need to associate a data store with a case store.

Prior to pushing data to an internal database data store via an analysis, all of the fields that the internal database data store will receive need to be created. If the internal database data store is created on-the-fly within the analysis, these fields will be created automatically. If you want to create the internal database data store ahead of time, you can create all of the required fields by using the Import Fields feature, and importing from the source data store.

When an internal database data store is added to a case store's workflow, it will become "owned" by the case store. Within the data store, this ownership is represented by the addition of the following system fields:

  • OwningObjectId - The system generated ID of the data stage that "owns" the data store.
  • OwningObjectType - The type of the data stage that "owns" the data store.
  • OwningSubObjectUID - The UID of the case record from the case store that owns the detail record in the data store. This UID must be generated and tied to header/detail records in an analysis using the UUID() function. For more information, see Case Stores.
  • OwningSubObjectID - Rather than the UID of a case record that owns the detail record in the data store, this field contains the ID of a case record that owns the detail record in the data store. This ID is system generated.
  • SysDeleted - A Boolean flag that indicates whether a record has been deleted and can be physically deleted from the system at any time.