Data store layouts - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
Last updated
2024-10-09
Published on
2024-10-09T14:37:51.625264

Delimited layout

You can create a data store by using a delimited text file. Commonly used delimiters include commas and tabs, but you can use any character. Whichever symbol is used, it is the delimiter that tells Data360 DQ+ how to parse the external data.

The Delimited Layout options are described below:

  • Records Span Multiple Lines - You can use this setting when you have a field that contains line breaks which causes records in the data set to span multiple lines. For example, the following header (line 1) and record (lines 2-5) which spans multiple lines due to the presence of Line Feeds in the Description field:
    Agency_ID, Agency_Name, Description, Zipcode, City, State
    123, "Acme Labs", "Technical facility for testing big data concepts in LF
    the furthest reaches of space, time, and spreadsheets. Dedicated to LF
    unraveling the mysteries of the universe through data science. Also LF
    interested in interplanetary travel and data visualization.", "60555", "Warrenville", IL"
    Note: Selecting Records Span Multiple Lines enables the system to process a file containing records such as this example. Clearing this setting would likely result in an error in this case.
  • Has Header Row - If your data file contains a header row, you can select this option. Data360 DQ+ will then treat the first row in your file as a header and use this header to automatically name your fields.
  • Allow Lenient Parsing - Select this option for a data store with a Delimited or Excel Layout to ensure that the structure of data loaded to the data store does not need to exactly match the structure defined in the data store's definition. Additionally, selecting this option ensures that:
    • Data files can contain more fields than are present in the data store definition. For data files that contain more fields than are present in the data store definition, when lenient parsing is set, an Overflow Field property will become available when the data store is used as a data store input node in an analysis. Content from the extra fields will then be pushed into this Overflow Field in the JSON format when the analysis runs.
    • Data files can contain fewer fields than are present in the data store definition. For data files that contain fewer fields than are present in the data store definition, a Default Value field can be added to each data store field definition. When a file containing fewer fields than are present in the data store definition is used, the Default Value will be used to populate records in missing fields.
    • Data files can contain fields in an order that is different than the order specified in the data store definition.
    • Synonyms can be used for field names.
  • Determine Delimiters - Click Determine Delimiters to automatically detect delimiters when accessing data that resides in an external location. If your file uses common delimiters such as commas or tabs, this feature will detect them. If other delimiters are used, they must be entered manually.
    • Field Delimiters separate your columns.
    • Record Delimiters separate your rows. These are usually not visible in your source file, but can be shown by modifying the settings of various text editors.
    • Quote Character should be specified if paired quotes are used around string fields.

    For example, in the following diagram, specifying the Single Quote character prevents 'bla,ck' from being parsed into two fields due to the presence of a comma:

    Delimited data

JSON layout

You can configure your JSON layout by using the JSON schema and transformation builder.

Hadoop layouts

Data360 DQ+ supports the following Hadoop layouts:

  • Parquet
  • ORC
  • AVRO
  • Options - Use the Options panel to add Name / Value pairs.
  • Selection Criteria - Use the Selection Criteria panel to add SQL statements to define which records from the delimited file are included in the data store. In your SQL statements, use ${ResultTable} to refer to the table that your file represents. For example, given the following JSON:
    {"name": "Justin", "age": 19},
    {"name": "Michael", "age": 5},
    {"name": "Andy", "age": 30}

    You can select only the first and third record for inclusion in the data store by adding the following SQL statement:

    SELECT * FROM ${ResultTable} WHERE age >=18

Microsoft Excel layout

This layout option allows you to create data stores using Microsoft Excel files that have been saved in the .xlsx format. By using this layout type, you can create the data store using specific worksheets from the Excel file and skip empty rows and columns if you need to.

  • Skip Empty Columns - Select this option if your worksheets contain empty columns that you would like to skip and not include in the data store.
  • Skip Empty Rows - Select this option if worksheets contain empty rows that you would like to skip and not include in the data store.
  • Column Name Line Number - This is the row number of the worksheets' header, which can be used to generate the field names of the data store.
  • Identify Worksheets By - You can use the following options to be selective about the worksheets within your Excel file used to create your data store.

    Option

    Description

    By Numbers

    Specify worksheets by using a comma-delimited list of worksheet indexes with no spaces, where indexing starts at 1, for example 1,2,4

    By Number range

    Specify worksheets by using a numeric range, where indexing starts at 1.

    By Names

    Specify worksheets by name, by using a comma delimited list of worksheet names, for example august,september.

    By Name pattern

    Specify worksheets by using a regular expression. All worksheets that have names that match the regex pattern will be used.

Line layout

Data stores can also be created using non-delimited files that consist of a single line.

Generating a line layout data store

  1. Select a Store Repository Type. Files for line layout data stores can be uploaded to the default repository or accessed by pointing the data store to an external S3 location.
  2. Specify a Line Layout. To parse a file into multiple records you can use the line delimiter type.
    • New Line - Parses the file into a new record whenever a CRLF or LF character is found. Note that CRLF and LF characters are usually not visible characters in the file, they are line breaks. Here they are included to clarify the provided examples.

      For example, if your file read: 'thisCRLFthat'Choosing New Line would parse the file into the following two records: 'this' and 'that'.

    • Number of Characters - Takes the specified number of characters from each CRLF or LF separated value in the file and uses those characters to create a record. In other words, Number of Characters takes the first x characters from each line in the file.

      For example, if your file read: 'thisCRLFthat'Setting Number of Characters to three would parse the file into the following two records: 'thi' and 'tha'.

  3. Generate a field named "line" in the Fields tab. You can either do this manually, or by using the Generate button.
  4. If you chose the Default repository type, you need to upload a file to the data store.

After creating the line layout data store, you can then see how your file has been parsed by using the data store in an analysis.

XML layout

This layout type allows you to create data stores using XML files. Configure the following XML Layout properties to customize what contents are pulled from the file:

  • Row Tag - The string that is used to tag rows within your XML file, for example:
    <Roster>
    <Employee>
          <RetailStoreID>48</RetailStoreID>
          <WorkstationID>6</WorkstationID>
          <SequenceNumber>73</SequenceNumber>
          <StartDate>2014-09-30</StartDate>
          <EndDate>2015-09-30</EndDate>
          <OperatorID OperatorName="KERRY P">48237</OperatorID>
    </Employee>
    </Roster>

    You could use "Employee" as the Row Tag, signifying a structure where each <Employee> record represents a row in your data store. The tags within that row could then be used as data store fields. Here, those fields would be RetailStoreID, WorkstationID, SequenceNumber, StartDate, EndDate, and OperatorID.

  • Options - For details of the available options, see XML Data Source for Apache Spark.

    When using these options in the product, they must be prefixed with "read.xml".

    Note: The following features are not available:path, which is handled by the product, and rowTag, which is handled by the Row Tag property described above.

    Consider the following XML structure:

    <Roster>
       <Employee>
          <RetailStoreID>48</RetailStoreID>
          <WorkstationID>6</WorkstationID>
          <SequenceNumber>73</SequenceNumber>
          <StartDate>2014-09-30</StartDate>
          <EndDate>2015-09-30</EndDate>
    <OperatorID OperatorName="KERRY P">48237</OperatorID>
       </Employee>
    </Roster>

    In this example, the final field, OperatorID, contains an attribute, OperatorName. Using the excludeAttribute option, you could control whether this attribute was contained in the OperatorID field for this row.

    To exclude the OperatorName attribute, you would create the following Option:

    Name

    Value

    read.xml.excludeAttribute

    true

    This would populate the OperatorID field with: 48237

    To include the OperatorName attribute, you would create this Option instead:

    Name

    Value

    read.xml.excludeAttribute

    false

    This would populate the OperatorID field with: [KERRY P, 48237]

  • Selection Criteria - Use the Selection Criteria panel to add SQL statements to define which records from the XML file are included in the data store. For an example of how to use Selection Criteria, see Hadoop layouts.