Delimited layout
You can create a data store by using a delimited text file. Commonly used delimiters include commas and tabs, but you can use any character. Whichever symbol is used, it is the delimiter that tells Data360 DQ+ how to parse the external data.
The Delimited Layout options are described below:
-
Records Span Multiple Lines - You can use this setting when you have a field that contains line breaks which causes records in the data set to span multiple lines. For example, the following header (line 1) and record (lines 2-5) which spans multiple lines due to the presence of Line Feeds in the
Description
field:Agency_ID, Agency_Name, Description, Zipcode, City, State 123, "Acme Labs", "Technical facility for testing big data concepts in LF the furthest reaches of space, time, and spreadsheets. Dedicated to LF unraveling the mysteries of the universe through data science. Also LF interested in interplanetary travel and data visualization.", "60555", "Warrenville", IL"
Note: Selecting Records Span Multiple Lines enables the system to process a file containing records such as this example. Clearing this setting would likely result in an error in this case. - Has Header Row - If your data file contains a header row, you can select this option. Data360 DQ+ will then treat the first row in your file as a header and use this header to automatically name your fields.
-
Allow Lenient Parsing - Select this option for a data store with a Delimited or Excel Layout to ensure that the structure of data loaded to the data store does not need to exactly match the structure defined in the data store's definition. Additionally, selecting this option ensures that:
- Data files can contain more fields than are present in the data store definition. For data files that contain more fields than are present in the data store definition, when lenient parsing is set, an Overflow Field property will become available when the data store is used as a data store input node in an analysis. Content from the extra fields will then be pushed into this Overflow Field in the JSON format when the analysis runs.
- Data files can contain fewer fields than are present in the data store definition. For data files that contain fewer fields than are present in the data store definition, a Default Value field can be added to each data store field definition. When a file containing fewer fields than are present in the data store definition is used, the Default Value will be used to populate records in missing fields.
- Data files can contain fields in an order that is different than the order specified in the data store definition.
- Synonyms can be used for field names.
-
Determine Delimiters - Click Determine Delimiters to automatically detect delimiters when accessing data that resides in an external location. If your file uses common delimiters such as commas or tabs, this feature will detect them. If other delimiters are used, they must be entered manually.
- Field Delimiters separate your columns.
- Record Delimiters separate your rows. These are usually not visible in your source file, but can be shown by modifying the settings of various text editors.
- Quote Character should be specified if paired quotes are used around string fields.
For example, in the following diagram, specifying the Single Quote character prevents 'bla,ck' from being parsed into two fields due to the presence of a comma:
JSON layout
You can configure your JSON layout by using the JSON schema and transformation builder.
Hadoop layouts
Data360 DQ+ supports the following Hadoop layouts:
- Parquet
- ORC
- AVRO
- Options - Use the Options panel to add Name / Value pairs.
-
Selection Criteria - Use the Selection Criteria panel to add SQL statements to define which records from the delimited file are included in the data store. In your SQL statements, use
${ResultTable}
to refer to the table that your file represents. For example, given the following JSON:{"name": "Justin", "age": 19}, {"name": "Michael", "age": 5}, {"name": "Andy", "age": 30}
You can select only the first and third record for inclusion in the data store by adding the following SQL statement:
SELECT * FROM ${ResultTable} WHERE age >=18
Microsoft Excel layout
This layout option allows you to create data stores using Microsoft Excel files that have been saved in the .xlsx
format. By using this layout type, you can create the data store using specific worksheets from the Excel file and skip empty rows and columns if you need to.
- Skip Empty Columns - Select this option if your worksheets contain empty columns that you would like to skip and not include in the data store.
- Skip Empty Rows - Select this option if worksheets contain empty rows that you would like to skip and not include in the data store.
- Column Name Line Number - This is the row number of the worksheets' header, which can be used to generate the field names of the data store.
-
Identify Worksheets By - You can use the following options to be selective about the worksheets within your Excel file used to create your data store.
Option
Description
By Numbers
Specify worksheets by using a comma-delimited list of worksheet indexes with no spaces, where indexing starts at 1, for example
1,2,4
By Number range
Specify worksheets by using a numeric range, where indexing starts at 1.
By Names
Specify worksheets by name, by using a comma delimited list of worksheet names, for example
august,september
.By Name pattern
Specify worksheets by using a regular expression. All worksheets that have names that match the regex pattern will be used.
Line layout
Data stores can also be created using non-delimited files that consist of a single line.
Generating a line layout data store
- Select a Store Repository Type. Files for line layout data stores can be uploaded to the default repository or accessed by pointing the data store to an external S3 location.
- Specify a Line Layout. To parse a file into multiple records you can use the line delimiter type.
- New Line - Parses the file into a new record whenever a CRLF or LF character is found. Note that CRLF and LF characters are usually not visible characters in the file, they are line breaks. Here they are included to clarify the provided examples.
For example, if your file read: 'thisCRLFthat'Choosing New Line would parse the file into the following two records: 'this' and 'that'.
- Number of Characters - Takes the specified number of characters from each CRLF or LF separated value in the file and uses those characters to create a record. In other words, Number of Characters takes the first x characters from each line in the file.
For example, if your file read: 'thisCRLFthat'Setting Number of Characters to three would parse the file into the following two records: 'thi' and 'tha'.
- New Line - Parses the file into a new record whenever a CRLF or LF character is found. Note that CRLF and LF characters are usually not visible characters in the file, they are line breaks. Here they are included to clarify the provided examples.
- Generate a field named "line" in the Fields tab. You can either do this manually, or by using the Generate button.
- If you chose the Default repository type, you need to upload a file to the data store.
After creating the line layout data store, you can then see how your file has been parsed by using the data store in an analysis.
XML layout
This layout type allows you to create data stores using XML files. Configure the following XML Layout properties to customize what contents are pulled from the file:
-
Row Tag - The string that is used to tag rows within your XML file, for example:
<Roster> <Employee> <RetailStoreID>48</RetailStoreID> <WorkstationID>6</WorkstationID> <SequenceNumber>73</SequenceNumber> <StartDate>2014-09-30</StartDate> <EndDate>2015-09-30</EndDate> <OperatorID OperatorName="KERRY P">48237</OperatorID> </Employee> </Roster>
You could use "Employee" as the Row Tag, signifying a structure where each
<Employee>
record represents a row in your data store. The tags within that row could then be used as data store fields. Here, those fields would beRetailStoreID
,WorkstationID
,SequenceNumber
,StartDate
,EndDate
, andOperatorID
. -
Options - For details of the available options, see XML Data Source for Apache Spark.
When using these options in the product, they must be prefixed with "read.xml".
Note: The following features are not available:path, which is handled by the product, and rowTag, which is handled by the Row Tag property described above.Consider the following XML structure:
<Roster> <Employee> <RetailStoreID>48</RetailStoreID> <WorkstationID>6</WorkstationID> <SequenceNumber>73</SequenceNumber> <StartDate>2014-09-30</StartDate> <EndDate>2015-09-30</EndDate> <OperatorID OperatorName="KERRY P">48237</OperatorID> </Employee> </Roster>
In this example, the final field,
OperatorID
, contains an attribute,OperatorName
. Using the excludeAttribute option, you could control whether this attribute was contained in theOperatorID
field for this row.To exclude the
OperatorName
attribute, you would create the following Option:Name
Value
read.xml.excludeAttribute
true
This would populate the
OperatorID
field with: 48237To include the
OperatorName
attribute, you would create this Option instead:Name
Value
read.xml.excludeAttribute
false
This would populate the
OperatorID
field with: [KERRY P, 48237] - Selection Criteria - Use the Selection Criteria panel to add SQL statements to define which records from the XML file are included in the data store. For an example of how to use Selection Criteria, see Hadoop layouts.