Data Analyzer - Data360_Analyze - 3 - 3.12

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
3.12
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2023
First publish date
2016

Examines input data to determine its format, data model and statistical composition and ensures data gets associated with the correct data type by converting input strings or Unicode data into more appropriate data types, as needed.

For more information on data types, see Data types.

This node does not require any configuration. After running the Data Analyzer node, connect the converted data output pin to the next node in your data flow.

The Data Analyzer node also has two informational output pins, analysis and histogram.

Converted data

Outputs the input data where input string or Unicode data types have been converted to more appropriate data types, where necessary. Conversion is only performed on fields with a data type of string or Unicode. This is an optional output that is turned on by default; set the Perform Conversion property to False if you want to turn it off.

When converting data, the following coercions are attempted, in this order: Int, Long, Double, Bool, Date, Time.For the date type, the following formats are attempted in this order: CCYY/M/D, CCYY/D/M, M/D/CCYY, D/M/CCYY, Y/M/D, Y/D/M, M/D/Y, D/M/Y, CCYY-M-D, CCYY-D-M, M-D-CCYY, D-M-CCYY, Y-M-D, Y-D-M, M-D-Y, D-M-Y. For the time type, the following formats are attempted in this order: H:M:SPM, H:M:S PM, H:M:S, H:MPM, H:M PM, H:M. See Date and time operators

For example, you have the following input data:

Product_Codeunicode Productunicode
15 Tea
2 Coffee
3 Water
15 Tea
15 Tea

Running the Data Analyzer node converts the Product_Code field to an int data type and converts the Product to a string data type. The following output is displayed in the converted data output pin:

Product_Codeint Productstring
15 Tea
2 Coffee
3 Water
15 Tea
15 Tea

analysis

Gives a summary of the data types that have changed (for string or Unicode data types) and lists some simple metrics about the input data, including; minimum and maximum values, minimum and maximum lengths and the total number of NULLs (the null count can help to highlight data quality issues).

The analysis output pin also contains the Script that was used to translate the data from the input data type to the output data type. Depending on the size of the input data set, the Data Analyzer node can take some time to process. If you want to optimize your data flow for a production environment, you can copy the #BRAINscript Conversion field from the analysis pin output into a Transform (Superseded) node to perform data conversion without using the Data Analyzer node.

For example, you have the following input data:

Product_Codeunicode Productunicode
15 Tea
2 Coffee
3 Water
15 Tea
15 Tea

After running the Data Analyzer node, the analysis output pin displays the following information:

Field Namestring Current Typestring Discovered Typestring Min Valueunicode Max Valueunicode Min Lengthint Max Lengthint Null Countint Distinct Valueint #BRAINscript Conversionstring
Product_Code unicode int 2 15 1 2 0 3 emit (if 'Product_Code'.trim() == "" then int(null) else 'Product_Code'.double().int()) as "Product_Code"
Product unicode string Coffee Water 3 6 0 3 emit (if 'Product'.trim() == "" then str(null) else str('Product')) as "Product"

histogram

Outputs a list of unique values within each input field, and a count of the total number of instances of each unique value. If you want to limit the number of unique values that are displayed in the histogram output, enter a maximum value in the Max Distinct Value Kept property.

For example, you have the following input data:

Product_Codeunicode Productunicode
15 Tea
2 Coffee
3 Water
15 Tea
15 Tea

After running the Data Analyzer node, the histogram output pin displays a summary of the three unique values for each input field and a total count of each:

Product_Codeunicode Product_CodeCountint Productunicode ProductCountint
15 3 Tea 3
2 1 Coffee 1
3 1 Water 1

Properties

Empty String Is Null

Optionally specify whether strings that are either zero length or all whitespace are interpreted as NULL. If false, they are treated as distinct string values.

The default value is True.

Convert Leading Zeros

Optionally specify whether strings that contain leading zeros are converted to a numeric type (int, long, or double) for analysis. Generally, leading zeros are an indicator that the data is not a number. For example, it could be a ZIP Code or an ID.

The default value is False.

Override Field

Optionally specify whether to populate the #BRAINscript Conversion field if the data type is changed for input data.

If set to false, the #BRAINscript Conversion is output for all fields in the analysis output pin, regardless of whether a conversion occurs or not. If set to true, if the node has detected that the input data can be converted to a different data type, the #BRAINscript Conversion field is populated. Otherwise, this field is empty.

The default value is False.

Max Distinct Values Kept

Optionally specify the maximum number of rows that are output on the histogram pin and the maximum number of distinct values reported on the analysis pin for each input field. Each distinct value for each field is stored in memory. If you do not want to limit the number of unique values that are listed, set this property to 0. Note: Large data sets that have no limit set on the number of unique values to list can use a lot of memory. If the input data contains more unique values than the maximum number set in this property, the value set in this property will be reported as the total number of unique values discovered.

Time Limit In Seconds

Optionally specify the total number of seconds that the node analyzes the data. A value of zero means that this node will analyze all input data barring a limit placed by the Use Sample Set property. It is common to use either this property or the Use Sample Set property.

Use Sample Set

Optionally specify whether this node examines a sample of the input data as defined in the Sample Size property. If set to false, all data is evaluated barring a limit placed by the Time Limit in Seconds property.

The default value is False.

Sample Size

Optionally specify the size of the sample set of the input data to analyze in place of the full input set. This size can be written as either an absolute number of records or as a percentage of all input records. Note: If specifying a percentage, enter a number followed by the % sign. This value is only used if the Use Sample Set property is set to true.

Perform Conversion

Optionally specify whether input string or Unicode data types are converted into more restrictive data types and output to the converted data output pin. If set to false, the converted data output pin does not output any data.

The default value is True.

Inputs and outputs

Inputs: Data.

Outputs: converted data, analysis, histogram.