Transform - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Provides a wide range of filtering and transformation operations, using Python script to configure output fields, process input data and produce output records.

There are two Python script properties ConfigureFields and ProcessRecords. Together, these properties define the way that the node maps, filters and transforms the input records to the node outputs.

The ConfigureFields property is used to configure the output metadata of the node, either as fields which are mapped from an input field, or as new fields. This script is executed prior to reading any records, and must define all of the output fields and their properties (i.e. name, data type).

The ProcessRecords script is then called once per input record and produces output records by transforming these input records. Alternatively, if you wish to simply copy values straight over from inputs to outputs, this can be done as part of the initial mapping in ConfigureFields.

Both properties are populated with default scripting (out1 += in1) which, if left unchanged, will output all input fields and records.

Tip: See Python scripting for more information.

When typing Python script, auto indentation will insert four spaces. If you have pasted script into the editor that contains tabs, these will be highlighted in the code and you can use the Convert Tabs to Spaces option from the property menu to convert all tabs to spaces. This menu option will also convert any instances of two space indentation to four spaces.

Valid data types

  • bool
  • int
  • long
  • float
  • datetime.date
  • datetime.time
  • datetime.datetime
  • str
  • unicode

Examples

The following set of Python script snippets show how to achieve common tasks (all snippets assume the default pin names in1 and out1). Note that where the ProcessRecords property is excluded from the examples, assume a value of 'out1 += in1', copying all records over.

Passing all input data through

ConfigureFields

#map all of the fields from input pin in1 onto output pin out1out1 += in1

ProcessRecords

#copy all records overout1 += in1

Outputting specific fields

ConfigureFields

out1 += in1.Billing_Totalout1 += in1.Customer_ID

If a field name includes a space, it must be referenced in square brackets and single quotes, for example:

out1 += in1['Billing Total']

Filtering on row values

ConfigureFields

out1 += in1

ProcessRecords


        if in1['Billing Total'] > 100.00 and in1.Status == 'Active':
    out1 += in1
      

Excluding fields

ConfigureFields

out1 += in1out1 -= in1.Billing_Refout1 -= in1.Company_Code

Renaming a field

ConfigureFields

#map all of the fields overout1 += in1

#exclude customer ID and rename it on the outputout1 -= in1.Customer_IDout1.Customer_Ref = in1.Customer_ID

Note: You must write the exclude statement first, prior to inserting new field metadata that references the excluded input field, i.e. in this example, out1 -= in1.Customer_ID must be written before out1.Customer_Ref = in1.Customer_ID. Reversing the order of these statements will produce an incorrect result. In this case, the exclude statement will not only exclude the original field metadata, but will also remove any output field that is mapped from the excluded input field.

Outputting fields which match a pattern or regular expression

ConfigureFields

#Select the fields containing the string "Invoice"out1 += '*Invoice*'#Select the fields matching the regular expressionout1 += patterns.regex('.*?(due|present).*?Date', in1)

Creating and populating a new output field

ConfigureFields

#map all of the fields overout1 += in1

#add a new long integer fieldout1.RecordCount = long

ProcessRecords

out1 += in1out1.RecordCount = node.execCount

Accessing the type of the input fields and splitting fields using a delimiter

ConfigureFields

#Configure all fields from input 'in1' to be mapped
#to the corresponding fields on the output 'out1'
out1 += in1

## Check the type of field we are dealing with
fieldtype = in1['Data'].type()

## Create new string fields for the elements based on the input field's data type
if fieldtype == str:
    out1.c1 = str
    out1.c2 = str
    out1.c3 = str
    out1.c4 = str
    out1.c5 = str

if fieldtype == unicode:
    out1.c1 = unicode
    out1.c2 = unicode
    out1.c3 = unicode
    out1.c4 = unicode
    out1.c5 = unicode

ProcessRecords

## Split the input field on space chars
if in1['data'] is Null:
    splits = []
else:
    splits = in1['Data'].split(" ")

## ToDo Insert some error handling code here
if len(splits) > 5:
    ##Raise an error instead of the pass below pass

## Output all the input fields
out1 += in1

## Output the element values
if(len(splits) >0):
    out1.c1 = splits[0]
else:
    out1.c1 = Null

if(len(splits) >1):
    out1.c2 = splits[1]
else:
    out1.c2 = Null

if(len(splits) >2):
    out1.c2 = splits[1]
else:
    out1.c2 = Null

if(len(splits) >3):
    out1.c3 = splits[2]
else:
    out1.c3 = Null

if(len(splits) >4):
    out1.c4 = splits[3]
else:
    out1.c4 = Null

if(len(splits) >5):
    out1.c5 = splits[4]
else:
    out1.c5 = Null

Changing the data type of a field

ConfigureFields

#map all of the fields overout1 += in1

#specify the data type of the new output fieldout1.Account_Number = int

ProcessRecords

out1 += in1

#cast each value of Account_Number to an integerout1.Account_Number = int(in1.Account_Number)

In this second example, a string data type is converted to a date data type:

ConfigureFields

out1 += in1

#exclude original Transaction_Date fieldout1 -= in1.Transaction_Date

# Create new field called Transaction_Date with datetime.date formatout1.Transaction_Date = datetime.date

ProcessRecords

out1.Transaction_Date = datetime.datetime.strptime(in1.Transaction_Date, '%Y-%m-%d').date()

To handle blank records, in ProcessRecords:

out1 += in1

if in1.Transaction_Date != "":
    out1.Transaction_Date = datetime.datetime.strptime(in1.dueDate,'%Y-%m-%d').date()

else:
    out1.Transaction_Date = Null

To handle non-ASCIIcharacters, in ProcessRecords:

# Use the 'u' character prefix to convert the string to a Unicode string.

out1.color = str(fields[u'colör'])

Output all fields and the running sum of a particular field

ConfigureFields

out1 += in1out1.SumTotal = float

ProcessRecords

out1 += in1
if node.firstExec:
    sumTotal = 0

sumTotal += in1.total
out1.SumTotal = sumTotal

Logging an error and failing the node upon encountering an error condition

ConfigureFields

out1 += in1

ProcessRecords

out1 += in1
#AccountNumber is a required field, error if not present
if in1.AccountNumber is Null:
   msg = 'Account Number missing on record: '
   record = str(node.execCount)
   node.logger.error(msg + record)
   raise node.fail()

Tip: When processing a high volume of records (100+ million records) with large string values, consider using StringIO objects from the Python StringIO module for improved performance. For more information, see the Python documentation at https://docs.python.org/2/library/stringio.html.

Properties

ConfigureFields

Specify the Python script to configure the node outputs and output fields. This script is executed once in order to configure the output metadata prior to any records being processed.

The details of any mappings from input field to output field are stored in the mapper for the output which can be used in the ProcessRecords property.

The default script will configure the output fields on the default output to be the same as the input fields. The default script also contains (commented out) examples for other use cases.

If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g. in1, out1).

The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs). As this script is executed prior to reading any records, the inputs and outputs references refer specifically to the input and output metadata.

Tip: See Python scripting for more information.

ProcessRecords

Specify the Python script to process the input records and write output records.

This script is executed once per input record. The default script will simply pass through all of the fields from the input record to the corresponding output field if specified in the ConfigureFields property.

The default script also contains (commented out) examples for other use cases. If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g. in1, out1)

The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs).

The input and output references in this property refer to the input and output records as opposed to the metadata in the ConfigureFields property.

Tip: See Python scripting for more information.

GroupBy

Optionally specify the fields that you want to group by using the notation fields.<name> separating each field reference with a comma. To group in descending order, use the fn.desc function.

If left blank, all input records will be treated as a single group.

Example: fields.FirstName, fn.desc(fields.DOB)

Tip: See Pythonmodule support for more information.

ImplicitWriteMode

Optionally specify whether or not the output records will be automatically written by the node after processing each input record. Choose from:

  • Always - Unless an output record has been explicitly written within the Python script, or has been explicitly set to None, it will always be written.
  • Never - Output records will never be automatically written. They need to be explicitly written within the ProcessRecords script.
  • When Modified - Output records will be automatically written whenever something (including None) has been set on the output record, and the output has not been set to None, and the record has not been explicitly written.

The default value is When Modified.

Consider the following ProcessRecords script:

if in1.Status != 'Inactive':    out1 += in1

With ImplicitWriteMode set to When Modified, this record will only be written to the output if the 'Status' field value is not 'Inactive'.

If ImplicitWriteMode is set to Always, the record would be written to the output with the script above. To prevent this, you would use the following script:

if in1.Status != 'Inactive':    out1 += in1else    out1 += None

If ImplicitWriteMode is set to Never, then you would need to provide additional instructions to write the record to the output upon setting the value:

if in1.Status != 'Inactive':    out1 += in1    node.write(0, out1)

ImplicitWriteEvaluation

Optionally set when the ImplicitWriteMode is to be evaluated. Choose from:

  • Per Record - The ImplicitWriteMode property will be evaluated for each input record meaning records may be implicitly written per input record.
  • Group End - The ImplicitWriteMode property will be evaluated at the end of each group meaning records will only ever be implicitly written at the end of each group as configured by the GroupBy property.

The default value is Per Record.

UnsortedInputBehavior

Specify the behavior when input data has not been sorted. Choose from:

  • Error - The node will fail if the input records are not sorted according to the GroupBy criteria.
  • Log - If the input records are not sorted according to the GroupBy criteria, then a warning is logged the first time the problem is encountered however the node will continue processing.
  • Ignore - No action is taken if the input records are not sorted according to the GroupBy criteria.

The default value is Log.

Example data flows

A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.

In the Directory under the /Data360 Samples/Node Examples/ folder, you will find "Transform using Python", which shows examples of how to use this node.

Note: Upgrades will overwrite all data flows in the workspace. If you want to make changes to one of the sample data flows, we recommend you create a copy and save it elsewhere, using Save as...

Inputs and outputs

Inputs: in1.

Outputs: out1, multiple optional.