Provides a wide range of filtering and transformation operations, using Python script to configure output fields, process input data and produce output records.
There are two Python script properties ConfigureFields and ProcessRecords. Together, these properties define the way that the node maps, filters and transforms the input records to the node outputs.
The ConfigureFields property is used to configure the output metadata of the node, either as fields which are mapped from an input field, or as new fields. This script is executed prior to reading any records, and must define all of the output fields and their properties (i.e. name, data type).
The ProcessRecords script is then called once per input record and produces output records by transforming these input records. Alternatively, if you wish to simply copy values straight over from inputs to outputs, this can be done as part of the initial mapping in ConfigureFields.
Both properties are populated with default scripting (out1 += in1
) which, if left unchanged, will output all input fields and records.
When typing Python script, auto indentation will insert four spaces. If you have pasted script into the editor that contains tabs, these will be highlighted in the code and you can use the Convert Tabs to Spaces option from the property menu to convert all tabs to spaces. This menu option will also convert any instances of two space indentation to four spaces.
Valid data types
- bool
- int
- long
- float
- datetime.date
- datetime.time
- datetime.datetime
- str
- unicode
Examples
The following set of Python script snippets show how to achieve common tasks (all snippets assume the default pin names in1 and out1). Note that where the ProcessRecords property is excluded from the examples, assume a value of 'out1 += in1', copying all records over.
Passing all input data through
ConfigureFields
#map all of the fields from input pin in1 onto output pin out1out1 += in1
ProcessRecords
#copy all records overout1 += in1
Outputting specific fields
ConfigureFields
out1 += in1.Billing_Totalout1 += in1.Customer_ID
If a field name includes a space, it must be referenced in square brackets and single quotes, for example:
out1 += in1['Billing Total']
Filtering on row values
ConfigureFields
out1 += in1
ProcessRecords
if in1['Billing Total'] > 100.00 and in1.Status == 'Active':
out1 += in1
Excluding fields
ConfigureFields
out1 += in1out1 -= in1.Billing_Refout1 -= in1.Company_Code
Renaming a field
ConfigureFields
#map all of the fields overout1 += in1
#exclude customer ID and rename it on the outputout1 -= in1.Customer_IDout1.Customer_Ref = in1.Customer_ID
out1 -= in1.Customer_ID
must be written before out1.Customer_Ref = in1.Customer_ID
. Reversing the order of these statements will produce an incorrect result. In this case, the exclude statement will not only exclude the original field metadata, but will also remove any output field that is mapped from the excluded input field.Outputting fields which match a pattern or regular expression
ConfigureFields
#Select the fields containing the string "Invoice"out1 += '*Invoice*'#Select the fields matching the regular expressionout1 += patterns.regex('.*?(due|present).*?Date', in1)
Creating and populating a new output field
ConfigureFields
#map all of the fields overout1 += in1
#add a new long integer fieldout1.RecordCount = long
ProcessRecords
out1 += in1out1.RecordCount = node.execCount
Accessing the type of the input fields and splitting fields using a delimiter
ConfigureFields
#Configure all fields from input 'in1' to be mapped
#to the corresponding fields on the output 'out1'
out1 += in1
## Check the type of field we are dealing with
fieldtype = in1['Data'].type()
## Create new string fields for the elements based on the input field's data type
if fieldtype == str:
out1.c1 = str
out1.c2 = str
out1.c3 = str
out1.c4 = str
out1.c5 = str
if fieldtype == unicode:
out1.c1 = unicode
out1.c2 = unicode
out1.c3 = unicode
out1.c4 = unicode
out1.c5 = unicode
ProcessRecords
## Split the input field on space chars
if in1['data'] is Null:
splits = []
else:
splits = in1['Data'].split(" ")
## ToDo Insert some error handling code here
if len(splits) > 5:
##Raise an error instead of the pass below pass
## Output all the input fields
out1 += in1
## Output the element values
if(len(splits) >0):
out1.c1 = splits[0]
else:
out1.c1 = Null
if(len(splits) >1):
out1.c2 = splits[1]
else:
out1.c2 = Null
if(len(splits) >2):
out1.c2 = splits[1]
else:
out1.c2 = Null
if(len(splits) >3):
out1.c3 = splits[2]
else:
out1.c3 = Null
if(len(splits) >4):
out1.c4 = splits[3]
else:
out1.c4 = Null
if(len(splits) >5):
out1.c5 = splits[4]
else:
out1.c5 = Null
Changing the data type of a field
ConfigureFields
#map all of the fields overout1 += in1
#specify the data type of the new output fieldout1.Account_Number = int
ProcessRecords
out1 += in1
#cast each value of Account_Number to an integerout1.Account_Number = int(in1.Account_Number)
In this second example, a string data type is converted to a date data type:
ConfigureFields
out1 += in1
#exclude original Transaction_Date fieldout1 -= in1.Transaction_Date
# Create new field called Transaction_Date with datetime.date formatout1.Transaction_Date = datetime.date
ProcessRecords
out1.Transaction_Date = datetime.datetime.strptime(in1.Transaction_Date, '%Y-%m-%d').date()
To handle blank records, in ProcessRecords:
out1 += in1
if in1.Transaction_Date != "":
out1.Transaction_Date = datetime.datetime.strptime(in1.dueDate,'%Y-%m-%d').date()
else:
out1.Transaction_Date = Null
To handle non-ASCIIcharacters, in ProcessRecords:
# Use the 'u
' character prefix to convert the string to a Unicode string.
out1.color = str(fields[u'colör'])
Output all fields and the running sum of a particular field
ConfigureFields
out1 += in1out1.SumTotal = float
ProcessRecords
out1 += in1
if node.firstExec:
sumTotal = 0
sumTotal += in1.total
out1.SumTotal = sumTotal
Logging an error and failing the node upon encountering an error condition
ConfigureFields
out1 += in1
ProcessRecords
out1 += in1
#AccountNumber is a required field, error if not present
if in1.AccountNumber is Null:
msg = 'Account Number missing on record: '
record = str(node.execCount)
node.logger.error(msg + record)
raise node.fail()
StringIO
objects from the Python StringIO
module for improved performance. For more information, see the Python documentation at https://docs.python.org/2/library/stringio.html.Properties
ConfigureFields
Specify the Python script to configure the node outputs and output fields. This script is executed once in order to configure the output metadata prior to any records being processed.
The details of any mappings from input field to output field are stored in the mapper for the output which can be used in the ProcessRecords property.
The default script will configure the output fields on the default output to be the same as the input fields. The default script also contains (commented out) examples for other use cases.
If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g. in1, out1).
The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs). As this script is executed prior to reading any records, the inputs and outputs references refer specifically to the input and output metadata.
ProcessRecords
Specify the Python script to process the input records and write output records.
This script is executed once per input record. The default script will simply pass through all of the fields from the input record to the corresponding output field if specified in the ConfigureFields property.
The default script also contains (commented out) examples for other use cases. If there are no naming conflicts and no spaces in an input or output name, it can be referenced directly using its name (e.g. in1, out1)
The inputs and outputs can also be referenced using the format: inputs[index] or inputs['name'] (and the equivalent format for outputs).
The input and output references in this property refer to the input and output records as opposed to the metadata in the ConfigureFields property.
GroupBy
Optionally specify the fields that you want to group by using the notation fields.<name>
separating each field reference with a comma. To group in descending order, use the fn.desc
function.
If left blank, all input records will be treated as a single group.
Example: fields.FirstName, fn.desc(fields.DOB)
ImplicitWriteMode
Optionally specify whether or not the output records will be automatically written by the node after processing each input record. Choose from:
- Always - Unless an output record has been explicitly written within the Python script, or has been explicitly set to None, it will always be written.
- Never - Output records will never be automatically written. They need to be explicitly written within the ProcessRecords script.
- When Modified - Output records will be automatically written whenever something (including None) has been set on the output record, and the output has not been set to None, and the record has not been explicitly written.
The default value is When Modified.
Consider the following ProcessRecords script:
if in1.Status != 'Inactive': out1 += in1
With ImplicitWriteMode set to When Modified, this record will only be written to the output if the 'Status' field value is not 'Inactive'.
If ImplicitWriteMode is set to Always, the record would be written to the output with the script above. To prevent this, you would use the following script:
if in1.Status != 'Inactive': out1 += in1else out1 += None
If ImplicitWriteMode is set to Never, then you would need to provide additional instructions to write the record to the output upon setting the value:
if in1.Status != 'Inactive': out1 += in1 node.write(0, out1)
ImplicitWriteEvaluation
Optionally set when the ImplicitWriteMode is to be evaluated. Choose from:
- Per Record - The ImplicitWriteMode property will be evaluated for each input record meaning records may be implicitly written per input record.
- Group End - The ImplicitWriteMode property will be evaluated at the end of each group meaning records will only ever be implicitly written at the end of each group as configured by the GroupBy property.
The default value is Per Record.
UnsortedInputBehavior
Specify the behavior when input data has not been sorted. Choose from:
- Error - The node will fail if the input records are not sorted according to the GroupBy criteria.
- Log - If the input records are not sorted according to the GroupBy criteria, then a warning is logged the first time the problem is encountered however the node will continue processing.
- Ignore - No action is taken if the input records are not sorted according to the GroupBy criteria.
The default value is Log.
Example data flows
A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.
In the Directory under the /Data360 Samples/Node Examples/
folder, you will find "Transform using Python", which shows examples of how to use this node.
Inputs and outputs
Inputs: in1.
Outputs: out1, multiple optional.