Imports XML data from an input data source into tabular data.
To run the XML Data node, set the source of the data that you want to import in the XmlData property. You can import data from a specified file name or input field. If you are processing data from a file, choose the (from Filename) variant of the XmlData property to specify the name of the file containing XML data. If you are processing data from an input field, the input field can either contain the file names of the data to process, or it can contain the data itself within the field. Choose the (from Filename Field) variant of the XmlData property to specify the name of an input field containing XML file names, or choose the (from Data Field) variant of the XmlData property to specify the name of the input field containing XML data.
For certain data structures, you might need to configure additional properties. If this is the case, and you try to run the node without setting additional properties, an error message will inform you which properties you need to set.
The node will attempt to map each field in the structured data to an output with the same name as the field. There must be at least three outputs on the node, however you can modify the names and corresponding purpose of the outputs. By default, the XML Data node has the following three output pins:
- Data - Outputs all parsed data fields which do not have a corresponding output with the same name.Corresponds to the optional DefaultOutput property, if specified. By default, the DefaultOutput property is Data. If you rename the Data output pin, specify the new name in the DefaultOutput property.
- Structure - Outputs information about the structure of the parsed data. Corresponds to the optional StructureOutput property, if specified. By default, the StructureOutput property is Structure. If you rename the Structure output pin, specify the new name in the StructureOutput property.
- Errors - Outputs any errors that occur during parsing. Corresponds to the optional ErrorsOutput property. By default, the ErrorsOutput property is Errors. If you rename the Errors output pin, specify the new name in the ErrorsOutput property.
For any output that is not specified as a DefaultOutput, StructureOutput or ErrorsOutput, the node attempts to map parsed data fields to an output based on the name of the output. Each parsed data field is only included in at most one output. The node first attempts to find the "standard data output" (that is, an output when none are specified in the DefaultOutput, StructureOutput or ErrorsOutput properties) with the longest prefix which matches the prefix of the parsed data field. If no such standard data outputs exist, and there is a DefaultOutput, the field is output to the DefaultOutput. If there is no DefaultOutput specified, the parsed data field is not included in any output, and the behavior of the node depends on the UnmappedFieldBehavior property.
Example - Extracting specific XML elements
You have the following country data.xml
file containing some general metadata, along with country demographic information:
- Select the XML Data node and in the XmlData (from Filename) property, navigate to the
country data.xml
file. - Define a new output pin called Data.Country.
The node now has the following outputs:
- Data - Referenced in the DefaultOutput property by default.
- Structure - Referenced in the StructureOutput property by default.
- Errors - Referenced in the ErrorsOutput property by default.
- Data.Country - Not referenced by any property, considered a "standard data output".
- Run the node.
The Data output contains the Data.Metadata.timestamp and Data.Metadata.sequenceNumber fields because none of the other outputs match the name of these fields.
The Structure output specifies all input fields parsed from the input data, and identifies in which of the node outputs these fields have been included:
The Errors output contains information about any errors that occurred during parsing.
The Data.Country output contains the following Data.Country.<value> fields because "Data.Country" matches the prefix of these fields:
- Rename the Data.Country output pin to Data.Country.City, then save the data flow.
- Re-run the node.
The Data.Country.City output contains only the Data.Country.City.<value> fields because "Data.Country.City" matches the prefix of these fields:
All other fields (Data.Metadata.timestamp and Data.Metadata.sequenceNumber fields, and the Data.Country.<value> fields) are output on the Data pin.
Alternatively, to output only the Data.Country.City.<value> fields, and ignore all other fields:
- Delete the custom Data.Country.City output pin.
- Rename the default Data output pin to Data.Country.City.
- Set the UnmappedFieldBehavior property to Ignore, then save the data flow.
- Re-run the node.
The Data.Country.City output contains only the Data.Country.City.<value> fields:
Properties
XmlData
Specify the XML data that you want to import.
- Choose the (from Filename) variant of this property to specify the name of the file containing XML data.
- Choose the (from Filename Field) variant of this property to specify the name of an input field containing XML file names.
- Choose the (from Data Field) variant of this property to specify the name of the input field containing XML data.
A value is required for this property.
OutputNestingCharacter
Optionally specify the character to use to identify hierarchical relationships in the output fields. This applies to both the field names in the output metadata and to the mapping of fields to output pins. The default is ".".
For example, if the input data contains a field named Country which contains a sub-field named Population, and the OutputNestingCharacter property is set to ".", then the sub-field would be output as Country.Population. Similarly, to map this field to a specific output, set the name of the output to Country.Population.
OutputReferenceIds
Optionally specify whether reference identifiers are included in the output.
The default value is True.
This property only has effect when more than one output is present and receiving data from the data source. Where hierarchical data is being flattened to multiple tabular outputs, the reference identifiers can be used to identify how the data in the different outputs is related.
You can use these identifiers in subsequent join nodes, if required, to reassemble the data or identify the relationships between the different outputs.
PassThroughFields
Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields. Choose from:
- All - Passes through all input data fields to the output.
- None - No input data fields are passed through to the output; as such, only the fields created by the node appear on the output.
- Used - Passes through all fields that the node used to create the output, including any input field referenced by a property, such as the Filename Field or Data Field if specified in the XmlData property.
- Unused (default) - Passes through all fields that the node did not use to create the output.
The default value is Unused.
If, for a given input record, the only data to be output is the pass through fields, whether or not these pass through fields are output depends on setting specified in the AlwaysEmitPassThroughFields property.
RemoveCommonPrefixes
Optionally specify whether the node attempts to rename the output fields by removing common prefixes.
The default value is False.
For example, in an output, where the only fields from the parsed data to be output are Top.Middle.First and Top.Middle.Second, if this property is set to True, then the fields will be output as First and Second.
Note: This property only removes prefixes from the parsed data fields and not from pass through fields or reference Id fields.
StructureOutput
Optionally specify the name of the output to include the structure of the structured data, the default is to output the structure to the Structure output which contains a record for each of the parsed data fields recognized from the input data source(s). The output will contain the following pieces of information:
- The data type of the field.
- How the data was parsed from the input source (normally via the Data Fieldspecified in the XmlData property).
- The hierarchical name of the field in the structured data.
- The name of the field in the output.
- The name of the output in which the field was included.
In general, the output and input name of a field will be the same, unless the RemoveCommonPrefixes property is set to true, or the fields need to be renamed to be written in the BRD format (for example through the use of the SubstituteInvalidCharacters property).
If this property is set, the corresponding output must exist. If this property is not set, but either the DefaultOutput or ErrorsOutput properties are set to 'Structure', then the default of 'Structure' in this property is ignored, and there will be no structure output records.
Note: While the 'Structure' output pin exists by default, this can be renamed. In such cases, unless the StructureOutput property is changed to match the name of one of the outputs, the node will not output any structure records.
ErrorsOutput
Optionally specify the name of the output to which errors will be written. The default value is Errors.
If this property is set, the corresponding output must exist. If this property is not set, but either the DefaultOutput or StructureOutput properties are set to Errors, then the default of Errors in this property is ignored, and there will be no output error records.
DefaultOutput
The node will attempt to map each field in the structured data to an output that bears the name of the field. For fields that do not have the corresponding output, the node will map them to the output that you specify in this property. The default value is Data.
If this optional property is set, the corresponding output must exist. If this property is not set, but either the ErrorsOutput or StructureOutput properties are set to Data, then the default of Data in this property is ignored, and there is no DefaultOutput to handle unmapped fields.
If this property is not specified, and the default output of Data does not exist, then the UnmappedFieldBehavior property is used to determine the action to take when a parsed data field cannot be mapped to any output.
WellFormedXml
Optionally specify whether the node assumes that the document contains well-formed XML data.
The default value is True.
If set to False, the node assumes that the document has no root element and the node will wrap a dummy root element around the document.
NamespaceAware
Optionally specify whether the node does not output namespaces as attributes and all elements that are prefixed by namespaces are checked to see if such namespaces are specified in the parent element.
If set to false (the default value), namespace attributes are treated the same as other attributes and are output.
For example, consider the following XML element with namespace attributes:
<myroot date="2013-01-01" xmlns:aop="http://www.springframework.org/schema/aop"
xmlns:aprop="http://www.springframework.org/schema/aprop">
- If the NamespaceAware property is set to true, the node outputs the date attribute.
- If the NamespaceAware property is set to false, the node outputs the date, xmlns:aop and xmlns:aprop attributes.
Charset
Optionally specify the character set to be used when processing the XML data. For example: ASCII, UTF-8, UTF-16, UTF-32.
The default value depends on the source of the input data that is to be parsed.
If the data comes from an input field of type string, then the server character set is used by default. If the data comes from an input field of type Unicode, then UTF-8 is used by default. If the data comes from a file, the default depends on the XML data that is being processed. If the XML file contains an XML header specifying the character set, then this character set is used. Otherwise, the system default character set is used.
CharacterDataOutputFieldType
Optionally specify the type of the character-based (string/Unicode) output fields from the parsed data.
- Auto - If the node parses data from an input field, the character fields that are output from the structured contents will have the same type as the input field. If the node parses data from a file, the node will output character fields from the structured contents as Unicode.
- String - The fields will have string metadata.
- Unicode - The fields will have Unicode metadata.
The default value is Auto.
AlwaysEmitPassThroughFields
Optionally specify whether, even if the parsing of an input record results in no data fields that are to be included in an output, the pass through fields are still written. For instance, if the specified Data Field or Filename Field in the XmlData property is NULL, a record is still written containing the pass through fields.
If set to false (the default value), the pass through fields are output only when there are other records to include in that output with parsed data fields.
InputPrefix
Optionally specify a prefix to add to the pass through fields. The main objective for this property is to resolve the potential conflict where a node generated output field has the same name as an input field that you want to pass through.
For example, the input contains the following fields: EmployeeName, EmployeeAddress, and id. The PassThroughFields property is set to All and the InputPrefix property is set to PassThrough. In this case, the node outputs the following fields: PassThrough.EmployeeName, PassThrough.EmployeeAddress, PassThrough.Id.
ExcludeFieldPaths
Optionally specify field paths within the input data, which should not be written to the output.
By default no field paths are excluded. Each field path to be excluded should be specified on a new line.
If, for example, there are fields such as UserRecord.FirstName, UserRecord.LastName, UserRecord.Details.Attribute1 and UserRecord.Details.Attributes2 in the input data, together with several other fields under UserRecord.Details, and you do not want any of the UserRecord.Details fields to be written to the outputs, then this property can be set to exclude UserRecord.Details and all fields under that path.
AttributeElementConflictBehavior
Optionally specify what to do when there is a naming conflict between an attribute and an element, for example in the following XML, on the "example" attribute and element:
<document example="1"> <example>2</example>
</document>
Choose from:
- Prefix Attribute - The attribute is renamed to "a_<name>"
- Prefix Element - The element is renamed to "e_<name>"
- Error - The node errors.
The default value is Error.
NoRecordForOutputBehavior
Optionally specify the behavior of the node when the imported JSON data cannot be mapped to any output. Choose from:
- Error - The node errors and stops processing.
- Log - The node logs a message and continues processing.
- Ignore - The node continues processing.
The default value is Error.
Also note that in the cases of Log and Ignore, if there are no pass through fields because either there is no input, or there are no fields to pass through, the node cannot set up the output metadata, and so it will throw an error and stop processing.
PassThroughFieldConflictBehavior
Optionally specify the behavior of the node when there are parsed data fields (that is, data that has been manipulated by the node on import) which conflict with pass through fields (data that has been imported and is ready to output without further manipulation) on any given output. Choose from:
- Use PassThrough Field - The pass through field from the input is output. The parsed data field is not output.
- Use Data Field - The parsed data field is output. The pass through field from the input is not output.
- Error - The node errors and stops processing
The default value is Error.
UnmappedFieldBehavior
Optionally specify the behavior of the node when there are parsed data fields (that is, data that has been manipulated by the node on import) that cannot be mapped to any output, and there is no default output (see the DefaultOutput property) that collects all such fields. Note: If the default output exists, this situation will not occur. Choose from:
- Error - The node errors and stops processing.
- Log - The node logs the situation and continues processing.
- Ignore - The node ignores the situation and continues processing.
The default value is Error.
NullValueBehavior
Optionally specify what to do if the Data Field or Filename Field specified in the XmlData property is null in any of the input records. Choose from:
- Error - The node errors when a null record is found.
- Log - The node logs the situation and continues processing.
- Ignore - The node ignores the situation and continues processing.
The default value is Error.
ErrorThreshold
Optionally specify the number of transfer errors that will cause the node to give up and fail.
Each record on the input pin is a "request". A transfer error is any error that causes a request to fail (e.g. a requested file does not exist). Setting this property instructs the node to continue processing requests as long as the number of errors remains below the given threshold.
An ErrorThreshold of 0 means never fail on a transfer error (the node will still fail on more serious errors). The default value is 1 i.e. the node fails on the first error encountered.
SubstituteInvalidCharacters
There are some reserved characters which cannot appear in metadata. This optional property defines what to do if invalid characters appear in the input data and are to be used in the record metadata.
In certain cases, the same field may be present in the data with different types in different parts of the data. In these cases, there is no way to output the data in the type inferred from the input data without coercion.
- If set to true, invalid characters are substituted for acceptable BRD metadata characters.
- If set to false (the default value), the node errors and stops processing.
MissingExcludeFieldBehavior
Optionally specify the behavior of the node when there are field paths in the ExcludeFieldPaths property, which do not exist in the input data to be parsed.
Choose from:
- Error - The node errors and stops processing.
- Log - The node logs the situation and continues processing.
- Ignore - The node ignores the situation and continues processing.
The default value is Log.
Inputs and outputs
Inputs: Multiple optional (input fields).
Outputs: Data, Structure, Errors, multiple optional.