XML Data - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Imports XML data from an input data source into tabular data.

To run the XML Data node, set the source of the data that you want to import in the XmlData property. You can import data from a specified file name or input field. If you are processing data from a file, choose the (from Filename) variant of the XmlData property to specify the name of the file containing XML data. If you are processing data from an input field, the input field can either contain the file names of the data to process, or it can contain the data itself within the field. Choose the (from Filename Field) variant of the XmlData property to specify the name of an input field containing XML file names, or choose the (from Data Field) variant of the XmlData property to specify the name of the input field containing XML data.

Note: If you are processing data from multiple input records, where the structure of the data is not identical across all of the data sources, the node performs an implicit union of all fields in the data structure.

For certain data structures, you might need to configure additional properties. If this is the case, and you try to run the node without setting additional properties, an error message will inform you which properties you need to set.

The node will attempt to map each field in the structured data to an output with the same name as the field. There must be at least three outputs on the node, however you can modify the names and corresponding purpose of the outputs. By default, the XML Data node has the following three output pins:

  • Data - Outputs all parsed data fields which do not have a corresponding output with the same name.Corresponds to the optional DefaultOutput property, if specified. By default, the DefaultOutput property is Data. If you rename the Data output pin, specify the new name in the DefaultOutput property.
  • Structure - Outputs information about the structure of the parsed data. Corresponds to the optional StructureOutput property, if specified. By default, the StructureOutput property is Structure. If you rename the Structure output pin, specify the new name in the StructureOutput property.
  • Errors - Outputs any errors that occur during parsing. Corresponds to the optional ErrorsOutput property. By default, the ErrorsOutput property is Errors. If you rename the Errors output pin, specify the new name in the ErrorsOutput property.

For any output that is not specified as a DefaultOutput, StructureOutput or ErrorsOutput, the node attempts to map parsed data fields to an output based on the name of the output. Each parsed data field is only included in at most one output. The node first attempts to find the "standard data output" (that is, an output when none are specified in the DefaultOutput, StructureOutput or ErrorsOutput properties) with the longest prefix which matches the prefix of the parsed data field. If no such standard data outputs exist, and there is a DefaultOutput, the field is output to the DefaultOutput. If there is no DefaultOutput specified, the parsed data field is not included in any output, and the behavior of the node depends on the UnmappedFieldBehavior property.

Example - Extracting specific XML elements

You have the following country data.xml file containing some general metadata, along with country demographic information:

XML Data input

  1. Select the XML Data node and in the XmlData (from Filename) property, navigate to the country data.xml file.
  2. Define a new output pin called Data.Country.

    The node now has the following outputs:

    • Data - Referenced in the DefaultOutput property by default.
    • Structure - Referenced in the StructureOutput property by default.
    • Errors - Referenced in the ErrorsOutput property by default.
    • Data.Country - Not referenced by any property, considered a "standard data output".
  3. Run the node.

    The Data output contains the Data.Metadata.timestamp and Data.Metadata.sequenceNumber fields because none of the other outputs match the name of these fields.

    The Structure output specifies all input fields parsed from the input data, and identifies in which of the node outputs these fields have been included:

    XML Data output

    The Errors output contains information about any errors that occurred during parsing.

    The Data.Country output contains the following Data.Country.<value> fields because "Data.Country" matches the prefix of these fields:

    XML Data output

  4. Rename the Data.Country output pin to Data.Country.City, then save the data flow.
  5. Re-run the node.

    The Data.Country.City output contains only the Data.Country.City.<value> fields because "Data.Country.City" matches the prefix of these fields:

    XML Data output

    All other fields (Data.Metadata.timestamp and Data.Metadata.sequenceNumber fields, and the Data.Country.<value> fields) are output on the Data pin.

Tip: If you had two custom output pins on the node, one named Data.Country and one named Data.Country.City, the data would be output to the longest matching prefix. This means that the Data.Country.<value> fields would be output on the Data.Country pin and the Data.Country.City.<value> fields would be output on the Data.Country.City pin.

Alternatively, to output only the Data.Country.City.<value> fields, and ignore all other fields:

  1. Delete the custom Data.Country.City output pin.
  2. Rename the default Data output pin to Data.Country.City.
  3. Set the UnmappedFieldBehavior property to Ignore, then save the data flow.
  4. Re-run the node.

    The Data.Country.City output contains only the Data.Country.City.<value> fields:

    XML Data output

Properties

XmlData

Specify the XML data that you want to import.

  • Choose the (from Filename) variant of this property to specify the name of the file containing XML data.
  • Choose the (from Filename Field) variant of this property to specify the name of an input field containing XML file names.
  • Choose the (from Data Field) variant of this property to specify the name of the input field containing XML data.

A value is required for this property.

Note: If you are using the server edition and wish to use a file on your local machine, you must first upload it to the Data360 Analyze server. See Upload data files for more details.

OutputNestingCharacter

Optionally specify the character to use to identify hierarchical relationships in the output fields. This applies to both the field names in the output metadata and to the mapping of fields to output pins. The default is ".".

For example, if the input data contains a field named Country which contains a sub-field named Population, and the OutputNestingCharacter property is set to ".", then the sub-field would be output as Country.Population. Similarly, to map this field to a specific output, set the name of the output to Country.Population.

OutputReferenceIds

Optionally specify whether reference identifiers are included in the output.

The default value is True.

This property only has effect when more than one output is present and receiving data from the data source. Where hierarchical data is being flattened to multiple tabular outputs, the reference identifiers can be used to identify how the data in the different outputs is related.

You can use these identifiers in subsequent join nodes, if required, to reassemble the data or identify the relationships between the different outputs.

PassThroughFields

Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields. Choose from:

  • All - Passes through all input data fields to the output.
  • None - No input data fields are passed through to the output; as such, only the fields created by the node appear on the output.
  • Used - Passes through all fields that the node used to create the output, including any input field referenced by a property, such as the Filename Field or Data Field if specified in the XmlData property.
  • Unused (default) - Passes through all fields that the node did not use to create the output.

The default value is Unused.

Note: The pass through fields are only included in outputs which receive parsed data. Pass through fields are not included in the Structure or Errors outputs.

If, for a given input record, the only data to be output is the pass through fields, whether or not these pass through fields are output depends on setting specified in the AlwaysEmitPassThroughFields property.

RemoveCommonPrefixes

Optionally specify whether the node attempts to rename the output fields by removing common prefixes.

The default value is False.

For example, in an output, where the only fields from the parsed data to be output are Top.Middle.First and Top.Middle.Second, if this property is set to True, then the fields will be output as First and Second.

Note: This property only removes prefixes from the parsed data fields and not from pass through fields or reference Id fields.

StructureOutput

Optionally specify the name of the output to include the structure of the structured data, the default is to output the structure to the Structure output which contains a record for each of the parsed data fields recognized from the input data source(s). The output will contain the following pieces of information:

  • The data type of the field.
  • How the data was parsed from the input source (normally via the Data Fieldspecified in the XmlData property).
  • The hierarchical name of the field in the structured data.
  • The name of the field in the output.
  • The name of the output in which the field was included.

In general, the output and input name of a field will be the same, unless the RemoveCommonPrefixes property is set to true, or the fields need to be renamed to be written in the BRD format (for example through the use of the SubstituteInvalidCharacters property).

If this property is set, the corresponding output must exist. If this property is not set, but either the DefaultOutput or ErrorsOutput properties are set to 'Structure', then the default of 'Structure' in this property is ignored, and there will be no structure output records.

Note: While the 'Structure' output pin exists by default, this can be renamed. In such cases, unless the StructureOutput property is changed to match the name of one of the outputs, the node will not output any structure records.

ErrorsOutput

Optionally specify the name of the output to which errors will be written. The default value is Errors.

If this property is set, the corresponding output must exist. If this property is not set, but either the DefaultOutput or StructureOutput properties are set to Errors, then the default of Errors in this property is ignored, and there will be no output error records.

Note: While the Errors output pin exists by default, this can be renamed. In such cases, unless the ErrorsOutput property is changed to match the name of one of the outputs, the node will not output any error records.

DefaultOutput

The node will attempt to map each field in the structured data to an output that bears the name of the field. For fields that do not have the corresponding output, the node will map them to the output that you specify in this property. The default value is Data.

If this optional property is set, the corresponding output must exist. If this property is not set, but either the ErrorsOutput or StructureOutput properties are set to Data, then the default of Data in this property is ignored, and there is no DefaultOutput to handle unmapped fields.

Note: While the Data output pin exists by default, this can be renamed. In such cases, unless the DefaultOutput property is changed to match the name of one of the outputs, the node will not have a DefaultOutput to handle unmapped fields.

If this property is not specified, and the default output of Data does not exist, then the UnmappedFieldBehavior property is used to determine the action to take when a parsed data field cannot be mapped to any output.

WellFormedXml

Optionally specify whether the node assumes that the document contains well-formed XML data.

The default value is True.

If set to False, the node assumes that the document has no root element and the node will wrap a dummy root element around the document.

NamespaceAware

Optionally specify whether the node does not output namespaces as attributes and all elements that are prefixed by namespaces are checked to see if such namespaces are specified in the parent element.

If set to false (the default value), namespace attributes are treated the same as other attributes and are output.

For example, consider the following XML element with namespace attributes:

<myroot date="2013-01-01" xmlns:aop="http://www.springframework.org/schema/aop" xmlns:aprop="http://www.springframework.org/schema/aprop">

  • If the NamespaceAware property is set to true, the node outputs the date attribute.
  • If the NamespaceAware property is set to false, the node outputs the date, xmlns:aop and xmlns:aprop attributes.
Note: If this property is set to false, the SubstituteInvalidCharacters property will normally need to be set to true as the colon (":") character is not a valid metadata character.

Charset

Optionally specify the character set to be used when processing the XML data. For example: ASCII, UTF-8, UTF-16, UTF-32.

The default value depends on the source of the input data that is to be parsed.

If the data comes from an input field of type string, then the server character set is used by default. If the data comes from an input field of type Unicode, then UTF-8 is used by default. If the data comes from a file, the default depends on the XML data that is being processed. If the XML file contains an XML header specifying the character set, then this character set is used. Otherwise, the system default character set is used.

CharacterDataOutputFieldType

Optionally specify the type of the character-based (string/Unicode) output fields from the parsed data.

Note: This property does not affect the types of pass through fields. It also does not affect fields within the parsed data that have an inferred type which is not character-based unless the CoerceData property is set to something other than Never. Choose from:
  • Auto - If the node parses data from an input field, the character fields that are output from the structured contents will have the same type as the input field. If the node parses data from a file, the node will output character fields from the structured contents as Unicode.
  • String - The fields will have string metadata.
  • Unicode - The fields will have Unicode metadata.

The default value is Auto.

AlwaysEmitPassThroughFields

Optionally specify whether, even if the parsing of an input record results in no data fields that are to be included in an output, the pass through fields are still written. For instance, if the specified Data Field or Filename Field in the XmlData property is NULL, a record is still written containing the pass through fields.

If set to false (the default value), the pass through fields are output only when there are other records to include in that output with parsed data fields.

InputPrefix

Optionally specify a prefix to add to the pass through fields. The main objective for this property is to resolve the potential conflict where a node generated output field has the same name as an input field that you want to pass through.

Note: Such a conflict will not occur if you select None in the PassThroughFields property. In the absence of field name conflict, you may still want to highlight the pass through input fields by giving them a prefix. The default is to have no prefix.

For example, the input contains the following fields: EmployeeName, EmployeeAddress, and id. The PassThroughFields property is set to All and the InputPrefix property is set to PassThrough. In this case, the node outputs the following fields: PassThrough.EmployeeName, PassThrough.EmployeeAddress, PassThrough.Id.

ExcludeFieldPaths

Optionally specify field paths within the input data, which should not be written to the output.

By default no field paths are excluded. Each field path to be excluded should be specified on a new line.

If, for example, there are fields such as UserRecord.FirstName, UserRecord.LastName, UserRecord.Details.Attribute1 and UserRecord.Details.Attributes2 in the input data, together with several other fields under UserRecord.Details, and you do not want any of the UserRecord.Details fields to be written to the outputs, then this property can be set to exclude UserRecord.Details and all fields under that path.

Note: If any fields nested within a referenced exclude path have corresponding and explicitly named outputs, then those fields will still be written to that output.

AttributeElementConflictBehavior

Optionally specify what to do when there is a naming conflict between an attribute and an element, for example in the following XML, on the "example" attribute and element:

<document example="1"> <example>2</example> </document>

Choose from:

  • Prefix Attribute - The attribute is renamed to "a_<name>"
  • Prefix Element - The element is renamed to "e_<name>"
  • Error - The node errors.

The default value is Error.

NoRecordForOutputBehavior

Optionally specify the behavior of the node when the imported JSON data cannot be mapped to any output. Choose from:

  • Error - The node errors and stops processing.
  • Log - The node logs a message and continues processing.
  • Ignore - The node continues processing.

The default value is Error.

Note: This property specifies the behavior when the parsed data (that is, data that has been manipulated by the node on import, for example imported JSON data) cannot be mapped. Therefore, even if there are pass through fields (for example non-JSON data that has been imported and is ready to output without further manipulation), if parsed data cannot be mapped to the output, then if Error is selected, the node still errors.

Also note that in the cases of Log and Ignore, if there are no pass through fields because either there is no input, or there are no fields to pass through, the node cannot set up the output metadata, and so it will throw an error and stop processing.

PassThroughFieldConflictBehavior

Optionally specify the behavior of the node when there are parsed data fields (that is, data that has been manipulated by the node on import) which conflict with pass through fields (data that has been imported and is ready to output without further manipulation) on any given output. Choose from:

  • Use PassThrough Field - The pass through field from the input is output. The parsed data field is not output.
  • Use Data Field - The parsed data field is output. The pass through field from the input is not output.
  • Error - The node errors and stops processing

The default value is Error.

UnmappedFieldBehavior

Optionally specify the behavior of the node when there are parsed data fields (that is, data that has been manipulated by the node on import) that cannot be mapped to any output, and there is no default output (see the DefaultOutput property) that collects all such fields. Note: If the default output exists, this situation will not occur. Choose from:

  • Error - The node errors and stops processing.
  • Log - The node logs the situation and continues processing.
  • Ignore - The node ignores the situation and continues processing.

The default value is Error.

NullValueBehavior

Optionally specify what to do if the Data Field or Filename Field specified in the XmlData property is null in any of the input records. Choose from:

  • Error - The node errors when a null record is found.
  • Log - The node logs the situation and continues processing.
  • Ignore - The node ignores the situation and continues processing.

The default value is Error.

Note: This property only affects the Data Field or Filename Field specified in the XmlData property. Other input fields are not affected.

ErrorThreshold

Optionally specify the number of transfer errors that will cause the node to give up and fail.

Each record on the input pin is a "request". A transfer error is any error that causes a request to fail (e.g. a requested file does not exist). Setting this property instructs the node to continue processing requests as long as the number of errors remains below the given threshold.

An ErrorThreshold of 0 means never fail on a transfer error (the node will still fail on more serious errors). The default value is 1 i.e. the node fails on the first error encountered.

SubstituteInvalidCharacters

There are some reserved characters which cannot appear in metadata. This optional property defines what to do if invalid characters appear in the input data and are to be used in the record metadata.

In certain cases, the same field may be present in the data with different types in different parts of the data. In these cases, there is no way to output the data in the type inferred from the input data without coercion.

  • If set to true, invalid characters are substituted for acceptable BRD metadata characters.
  • If set to false (the default value), the node errors and stops processing.
Note: This only affects reserved characters in the metadata, such as colon (":") and newlines ("\n"). It is also required that the metadata be in the same character set as the server's character set. If this property is set to true, the node does not attempt to perform any special substitution operations on characters from a different character set that cannot be mapped to the character set of the metadata.

MissingExcludeFieldBehavior

Optionally specify the behavior of the node when there are field paths in the ExcludeFieldPaths property, which do not exist in the input data to be parsed.

Choose from:

  • Error - The node errors and stops processing.
  • Log - The node logs the situation and continues processing.
  • Ignore - The node ignores the situation and continues processing.

The default value is Log.

Inputs and outputs

Inputs: Multiple optional (input fields).

Outputs: Data, Structure, Errors, multiple optional.