You can use the Transformation Builder to manually generate a schema.
Adding the initial struct
Starting with a blank schema, you should first add a Struct field. This field will contain all the fields of your schema content.
- Click Add.
- Select Field from the drop down list. The Add Transformation dialog is displayed.
- Leave Source Name blank.
- For Source Data Type, choose "Struct".
- Click Save.
Adding fields
Next you need to add fields and arrays, depending on the structure of your content.
- Select the Struct you created above.
- Click Add.
- Select Field from the drop down list. The Add Transformation dialog is displayed.
- Enter the Source Name.
- Select the Source Data Type from the drop down list. For information about the available types, see Source Data Type.
- Using the Output Transform Action property, select whether to "Copy" the source to the output, or "Ignore" the field.
- If the Source Data Type is "Number", use the Output Field Type property to define the number type of the output field. The available options are "Big Integer", "Decimal", "Floating Point", "Integer".
- When you selected the Source Data Type, the Output Field Name was added, with the same value.
- Click Save. The Add Transformation dialog closes, and the Edit Transformations window is updated, with the new field added as part of the Struct.
- Click OK to close the schema and transformations builder.
- Click Accept Changes.
Editing fields
To edit an existing field:
- Select the corresponding row for the field in the table.
- Click Edit in the header bar. The Edit field Transformation dialog is displayed.
- Modify the transformation from the source field to its output by changing the value of one or more options. Note that some options might be disabled.
- Click Save.
- Click OK in the Edit Transformations window to confirm your changes.
Transformation options
Source Name
The name of the field.
Source Data Type
The Schema and Transformation Builder supports the following data types.
- Array - can be an array of elements of any of the other supported types, except Map. Arrays of mixed type are not supported. The transformation builder can handle arrays in a number of ways. You can copy an array, or explode the contents, choosing to include or exclude null values. For more information, see Array handling in the transformation builder.
- Boolean
- Date
- DateTime
- Map
- Number - can be Big Integer, Decimal, Floating, or Integer.
- String
- Struct
Notes:
- If you are using the transformation builder in a JSON Parser node, a Source Data Type of Number is available, and is selected as the default for numeric fields in the sample JSON data that you provide. You can choose a specific numeric type for the output field by selecting a numeric Output Field Type.
- If you add a non-array field to a schema, you can not later change the field data type to an array type by editing the field. You will need to add a new field and delete the existing field.
-
For hortonworks (which is on an older version of Spark), when the transform action for a field of type "Array (Struct)" is set to "Copy", no child at any level in the field can be set to "Ignore".
Output Transform Action
The available Output Transform Actions depend on the Source Data Type value.
When the Source Data Type is an array, you can Copy the field into the output as an array, Ignore the field and exclude it from the output, or choose an Explode option to generate multiple output rows. For more information, see Array handling in the transformation builder.
When the Source Data Type is a non-array value, you can choose to Copy the field into the output, or Ignore the field and exclude it from the output.
Related Field
This option is available only when the Source Data Type is an array, and the Output Transform Action is "Pos Explode" or "Pos Explode Outer".
Output Field Type
You can only select an Output Field Type when the Source Data Type is "Number" or "Array (Number)". Use this option to provide a specific numeric type for an output field.
Output Field Name
The name of the field in the output.