The Split node allows you to parse a String field, using Regular Expressions.
Doing so requires 4 input parameters:
- Field to Split - This is the String field you would like to split into multiple fields.
- Number of Fields - This is the number of fields you would like to split the Field to Split into, and the number of new fields that will be added to your data set.
- Field Names - These are the names of the new fields created by the split. Names default to Field1, Field2, Field3, etc., but can be changed by double clicking.
- Regular Expression Pattern - This is the RegExp you want to use to split your field. This is the most crucial component of the Split node, as it is what actually does the work of splitting the field. As such, it is covered in more detail below.
Split node Regular Expression Patterns
The Split node follows the rules of Regular Expressions in Java. This means that splitting is not always simply a matter of specifying the characters with which you want to split.
For example, if you had a name field that you wanted to split into a firstName field and a lastName field, you could split using the Space character - simply by typing the Space bar once - or you could use the more formal RegExp version of Space, \\s.
Either case would result in the following:
name |
firstName |
lastName |
---|---|---|
Ernest Hemingway |
Ernest |
Hemingway |
Split Node Example 1 - Splitting on a Space
Some cases require more special treatment, however. For example, if you wanted to parse a URL using the period character, and you simply used the period character . as your Regular Expression Pattern, you would very quickly find that this does not work.
url |
protocolPrefix |
companyName |
worldWideSuffix |
---|---|---|---|
http://www.infogix.com |
|
|
|
Split Node Example 2 - A RegExp That Doesn't Work
This is because with Regular Expressions the period . is a special character, which needs to be escaped. In Data360 DQ+, special RegExp characters should be escaped using a double back slash \\
Using such a RegExp would result in \\. which would properly split our URL field into the three fields shown below.
url |
protocolPrefix |
companyName |
worldWideSuffix |
---|---|---|---|
http://www.infogix.com |
http://www |
infogix |
com |
Split Node Example 3 - Splitting on \\.
Data360 DQ+ uses the Java Regular Expression Standards. More on this can be found here:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html