Regardless of how different nodes operate, the Data360 Analyze Script that is used to configure them works the same way. For more information on the various aspects of the Data360 Analyze Script language.
Grammar and syntax
Data360 Analyze Script must be single-byte ASCII, limited to printable characters (33-126) plus standard whitespace characters: space, tab, carriage return, and linefeed (32, 9, 13, 10). All whitespace is equivalent. Data360 Analyze Script consists of a series of expressions and statements. Expressions have a value. They include literals, variables, and combinations. An expression can be used anywhere a value is required. Statements generally do not have values. They include variable declarations, output definitions, and procedural extensions. Statements are limited in where they may be used; some must be top-level, while some must be nested within other statements.
Simple expressions
Single quotes are used to identify input fields from the incoming data. Double quotes are used to indicate text (string literals). Upper and lower case are distinguished in string literals.
#Literals: "say ‘hi’" # string literal 18 # integer literal 3.14159 # double literal true # boolean literal
Input fields may be referenced either by name or by ordinal position (1-based) and are demarcated by single-quotes. If the field identifier is ambiguous, then an input must be specified, either by name or by ordinal (1-based). Note that the input may always be specified, but it is not always required. In all cases, the input name or ordinal is relative to the node that is being configured. The input identifier precedes the field identifier and is followed by a colon (:).
#Input Field Expressions" 'customer name' # simple field name 'billing:customer name' # input name:field name '1:state' # from input 1 '2:state' # from input 2
A variable name begins with a letter or underscore, and contains only alphanumeric characters and underscores.
#Variable Values foo # value of variable "foo" bar-baz # hyphens ok: value of variable "bar-baz" _qux? # other chars too: value of variable "_qux?"
Code comments
When the #
symbol is used in a line of code, it indicates that any data on the current line that follows the symbol should be treated as a comment and should not be executed (as long as the comment character is not part of a valid token, such as a string literal). A semicolon ;
can also be used to indicate a comment.
A comment at the beginning of a line:
# Comment at the beginning of a line
A comment after a piece of Data360 Analyze Script:
today = date() # comment after Script
Combinations
Combinations allow you to construct powerful and complex expressions. A combination is the application of an operator to a (possibly empty) set of parameters, in the following format:
operator-expr(parameter-expr-1, parameter-expr-2 ... )
parameter-expr-1.operator-expr(parameter-expr-2 ... )
strcat("field1=", 'field1') # value: "field1=<field1>"
"field1=".strcat('field1') # value: "field1=<field1>"
date("5/2/1974", "M/D/CCYY") # value: 1974-05-02
"5/2/1974".date("M/D/CCYY") # value: 1974-05-02
Statements
Statements are used, for example, to define variables and configure outputs. Although they look like combinations, there are a few important differences. Firstly, statements often have a more specific syntax than combinations. For example, a statement type may require a string literal as a parameter, while an operator cannot differentiate between a string literal and a string-valued combination.
statement-name(parameter-1, parameter-2 ... )parameter-1.statement-name(parameter-2 ... )
foo = bar + 1 # foo = bar + 1(strcat “foo” “bar”) = 8 # ERROR – does not define “foobar”
Additionally, statements may have particular context rules:
output “outfield” emit foo # ERROR – improper contextoutput “primary” { # defining an output emit foo as “outfield” # Here, the statement is correct }
Finally, statement names are not values (as operators are), so you must explicitly provide the statement name for every statement.
output 2 { # defining output #2 emit * if $flag excludeInputField “name” # ERROR }
Statement properties may be other statements or expressions, depending on the statement type. However, since most statements do not have a value, you cannot use them inside expressions.
Floating point arithmetic
Data360 Analyze Script provides the double value type to represent floating-point numbers, that is, numbers with a decimal point. The double type is double precision (64 bits). Data360 Analyze Script implements the IEEE standard IEEE 754 of floating-point arithmetic. Sometimes floating-point operations produce confusing results. These problems are not due to limitations of Data360 Analyze Script, but are common to all fixed-width representations of real numbers. Double precision float-point numbers cannot represent more than sixteen significant (decimal) digits.
# Truncation (1.0e20 + 1).equals(1.0e20) # results in value of True
Comparisons between doubles can also lead to surprising results. Exact equality comparisons can fail even though the numbers "look" the same. For example, the number 0.2 cannot be represented exactly as a 64-bit double (the error will be of the order 1e-15). These problems can usually be avoided by comparing with a tolerance, or, better still, by comparing relative magnitudes with a tolerance. These limitations in floating-point number representations can be particularly troublesome in calculations of monetary values.
In this example, rounding returns different answers for values that differ by very small fractions of cents.
a = 2.905b = 2.9 + .005c = 2.9 + .004999999998
d = 2.9 + .005000000000000001
emit a,b,c,d
emit a.round(-2) as ar
emit b.round(-2) as br
emit c.round(-2) as cr
emit d.round(-2) as dr
This results in:
a:double | b:double | c:double | d:double | ar:double | br:double | cr:double | dr:double
2.905 2.905 2.905 2.905 2.91 2.91 2.9 2.91
round(int(1000 * 2.905), 1)/1000
The floating-point value is converted to an integer representing tenths of cents, it is rounded, and finally converted back into a dollar amount.
String and unicode literals
Data360 Analyze Script recognizes strings of the form u"xxx" as Unicode literals (rather than of type "string"). For example, u"abcde" represents a five-character Unicode value. The 'u' must be lower-case. U"xxx" will give an error.
"abc\u1234" # An ordinary string literal containing nine characters.
u"abcde" # A valid, five-character Unicode value.
In a Unicode literal, each occurrence of '\uhhhh' or '\Uhhhhhhhh' (where 'h' is a hexadecimal digit) is treated as an escape sequence that represents a single Unicode character. The '\u' must be followed by exactly four hexadecimal digits, and the '\U' must be followed by exactly eight hexadecimal digits. When a Unicode literal with one or more of these escape sequences appears as an operand to an operator, or as a parameter to a function, the system will replace each of these escape sequences with a single Unicode character.
The case of the 'u' or 'U' is significant given that one precedes a 4-digit value, and the other precedes an 8-digit value. However, the hexadecimal digits, following the 'u' or 'U' can be upper-case, lower-case or mixed-case.
u"abc\u0321d\u1064\n" # A valid Unicode value of seven characters,
the last of which is a new-line character.
Double backslashes ('\\') will escape the escape, that is, a double backslash results in a single backslash character in the resulting Unicode.
u"abc\\u1234d" # A valid Unicode value of ten characters, "abc\u1234d"
In Unicode literals, escape sequences other than '\u' and '\U' are treated in the same way as escape sequences in ordinary string literals. For example, '\n' in u"abc\n" is replaced with a single newline character, which is the same thing it would do with the non-Unicode literal "abc\n". In string literals, Data360 Analyze checks any escape sequences '\u' and '\U' to be sure that they are well-formed, but they are not translated into Unicode characters. For example, "\u1234" contains a well-formed escape sequence, but it is not translated into a single Unicode character (in contrast, Data360 Analyze would translate u"\u1234" into a single character). Data360 Analyze will raise an error for a string literal such as "\u123" because the escape sequence is not well-formed ('\u' must be followed by four hexadecimal digits). Data360 Analyze enforces the rule that escape sequences in both Unicode and string literals in Data360 Analyze Script must be legal escape sequences. As noted earlier, this rule applies even to string literals that contain '\u' or '\U' escape sequences, which are not replaced with Unicode characters. For example, in Data360 Analyze Script, the string "c:\users\devp\bob" would be an error for two reasons: The '\u' sequence is not followed by 4 hexadecimal digits, and the '\d' is not a standard escape character sequence either; also, Data360 Analyze Script would interpret the \b as a backspace character. Similarly, the literals "\z" and "\u123" and u"\u123" are not valid. Note that the server, by default, is just as strict about enforcing the escape sequence rule, but there are options to control its level of enforcement.
u"abc\u12345" # A valid Unicode value of five characters. The '\u1234' is an escape sequence.
The '5' is an ordinary character that follows the escape sequence.
u"abc\\u1234d" # A valid Unicode value of ten characters, "abc\u1234d"
U"abcde" # Not valid Data360 Analyze Script. The 'U' before the string is not valid.
u"abc\u123" # Not valid Data360 Analyze Script syntax.
The '\u' must be followed by four hexadecimal digits.
u"abc\" # Not valid Data360 Analyze Script syntax.
The quote character is escaped, making this an unterminated string.
Escape sequences for string and Unicode literals
\\ | Backslash (\) |
\' | Single quote (') |
\" | Double quote (") |
\a | Backslash (\) |
\b | Backslash (\) |
\f | ASCII Formfeed (FF) |
\n | ASCII Linefeed (LF) |
\r | ASCII Carriage Return (CR) |
\t | ASCII Horizontal Tab (TAB) |
\v | ASCII Vertical Tab (VT) |
\ooo | Character with octal value ooo |
\xhh | Character with hexadecimal value hh |
Literals of the form r"xxx" represent "raw" string types. Escaped sequences (including '\u') are not validated or replaced raw strings. The 'r' must be lower-case. There are two exceptions to the rule regarding not validating:
- Raw strings cannot end in a '\'; that is, r"abc\" will generate an error (unterminated string).
- The two-character sequence '\"' can appear in any position before the end of a raw string, and will be recognized as something that does not terminate the string, but will not replace it with a single character ('"'); it recognizes the sequence but leaves it intact.
This means that you cannot have a stand-alone double-quote character in a raw literal (such as r"abc"def"), but you can have a '\"' sequence in a raw string (except at the end). Thus, r"abc\"def" is a valid eight-character string. Literals the form ru"xxx" and ur"xxx" are recognized as being both of type "Raw" and "Unicode". The 'r' and the 'u' must be lower-case. Data360 Analyze will not validate or replace any escape sequences (including '\u') in raw Unicode literals. The value of unicode(r"xxx") is the same as the value of ru"xxx", for all strings 'xxx', because the Unicode function does not interpret the '\u' or '\U' escape sequences, and Data360 Analyze does not interpret '\u' or '\U' in raw Unicode strings either.
Support for implicit source names
Support for implicit source names in field name resolution has been implemented.
output "out1" { emit 'input-one:foo' as "foo" #Always works emit 'foo' as "footoo" #Works as long as no other source has a "foo" }
There must be exactly one source with a field of the given name (case-insensitive). This does not work for ordinal field identifiers:
output "out1" { emit 'input-one:3' as "foo" #Always works emit '3' as "footoo" #Only works in single-source environments }
Value types
Data360 Analyze Script is generally not strict about value types. For instance, conditional expressions are allowed to have consequent and alternate expressions that differ in type. Variables can change types. In fact, sometimes the value type of an expression is dependent on input data field values. The exception to the rule is input/output fields; because data streams must have a constant, well-defined record schema, each field must have a constant, well-defined type. For input fields, the Data360 Analyze Script interpreter reads the stream’s metadata to determine field names and types. When output fields are defined in Data360 Analyze Script, however, there can be no ambiguity. The expression supplied as an output field value must have a type that does not vary by record.
output 1 { emit 'foo' as "foo" # OK emit (if $flag "hi" 7.3) as "bar" # ERROR -- ambiguous emit (if $flag 'foo' 'bar') as "baz" # MAYBE }
Note that in this example, the definition of the output field baz
will only be OK if the input fields foo
and bar
have the same type. This can also be a problem with the agg node if the initial expression and update expression do not have the same type. Often the type of the expression will be determined by implicit conversion by arithmetic operators. When you perform arithmetic operations on two values of different numerical types, the following rules apply:
* Int and long -> long * Int and double -> double * Long and double -> double
In general this is not a problem, since the vast majority of expressions are constant typed. It is also easily fixed with a simple cast operator.
output 1 { emit (str (if $flag 'foo' 'bar')) as "baz" # baz is string }
Value types (all values are immutable)
string | ASCII character sequences. |
unicode | Unicode character sequences. |
int | Signed integer. Valid range of values is from -2^31 to 2^31 - 1. |
long | 64-bit signed integer. Valid range of values is from -2^63 to 2^63 - 1. |
double | 64-bit double. Valid range of values is from -1.7976931348623157 * 10^308 to 1.7976931348623157 * 10^308. |
date | Year, month, and day. |
time | Hour, minute, and second. |
datetime | Year, month, day, hour, minute, and second. |
boolean | True or False. |
operator | Used in combinations. |
list | Ordered, indexed sequence of values of any (mixed) type. |
null | The null value has null type. |