R node - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Runs an R script on an R server. This server can be either remote or on the same machine as the Data360 Analyze server.

Prerequisites

  • Download and install R from https://www.r-project.org/
  • Download the R Node Pack and follow the instructions in the "Installing and configuring R" PDF documentation.
Note: These software programs are not distributed by Precisely. The R node communicates with these programs to execute R scripts. The machine hosting the R environment must have sufficient available RAM to process the data.DisclaimerOpen-source R is available under separate open source software license terms and is not part of Data360 Analyze. As such, open-source R is not within the scope of your license for Data360 Analyze. Open-source R is not supported, maintained, or warranted in any way by Precisely. Download and use of open-source R is solely at your own discretion and subject to the free open source license terms applicable to open-source R.

Working with the R node

The R node provides integration with the R processing engine for advanced statistical functionality. R is a popular, powerful, and extensible statistical computing platform that enables you to create complex computations and apply them to solve big data problems.

The R node can take a variable number of inputs which you can then make available in R as data frames.

The R node can extract data from R in the form of data frames or plot images and make them available in Data360 Analyze, creating a seamless integration between the two engines and allowing you to execute R scripts on data alongside traditional Data360 Analyze nodes.

The node uses Rserve to communicate with the R server. In contrast to other Data360 Analyze nodes, in particular the Java and Python nodes, the R node requires that all input and output data be loaded into memory (due to a limitation in R). When all of the data is loaded into memory, the RScript is run once over the entire input data set, as if the script were running in R outside of Data360 Analyze. After the script is run, all of the output data of the script is held in memory and the node outputs this data to the output pins.

All inputs are created in R as data frames. A variable with the same name as the input pin points to each input data frame. The columns of each data frame are named after the corresponding input field so that they can be accessed via "pinName$fieldName".

The node can output either images or data frames. To output a data frame, create a data frame in R and assign it to a variable with the same name as the pin where the frame will be output. To designate an output pin as an image, call createImage("pinName") which opens a new device and makes it the current device. In addition, the function assigns a placeholder object to a variable named pinName. You can then plot information and display text using standard R calls in the open device. The device should not ever be closed, and the placeholder variable should also not be changed or overwritten. In addition to the pinName variable, you can also pass in the following named arguments to change attributes of the generated image: width, height, units, pointsize, bg, and res. Note that this is similar to creating a png or jpeg device, however you cannot set the filename or type parameters. The node will only output images in the JPEG format. Cairo is not supported.

Pin-named variables that are neither a placeholder image variable nor a data frame will not be output. The behavior of the node in this case is defined by the InvalidOutputBehavior property.

R node data types

The node automatically converts input data to the correct R classes: "integer", "numeric", "factor", "character", "logical", "Date", and, finally, "POSIXct" for times and datetimes.

Strings are converted either to factors or to characters per the importStringCoercion property. All output columns of the above types are converted back to Data360 Analyze data types (see Data types).

Strings and doubles are output via types as defined by the ExportStringCoercion and ExportDoubleCoercion properties.

All output columns of any other R type, including "complex", are converted to string during output.

Location of R server

If you have R on the same machine as the Data360 Analyze Server, you can leave the RUsername, RPassword, RHost, and RPort properties blank in the R Node.

If you install R on a different machine to the one on which the Data360 Analyze Server is installed:

  • You should use the RHost and RPort properties to tell the R Node where R is located.
  • You should consider securing R to prevent unauthorized use. If you do this, you should provide RUsername and RPassword properties in the R Node.

CAUTION:
There is a known limitation in Rserve such that, when Rserve is run on Windows, only one R node can be run at a time. If you start two R nodes at once, both will appear to run, but the second node will not start processing data until the first node has completed. Therefore, if you stop the Data360 Analyze server while an R node is running on Windows, no other R nodes will be able to connect to the Rserve server for some time. The workaround is to restart Rserve. You can restart Rserve by opening Task Manager and ending the Rserve.exe process. Then restart Rserve in order to connect to R again.

Properties

RScript

Specify the R script that is run on the R server.

A value is required for this property.

RUsername

Optionally specify a username to log into the R server.

If you specify an RUsername or RPassword, the node will attempt to log in. However, if both properties are blank, the node will assume that the R server does not require authentication and will not try to log in.

RPassword

Optionally specify a password to log into the R server.

If you specify an RUsername or RPassword, the node will attempt to log in. However, if both properties are blank, the node will assume that the R server does not require authentication and will not try to log in.

RHost

Optionally specify the hostname of the R server.

The node will try to connect to the specified server. If you do not specify a hostname, but you do specify a port, the node will connect to the local machine via IP address 127.0.0.1.

If both the RHost and RPort properties are blank, the node will use the Rserve client's default connection settings. As of version 0.6-8, this default is "localhost".

RPort

Optionally specify the port on the host to use to connect to the R server.

If no port is entered, the node will use the default port as set in the Rserve client. As of version 0.6-8, the default port is 6311.

ImportStringCoercion

Optionally specify how to convert string and Unicode input fields when moving data from Data360 Analyze into R. Choose from:

  • To Factor - Data frames convert character vectors to factors.
  • To Character - Data frames keep character vectors as characters.

The default value is To Factor.

R generally converts Data360 Analyze string and Unicode values into factors when creating data frames. Factors take a limited amount of values and are stored as integer vectors, which map to characters when being displayed. They can be used in a variety of modeling functions, but sometimes it is more convenient for strings to simply stay strings and not be converted.

ExportStringCoercion

Optionally specify how to export character vectors from R to Data360 Analyze. Choose from:

  • To Unicode - R represents all string values in data frames as character vectors or factors, both of which are implemented by Unicode strings. By contrast, Data360 Analyze has two field types for this class: string and Unicode. Unicode can contain all characters found in R while string can only hold a subset (only those found in the Data360 Analyze server's code page). Therefore, if data from R has characters that are not in the Data360 Analyze server's code page, select To Unicode to avoid errors when outputting the data.
  • To String - Only select this option if all characters in the output data frames are in the Data360 Analyze server's code page.

The default value is To Unicode.

ExportDoubleCoercion

Optionally specify how to export double vectors from R to Data360 Analyze. Data360 Analyze uses three field types for numbers: int, double, and long. However, R does not have long, but instead puts all numbers that do not fit into an int into a double. Choose from:

  • To Long - Values are rounded as long as they are within Epsilon of the nearest integer. If you select this option when there are values that are too far away to be rounded, or that are larger than Long.MAX_ VALUE or smaller than Long.MIN_VALUE, the node will fail with an error.
  • To Double - Values are exported without conversion.

The default value is To Double.

Epsilon

Optionally specify a decimal tolerance for rounding doubles to longs. If the ExportDoubleCoercion property is set to To Long, then double values from R will be rounded to the nearest integer if the distance between the two values is less than or equal to Epsilon.

The default value is 0.0, i.e. the node will only convert exact integers.

InvalidOutputBehavior

Optionally specify how the node should react if the variables named after output pins do not exist in the R workspace or if they are not of class "data.frame".

The node will check if each output variable is defined before exporting them. If the matching variable is of class data.frame, then it will attempt to write out all records found in the data frame. If a variable of a given output pin's name is not found or if that variable does not point to a data frame, then the behavior of this node depends on the option selected in this property. Choose from:

  • Error - An error is logged and the node fails.
  • Log - A warning is logged and the node continues to run.
  • Ignore - No issue is reported.

The default value is Error.

Inputs and outputs

Inputs: Multiple optional.

Outputs: Multiple optional.