Uploads files to a specified HDFS server using the WebHDFS API.
Enables users to access a secure Hadoop cluster by providing support for authentication using the Kerberos protocol.
For example, you can use the HDFS Upload node to publish results files to the Hadoop Distributed File System of a Hadoop cluster. When run, the node uploads the specified files to HDFS and outputs details of the files that were uploaded.
Data360 Analyze also provides the HDFS Directory List node and the HDFS Download node. You can use the HDFS nodes together, for example:
Authentication options
There are a number of authentication properties on the node. The specific properties that you need to configure are determined by your authentication method. The following table outlines which authentication properties you need to configure for each of the authentication methods:
Property | |||||
---|---|---|---|---|---|
Authentication method | ServerUsername | ServerPassword | KdcServerHost | KerberosSecurityRealm | KerberosConfiguration |
Username | P | ||||
Username and Password | P | P | |||
Kerberos SSO | P | ||||
Kerberos Isolated | P | P | P | P |
Username authentication
Enter your HDFS server username in the ServerUsername property.
For this authentication method, the username is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. The default value is "Username".
Username and password authentication
Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username and password information is used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
The username and password must correspond with the credentials configured in the authentication server used by the Knox Gateway.
Where SSL is used as the transport protocol, the Data360 Analyze server must be configured with the security certificate for the Knox Gateway server. The Java Runtime Environment (JRE) truststore file is used by default if no TrustStoreFile is specified. If the TrustStoreFile is specified, the TrustStoreFilePassword must also be specified.
Kerberos Isolated authentication
Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username should be the Kerberos principal primary to be used to access the cluster, and the password is the principal's password on the cluster.
When using Kerberos Isolated authentication, there are additional Kerberos properties that you need to configure:
- KdcServerHost - Specify the name of the server hosting the Kerberos Key Distribution Center (KDC) e.g. kdc.example.com
- KerberosSecurityRealm - Specify the name of the realm to use for Kerberos security e.g. HDFS.EXAMPLE.COM.
Kerberos SSO authentication
If you select Kerberos SSO in the ServerAuthenticationMethod property, the node uses Single Sign On to authenticate the user.
A Kerberos ticket-granting ticket (TGT) is a small amount of encrypted data that is issued by a server in the Kerberos authentication model to begin the authentication process. A kinit command is run to obtain or renew a Kerberos ticket-granting ticket. There are various methods by which a kinit command can be run to obtain a TGT:
- Your company may have configured your machine to perform a kinit operation when you log in.
- If you are using the MITKerberos client, you can use the user interface to manage TGTs.
- Alternatively, the kinit program can be invoked by adding
C:\Program Files\Data360Analyze\jre\bin\kinit.exe
to the laeenv.bat file.
You must also identify a valid Kerberos configuration file to be used by the authentication process in the KerberosConfiguration property. If you do not specify a value, a default file path is used which depends on the platform on which the Data360 Analyze server is running. On Windows, the default file path is 'WINDIRkrb5.ini' (e.g. C:\Windows\krb5.ini). On Linux, the default file path is '/etc/krb5.conf'.
For Kerberos SSO authentication, you do not need to configure the ServerUsername, ServerPassword, KdcServerHost or KerberosSecurityRealm properties.
Properties
ServerUrl
Specify the URL of the HDFS server hosting the HDFS site (e.g. http://www.hdfs-server.example.com). The URL must be correctly formatted, or the node will fail.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
A value is required for this property.
ServerPath
Specify the path of the server directory to which the file will be uploaded (must be an absolute path).
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
A value is required for this property.
LocalPath
Specify the path of the file to upload (must be an absolute path).
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
A value is required for this property.
Permissions
Optionally specify the permissions to assign to the uploaded file.
This value should comprise of three Octal digits (000 - 777) corresponding to the permissions for user, group and other classes. If not specified, this attribute is not sent to the server. The server will assign default permissions of "755".
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
PassThroughFields
Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields.
The following options are available:
- All - Passes through all the input data fields to the output.
- None - Passes none of the input data fields to the output; as such, only the fields created by the node appear on the output.
- Used - Passes through all the fields that the node used to create the output. Used fields include any input field referenced by a property, be it explicitly (i.e., via a 'field1' reference) or via a field pattern (i.e., '1:foo*').
- Unused - Passes through all the fields that the node did not use to create the output.
If a naming conflict exists between a pass-through field and an explicitly named output field, an error will occur.
The default value is Used.
ServerAuthenticationMethod
Optionally specify the authentication method to be used on the Hadoop Cluster. Choose from:
- Username - the username defined in the ServerUsername property is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. If not supplied, the default username is "Username".
- Username and Password - The username defined in the ServerUsername property and the password defined in the ServerPassword property are used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
- Kerberos SSO - The node uses Single Sign On to authenticate the user. The node will use an existing Kerberos ticket for the Hadoop cluster. The KerberosConfiguration property must identify a valid Kerberos configuration file to be used by the authentication process.
- Kerberos Isolated - The node uses the supplied credentials to authenticate the user. The ServerUsername, ServerPassword, KdcServerHost and KerberosSecurityRealm properties must be specified.
The default value is Username.
ServerUsername
Optionally specify the username on the HDFS server. This may contain the domain, if required, in the format "Domain\Username"
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated, Username and Password or Username.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
ServerPassword
Optionally specify the password for the user on the HDFS Server.
A value is required for this property when the ServerAuthenticationMethod property is set to Kerberos Isolated or Username and Password.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KdcServerHost
Optionally specify the name of the Kerberos Key Distribution Center server e.g. kdc.example.com
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KerberosSecurityRealm
Optionally specify the name of the Kerberos Realm e.g. EXAMPLE.COM
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KerberosConfiguration
Optionally specify the file path of the Kerberos configuration file to be used during the authentication process. If not specified, a default file path is used which depends on the platform on which the Data360 Analyze Server is running. On Windows the default file path is WINDIRkrb5.ini and on Linux, the default file path is /etc/krb5.conf
A valid Kerberos configuration file must be identified by this property when the ServerAuthenticationMethod property is set to Kerberos SSO.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
TrustStoreFile
Optionally specify the name of the truststore file.
The default is the Java Runtime Environment (JRE) truststore file.
TrustStoreFilePassword
Optionally specify the password relating to the TrustStoreFile.
A value is required for this property when the TrustStoreFile property is specified.
The default is the Java Runtime Environment (JRE) truststore file password.
FileExistsBehavior
Optionally specify what to do when a file being downloaded already exists on the local machine. Choose from:
- Error - Give a transfer error and skip the file.
- Log - Log a warning message and skip the file.
- Ignore - Skip the file.
- Overwrite - Overwrite the file.
- Update - Overwrite if the file being downloaded is newer than the existing file.
The default value is Error.
ErrorThreshold
Optionally specify the number of transfer errors that will cause the node to give up and fail.
Each record on the input pin is a "request". A transfer error is any error that causes a request to fail (e.g. a requested file does not exist). Setting this property instructs the node to continue processing requests as long as the number of errors remains below the given threshold.
An ErrorThreshold of 0 means never fail on a transfer error (the node will still fail on more serious errors).
The default value is 1 - the node fails on the first error encountered.
Inputs and outputs
Inputs: 1 optional.
Outputs: uploaded files.