Downloads files from a specified HDFS server using the WebHDFS API.
Enables you to access a secure Hadoop cluster by providing support for authentication using the Kerberos protocol.
For example, you can use the HDFS Download node to transfer source or results files from the Hadoop Distributed File System of a Hadoop cluster to the machine hosting the Data360 Analyze server for local processing. When run, the node downloads the specified files and outputs details of the files that were downloaded. The node can be configured to write the files to disk or output the contents in a field.
Data360 Analyze also provides the HDFS Directory List node and the HDFS Upload node. You can use the HDFS nodes together, for example:
Authentication options
There are a number of authentication properties on the node. The specific properties that you need to configure are determined by your authentication method. The following table outlines which authentication properties you need to configure for each of the authentication methods:
Property | |||||
---|---|---|---|---|---|
Authentication method | ServerUsername | ServerPassword | KdcServerHost | KerberosSecurityRealm | KerberosConfiguration |
Username | P | ||||
Username and Password | P | P | |||
Kerberos SSO | P | ||||
Kerberos Isolated | P | P | P | P |
Username authentication
Enter your HDFS server username in the ServerUsername property.
For this authentication method, the username is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. The default value is "Username".
Username and password authentication
Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username and password information is used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
The username and password must correspond with the credentials configured in the authentication server used by the Knox Gateway.
Where SSL is used as the transport protocol, the Data360 Analyze server must be configured with the security certificate for the Knox Gateway server. The Java Runtime Environment (JRE) truststore file is used by default if no TrustStoreFile is specified. If the TrustStoreFile is specified, the TrustStoreFilePassword must also be specified.
Kerberos Isolated authentication
Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username should be the Kerberos principal primary to be used to access the cluster, and the password is the principal's password on the cluster.
When using Kerberos Isolated authentication, there are additional Kerberos properties that you need to configure:
- KdcServerHost - Specify the name of the server hosting the Kerberos Key Distribution Center (KDC) e.g. kdc.example.com
- KerberosSecurityRealm - Specify the name of the realm to use for Kerberos security e.g. HDFS.EXAMPLE.COM.
Kerberos SSO authentication
If you select Kerberos SSO in the ServerAuthenticationMethod property, the node uses Single Sign On to authenticate the user.
A Kerberos ticket-granting ticket (TGT) is a small amount of encrypted data that is issued by a server in the Kerberos authentication model to begin the authentication process. A kinit command is run to obtain or renew a Kerberos ticket-granting ticket. There are various methods by which a kinit command can be run to obtain a TGT:
- Your company may have configured your machine to perform a kinit operation when you log in.
- If you are using the MITKerberos client, you can use the user interface to manage TGTs.
- Alternatively, the kinit program can be invoked by adding
C:\Program Files\Data360Analyze\jre\bin\kinit.exe
to the laeenv.bat file.
You must also identify a valid Kerberos configuration file to be used by the authentication process in the KerberosConfiguration property. If you do not specify a value, a default file path is used which depends on the platform on which the Data360 Analyze server is running. On Windows, the default file path is 'WINDIRkrb5.ini' (e.g. C:\Windows\krb5.ini). On Linux, the default file path is '/etc/krb5.conf'.
For Kerberos SSO authentication, you do not need to configure the ServerUsername, ServerPassword, KdcServerHost or KerberosSecurityRealm properties.
Properties
ServerUrl
Specify the URL of the HDFS server hosting the HDFS site (e.g. http://www.Hdfs-server.example.com). The URL must be correctly formatted, or the node will fail.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
A value is required for this property.
ServerPath
Optionally specify the path of the server file to download. Must be an absolute path.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
DataOutputMode
Optionally specify whether data is written to a file on disk or to an output field. Choose from:
- Field
- File
The default value is Field.
This property determines how DataOutputField and DataOutputDirectory behave.
DataOutputField
Optionally specify the name of the output field that contains either the response body or the filenames where the response body has been written.
The behavior of this property depends upon the DataOutputMode. If DataOutputMode is Field, it names the field where the response body is output.
If DataOutputMode is File, the output field named in the property contains the full path of the files that contain the response bodies.
The data type of this field is set by the DataOutputFieldType property. The DataOutputFieldType property is useful if the response body has Unicode text data and DataOutputMode is set to Field.
The default value is "_Output".
DataOutputDirectory
Optionally specify the directory where response bodies are written when DataOutputMode is set to File. When DataOutputDirectory is blank, files are written to the Data360 Analyze temporary directory. Otherwise, the files are written to the specified directory - the specified directory must exist and be writeable. This node will not overwrite existing files by default. Behavior can be set by configuring the ExceptionBehavior properties.
This property can only be filled in when DataOutputMode is set to File.
DataOutputFieldEncoding
Optionally specify when to encode the HTTP response data when writing to the field specified by DataOutputField.
The data returned from an HDFS server via HTTP call can be either ASCII text, Unicode text, or binary. Since Data360 Analyze does not support binary data in records, this data must be encoded, or put in a valid data format, before being output to a pin. In addition, if DataOutputFieldType is set to String, then any Unicode data will also have to be encoded in order to avoid errors. Choose from:
- Auto - Determines whether to Base64 encode the data based on the Content-Type of the HTTP response. This setting will encode all data types except for text, html, and xml.
- Base64 - Encodes all DataOutputField values using Base64 encoding. This is the safest option.
- None - Do not encode any of the output data. If binary data comes in an HTTP response, then an error will be thrown, and the node will stop processing. This option should only be used when the user can guarantee that the returned data is not binary and is of the same type as the DataOutputFieldType.
The default value is Auto.
DataOutputFieldType
Optionally specify the type of the field named in DataOutputField. Choose from:
- String - The data output field will be a Data360 Analyze string type.
- Unicode - The data output field will be a Data360 Analyze Unicode string type.
The default value is Unicode.
ConvertTimesToLocal
Optionally specify whether to convert server times from UTC time zone to local time.
The default value is True.
PassThroughFields
Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields. Choose from:
- All - Passes through all the input data fields to the output.
- None - Passes none of the input data fields to the output; as such, only the fields created by the node appear on the output.
- Used - Passes through all the fields that the node used to create the output. Used fields include any input field referenced by a property, be it explicitly (i.e., via a 'field1' reference) or via a field pattern (i.e., '1:foo*').
- Unused - Passes through all the fields that the node did not use to create the output.
The default value is Used.
If a naming conflict exists between a pass-through field and an explicitly named output field, an error will occur.
ServerAuthenticationMethod
Optionally specify the authentication method to be used on the Hadoop Cluster. Choose from:
- Username - The username defined in the ServerUsername property is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. If not supplied, the default username is "Username".
- Username and Password - The username defined in the ServerUsername property and the password defined in the ServerPassword property are used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
- Kerberos SSO - The node uses Single Sign On to authenticate the user. The node will use an existing Kerberos ticket for the Hadoop cluster. The KerberosConfiguration property must identify a valid Kerberos configuration file to be used by the authentication process.
- Kerberos Isolated - The node uses the supplied credentials to authenticate the user. The ServerUsername, ServerPassword, KdcServerHost and KerberosSecurityRealm properties must be specified.
The default value is Username.
ServerUsername
Optionally specify the username on the HDFS server. This may contain the domain, if required, in the format "Domain\Username"
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated or Username.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
ServerPassword
Optionally specify the password for the user on the HDFS Server.
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KdcServerHost
Optionally specify the name of the Kerberos Key Distribution Center server e.g. kdc.example.com
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KerberosSecurityRealm
Optionally specify the name of the Kerberos Realm e.g. EXAMPLE.COM
This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
KerberosConfiguration
Optionally specify the file path of the Kerberos configuration file to be used during the authentication process. If not specified, a default file path is used which depends on the platform on which the Data360 Analyze Server is running. On Windows the default file path is WINDIRkrb5.ini and on Linux, the default file path is /etc/krb5.conf
A valid Kerberos configuration file must be identified by this property when the ServerAuthenticationMethod property is set to Kerberos SSO.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
TrustStoreFile
Optionally specify the name of the truststore file. The default value is the Java Runtime Environment (JRE) truststore file.
TrustStoreFilePassword
Optionally specify the password relating to the TrustStoreFile. This property is mandatory if the TrustStoreFile property is specified. The default value is the Java Runtime Environment (JRE) truststore file password.
FileExistsBehavior
Optionally specify what to do when a file being downloaded already exists on the local machine. Choose from:
- Error - Give a transfer error and skip the file.
- Log - Log a warning message and skip the file.
- Ignore - Skip the file.
- Overwrite - Overwrite the file.
- Update - Overwrite if the file being downloaded is newer than the existing file.
The default value is Error.
ErrorThreshold
Optionally specify the number of transfer errors that will cause the node to give up and fail. Each record on the input pin is a "request". A transfer error is any error that causes a request to fail (e.g. a requested file does not exist). Setting this property instructs the node to continue processing requests as long as the number of errors remains below the given threshold.
An ErrorThreshold of 0 means never fail on a transfer error (the node will still fail on more serious errors). The default value is 1 - the node fails on the first error encountered.
Inputs and outputs
Inputs: 1 optional (input fields).
Outputs: downloaded files.