HDFS Upload - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Uploads files to a specified HDFS server using the WebHDFS API.

Enables users to access a secure Hadoop cluster by providing support for authentication using the Kerberos protocol.

For example, you can use the HDFS Upload node to publish results files to the Hadoop Distributed File System of a Hadoop cluster. When run, the node uploads the specified files to HDFS and outputs details of the files that were uploaded.

Data360 Analyze also provides the HDFS Directory List node and the HDFS Download node. You can use the HDFS nodes together, for example:

HDFS nodes

Authentication options

There are a number of authentication properties on the node. The specific properties that you need to configure are determined by your authentication method. The following table outlines which authentication properties you need to configure for each of the authentication methods:

Property
Authentication method ServerUsername ServerPassword KdcServerHost KerberosSecurityRealm KerberosConfiguration
Username P
Username and Password P P
Kerberos SSO P
Kerberos Isolated P P P P

Username authentication

Enter your HDFS server username in the ServerUsername property.

For this authentication method, the username is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. The default value is "Username".

Username and password authentication

Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username and password information is used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.

The username and password must correspond with the credentials configured in the authentication server used by the Knox Gateway.

Note: This option uses HTTPbasic authentication meaning that user credentials are passed to the cluster in (obfuscated) clear text. It is recommended that HTTPS (SSL) is used to secure the communication link between the Data360 Analyze server and the Knox Gateway server.

Where SSL is used as the transport protocol, the Data360 Analyze server must be configured with the security certificate for the Knox Gateway server. The Java Runtime Environment (JRE) truststore file is used by default if no TrustStoreFile is specified. If the TrustStoreFile is specified, the TrustStoreFilePassword must also be specified.

Kerberos Isolated authentication

Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username should be the Kerberos principal primary to be used to access the cluster, and the password is the principal's password on the cluster.

When using Kerberos Isolated authentication, there are additional Kerberos properties that you need to configure:

  • KdcServerHost - Specify the name of the server hosting the Kerberos Key Distribution Center (KDC) e.g. kdc.example.com
  • KerberosSecurityRealm - Specify the name of the realm to use for Kerberos security e.g. HDFS.EXAMPLE.COM.

Kerberos SSO authentication

If you select Kerberos SSO in the ServerAuthenticationMethod property, the node uses Single Sign On to authenticate the user.

Note: For this type of authentication, a valid Kerberos ticket-granting ticket (TGT) is required for the Hadoop cluster.

A Kerberos ticket-granting ticket (TGT) is a small amount of encrypted data that is issued by a server in the Kerberos authentication model to begin the authentication process. A kinit command is run to obtain or renew a Kerberos ticket-granting ticket. There are various methods by which a kinit command can be run to obtain a TGT:

  • Your company may have configured your machine to perform a kinit operation when you log in.
  • If you are using the MITKerberos client, you can use the user interface to manage TGTs.
  • Alternatively, the kinit program can be invoked by adding C:\Program Files\Data360Analyze\jre\bin\kinit.exe to the laeenv.bat file.

You must also identify a valid Kerberos configuration file to be used by the authentication process in the KerberosConfiguration property. If you do not specify a value, a default file path is used which depends on the platform on which the Data360 Analyze server is running. On Windows, the default file path is 'WINDIRkrb5.ini' (e.g. C:\Windows\krb5.ini). On Linux, the default file path is '/etc/krb5.conf'.

For Kerberos SSO authentication, you do not need to configure the ServerUsername, ServerPassword, KdcServerHost or KerberosSecurityRealm properties.

Properties

ServerUrl

Specify the URL of the HDFS server hosting the HDFS site (e.g. http://www.hdfs-server.example.com). The URL must be correctly formatted, or the node will fail.

Note: If no port is specified in the URL then the port will default to 50470 if using the HTTPS protocol and port 50070 if using the HTTP protocol.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

A value is required for this property.

ServerPath

Specify the path of the server directory to which the file will be uploaded (must be an absolute path).

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

A value is required for this property.

LocalPath

Specify the path of the file to upload (must be an absolute path).

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

A value is required for this property.

Permissions

Optionally specify the permissions to assign to the uploaded file.

This value should comprise of three Octal digits (000 - 777) corresponding to the permissions for user, group and other classes. If not specified, this attribute is not sent to the server. The server will assign default permissions of "755".

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

PassThroughFields

Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields.

The following options are available:

  • All - Passes through all the input data fields to the output.
  • None - Passes none of the input data fields to the output; as such, only the fields created by the node appear on the output.
  • Used - Passes through all the fields that the node used to create the output. Used fields include any input field referenced by a property, be it explicitly (i.e., via a 'field1' reference) or via a field pattern (i.e., '1:foo*').
  • Unused - Passes through all the fields that the node did not use to create the output.

If a naming conflict exists between a pass-through field and an explicitly named output field, an error will occur.

The default value is Used.

ServerAuthenticationMethod

Optionally specify the authentication method to be used on the Hadoop Cluster. Choose from:

  • Username - the username defined in the ServerUsername property is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. If not supplied, the default username is "Username".
  • Username and Password - The username defined in the ServerUsername property and the password defined in the ServerPassword property are used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
  • Kerberos SSO - The node uses Single Sign On to authenticate the user. The node will use an existing Kerberos ticket for the Hadoop cluster. The KerberosConfiguration property must identify a valid Kerberos configuration file to be used by the authentication process.
  • Kerberos Isolated - The node uses the supplied credentials to authenticate the user. The ServerUsername, ServerPassword, KdcServerHost and KerberosSecurityRealm properties must be specified.

The default value is Username.

ServerUsername

Optionally specify the username on the HDFS server. This may contain the domain, if required, in the format "Domain\Username"

This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated, Username and Password or Username.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

ServerPassword

Optionally specify the password for the user on the HDFS Server.

A value is required for this property when the ServerAuthenticationMethod property is set to Kerberos Isolated or Username and Password.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

KdcServerHost

Optionally specify the name of the Kerberos Key Distribution Center server e.g. kdc.example.com

This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

KerberosSecurityRealm

Optionally specify the name of the Kerberos Realm e.g. EXAMPLE.COM

This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

KerberosConfiguration

Optionally specify the file path of the Kerberos configuration file to be used during the authentication process. If not specified, a default file path is used which depends on the platform on which the Data360 Analyze Server is running. On Windows the default file path is WINDIRkrb5.ini and on Linux, the default file path is /etc/krb5.conf

A valid Kerberos configuration file must be identified by this property when the ServerAuthenticationMethod property is set to Kerberos SSO.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

TrustStoreFile

Optionally specify the name of the truststore file.

The default is the Java Runtime Environment (JRE) truststore file.

TrustStoreFilePassword

Optionally specify the password relating to the TrustStoreFile.

A value is required for this property when the TrustStoreFile property is specified.

The default is the Java Runtime Environment (JRE) truststore file password.

FileExistsBehavior

Optionally specify what to do when a file being downloaded already exists on the local machine. Choose from:

  • Error - Give a transfer error and skip the file.
  • Log - Log a warning message and skip the file.
  • Ignore - Skip the file.
  • Overwrite - Overwrite the file.
  • Update - Overwrite if the file being downloaded is newer than the existing file.

The default value is Error.

ErrorThreshold

Optionally specify the number of transfer errors that will cause the node to give up and fail.

Each record on the input pin is a "request". A transfer error is any error that causes a request to fail (e.g. a requested file does not exist). Setting this property instructs the node to continue processing requests as long as the number of errors remains below the given threshold.

An ErrorThreshold of 0 means never fail on a transfer error (the node will still fail on more serious errors).

The default value is 1 - the node fails on the first error encountered.

Inputs and outputs

Inputs: 1 optional.

Outputs: uploaded files.