HDFS Directory List - Data360_Analyze - 3 - 3.12

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
3.12
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2023
First publish date
2016

Lists the contents of a HDFS directory using the WebHDFS API.

Enables you to access a secure Hadoop cluster by providing support for authentication using the Kerberos protocol.

Data360 Analyze also provides the HDFS Download node and the HDFS Upload node. You can use the HDFS nodes together, for example:

HDFS nodes

Authentication options

There are a number of authentication properties on the node. The specific properties that you need to configure are determined by your authentication method. The following table outlines which authentication properties you need to configure for each of the authentication methods:

  Property
Authentication method ServerUsername ServerPassword KdcServerHost KerberosSecurityRealm KerberosConfiguration
Username P        
Username and Password P P      
Kerberos SSO         P
Kerberos Isolated P P P P  

Username authentication

Enter your HDFS server username in the ServerUsername property.

For this authentication method, the username is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. The default value is "Username".

Username and password authentication

Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username and password information is used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.

The username and password must correspond with the credentials configured in the authentication server used by the Knox Gateway.

Note: This option uses HTTPbasic authentication meaning that user credentials are passed to the cluster in (obfuscated) clear text. It is recommended that HTTPS (SSL) is used to secure the communication link between the Data360 Analyze server and the Knox Gateway server.

Where SSL is used as the transport protocol, the Data360 Analyze server must be configured with the security certificate for the Knox Gateway server. The Java Runtime Environment (JRE) truststore file is used by default if no TrustStoreFile is specified. If the TrustStoreFile is specified, the TrustStoreFilePassword must also be specified.

Kerberos Isolated authentication

Enter your HDFS server username in the ServerUsername property and the corresponding password in the ServerPassword property. The username should be the Kerberos principal primary to be used to access the cluster, and the password is the principal's password on the cluster.

When using Kerberos Isolated authentication, there are additional Kerberos properties that you need to configure:

  • KdcServerHost - Specify the name of the server hosting the Kerberos Key Distribution Center (KDC) e.g. kdc.example.com
  • KerberosSecurityRealm - Specify the name of the realm to use for Kerberos security e.g. HDFS.EXAMPLE.COM.

Kerberos SSO authentication

If you select Kerberos SSO in the ServerAuthenticationMethod property, the node uses Single Sign On to authenticate the user.

Note: For this type of authentication, a valid Kerberos ticket-granting ticket (TGT) is required for the Hadoop cluster.

A Kerberos ticket-granting ticket (TGT) is a small amount of encrypted data that is issued by a server in the Kerberos authentication model to begin the authentication process. A kinit command is run to obtain or renew a Kerberos ticket-granting ticket. There are various methods by which a kinit command can be run to obtain a TGT:

  • Your company may have configured your machine to perform a kinit operation when you log in.
  • If you are using the MITKerberos client, you can use the user interface to manage TGTs.
  • Alternatively, the kinit program can be invoked by adding C:\Program Files\Data360Analyze\jre\bin\kinit.exe to the laeenv.bat file.

You must also identify a valid Kerberos configuration file to be used by the authentication process in the KerberosConfiguration property. If you do not specify a value, a default file path is used which depends on the platform on which the Data360 Analyze server is running. On Windows, the default file path is 'WINDIRkrb5.ini' (e.g. C:\Windows\krb5.ini). On Linux, the default file path is '/etc/krb5.conf'.

For Kerberos SSO authentication, you do not need to configure the ServerUsername, ServerPassword, KdcServerHost or KerberosSecurityRealm properties.

Properties

ServerUrl

Enter the URL of the HDFS server hosting the HDFS site (e.g. http://www.hdfs-server.example.com). The URL must be correctly formatted, or the node will fail.

Note: If no port is specified in the URL then the port will default to 50470 if using the https protocol and port 50070 if using the http protocol.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

A value is required for this property.

ServerPath

Optionally enter the path of the server file to download. You can enter a literal value (default) or an input field.

The includes the HDFS site name, the Document Library, and any nested folders within the Document Library. e.g., MyHdfsSite/SharedDocuments/SalesData.

Default is "/" and it points to the root of the HDFS server.

Pattern

Optionally enter a case insensitive expression to select specific files from the specified ServerPath using pattern matching. "*" can be used to substitute for any character. You can enter a literal value (default) or an input field.

e.g., "data*" can be used to select Data1.csv and data2.csv.

ConvertTimesToLocal

Optionally specify whether to convert server times from UTC to local time.

If this property is set to True, times will be converted to local time.

By default times will be converted to local time zone.

PassThroughFields

Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields.

Choose from:

  • All - passes through all the input data fields to the output.
  • Used - passes through all the fields that the node used to create the output. Includes any input field referenced by a property, be it explicitly (i.e., via a 'field1' reference) or via a field pattern (i.e., '1:foo*').
  • Unused - passes through all the fields that the node did not use to create the output.
  • None - passes none of the input data fields to the output. Only the fields created by the node appear on the output.

The default value is Used.

If a naming conflict exists between a pass-through field and an explicitly named output field, an error will occur.

ServerAuthenticationMethod

Optionally specify the authentication method to be used on the Hadoop Cluster. Choose from:

  • Username - The username that is specified in the ServerUsername property is used to identify the user on the Hadoop cluster when security is not enabled on the cluster. If not supplied, the default username is "Username".
  • Username and Password - The username that is specified in the ServerUsername property and the password that is specified in the ServerPassword property are used to authenticate the user on the Hadoop cluster when it is secured by Knox Gateway perimeter security.
  • Kerberos SSO - The node uses Single Sign On to authenticate the user. The node will use an existing Kerberos ticket for the Hadoop cluster. The KerberosConfiguration property must identify a valid Kerberos configuration file to be used by the authentication process.
  • Kerberos Isolated - The node uses the specified credentials to authenticate the user. The ServerUsername, ServerPassword, KdcServerHost and KerberosSecurityRealm properties must be specified.

The default value is Username.

ServerUsername

Optionally specify the username on the HDFS server. This may contain the domain, if required, in the format "Domain\Username"

You can enter a literal value (default) or an input field. This property must be specified when the ServerAuthenticationMethod property is set to Username, Username and Password, or Kerberos Isolated.

ServerPassword

Optionally specify the password for the user on the HDFS Server.

This property must be specified when the ServerAuthenticationMethod parameter is set to Kerberos Isolated or Username and Password.

KdcServerHost

Optionally specify the name of the Kerberos Key Distribution Center server e.g. kdc.example.com

This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.

KerberosSecurityRealm

Optionally specify the name of the Kerberos Realm e.g. EXAMPLE.COM

This property must be specified when the ServerAuthenticationMethod property is set to Kerberos Isolated.

KerberosConfiguration

Optionally specify the file path of the Kerberos configuration file to be used during the authentication process. If not specified, a default file path is used which depends on the platform on which the Data360 Analyze Server is running. On Windows the default file path is WINDIRkrb5.ini and on Linux, the default file path is /etc/krb5.conf

A valid Kerberos configuration file must be identified by this property when the ServerAuthenticationMethod property is set to Kerberos SSO.

TrustStoreFile

Optionally specify the name of the truststore file.

The default is the Java Runtime Environment (JRE) truststore file.

TrustStoreFilePassword

Optionally specify the password relating to the TrustStoreFile.

A value is required in this property if the TrustStoreFile property is specified.

The default is the Java Runtime Environment (JRE) truststore file password.

Inputs and outputs

Inputs: 1 optional.

Outputs: out1.