In order for Connect to access data located in a HDFS, a Hadoop distribution must be installed and configured as follows on the system where the Connect jobs and tasks are executed:
- The hadoop
command must be accessible to Connect:
- Connect first looks for the hadoop command in $HADOOP_HOME/bin/hadoop, where the environment variable HADOOP_HOME is set to the directory where Hadoop is installed. Defining environment variables can be done through the Environment Variables tab of the Connect Server dialog.
- If HADOOP_HOME is not defined or the directory can't be found, Connect looks for the hadoop command in the system path, where it is automatically added by some Hadoop distributions.
- The
fs.default.name
property in the core-site.xml configuration file must be set to point to the Hadoop file system. - The HTTP namenode daemon must be running on the default port 50070. If you would like to use a different port number, please contact Technical Support.
- If the Hadoop cluster requires Kerberos authentication, you need to use the dmxkinit utility to run your HDFS extract/load jobs/tasks.