There are multiple ways to consume the data produced by a Spark job, MapReduce job, or Hive query.
To verify whether or not the output exists, use the following command. This command lists the output files:
hadoop fs -ls /dir/on/hdfs/output
To display the size of files and directories contained in the given output directory, use the following command:
hadoop fs -du /dir/on/hdfs/output
To check the file, use the following command:
hadoop fs -cat dir/on/hdfs/output/part-r-00000 | more
To check initial kilobytes of the file, use the following command:
hadoop fs -cat dir/on/hdfs/output/part-r-00000 | head
To check last kilobytes of the file, use the following command:
hadoop fs -tail dir/on/hdfs/output/part-r-00000
The syntax supports Unix -f
option, that enables the specified file to be
monitored. As new lines are added to the file by some another process, the
tail
updates the display.
To copy the output from HDFS to Linux file system, use the following commands:
mkdir /pb/spectrum-bigdata-geocoding/out
hadoop fs -copyToLocal dir/on/hdfs/output/* /pb/spectrum-bigdata-geocoding/out
To concatenate the output files from HDFS directory to a file in the Linux file system, use the following command:
hadoop fs -getmerge dir/on/hdfs/output/* /pb/spectrum-bigdata-geocoding/out/merged_output.txt addnl
addnl
is optional and can be set to enable adding a new line character at
the end of each file. To copy the output from HDFS to some other location in HDFS, use the following command:
hadoop fs -cp dir/on/hdfs/output/* /dir/on/hdfs/copy_of_output
To copy the output recursively from one HDFS to some other HDFS, use the following command:
hadoop fs -distcp <path_of_output_dir_on_hdfs> <dir_path_on_other_hdfs>
To copy the output from HDFS using Hive, use the following command:
hive> CREATE EXTERNAL TABLE hexbin(id string, wkt string,
long double, lat double) ROW FORMAT DELIMITED FIELDS
TERMINATED BY "\t" LINES TERMINATED BY "\n" STORED AS
TEXTFILE LOCATION '/dir/on/hdfs/hive_output'
Now, the output can be deployed from Linux file system for further processing.
For example, if the output contains WKT, it can be imported into a database or product that supports importing spatial objects such as WKT (such as PostGIS, FME, or SAP HANA).
Alternatively, a plug-in for MapInfo Professional can be used to pull data directly from HDFS into a native table. Refer to WKT2MapInfo for more information.