- Create an EMR cluster having following configuration:
Table 1. Software Details Component Version EMR emr-6.0.0 Hadoop Amazon 3.2.1 Hive Hive 3.1.2 Tez Tez 0.9.2 Table 2. Hardware Details Nodes Machine Type Machine Details Master m4.xlarge 4 vCore, 16 GiB memory, EBS only storage, EBS Storage:400 GiB Slaves*2 m4.xlarge 4 vCore, 16 GiB memory, EBS only storage, EBS Storage:400 GiB - Download the Property_Attributes_Assessment_yyyymm.zip file to location /mnt/data/.
- Uncompress the file using following
command:
unzip Property_Attributes_Assessment_yyyymm.zip
- Data will be uncompressed to /mnt/data/Property_Attributes_Assessment_yyyymm/Property_Attributes_Assessment_Data/ property_attributes_assessment_usa.txt
- Initiate command prompt through Hive.
- Run the create table script in Hive Shell.
- Load data in Hive using following
command:
hive:>LOAD DATA LOCAL INPATH '/mnt/data/Property_Attributes_Assessment_yyyymm/ Property_Attributes_Assessment_Data/property_attributes_assessment_usa.txt' OVERWRITE INTO TABLE asmt_d;
- Once data is loaded, execute
hive:>select count(*) from asmt_d;
Note: Counts can be verified by comparing the file against the Hive table. Both counts should match.