Point In Polygon - Spectrum_Location_Intelligence_for_Big_Data - 5.2.1

Location Intelligence SDK for Big Data Guide

Product type
Software
Portfolio
Locate
Product family
Spectrum
Product
Spatial Big Data > Location Intelligence SDK for Big Data
Version
5.2.1
Language
English
Product name
Location Intelligence for Big Data
Title
Location Intelligence SDK for Big Data Guide
Copyright
2024
First publish date
2015
Last updated
2024-10-16
Published on
2024-10-16T13:55:01.634374

This Spark job filters the point coordinates which are within a specified polygon (for example, the polygon of the continental USA).

To perform this operation, you will need:
  • A dataset having point coordinates (latitude and longitude) (in epsg:4326)
  • A TAB, shape or geodatabase file describing the polygon
  • Deploy jar and required input data to the cluster.
  • Start the Spark job using the following command:
    spark-submit  
    –class com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver 
    --master local <dir-on-server>/location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar 
    --input <input-data-containing-points> 
    --input-format=csv 
    --csv header=true delimiter=',' 
    --table-file-type TAB 
    --table-file-path <polygon-data-path> 
    --table-file-name <file-name> 
    --latitude-column-name <latitude-col-name-in-point-data> 
    --longitude-column-name <latitude-col-name-in-point-data> 
    --output <output-path>  
    --output-fields <output-fields> 
    --overwrite

The output of the Point-In-Polygon operation could be either a csv or parquet containing only rows which are within the polygon specified with columns appended which are passed as output fields.

Executing the Job

To run the Spark job, you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark3 jar for your installed distribution of Spark and Scala.

DriverClass:

com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver

Scala2.12:

/precisely/li/software/spark3/sdk/lib/location-intelligence-bigdata-spark3drivers_2.12-sdk_version-all.jar

For Example:

spark-submit
--class com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver
--master local C:\python\jars\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar
--input C:\python\input\addressFabric50.csv
--input-format=csv
--csv header=true delimiter=','
--table-file-type TAB
--table-file-path C:\python\input\uszips
--table-file-name USZIPBDY.TAB
--latitude-column-name lat
--longitude-column-name lon
--output C:\python\output
--output-fields ZIP, Name
--include-empty-search-results
--overwrite

Job Parameters

All parameters are declared with a double dash. The required fields are in bold.
Parameter Description Example
--output The location of the directory for the output. --output /user/sdkuser/pip_output
--input The location to the input file. --input /user/sdkuser/customers/addresses.csv
--table-file-type Type of target polygon data file (either TAB, shape or geodatabase). --table-file-

type=TAB

--table-file-path Path to polygon data files. --table-file-

path=/user/sdkuser/input/TABs/uszips

--table-file-name Name of the TAB, shape or geodatabase file. --table-file-

name=USZIPBDY.TAB

--longitude-column-name Name of column containing longitude values in input point data. --longitude-

column-name=lon

--latitude-column-name Name of column containing latitude values in input point data. --latitude-

column-name=lat

--output-fields The requested fields to be included in the output. Multiple output field expressions should be separated by a space. --output-fields

ZIP, Name

--download-location Local path of the directory where input data will be downloaded to.This path must exist or be able to be created by the spark user executing the job, on every node.
Note: This parameter is required if reference data is distributed remotely via HDFS or S3.
--download-location /precisely/downloads
--download-group This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that multiple Hadoop services can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the spark job should be a part of this group.

For more information, see Download Permissions.

Note: Use only if reference data is distributed remotely via HDFS or S3.
--download-group dm_users
--limit Limits the number of records. --limit 10
--libraries

Libraries to be provided in case of geodatabase table-file-type parameter.

--libraries

hdfs:///pb/li/software/lib/ linux-x86-64

--input-format The input format. Valid values: csv or parquet. If not specified, the default is csv. --input-format=csv
--output-format The output format. Valid values: csv or parquet. If not specified, the default is the input-format value. --output-format=csv
--csv Specify the options to be used when reading and writing CSV input and output files.

Common options and their default values:

  • delimiter:,
  • quote:"
  • escape:\
  • header:false
  • Specify individual options:

    --csv header=true

    --csv delimiter='\t'

  • Specify multiple options:

    --csv header=true delimiter='\t'

--parquet Specify the options to be used when reading and writing parquet input and output files. --parquet compression=gzip
--include-empty-search-results This flag keeps the rows in output which were not present in polygon. --include-empty-search-results
--overwrite Including this parameter will tell the job to overwrite the output directory. Otherwise, the job will fail if this directory already has content. This parameter does not have a value. --overwrite