This Spark job filters the point coordinates which are within a specified polygon (for example, the polygon of the continental USA).
- A dataset having point coordinates (latitude and longitude) (in epsg:4326)
- A TAB, shape or geodatabase file describing the polygon
- Deploy jar and required input data to the cluster.
- Start the Spark job using the following command:
spark-submit –class com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver --master local <dir-on-server>/location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar --input <input-data-containing-points> --input-format=csv --csv header=true delimiter=',' --table-file-type TAB --table-file-path <polygon-data-path> --table-file-name <file-name> --latitude-column-name <latitude-col-name-in-point-data> --longitude-column-name <latitude-col-name-in-point-data> --output <output-path> --output-fields <output-fields> --overwrite
The output of the Point-In-Polygon operation could be either a csv or parquet containing only rows which are within the polygon specified with columns appended which are passed as output fields.
Executing the Job
To run the Spark job, you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark3 jar for your installed distribution of Spark and Scala.
DriverClass:
com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver
Scala2.12:
/precisely/li/software/spark3/sdk/lib/location-intelligence-bigdata-spark3drivers_2.12-sdk_version-all.jar
For Example:
spark-submit
--class com.precisely.bigdata.li.spark.app.pointinpolygon.PointInPolygonDriver
--master local C:\python\jars\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar
--input C:\python\input\addressFabric50.csv
--input-format=csv
--csv header=true delimiter=','
--table-file-type TAB
--table-file-path C:\python\input\uszips
--table-file-name USZIPBDY.TAB
--latitude-column-name lat
--longitude-column-name lon
--output C:\python\output
--output-fields ZIP, Name
--include-empty-search-results
--overwrite
Job Parameters
Parameter | Description | Example |
---|---|---|
--output |
The location of the directory for the output. | --output /user/sdkuser/pip_output |
--input |
The location to the input file. | --input /user/sdkuser/customers/addresses.csv |
--table-file-type |
Type of target polygon data file (either TAB, shape or geodatabase). | --table-file-
type=TAB |
--table-file-path |
Path to polygon data files. | --table-file-
path=/user/sdkuser/input/TABs/uszips |
--table-file-name |
Name of the TAB, shape or geodatabase file. | --table-file-
name=USZIPBDY.TAB |
--longitude-column-name |
Name of column containing longitude values in input point data. | --longitude-
column-name=lon |
--latitude-column-name |
Name of column containing latitude values in input point data. | --latitude-
column-name=lat |
--output-fields |
The requested fields to be included in the output. Multiple output field expressions should be separated by a space. | --output-fields
ZIP, Name |
--download-location |
Local path of the directory where input data will be downloaded to.This path
must exist or be able to be created by the spark user executing the job, on every
node. Note: This parameter is required if reference data is distributed remotely
via HDFS or S3.
|
--download-location /precisely/downloads |
--download-group |
This property is only used for POSIX-compliant platforms like Linux. It
specifies the operating system group which should be applied to the downloaded
data on a local file system, so that multiple Hadoop services can update the data
when required. This group should be present on all nodes in the cluster and the
operating system user executing the spark job should be a part of this group.
For more information, see Download Permissions. Note: Use only if reference data is distributed remotely via HDFS or
S3.
|
--download-group dm_users |
--limit |
Limits the number of records. | --limit 10 |
--libraries |
Libraries to be provided in case of geodatabase table-file-type parameter. |
--libraries
hdfs:///pb/li/software/lib/ linux-x86-64 |
--input-format |
The input format. Valid values: csv or parquet. If not specified, the default is csv. | --input-format=csv |
--output-format |
The output format. Valid values: csv or parquet. If not specified, the
default is the input-format value. |
--output-format=csv |
--csv |
Specify the options to be used when reading and writing CSV input and output
files. Common options and their default values:
|
|
--parquet |
Specify the options to be used when reading and writing parquet input and output files. | --parquet compression=gzip |
--include-empty-search-results |
This flag keeps the rows in output which were not present in polygon. | --include-empty-search-results |
--overwrite |
Including this parameter will tell the job to overwrite the output directory. Otherwise, the job will fail if this directory already has content. This parameter does not have a value. | --overwrite |