This Spark job takes in a geometry string (either in GeoJSON, WKT, KML or WKB format) and searches for geometries from a table of geometries within a specified distance. Searched geometries counts can be limited by defining max-candidates parameter. By default, geometries are listed from nearest to farthest.
- A dataset having geometry string column described either in GeoJSON, WKT, WKB or KML format
- A TAB, shape or geodatabase file containing list of geometries to search from
- Deploying jar and required input data to the cluster
- Start the Spark job using the following command:
spark-submit --class com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver --master local <dir-on-server>\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar --input <input-file-containing-the-source-geometry-string> --input-format=csv --csv header=true delimiter='\t' --table-file-type <TAB/shape/geodatabase> --table-file-path <target-geometry-data> --table-file-name <target-geometry-file-name> --geometry-column-name geometry --geometry-string-type <WKT/WKB/GeoJSON/KML> --distance 100 --distance-unit mi --max-candidates 20 -–output <output-path> --output-fields <output-fields> --include-empty-search-results --overwrite
The output of the Search Nearest operation is input dataset appended with output-fields provided in command. A column specifying the distance from source geometry to target geometry is also added in output and geometries are ordered in ascending order of distance between source and target geometry.
Executing the Job
To run the Spark job, you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark3 jar for your installed distribution of Spark and Scala.
DriverClass:
com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver
Scala2.12:
/precisely/li/software/spark3/sdk/lib/location-intelligence-bigdata-spark3drivers_2.12-sdk_version-all.jar
For Example:
spark-submit
--class com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver
--master local C:\python\jars\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar
--input C:\python\input\geometryGeoJson.csv
--input-format=csv
--csv header=true delimiter='\t'
--table-file-type TAB
--table-file-path C:\python\input\usa_landmarks
--table-file-name LANDMARKS.TAB
--geometry-column-name wkt
--geometry-string-type WKT
--distance 50
--distance-unit mi
--max-candidates 20
--output C:\python\output\sn_geojson
--output-fields Name, State, Landmark
--include-empty-search-results
--overwrite
Job Parameters
Parameter | Description | Example |
---|---|---|
--output |
The location of the directory for the output. | --output /user/sdkuser/search_nearest_output |
--input |
The location to the input file. | --input /user/sdkuser/customers/addresses.csv |
--table-file-type |
Type of target polygon data file (either TAB, shape or geodatabase). | --table-file-
type=TAB |
--table-file-path |
Path to polygon data files. | --table-file-
path=/user/sdkuser/input/TABs/uszips |
--table-file-name |
Name of the TAB, shape or geodatabase file. | --table-file-
name=USZIPBDY.TAB |
--geometry-column-name |
Name of column containing string representation of geometry. | --geometry-column-name wkt |
--geometry-string-type |
Type of geometry string provided in input file. Supported values are WKT, GeoJSON, WKB, or KMLs. | --geometry-string-type=WKT |
--distance |
The absolute value of distance from source geometry within which we will search for target geometries. | --distance 5 |
--distance-unit |
Unit of measurement for distance parameter. Valid values of unit are: mi: miles km: kilometers in: inches ft: feet yd: yards mm: millimeters cm: centimeters m: meters survey ft: US Survey feet nmi: nautical miles |
--distance-unit=mi |
--max-candidates |
Limits the count of target geometries to search for. | --max-candidates 4 |
--output-fields |
The requested fields to be included in the output. Multiple output field expressions should be separated by a space. | --output-fields
ZIP, Name |
--distance-column-name |
Name of the distance column in output. | --distance-column-name
dist_between_geoms |
--libraries |
Libraries to be provided in case of geodatabase table-file-type parameter. | --libraries
hdfs:///pb/li/software/lib/ linux-x86-64 |
--download-location |
Local path of the directory where input data will be downloaded to.This path
must exist or be able to be created by the spark user executing the job, on every
node. Note: This parameter is required if reference data is distributed remotely
via HDFS or S3.
|
--download-location /precisely/downloads |
--download-group |
This property is only used for POSIX-compliant platforms like Linux. It
specifies the operating system group which should be applied to the downloaded
data on a local file system, so that multiple Hadoop services can update the data
when required. This group should be present on all nodes in the cluster and the
operating system user executing the spark job should be a part of this group.
For more information, see Download Permissions. Note: Use only if reference data is distributed remotely via HDFS or
S3.
|
--download-group dm_users |
--limit |
Limits the number of records. | --limit 10 |
--input-format |
The input format. Valid values: csv or parquet. If not specified, the default is csv. | --input-format=csv |
--output-format |
The output format. Valid values: csv or parquet. If not specified, the
default is the input-format value. |
--output-format=csv |
--csv
|
Specify the options to be used when reading and writing CSV input and output
files. Common options and their default values:
|
|
--parquet |
Specify the options to be used when reading and writing parquet input and output files. | --parquet compression=gzip |
--include-empty-search-results |
This flag keeps the rows in output which were not present in polygon. | --include-empty-search-results |
--overwrite |
Including this parameter will tell the job to overwrite the output directory. Otherwise, the job will fail if this directory already has content. This parameter does not have a value. | --overwrite |