Search Nearest - Spectrum_Location_Intelligence_for_Big_Data - 5.2.1

Location Intelligence SDK for Big Data Guide

Product type
Software
Portfolio
Locate
Product family
Spectrum
Product
Spatial Big Data > Location Intelligence SDK for Big Data
Version
5.2.1
Language
English
Product name
Location Intelligence for Big Data
Title
Location Intelligence SDK for Big Data Guide
Copyright
2024
First publish date
2015
Last updated
2024-10-16
Published on
2024-10-16T13:55:01.634374

This Spark job takes in a geometry string (either in GeoJSON, WKT, KML or WKB format) and searches for geometries from a table of geometries within a specified distance. Searched geometries counts can be limited by defining max-candidates parameter. By default, geometries are listed from nearest to farthest.

To perform this operation, you will need:
  • A dataset having geometry string column described either in GeoJSON, WKT, WKB or KML format
  • A TAB, shape or geodatabase file containing list of geometries to search from
  • Deploying jar and required input data to the cluster
  • Start the Spark job using the following command:
    spark-submit  
    --class com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver 
    --master local <dir-on-server>\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar  
    --input <input-file-containing-the-source-geometry-string> 
    --input-format=csv 
    --csv header=true delimiter='\t' 
    --table-file-type <TAB/shape/geodatabase> 
    --table-file-path <target-geometry-data> 
    --table-file-name <target-geometry-file-name> 
    --geometry-column-name geometry
    --geometry-string-type <WKT/WKB/GeoJSON/KML> 
    --distance 100 
    --distance-unit mi
    --max-candidates 20 
    -–output <output-path>  
    --output-fields <output-fields> 
    --include-empty-search-results 
    --overwrite  

The output of the Search Nearest operation is input dataset appended with output-fields provided in command. A column specifying the distance from source geometry to target geometry is also added in output and geometries are ordered in ascending order of distance between source and target geometry.

Executing the Job

To run the Spark job, you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark3 jar for your installed distribution of Spark and Scala.

DriverClass:

com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver

Scala2.12:

/precisely/li/software/spark3/sdk/lib/location-intelligence-bigdata-spark3drivers_2.12-sdk_version-all.jar

For Example:

spark-submit
--class com.precisely.bigdata.li.spark.app.searchnearest.SearchNearestDriver
--master local C:\python\jars\location-intelligence-bigdata-spark3drivers_2.12-0-SNAPSHOT-all.jar
--input C:\python\input\geometryGeoJson.csv
--input-format=csv
--csv header=true delimiter='\t'
--table-file-type TAB
--table-file-path C:\python\input\usa_landmarks
--table-file-name LANDMARKS.TAB
--geometry-column-name wkt
--geometry-string-type WKT
--distance 50
--distance-unit mi
--max-candidates 20
--output C:\python\output\sn_geojson
 --output-fields Name, State, Landmark 
 --include-empty-search-results
 --overwrite

Job Parameters

All parameters are declared with a double dash. The required fields are in bold.
Parameter Description Example
--output The location of the directory for the output. --output /user/sdkuser/search_nearest_output
--input The location to the input file. --input /user/sdkuser/customers/addresses.csv
--table-file-type Type of target polygon data file (either TAB, shape or geodatabase). --table-file-

type=TAB

--table-file-path Path to polygon data files. --table-file-

path=/user/sdkuser/input/TABs/uszips

--table-file-name Name of the TAB, shape or geodatabase file. --table-file-

name=USZIPBDY.TAB

--geometry-column-name Name of column containing string representation of geometry. --geometry-column-name wkt
--geometry-string-type Type of geometry string provided in input file. Supported values are WKT, GeoJSON, WKB, or KMLs. --geometry-string-type=WKT
--distance The absolute value of distance from source geometry within which we will search for target geometries. --distance 5
--distance-unit Unit of measurement for distance parameter.

Valid values of unit are:

mi: miles

km: kilometers

in: inches

ft: feet

yd: yards

mm: millimeters

cm: centimeters

m: meters

survey ft: US Survey feet

nmi: nautical miles
--distance-unit=mi
--max-candidates Limits the count of target geometries to search for. --max-candidates 4
--output-fields The requested fields to be included in the output. Multiple output field expressions should be separated by a space. --output-fields

ZIP, Name

--distance-column-name Name of the distance column in output. --distance-column-name

dist_between_geoms

--libraries Libraries to be provided in case of geodatabase table-file-type parameter. --libraries

hdfs:///pb/li/software/lib/ linux-x86-64

--download-location Local path of the directory where input data will be downloaded to.This path must exist or be able to be created by the spark user executing the job, on every node.
Note: This parameter is required if reference data is distributed remotely via HDFS or S3.
--download-location /precisely/downloads
--download-group This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that multiple Hadoop services can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the spark job should be a part of this group.

For more information, see Download Permissions.

Note: Use only if reference data is distributed remotely via HDFS or S3.
--download-group dm_users
--limit Limits the number of records. --limit 10
--input-format The input format. Valid values: csv or parquet. If not specified, the default is csv. --input-format=csv
--output-format The output format. Valid values: csv or parquet. If not specified, the default is the input-format value. --output-format=csv
--csv Specify the options to be used when reading and writing CSV input and output files.

Common options and their default values:

  • delimiter:,
  • quote:"
  • escape:\
  • header:false
  • Specify individual options:

    --csv header=true

    --csv delimiter='\t'

  • Specify multiple options:

    --csv header=true delimiter='\t'

--parquet Specify the options to be used when reading and writing parquet input and output files. --parquet compression=gzip
--include-empty-search-results This flag keeps the rows in output which were not present in polygon. --include-empty-search-results
--overwrite Including this parameter will tell the job to overwrite the output directory. Otherwise, the job will fail if this directory already has content. This parameter does not have a value. --overwrite