JoinByDistance - Spectrum_Location_Intelligence_for_Big_Data - 5.2.1

Location Intelligence SDK for Big Data Guide

Product type
Software
Portfolio
Locate
Product family
Spectrum
Product
Spatial Big Data > Location Intelligence SDK for Big Data
Version
5.2.1
Language
English
Product name
Location Intelligence for Big Data
Title
Location Intelligence SDK for Big Data Guide
Copyright
2024
First publish date
2015
Last updated
2024-10-16
Published on
2024-10-16T13:55:01.634374

Description

joinByDistance is an implicit method which joins two dataframes taking longitude and latitude values, one set from each dataframe, representing the location of the records to be joined. This method can be used to enrich a CSV containing point data with attributes associated with points within some max distance; for example, finding all the POIs within half of a mile of each of your customers.

Syntax

import com.precisely.bigdata.li.spark.api.SpatialImplicits._

joinByDistance(df2: DataFrame, df1Longitude: Column, df1Latitude: Column, 
               df2Longitude: Column, df2Latitude: Column, 
               maxDistance: Length, geohashPrecision: Int): DataFrame
joinByDistance(df2: DataFrame, df1Longitude: Column, df1Latitude: Column, 
               df2Longitude: Column, df2Latitude: Column, 
               maxDistance: Length, geohashPrecision: Int,
               options: Map[DistanceJoinOption.DistanceJoinOption, Any]): DataFrame

Parameters

Note: The coordinate values must be in the CoordSysConstants.longLatWGS84 coordinate system.
Parameter Type Description
df2 DataFrame The dataframe to join to.
df1Longitude Column The longitude value from the first dataframe.
df1Latitude Column The latitude value from the first dataframe.
df2Longitude Column The longitude value from the second dataframe.
df2Latitude Column The latitude value from the second dataframe.
maxDistance Length The buffer length around point 1 to search for point 2.
geohashPrecision Integer The geohash precision to be used for the primary join. Value must be between 1 and 12. The higher the number, the more memory may be required.
options Map Optional. Options that add extra attributes to the result of the join.

Options

Key Type Description
DistanceColumnName String Adds a column to the result dataframe that contains the distance calculated.
LimitMatches Numeric Limits the number of joined results for each source dataframe record.
LimitMethod LimitMethods enumeration value The method used for ranking matches. See LimitMethods enumeration.

LimitMethods enumeration

Value Description
RowNumber Use the org.apache.spark.sql.functions.row_number window function for limiting matches.
Rank Use the org.apache.spark.sql.functions.rank window function for limiting matches.
DenseRank Use the org.apache.spark.sql.functions.dense_rank window function for limiting matches.

Return Values

Return Type Description
DataFrame The dataframe that is the result of the join.

Examples

This example returns a dataframe that is the result of a join where points from a second dataframe are located within a 0.5-mile buffer around each point in the first dataframe.

val searchRadius = "0.5";
val distanceUnit = "mi";
val distance = new com.mapinfo.midev.unit.Length(searchRadius.toDouble, 
com.mapinfo.midev.unit.LinearUnit.getFromMapInfoCode(distanceUnit))

val resultDF = baseDF.joinByDistance(joinDF, col("longitude"), col("latitude"), col("lon"), 
col("lat"), distance, 7)
Example showing options set:
val distance = new Length(0.5, LinearUnit.getFromMapInfoCode("mi"))

val resultDF = baseDF(joinDF, baseDF("Longitude"), baseDF("Latitude"), joinDF("Lon"), joinDF("Lat"), 
distance, 7, Map(DistanceColumnName -> "outputDistance"))

Example where results are limited to 5 matches by the Rank method.

val distance = new Length(0.5, LinearUnit.getFromMapInfoCode("mi"))
                
val resultDF = baseDF(joinDF, baseDF("Longitude"), baseDF("Latitude"), joinDF("Lon"), joinDF("Lat"), 
distance, 7, Map(DistanceColumnName -> "outputDistance", LimitMatches -> 5, LimitMethod -> 
LimitMethods.Rank))