Description
joinByDistance
is an implicit method which joins two dataframes taking
longitude and latitude values, one set from each dataframe, representing the location of the
records to be joined. This method can be used to enrich a CSV containing point data with
attributes associated with points within some max distance; for example, finding all the
POIs within half of a mile of each of your customers.
Syntax
import com.precisely.bigdata.li.spark.api.SpatialImplicits._
joinByDistance(df2: DataFrame, df1Longitude: Column, df1Latitude: Column,
df2Longitude: Column, df2Latitude: Column,
maxDistance: Length, geohashPrecision: Int): DataFrame
joinByDistance(df2: DataFrame, df1Longitude: Column, df1Latitude: Column,
df2Longitude: Column, df2Latitude: Column,
maxDistance: Length, geohashPrecision: Int,
options: Map[DistanceJoinOption.DistanceJoinOption, Any]): DataFrame
Parameters
Note: The coordinate values must be in the
CoordSysConstants.longLatWGS84
coordinate system.Parameter | Type | Description |
---|---|---|
df2 | DataFrame | The dataframe to join to. |
df1Longitude | Column | The longitude value from the first dataframe. |
df1Latitude | Column | The latitude value from the first dataframe. |
df2Longitude | Column | The longitude value from the second dataframe. |
df2Latitude | Column | The latitude value from the second dataframe. |
maxDistance | Length | The buffer length around point 1 to search for point 2. |
geohashPrecision | Integer | The geohash precision to be used for the primary join. Value must be between 1 and 12. The higher the number, the more memory may be required. |
options | Map | Optional. Options that add extra attributes to the result of the join. |
Options
Key | Type | Description |
---|---|---|
DistanceColumnName | String | Adds a column to the result dataframe that contains the distance calculated. |
LimitMatches | Numeric | Limits the number of joined results for each source dataframe record. |
LimitMethod | LimitMethods enumeration value | The method used for ranking matches. See LimitMethods enumeration. |
LimitMethods enumeration
Value | Description |
---|---|
RowNumber | Use the org.apache.spark.sql.functions.row_number window function for limiting matches. |
Rank | Use the org.apache.spark.sql.functions.rank window function for limiting matches. |
DenseRank | Use the org.apache.spark.sql.functions.dense_rank window function for limiting matches. |
Return Values
Return Type | Description |
---|---|
DataFrame | The dataframe that is the result of the join. |
Examples
This example returns a dataframe that is the result of a join where points from a second dataframe are located within a 0.5-mile buffer around each point in the first dataframe.
val searchRadius = "0.5";
val distanceUnit = "mi";
val distance = new com.mapinfo.midev.unit.Length(searchRadius.toDouble,
com.mapinfo.midev.unit.LinearUnit.getFromMapInfoCode(distanceUnit))
val resultDF = baseDF.joinByDistance(joinDF, col("longitude"), col("latitude"), col("lon"),
col("lat"), distance, 7)
Example showing
options
set:val distance = new Length(0.5, LinearUnit.getFromMapInfoCode("mi"))
val resultDF = baseDF(joinDF, baseDF("Longitude"), baseDF("Latitude"), joinDF("Lon"), joinDF("Lat"),
distance, 7, Map(DistanceColumnName -> "outputDistance"))
Example where results are limited to 5 matches by the Rank method.
val distance = new Length(0.5, LinearUnit.getFromMapInfoCode("mi"))
val resultDF = baseDF(joinDF, baseDF("Longitude"), baseDF("Latitude"), joinDF("Lon"), joinDF("Lat"),
distance, 7, Map(DistanceColumnName -> "outputDistance", LimitMatches -> 5, LimitMethod ->
LimitMethods.Rank))