Step 1: Create Cluster
Use or create a new cluster as per your requirements. We support spark versions starting from 3.0+ to 3.5.2. So, you can choose any version among the supported versions. We recommend using a cluster with minimum of 32 GB memory and 4 cores.
Add following configurations to advanced section of spark configurations:
If you are using to Download Manager utility and in case your data is located on any remote location for example on Amazon S3 then you also need to provide credential information as environment variables.
Step 2: Install SDK jar
Install the following jar as a library in cluster.
location_intelligence_bigdata_li_sdk_spark3_2_12_<version>.jar
For more information about how to install libraries, see Libraries.
Step 3: Create a notebook to perform the operations.
- Start with initializing SparkSession and check whether you can import SQLRegistrator
class.
import com.precisely.bigdata.li.spark.api.udf.SQLRegistrator import org.apache.spark.sql.SparkSession var SparkSession=SparkSession.builder() .appName("testApplication") .getOrCreate() SQLRegistrator.registerAll()
This will initialize the Spark Session and register all the custom functions that we offer for usage.
- Access the data from DBFS and open it as a databricks table.
var data_file_path = "dbfs:/<dbfs_path_to_data>" var df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load(data_file_path) df.createOrReplaceTempView("quickGuideTable")
This uses create command of spark to create a dataframe out of csv file which can be used to perform different spark operations. Use different csv options as per requirement to read your data as accurately as possible. For more information about CSV options, see spark csv options.
Last line creates a databricks view from dataframe against which we are going to run SQL commands.
-
Run spatial operations like SQL commands.
val resultDf = spark.sql("select CONCAT(ROUND(ST_Area(ST_GeomFromWKT(WKT), 'sq mi', 'SPHERICAL'), 2), ' sq mi') as area, ST_Perimeter(ST_GeomFromWKT(WKT), 'mi', 'SPHERICAL') as perimeter, State_Name, State FROM quickGuideTable ORDER BY perimeter")
First, we convert the geometry data from WKT format to Geom format using ST_GeomFromWKT() and then run ST_Area() and ST_Perimeter() functions to calculate area and perimeter of geometry respectively. Dataframe 'resultDf' stores the result of SQL query. For more information about spatial functions, see Spark SQL functions.
- Write the resultant dataframe into DBFS.
resultDf.write.mode(SaveMode.Overwrite).option("header", "true") .option("delimiter", ",").format("csv") .save("dbfs:/<dbfs_file_path>/output_area_and_perimter")
This writes or replaces the output of SQL query in a csv file with the options specified at a path specified.