How to Extend Spark

In this post, we go through extending a Spark application and also Spark APIs by some examples. These two kinds of extensions are sometimes related, and we go with extending a Spark application first.

Pitfalls in Using Spark Catalog API

Spark SQL supports querying data via SQL and in order to use this feature, we must enable Spark with Hive support, because Spark uses Hive Metastore to store metadata. By default, Spark uses an in-memory embedded database called Derby to store the metadata, but it can also configure to use an external Hive Metastore. Spark Hive Configuration can be found here: Hive Tables - Spark 2.4.1 Documentation

Except using DDL SQL to manipulate metadata stored in Hive Metastore, Spark SQL also provides a minimalist API know as Catalog API to manipulate metadata in spark applications. Spark Catalog API can be found here: Catalog (Spark 2.2.1 JavaDoc) and pyspark.sql.catalog — PySpark master documentation.

After using Spark Catalog API for a period of time, I found some pitfalls.

Use Spark Sql With Hive Metastore

Spark SQL Spark SQL is a module for working with structured data spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() spark.sql("use database_name") spark.sql("select * from table_name where id in (3085,3086,3087) and dt > '2019-03-10'").toPandas() Hive Data warehouse software for reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Hive transforms SQL queries as Hive MapReduce job. Besides MapReduce, Hive can use Spark and Tez as execute engine.