Sys Tables in Apache Spark/Databricks

Ankit Choudhary
2 min readAug 9, 2020

Databricks does an amazing job of orchestrating Apache Spark. The Databricks Delta Table is a revolutionary invention and brings a set of new possibilities to the Big Data Ecosystem.

While companies are shifting to Databricks as their new Data Warehouse, there’s one feature of the conventional Database systems that I wished Databricks had. It’s the availability of a dynamic system database (sys db) which stores the metadata about the larger set of spark tables holistically.

The sys tables (sys.tables, sys.columns, etc.) comes in handy when you want to scan and search your database based on the metadata. One such use can be the search for tables/columns containing PII. Wouldn’t it be great if you can just search for all the tables/columns having the keyword “email/phone/etc”. With the fast moving Data Privacy Acts like GDPR & CCPA, these things can save a lot of time.

So, how do we get this sys db created in Databricks? Well, you have landed at the right place. Just follow the instructions below and start exploring.

Here’s the Entire Code on: Github

Once the code snippet/notebook has been processed, it stores the results in the below tables:-

  • sys.db_tables : Contains the list of Databases , their respective Tables and the Object’s fully qualified names.
  • sys.db_table_columns : Contains the table vs column mappings along with the data types and comments (if any).
  • sys.db_table_properties : Contains other crucial table properties information like InputFormat, OutputFormat, Storage Properties, numFiles, sizeInBytes, etc.
  • sys.db_table_partitions : Contains the Table vs Partition Columns Mapping.

Here’s a snapshot from a Test Notebook that I created to show the results.

Actual Table and Database names have been wiped off for security

It is recommended that you schedule this notebook as a job within your Databricks workspace and run it a desired frequency/interval (daily/weekly/your_choice), so that the latest state of your workspace is captured and stored into these meta-data tables.

I wish Databricks incorporates this as one of their packaged offerings in the future releases. Until then, I hope this blog could help you solve some of your use-cases.

Happy Coding!

--

--

Ankit Choudhary

Programmer & Architect @ Deloitte in Python, Big Data, Azure/AWS by Profession. Biker, Chef & Philanthrope by Passion. Student Pilot.