How to audit your Databricks infrastructure?
Databricks Security Analysis Tool
Some time ago I came across an open source tool offered by Databricks called Security Analysis Tool, I was very curious because it is common to do security audits on an infrastructure or on an application but I did not I had never heard of it for Databricks and I wondered what information/alerts this tool could provide.
This post aims to summarize the discovery and experience of this tool but I will not return to its implementation which is very well explained in the [ReadMe](https://github.com/databricks-industry-solutions /security-analysis-tool/blob/main/docs/setup.md) of the project.
What is this ? Why implement it?
The following presentation can be found in the project Readme: “Security Analysis Tool (SAT) is an observability tool that aims to improve the security hardening of Databricks deployments by making customers aware of deviations from established security best practices by helping customers easily monitor the health security of Databricks account workspaces”
The point is to have an overall analysis of your Databricks infrastructure in order to know if you are following best practices and what to improve. Several aspects are scrutinized:
- Network security
- Identity and access
- Data protection
- Governance
- Information
The program is promising but what is it in reality? 🤔
Here is an example of a generated report:
Pros
The major positive point is the fact that you can audit all the workspaces of your Databricks Account via a single job and without having to keep an up-to-date list, this means that even if you have a moving infrastructure (the number of workspaces evolves) there is no need to worry about redeploying several times.
This audit is nothing more and nothing less than a Databricks job which runs a notebook so it is possible to schedule it to benefit from an analysis over time and without any maintenance.
The audit report is viewable via Databricks Warehouse so nothing to store or host anywhere other than Databricks.
The last point I would like to address is the fact that everything has already been terraformed 😍 so it is very easy to test the tool and make it sustainable if you are attracted to it. However, be careful there are some differences in the Terraform values and what is presented in the Readme for doing it by hand.
Cons
At the end of this POC the first thing that came to my mind was “Why is this not integrated by default?”, not that this tool is essential on a daily basis but it should already be integrated into Databricks in order to to validate that as many good practices as possible are followed during the build phase of a workspace.
This tool, although open source, is actually payable for its use because you will need a cluster for the infrastructure scanning phase + a SQL Warehouse cluster when you want to consult the dashboards. audit. In addition, I chose not to touch the terraform code provided by Databricks and the cost can be high because the cluster is quite large and the photon option is activated for a processing time of at least 30min, which eliminates the idea of running this job frequently (unless you try to customize the Terraform code and the cluster).
If we come to the report itself, the thing that “disappointed” me the most is that most of the points raised as critical (HIGH) are things that are only applicable with the Enterprise plan from Databricks; too bad they have to pay for the most expensive mode to have the most secure infrastructure possible (according to their best practices). Other points are of course interesting and lead you to discover things, but unfortunately nothing transcendent.
Conclusion
As you’ll have realised, I’m still a little disappointed because the cost is high and the information provided, while not obvious, isn’t of great added value. But if you’re a novice and just starting out with Databricks, I’m sure you’ll find it useful. In this article I’ve focused on Databricks SAT but of course it’s not a miracle solution, you can also follow the best practices of your Cloud Provider with the whole range of services at your disposal and why not integrate Kics into your CI to validate your Terraform code with the Databricks provider (PS: I did the integration into the tool😉).