December 11, 2023

Scale Azure Databricks secure network access to Azure Data Lake Storage

Many organisations store their data in a centralised data lake on Azure Storage accounts. In this post we will look at a solution to scale the secure network access between the centralised data lake storage account and Azure Databricks workspaces VNets.

Architecture diagram of having storage account with Private Endpoint connected to Azure Databricks workspaces VNets

The recommended approach to achieve a scalable and secure network access to the data lake storage is to use Private Endpoints for the storage accounts.

Private Endpoint relies upon DNS resolution to automatically route the connections from the VNet to the storage account over a private link. When you create a Private Endpoint, it creates a private DNS zone attached to the VNet with the necessary updates for the private endpoints. However, if you’re using your own DNS server, you might need to make additional changes to your DNS configuration.

For better scalability, create a dedicated VNet for Private Endpoints grouped based on environment - (e.g. Dev, Test, Prod) or projects. Next establish a VNet peering connection between the Private Endpoint VNet and all the VNets where your Azure Databricks workspaces are hosted. Additionally, you should establish virtual network link between the VNets and the private DNS zone that contains the records pertaining to the Private Endpoint.

Each storage account supports up to two hundred private endpoints and each VNet supports up to 500 VNet peering connections. With this approach we could scale to support up to 100,000 VNets to connect to the single centralized data lake storage.

Cost Optimization:

With this approach there are additional costs involved. Each Private DNS Hosted Zone costs $0.5/month and $0.4/million DNS queries. Each Private Endpoint costs $0.01/hour (e.g. $7.2/month). Private Endpoint incurs data processing charges of $0.02/GB (both inbound and outbound together) and VNet peering within the same region has a data processing charges of $0.02/GB (both inbound and outbound together), therefore for example to process 500GB of data it will cost $20.

The cost scales linearly according to the amount of data processed hence to optimise the cost utilisation you should combine this solution (using Private Endpoint and VNet peering) along with enabling access only to specific VNets in the storage account networking firewall rules. For the scenarios where there are lots of data needs to be processed, using the public endpoint with the network access restricted to specific VNets helps to avoid data processing charges.

This post was originally published at Azure Architecture Blog.

© Prakash P 2015 - 2023

Powered by Hugo & Kiss.