AWS Glue is a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. Glue job is the business logic that automate the extract, transform, and transfer data to different locations.
Glue job creates a Hadoop task in the backend and since it uses Hadoop ecosystem all the nodes in the cluster must be able to find using the FQDN and communicate with each other. To satisfy Hadoop DNS lookup, both forward and reverse lookups must succeed. When you use the default DHCP option set with AmazonProvidedDNS, it automatically creates the DNS record for the nodes that’s getting created in the VPC.
When you have custom domain name and DNS server configured in the DHCP options of your VPC it won’t auto-generate the DNS hostname records. In this scenario, you shall use one of the following two options to run a Glue job.
Create small subnets in your VPC specifically for Glue jobs. Then you create hostname A record for all the IP addresses in the subnet with the naming convention as ip-10-100-11-25.custom.domain and create PTR records for all the hostnames to enable reverse lookup. This will enable the Glue job to successfully resolve the FQDN to the appropriate IP address. Example configuration of similar setup using BIND is explained in the AWS blog post titled “Launching and Running an Amazon EMR Cluster in your VPC – Part 2: Custom DNS”
Create a separate VPC for Glue jobs (glue-jobs-vpc) with the default DHCP options of Amazon provided domain name and Amazon provided DNS server. Configure VPC peering between the Glue jobs VPC (glue-job-vpc) & the production VPC with custom DNS server configured where the datastore is located. With this setup, when you launch a Glue job it will be executed in the Glue jobs VPC which will use the default settings and will be able to successfully resolve the FQDN, and also it will be able to connect to the datastore using the VPC peering.
Following diagram show the simple depiction of the option 2 discussed above.
You learned two different approach to run AWS Glue jobs while having your datastore in a VPC with custom DNS configuration. In both the options that we discussed above plan the VPC/Subnet size according to your need.