AWS Glue is a fully
managed ETL service that makes it simple and cost-effective to categorize your
data, clean it, enrich it, and move it reliably between various data stores.
Glue job is the business logic that automate the extract, transform, and
transfer data to different locations.
Glue job creates a Hadoop task in the backend and since it uses Hadoop ecosystem all the nodes in the cluster must be able to find using the FQDN and communicate with each other. To satisfy Hadoop DNS lookup, both forward and reverse lookups must succeed. When you use the default DHCP option set with AmazonProvidedDNS, it automatically creates the DNS record for the nodes that’s getting created in the VPC.
When you have custom domain name and DNS server configured in the DHCP options of your VPC it won’t auto-generate the DNS hostname records. In this scenario, you shall use one of the following two options to run a Glue job.
Create small subnets
in your VPC specifically for Glue jobs. Then you create hostname A record for
all the IP addresses in the subnet with the naming convention as
ip-10-100-11-25.custom.domain and create PTR records for all the hostnames to
enable reverse lookup. This will enable the Glue job to successfully resolve
the FQDN to the appropriate IP address. Example configuration of similar setup
using BIND is explained in the AWS blog post titled “Launching
and Running an Amazon EMR Cluster in your VPC – Part 2: Custom DNS”
Create a separate VPC
for Glue jobs (glue-jobs-vpc) with the default DHCP options of Amazon provided
domain name and Amazon provided DNS server. Configure VPC peering between the
Glue jobs VPC (glue-job-vpc) & the production VPC with custom DNS server
configured where the datastore is located. With this setup, when you launch a
Glue job it will be executed in the Glue jobs VPC which will use the default
settings and will be able to successfully resolve the FQDN, and also it will be
able to connect to the datastore using the VPC peering.
Following diagram show
the simple depiction of the option 2 discussed above.
You learned two
different approach to run AWS Glue jobs while having your datastore in a VPC
with custom DNS configuration. In both the options that we discussed above plan
the VPC/Subnet size according to your need.