Privatelink - enables private connection between control plane and data plane eliminateing exposure to the public interenet.
Endpoints - enables traffic to be routed privately from the Databricks cluster to AWS hosted services like S3, Kinesis and STS. This reduces the cost for the data processed by Nat Gateway while making the connection more secure never reaching the public internet. See an explannation of the endpoints below:
S3 endpoint - needed not only for EC2 to reach the root bucket but also for S3 buckets that contain your data. It will save you money and add another layer of security by keeping traffic to S3 on the AWS private network.
Kinesis endpoint - is for internal logs that are collected from the cluster, including important security information, auditing information, and more.
STS endpoint - is for temporary credentials that can be passed to the EC2 instance.
Security Groups - Datbricks cluster with least privilege security groups
Inbound - Only traffic from within the cluster itself .
Outbound - Traffic to the custer itself on any port. Traffic to any IP through port 443(Databricks API, AWS API, Library repositories), 80(HTTP requests), 3306(Metastore) and 6666(Privatelink)
Security Groups - Datbricks VPC Endpoints with least privilege security groups
Inbound - Only Traffic from any Databricks Cluster node using port 443(Databricks API, AWS API, Library repositories), 2443(Fips) and 6666(Privatelink). Also, traffic from the EC2 - Snowplow Databricks Loader on port 443.
Outbound - Only Traffic to any Databricks Cluster node using port 443(Databricks API, AWS API, Library repositories), 2443(Fips) and 6666(Privatelink)
Encryption - Customer Managed Keys(AWS KMS) will encrypt all Databricks resources on the Data Plane. This ensures the Databricks AWS Account will be the only user with access(And my Terraform Deployer AWS Account).
Managed Services Key - Encrypt the workspace’s managed services data in the control plane, including notebooks, secrets, Databricks SQL queries, and Databricks SQL query history with a CMK.
Workspace Storage Key- Encrypt the workspace's root S3 bucket and clusters' EBS volumes with a CMK.
Data Governance - All Roles and Permissions are managed through terraform. Easy to keep track of who has what access.
DBT - Tranformation Layer
Benefits of DBT
Package Library - DBT has many packages to pick from to make including macros that make SQL coding even easier. The Snowplow package allows me to create standard Snoplow data models
Incremental Loading - DBT only processes new data so sql queries are much faster
Snapshots - Very easy to set up a macro for snapshots, almost no boilerplate code
Unit Testing - Its very easy to add in unit testing(column unique or not null)
Dependency Management - DBT keeps track of all table dependencies so you don't have to worry about the correct orchestration of tables/views
Open Source - DBT can be used on top of any SQL Database making it very versatile
Environment Awareness - DBT makes moving between environments pretty effortless
Documentation - Self generated documentation in a clean and concise format