Terraform Provider
The Rockset Terraform Provider gives you the ability to use and deploy Rockset resources programmatically using the Terraform framework. This will allow you to idempotently deploy Rockset Integrations, Collections and other resources along with any other third party dependencies simplifying and automating the deployment process. Below are the steps for how to get Rockset integrated into your existing CI/CD pipelines with Terraform.
Installation
First, we need to make sure Terraform is installed. We can follow the official installation steps here. After that we can verify that terraform is ready to use by checking the help command.
terrform -help
Rockset Provider
A Terraform provider adds a set of resources and/or data sources which Terraform will then be able to deploy. You can read more about how providers work here. To import Rockset's provider we first set it as a required dependency we can then declare the provider.
In a file called version.tf we add:
terraform {
required_providers {
rockset = {
source = "rockset/rockset"
version = "~> 0.9"
}
}
}
provider "rockset" {
# Optionally you can put you apiserver endpoint and apikey here but it is recommended to set these
# as ENV variables instead.
}
Now from the same directory we can call terraform init
which will initialize our Terraform state and pull any needed dependencies.
We will also need to set the ENV variables ROCKSET_APIKEY
and ROCKSET_APISERVER
based on your account. You can view your region endpoints and apikeys from the API Keys tab of the Rockset console. For unix based systems:
export ROCKSET_APIKEY=<Your key here>
export ROCKSET_APISERVER=<Your endpoint here>
Example: Amazon RDS (Hosted PostgreSQL)
To connect to Amazon RDS a few configurations and resources will have to be set up. Doing all of this manually can be tedious, prone to mistakes and is not easily repeatable which makes it a perfect use case for Terraform. We will need to setup our Terraform configuration so that the deploy will:
- Configure PostgreSQL Server
- Create an AWS Kinesis Stream
- Set up the AWS Data Migration Service (DMS)
- Finally create the Rockset integration and collection
Download AWS Provider
First, lets update our version.tf to also import the AWS Terraform provider. You can also set any AWS Terraform configurations needed here. After updating this you will need to run terraform init
again.
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4"
}
rockset = {
source = "rockset/rockset"
version = "~> 0.9"
}
}
}
provider "rockset" {
}
provider "aws" {
region = var.region
}
Variables of the form var.* are Terraform input variables which allow us to create generic Terraform templates. You can define these variables as Terraform commandline arguments or in a separate file like terraform.tfvars.
Configure RDS
Configuration Warning
In this step we bring the existing RDS instance into our Terraform environment in order to enable backups and replication. Terraform will blindly apply any changes from the configuration file to the existing database. You can alternatively update this manually in the console and skip this step in the Terraform workflow.
Before updating your RDS service using Terraform you will need to make sure that you are properly authenticated with AWS. You can check here to learn more about AWS authentication and here to learn how AWS authentication interacts with Terraform.
Once we have our authentication set up we can work on importing our RDS state and apply an update to allow CDC streaming. We need to add some terraform for our AWS RDS instance. We will create a new file rds.tf:
resource "aws_db_parameter_group" "rds-parameter-group" {
name = "example-param"
family = "postgres13"
description = "RDS example cluster parameter group"
parameter {
apply_method = "pending-reboot"
name = "rds.logical_replication"
value = 1
}
}
resource "aws_db_instance" "rds-instance" {
# Fill in with any existing configuration which you will import from your existing RDS instance
# below.
instance_class = var.rds_instance.instance_class
parameter_group_name = aws_db_parameter_group.rds-parameter-group.name
backup_retention_period = var.rds_backup_retention_period
}
The most important thing above is that we have a parameter group with rds.logical_replication
set to 1 and that we have set up backups on our RDS instance. Everything else you will want to inherit from your existing instance. To pull the state for your existing instance down into your local Terraform state you will have to perform a Terraform import using your RDS indentifier, which you can find in the AWS console. You can learn more about Terraform imports here
terraform import aws_db_instance.rds-instance <RDS Indentifier>
Now that we have the Terraform state for our RDS instance we can update our Terraform file to better represent the RDS instance we already have. We can see a diff of these configurations by running:
terraform plan
This should give us an idea of what is different between our current Terraform file and what is already deployed (in our local state). Make any necessary changes to the Terraform file so that existing configuration is not lost. Note that the plan will mention the RDS instance will have to be restarted. This is because a parameter group is a static configuration which requires a restart in order to change.
At this point if you're feeling comfortable with your changes run:
terraform apply
This will carry out the actual changes and complain if anything breaks during the deploy.
Create Kinesis Streams
Now that our RDS instance is configured correctly we need to create a Kinesis stream which will serve as the glue between our RDS instance and Rockset. In a separate file called kinesis.tf we can add the following:
locals {
all_kinesis_streams_array_json = [for o in aws_kinesis_stream.db_stream : o.arn]
}
data "aws_iam_policy_document" "dms_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
identifiers = ["dms.amazonaws.com"]
type = "Service"
}
}
}
data "aws_iam_policy_document" "dms_kinesis_role" {
statement {
actions = [
"kinesis:DescribeStream",
"kinesis:PutRecord",
"kinesis:PutRecords"
]
resources = local.all_kinesis_streams_array_json
}
}
data "aws_iam_policy_document" "rockset_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "AWS"
identifiers = ["arn:aws:iam::${var.rockset_account_id}:root"]
}
condition {
test = "StringEquals"
variable = "sts:ExternalId"
values = ["${var.rockset_external_id}"]
}
}
}
data "aws_iam_policy_document" "rockset_kinesis_role" {
statement{
actions = ["kinesis:ListShards", "kinesis:DescribeStream",
"kinesis:GetRecords", "kinesis:GetShardIterator"]
resources = local.all_kinesis_streams_array_json
}
}
# This is the Kinesis stream Rockset will connect to as an integration. It will be where all PostgreSQL updates flow to.
resource "aws_kinesis_stream" "db_stream" {
for_each = var.db_tables
name = "${var.name}-stream-${each.key}"
retention_period = var.kinesis_retention_period
shard_level_metrics = []
stream_mode_details {
stream_mode = "ON_DEMAND"
}
tags = var.common_tags
}
# This is the role that allows AWS DMS to write into your Kinesis stream.
resource "aws_iam_role" "kinesis_dms" {
assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
name = "${var.name}-kinesis-dms"
inline_policy {
name = "${var.name}-dms_kinesis_role"
policy = data.aws_iam_policy_document.dms_kinesis_role.json
}
}
# This is the Role Rockset will use to connect to your Kinesis stream and ingest.
resource "aws_iam_role" "kinesis_rockset" {
name = "${var.name}-kinesis-rockset"
assume_role_policy = data.aws_iam_policy_document.rockset_assume_role.json
inline_policy {
name = "${var.name}-rockset_kinesis_policy"
policy = data.aws_iam_policy_document.rockset_kinesis_role.json
}
}
There's a lot going on above and it's worth briefly talk about what this does. First, we create the Kinesis resource itself. You can read more about what Kinesis is here but just think of it as a streaming database that will keep RDS in sync with Rockset.
Next, we have two new roles with different policies created. kinesis_dms is the role that will be used by DMS in the next step to provide the stream with updates. You can see from it's associated policy at the top that it needs permission to put new records into Kinesis. kinesis_rockset is the role that will be used by the Rockset integration to read records from Kinesis. You can see that Rockset only needs to get records and does not have "PutRecords" permissions.
Set up DMS
Now that we have a Kinesis resource set up let's stitch it together with our RDS instance. Create a new file called dms.tf:
resource "aws_iam_role" "dms-cloudwatch-logs-role" {
assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
name = "${var.name}-dms-cloudwatch-logs-role"
}
resource "aws_iam_role_policy_attachment" "dms-cloudwatch-logs-role-AmazonDMSCloudWatchLogsRole" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonDMSCloudWatchLogsRole"
role = aws_iam_role.dms-cloudwatch-logs-role.name
}
resource "aws_iam_role" "dms-vpc-role" {
assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
name = "${var.name}-dms-vpc-role"
}
resource "aws_iam_role_policy_attachment" "dms-vpc-role-AmazonDMSVPCManagementRole" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonDMSVPCManagementRole"
role = aws_iam_role.dms-vpc-role.name
}
resource "aws_dms_replication_subnet_group" "replication_subnet_group" {
replication_subnet_group_description = "${var.name} dms replication subnet group"
replication_subnet_group_id = "${var.name}-dms-replication-subnet-group"
subnet_ids = var.nodes_subnets
tags = var.common_tags
}
resource "aws_dms_replication_instance" "instance" {
allocated_storage = var.rds_instance_allocated_storage
apply_immediately = true
auto_minor_version_upgrade = true
engine_version = var.rds_instance_engine_version
multi_az = var.replication_instance.multi_az
publicly_accessible = var.replication_instance.publicly_accessible
replication_instance_class = var.replication_instance.instance_class
replication_instance_id = "${var.name}-dms"
replication_subnet_group_id = aws_dms_replication_subnet_group.replication_subnet_group.id
tags = var.common_tags
vpc_security_group_ids = [
var.rds_instance.sg
]
}
resource "aws_dms_endpoint" "source" {
endpoint_type = "source"
endpoint_id = "${var.name}-endpoint-source"
engine_name = var.db_engine_name
database_name = aws_db_instance.rds-instance.db_name
extra_connection_attributes = var.db_extra_connection_attributes
username = aws_db_instance.rds-instance.username
password = var.rds_instance.password
port = aws_db_instance.rds-instance.port
server_name = aws_db_instance.rds-instance.address
ssl_mode = "none"
tags = var.common_tags
}
resource "aws_dms_endpoint" "target" {
for_each = var.db_tables
endpoint_type = "target"
endpoint_id = "${var.name}-endpoint-target-${each.key}"
engine_name = "kinesis"
kinesis_settings {
include_null_and_empty = true
include_table_alter_operations = true
message_format = "json"
stream_arn = aws_kinesis_stream.db_stream[each.key].arn
service_access_role_arn = aws_iam_role.kinesis_dms.arn
}
tags = var.common_tags
}
resource "aws_dms_replication_task" "task" {
for_each = var.db_tables
migration_type = "full-load-and-cdc"
replication_instance_arn = aws_dms_replication_instance.instance.replication_instance_arn
replication_task_id = "${var.name}-replication-task-${each.key}"
source_endpoint_arn = aws_dms_endpoint.source.endpoint_arn
start_replication_task = true
table_mappings = jsonencode({
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "${var.db_schema_pattern}",
"table-name": "${each.value}"
},
"rule-action": "include"
}
]
})
tags = var.common_tags
target_endpoint_arn = aws_dms_endpoint.target[each.key].endpoint_arn
lifecycle {
ignore_changes = [replication_task_settings]
}
}
Now that's a lot of Terraform. Let's see why we need all of this. In order to set up our DMS replication instance the following roles are needed:
- dms-vpc-role
- dms-cloudwatch-logs-role
- dms-access-for-endpoint
To read more about these roles and why they are needed for DMS you can view the AWS Data Migration Service Guide.
Next, we define the DMS subnet group which is simply a collection of subnets that will be used by the DMS Replication Instance. Along with the subnet group we also define the DMS replication instance itself. The replication instance is an EC2 instance that performs the actual data migration. It serves as a buffer between the DMS source and target database and performs reads on the source database and then applies any desired transformations for the target database. The replication instance must be attached to two replication endpoints which will define the source database and the target database. Finally, the aws_dms_replication_task
defines the actual task that is to be perfomred including what data should be read and what mappings should take place.
We now have the major parts of this integration set up. Let's run a terraform apply to make sure there are no hiccups in setting up DMS between our RDS instance and Kinesis.
terraform apply
Create Rockset Integration and Collections
It's now time to create the Rockset integration and a collection. Make sure that your apiserver end point and apikey for Rockset is properly set in the Provider or in your local ENV. Let's define one last Terraform file rockset.tf:
resource rockset_kinesis_integration "rockset-rds-integration" {
aws_role_arn = aws_iam_role.kinesis_rockset.arn
name = "MyFirstTerraformedRocksetIntegration"
}
resource rockset_kinesis_collection "rockset-rds-collection" {
name = "ThisCollectionWasDeployedWithTerraform-${each.key}"
workspace = "commons"
for_each = var.db_tables
# We will create a collection from just one of our tables.
source {
format = "postgres"
integration_name = rockset_kinesis_integration.rockset-rds-integration.name
# Kinesis that this collection will ingest from.
stream_name = "${var.name}-stream-${each.key}"
# The list of fields that will be used to construct the unique document id.
dms_primary_key = var.compound_id_list
}
}
One more terraform apply
and all done! The above gives Rockset the permission to start tailing from the Kinesis streams we just created. If you navigate to the Rockset console you should now see a new Amazon Kinesis integration and corresponding collections for each table in our RDS instance.
Check out our blog on "How to Use Terraform with Rockset" for more info!
Updated 11 months ago