Terraform Provider

The Rockset Terraform Provider gives you the ability to use and deploy Rockset resources programmatically using the Terraform framework. This will allow you to idempotently deploy Rockset Integrations, Collections and other resources along with any other third party dependencies simplifying and automating the deployment process. Below are the steps for how to get Rockset integrated into your existing CI/CD pipelines with Terraform.

Installation

First, we need to make sure Terraform is installed. We can follow the official installation steps here. After that we can verify that terraform is ready to use by checking the help command.

terrform -help

Rockset Provider

A Terraform provider adds a set of resources and/or data sources which Terraform will then be able to deploy. You can read more about how providers work here. To import Rockset's provider we first set it as a required dependency we can then declare the provider.

In a file called version.tf we add:

terraform {
  required_providers {
    rockset = {
      source  = "rockset/rockset"
      version = "~> 0.9"
    }
  }
}

provider "rockset" {
 # Optionally you can put you apiserver endpoint and apikey here but it is recommended to set these
 # as ENV variables instead.
}

Now from the same directory we can call terraform init which will initialize our Terraform state and pull any needed dependencies.

We will also need to set the ENV variables ROCKSET_APIKEY and ROCKSET_APISERVER based on your account. You can view your region endpoints and apikeys from the API Keys tab of the Rockset console. For unix based systems:

export ROCKSET_APIKEY=<Your key here>
export ROCKSET_APISERVER=<Your endpoint here>

Example: Amazon RDS (Hosted PostgreSQL)

To connect to Amazon RDS a few configurations and resources will have to be set up. Doing all of this manually can be tedious, prone to mistakes and is not easily repeatable which makes it a perfect use case for Terraform. We will need to setup our Terraform configuration so that the deploy will:

Download AWS Provider

First, lets update our version.tf to also import the AWS Terraform provider. You can also set any AWS Terraform configurations needed here. After updating this you will need to run terraform init again.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4"
    }

    rockset = {
      source  = "rockset/rockset"
      version = "~> 0.9"
    }
  }
}

provider "rockset" {
}

provider "aws" {
 region = var.region
}

Variables of the form var.* are Terraform input variables which allow us to create generic Terraform templates. You can define these variables as Terraform commandline arguments or in a separate file like terraform.tfvars.

Configure RDS

๐Ÿšง

Configuration Warning

In this step we bring the existing RDS instance into our Terraform environment in order to enable backups and replication. Terraform will blindly apply any changes from the configuration file to the existing database. You can alternatively update this manually in the console and skip this step in the Terraform workflow.

Before updating your RDS service using Terraform you will need to make sure that you are properly authenticated with AWS. You can check here to learn more about AWS authentication and here to learn how AWS authentication interacts with Terraform.

Once we have our authentication set up we can work on importing our RDS state and apply an update to allow CDC streaming. We need to add some terraform for our AWS RDS instance. We will create a new file rds.tf:


resource "aws_db_parameter_group" "rds-parameter-group" {
  name        = "example-param"
  family      = "postgres13"
  description = "RDS example cluster parameter group"

  parameter {
    apply_method = "pending-reboot"
    name  = "rds.logical_replication"
    value = 1
  }
}

resource "aws_db_instance" "rds-instance" {
   # Fill in with any existing configuration which you will import from your existing RDS instance
   # below.

   instance_class       = var.rds_instance.instance_class
   parameter_group_name = aws_db_parameter_group.rds-parameter-group.name
   backup_retention_period = var.rds_backup_retention_period
}

The most important thing above is that we have a parameter group with rds.logical_replication set to 1 and that we have set up backups on our RDS instance. Everything else you will want to inherit from your existing instance. To pull the state for your existing instance down into your local Terraform state you will have to perform a Terraform import using your RDS indentifier, which you can find in the AWS console. You can learn more about Terraform imports here

terraform import aws_db_instance.rds-instance <RDS Indentifier>

Alt Text

Now that we have the Terraform state for our RDS instance we can update our Terraform file to better represent the RDS instance we already have. We can see a diff of these configurations by running:

terraform plan

This should give us an idea of what is different between our current Terraform file and what is already deployed (in our local state). Make any necessary changes to the Terraform file so that existing configuration is not lost. Note that the plan will mention the RDS instance will have to be restarted. This is because a parameter group is a static configuration which requires a restart in order to change.

At this point if you're feeling comfortable with your changes run:

terraform apply

This will carry out the actual changes and complain if anything breaks during the deploy.

Create Kinesis Streams

Now that our RDS instance is configured correctly we need to create a Kinesis stream which will serve as the glue between our RDS instance and Rockset. In a separate file called kinesis.tf we can add the following:

locals {
  all_kinesis_streams_array_json = [for o in aws_kinesis_stream.db_stream : o.arn]
}

data "aws_iam_policy_document" "dms_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      identifiers = ["dms.amazonaws.com"]
      type        = "Service"
    }
  }
}

data "aws_iam_policy_document" "dms_kinesis_role" {
  statement {
    actions = [
      "kinesis:DescribeStream",
      "kinesis:PutRecord",
      "kinesis:PutRecords"
    ]
    resources = local.all_kinesis_streams_array_json
  }
}

data "aws_iam_policy_document" "rockset_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::${var.rockset_account_id}:root"]
    }

    condition {
      test     = "StringEquals"
      variable = "sts:ExternalId"
      values = ["${var.rockset_external_id}"]
    }
  }
}

data "aws_iam_policy_document" "rockset_kinesis_role" {
  statement{
    actions = ["kinesis:ListShards", "kinesis:DescribeStream",
               "kinesis:GetRecords", "kinesis:GetShardIterator"]
    resources = local.all_kinesis_streams_array_json
  }
}

# This is the Kinesis stream Rockset will connect to as an integration. It will be where all PostgreSQL updates flow to.
resource "aws_kinesis_stream" "db_stream" {
  for_each = var.db_tables

  name = "${var.name}-stream-${each.key}"
  retention_period = var.kinesis_retention_period
  shard_level_metrics = []

  stream_mode_details {
    stream_mode = "ON_DEMAND"
  }
  tags = var.common_tags
}

# This is the role that allows AWS DMS to write into your Kinesis stream.
resource "aws_iam_role" "kinesis_dms" {
  assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
  name = "${var.name}-kinesis-dms"

  inline_policy {
    name = "${var.name}-dms_kinesis_role"
    policy = data.aws_iam_policy_document.dms_kinesis_role.json
  }
}

# This is the Role Rockset will use to connect to your Kinesis stream and ingest.
resource "aws_iam_role" "kinesis_rockset" {
  name = "${var.name}-kinesis-rockset"
  assume_role_policy = data.aws_iam_policy_document.rockset_assume_role.json

  inline_policy {
    name = "${var.name}-rockset_kinesis_policy"
    policy = data.aws_iam_policy_document.rockset_kinesis_role.json
  }
}

There's a lot going on above and it's worth briefly talk about what this does. First, we create the Kinesis resource itself. You can read more about what Kinesis is here but just think of it as a streaming database that will keep RDS in sync with Rockset.

Next, we have two new roles with different policies created. kinesis_dms is the role that will be used by DMS in the next step to provide the stream with updates. You can see from it's associated policy at the top that it needs permission to put new records into Kinesis. kinesis_rockset is the role that will be used by the Rockset integration to read records from Kinesis. You can see that Rockset only needs to get records and does not have "PutRecords" permissions.

Set up DMS

Now that we have a Kinesis resource set up let's stitch it together with our RDS instance. Create a new file called dms.tf:

resource "aws_iam_role" "dms-cloudwatch-logs-role" {
  assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
  name               = "${var.name}-dms-cloudwatch-logs-role"
}

resource "aws_iam_role_policy_attachment" "dms-cloudwatch-logs-role-AmazonDMSCloudWatchLogsRole" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonDMSCloudWatchLogsRole"
  role       = aws_iam_role.dms-cloudwatch-logs-role.name
}

resource "aws_iam_role" "dms-vpc-role" {
  assume_role_policy = data.aws_iam_policy_document.dms_assume_role.json
  name               = "${var.name}-dms-vpc-role"
}

resource "aws_iam_role_policy_attachment" "dms-vpc-role-AmazonDMSVPCManagementRole" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonDMSVPCManagementRole"
  role       = aws_iam_role.dms-vpc-role.name
}

resource "aws_dms_replication_subnet_group" "replication_subnet_group" {
  replication_subnet_group_description = "${var.name} dms replication subnet group"
  replication_subnet_group_id          = "${var.name}-dms-replication-subnet-group"

  subnet_ids = var.nodes_subnets

  tags = var.common_tags
}

resource "aws_dms_replication_instance" "instance" {
  allocated_storage           = var.rds_instance_allocated_storage
  apply_immediately           = true
  auto_minor_version_upgrade  = true
  engine_version              = var.rds_instance_engine_version
  multi_az                    = var.replication_instance.multi_az
  publicly_accessible         = var.replication_instance.publicly_accessible
  replication_instance_class  = var.replication_instance.instance_class
  replication_instance_id     = "${var.name}-dms"
  replication_subnet_group_id = aws_dms_replication_subnet_group.replication_subnet_group.id

  tags = var.common_tags

  vpc_security_group_ids = [
    var.rds_instance.sg
  ]
}

resource "aws_dms_endpoint" "source" {
  endpoint_type = "source"
  endpoint_id   = "${var.name}-endpoint-source"
  engine_name                 = var.db_engine_name
  database_name               = aws_db_instance.rds-instance.db_name
  extra_connection_attributes = var.db_extra_connection_attributes
  username    = aws_db_instance.rds-instance.username
  password    = var.rds_instance.password
  port        = aws_db_instance.rds-instance.port
  server_name = aws_db_instance.rds-instance.address
  ssl_mode    = "none"
  tags = var.common_tags
}


resource "aws_dms_endpoint" "target" {
  for_each = var.db_tables

  endpoint_type = "target"
  endpoint_id   = "${var.name}-endpoint-target-${each.key}"
  engine_name = "kinesis"
  kinesis_settings {
    include_null_and_empty         = true
    include_table_alter_operations = true
    message_format                 = "json"
    stream_arn              = aws_kinesis_stream.db_stream[each.key].arn
    service_access_role_arn = aws_iam_role.kinesis_dms.arn
  }
  tags = var.common_tags
}

resource "aws_dms_replication_task" "task" {
  for_each = var.db_tables

  migration_type           = "full-load-and-cdc"
  replication_instance_arn = aws_dms_replication_instance.instance.replication_instance_arn
  replication_task_id      = "${var.name}-replication-task-${each.key}"
  source_endpoint_arn    = aws_dms_endpoint.source.endpoint_arn
  start_replication_task = true
  table_mappings         = jsonencode({
    "rules": [
      {
        "rule-type": "selection",
        "rule-id": "1",
        "rule-name": "1",
        "object-locator": {
          "schema-name": "${var.db_schema_pattern}",
          "table-name": "${each.value}"
        },
        "rule-action": "include"
      }
    ]
  })

  tags = var.common_tags
  target_endpoint_arn = aws_dms_endpoint.target[each.key].endpoint_arn
  lifecycle {
    ignore_changes = [replication_task_settings]
  }
}

Now that's a lot of Terraform. Let's see why we need all of this. In order to set up our DMS replication instance the following roles are needed:

  • dms-vpc-role
  • dms-cloudwatch-logs-role
  • dms-access-for-endpoint

To read more about these roles and why they are needed for DMS you can view the AWS Data Migration Service Guide.

Next, we define the DMS subnet group which is simply a collection of subnets that will be used by the DMS Replication Instance. Along with the subnet group we also define the DMS replication instance itself. The replication instance is an EC2 instance that performs the actual data migration. It serves as a buffer between the DMS source and target database and performs reads on the source database and then applies any desired transformations for the target database. The replication instance must be attached to two replication endpoints which will define the source database and the target database. Finally, the aws_dms_replication_task defines the actual task that is to be perfomred including what data should be read and what mappings should take place.

We now have the major parts of this integration set up. Let's run a terraform apply to make sure there are no hiccups in setting up DMS between our RDS instance and Kinesis.

terraform apply

Create Rockset Integration and Collections

It's now time to create the Rockset integration and a collection. Make sure that your apiserver end point and apikey for Rockset is properly set in the Provider or in your local ENV. Let's define one last Terraform file rockset.tf:

resource rockset_kinesis_integration "rockset-rds-integration" {
  aws_role_arn = aws_iam_role.kinesis_rockset.arn
  name = "MyFirstTerraformedRocksetIntegration"
}


resource rockset_kinesis_collection "rockset-rds-collection" {
  name = "ThisCollectionWasDeployedWithTerraform-${each.key}"
  workspace = "commons"
  for_each = var.db_tables

  # We will create a collection from just one of our tables.
  source {
    format = "postgres"
    integration_name = rockset_kinesis_integration.rockset-rds-integration.name

    # Kinesis that this collection will ingest from.
    stream_name = "${var.name}-stream-${each.key}"

    # The list of fields that will be used to construct the unique document id.
    dms_primary_key = var.compound_id_list
  }
}

One more terraform apply and all done! The above gives Rockset the permission to start tailing from the Kinesis streams we just created. If you navigate to the Rockset console you should now see a new Amazon Kinesis integration and corresponding collections for each table in our RDS instance.

๐Ÿ“˜

Check out our blog on "How to Use Terraform with Rockset" for more info!