MSSQL

This page covers how to use a Microsoft SQL Server table as a data source in Rockset.

These step-by-step instructions will allow you to migrate your data from your Microsoft SQL Server database table(s) into Rockset collection(s) by using AWS’s Data Migration Service (DMS) to export your data to a target AWS Kinesis stream. Those Rockset collection(s) will stay in sync with your Microsoft SQL Server databases table(s) as they are updated in real-time by reading Microsoft SQL Server CDC streams. This includes:

Setting up your Microsoft SQL Server Database
Setting up a target AWS Kinesis Stream
Setting up an AWS DMS Replication Instance and Migration Task
Creating an Amazon Kinesis integration

These instructions will work with all of the following Microsoft SQL Server databases:

SQL Server Versions: 2012, 2014, 2016, 2017, 2019
Licenses: Enterprise, Developer, and Standard Edition (2016 SP1 and later)

Step 1: Microsoft SQL Server Database Configuration

For this step, there may be additional configuration steps required if you are using a self-managed version of Microsoft SQL Server. You can read more about setting up a self-managed Microsoft SQL Server database here.

[Optional] Step 1.1 - Create a Microsoft SQL Server user to allow DMS to access your data

We recommend that you use the sysadmin user account when designating a Microsoft SQL Server user for AWS DMS to export your data. If you choose not to use the sysadmin account, you will need to create a new user with the proper roles and as well as several additional associated objects. You can read more about creating a new Microsoft SQL Server user for AWS DMS here.

Step 1.2 - Enable Periodic Backups of Microsoft SQL Server

Microsoft SQL Server must be configured for full backup, and you must perform a backup before initiating the migration. The recovery mode must be Bulk logged or Full .

On AWS RDS SQL Server instances, select the Microsoft SQL Server instance in the AWS Console and navigate to the Maintenance and Backups tab. If automatic backups are not enabled, modify the Microsoft SQL Server instance to enable the feature. After you are done, the Maintenance and Backups tab should look something like this:

periodic-backup

Step 1.3 - Configure SQL Server database for ongoing replication

To replicate changes to DMS, the Microsoft SQL Server database and tables must be configured for change data capture. The specific commands depend on the environment and table schemas.

To replicate tables with primary keys in AWS RDS, you can invoke the following commands:

-- Enable CDC on the database level
exec msdb.dbo.rds_cdc_enable_db '<DATABASE_NAME>';
GO

-- Setup each table for replication
exec sys.sp_cdc_enable_table
  @source_schema = N'dbo',
  @source_name = N'<TABLE_NAME>',
  @role_name = NULL,
  @supports_net_changes = 1;
GO

-- set retention period to 1 day
EXEC sys.sp_cdc_change_job @job_type = 'capture', @pollinginterval = 86399;

-- start cdc streaming
EXEC sp_cdc_stop_job 'capture';
EXEC sp_cdc_start_job 'capture';
GO

For Self Managed SQL Server databases, you can replicate tables with primary keys or unique constraints as follows, when using the sysadmin role:

-- Enable CDC on the database level
use [<<DATABASE_NAME>]
EXEC sys.sp_cdc_enable_db
GO

-- For each table with a unique key but no primary key
exec sys.sp_cdc_enable_table
  @source_schema = N'<SCHEMA_NAME>',
  @source_name = N'<TABLE_NAME>',
  @index_name = N'<UNIQUE_INDEX_NAME>',
  @role_name = NULL,
  @supports_net_changes = 1
GO

The AWS DMS docs provide more details, particularly relevant when not using the sysadmin role.

Step 2: Setup your AWS Kinesis Streams

The AWS Kinesis Stream is the destination that DMS uses as the target of a migration job. We will create an individual migration task for each Microsoft SQL Server table you wish to connect with Rockset. Use the AWS Console as shown below to create an AWS Kinesis Data Stream.

It is recommended to use a separate Kinesis Data Stream for every database table you wish to connect with Rockset. See Best Practices for more information.

Step 2.1 Create an AWS Kinesis Stream

Create an AWS Kinesis Stream here.

Microsoft SQL Server Kinesis Stream

If you select the Provisioned data stream capacity mode, choose a shard count that aligns with the maximum ingest rate of your Virtual Instance size. See Virtual Instances for information on Virtual Instance ingest rates and sizing. AWS Kinesis Shard limits can be found here. If you are not able to estimate the number of shards to configure here, please contact your Rockset solutions architect or customer support.

If you select the On-demand data stream capacity mode, Rockset will dynamically adjust ingest rate to match the number of shards, up to your Virtual Instance ingest rate limit.

Step 2.2 Create a Role for AWS Kinesis Stream

You will need to create a role that has the correct PutRecord, PutRecords, and DescribeStream permissions to access Kinesis. To create this role, please follow these instructions.

Step 3: Setup your AWS Data Migration (DMS)

AWS DMS setup is a multi-step process. First, you have to create one or more DMS Replication Instances. Each Replication Instance comes with a set of resources that power the data migration from Microsoft SQL Server to Kinesis. Then you will create a DMS Source Endpoint for each Microsoft SQL Server database. You will then create one DMS Destination Endpoint for each of the Microsoft SQL Server database table that you want to connect with Rockset. The Source Endpoint is used to configure the name of the Microsoft SQL Server database and the Destination Endpoint specifies the name of the corresponding Kinesis stream. Finally, you will configure a Database Migration Task that specifies the Microsoft SQL Server table names that you want to connect with Rockset. The creation of a Database Migration Task actually starts the migration of data from a specific Microsoft SQL Server table to the corresponding Kinesis stream.

Step 3.1 Create a DMS Replication Instance

AWS DMS uses a replication instance to connect to your source data store, read the source data, and format the data for consumption by the target data store. A replication instance also loads the data into the target data store. Most of this processing happens in memory; however, large transactions might require some buffering on disk. Cached transactions and log files are also written to disk. You can create a Replication Instance by using the AWS Console as follows:

Microsoft SQL Server Replication Instance

The Instance class specifies the amount of resources provisioned for this replication instance. You can pick the Instance class based on the total amount of data to be migrated and how quickly you want the replication process to occur, although we recommend an Instance class of dms.c5.2xlarge or larger to ensure optimal performance during data ingest. While all versions of DMS Replication Instances should theoretically work for this process, we recommend using versions 3.4.7 or earlier as some recent versions may be occassionally unstable when tailing CDC streams. Name the replication instance appropriately, especially if you are creating multiple replication instances.

Step 3.2 Create a DMS Source Endpoint for your Microsoft SQL Server database

A Source Endpoint specifies the location and access information of your Microsoft SQL Server database. You use the AWS Console to create a DMS Source Endpoint as follows:

Pick the RDS instance name for the Microsoft SQL Server database.
Pick an appropriate name for this Endpoint. You can specify access parameters by clicking on “Provide access information manually", and entering the name database name and port of the Microsoft SQL Server database. You can use the Microsoft SQL Server user aws-dms and password that you had created in Step 1.1.

Microsoft SQL Server Source Endpoint

If you want multiple Microsoft SQL Server database tables in a single Microsoft SQL Server database to be connected to Rockset, you only need to create a single Source Endpoint for that database. The same Source Endpoint can then be used by multiple Migration Tasks in a later step. You can click on the “Test endpoint connection (optional)” tab to verify that the access configurations to access the Microsoft SQL Server database are configured correctly.

AWS DMS versions 3.4.7 and later require that you configure AWS DMS to use VPC endpoints to all your source and target endpoints that interact with certain Amazon Web Services. Click here to configure VPC endpoints as AWS DMS source and target endpoints.

Step 3.3 Create a DMS Target Endpoint for every Microsoft SQL Server database table

The DMS Target Endpoint specifies the name of the Kinesis Stream that is the target of a migration task. Create a DMS Target Endpoint as follows:

Microsoft SQL Server Target Endpoint

Pick Amazon Kinesis as the Target Engine. Enter the ARN of the Kinesis Stream created in Step 2.1 in the Kinesis Stream ARN field. Additionally, enter the Service access role ARN of the role created in Step 2.2.

Again, AWS DMS versions 3.4.7 and later require that you configure AWS DMS to use VPC endpoints to all your target endpoints. Follow these instructions to configure VPC endpoints.

You can click on the “Test endpoint connection (optional)” tab to verify that the access configurations to access the Kinesis Stream service are configured correctly.

Step 3.4 Create a DMS Database Migration Task to migrate a Microsoft SQL Server table to a Kinesis Stream

A Migration Task connects one specific Microsoft SQL Server database table on a specific Microsoft SQL Server database to a specific Kinesis Stream. Use the AWS Console to specify the following:

Microsoft SQL Server Migration Task

Pick the option for Migration type as “Migrate existing data and replicate ongoing changes”. This option does a full dump of the Microsoft SQL Server database table to Kinesis first, followed by continuous replication through Microsoft SQL Server CDC using the same Kinesis Stream. Specify the following task settings:

Microsoft SQL Server Task Settings

Configure these DMS configuration settings as required if you are working with a large table or if your update rate to your table is high.

"TargetMetadata": {
  "ParallelLoadThreads": 32,
  "ParallelLoadBufferSize": 1000,
  "BatchApplyEnabled": true,
  "ParallelApplyThreads": 32,
  "ParallelApplyBufferSize": 1000,
  "ParallelApplyQueuesPerThread": 128,
  "ParallelLoadQueuesPerThread": 128,
}.
"FullLoadSettings": {
  "CommitRate": 50000,
  "MaxFullLoadSubTasks": 49,
},

The name of the Microsoft SQL Server table that you want to migrate is specified in the section titled Table Mappings. Note that if you are migrating data from a partitioned table in Microsoft SQL Server, you must specify the partition table name(s) in your selection rule rather than the parent table name. You must create a selection rule with the appropriate Table name field for each partition in the partitioned table.

Here is an exmaple where a table named customers is being configured:

Microsoft SQL Server Table Mappings

You can switch on Pre-migration assessment to verify that the settings are accurate. If you switch on the assessment then remember to manally start the task once the pre-migration assessment is complete.

Now that you have completed steps 1-3, you are ready to create the Rockset Kinesis integration.

Creating a Rockset Collection

Go to collection creation and select the Microsoft SQL Server tile under Managed Integrations. This will guide you through a series of steps to follow to set up the Kinesis Intergration needed to create the collection.

Microsoft SQL Server Create Collection 1

Populate the fields as shown below. Note that Format must be set to MSSQL. Also, it is mandatory to specify an immutable primary key from the source Microsoft SQL Server table during collection creation.

Microsoft SQL Server Create Collection 2

Ingesting Complex Types

The following is a table of custom SQL expressions that you can apply to fields as part of your collection's ingest transformation to ingest them with the desired type.

Microsoft SQL Server Data Type	Supported By DMS	Rockset Data Type	Rockset Ingest Transformation SQL Expression
BIGINT	Y	int	None
BIT	Y	bool	None
DECIMAL	Y	float	None
INT	Y	int	None
MONEY	Y	float	None
NUMERIC (p,s)	Y	float	None
SMALLINT	Y	int	None
SMALLMONEY	Y	float	None
TINYINT	Y	int	None
REAL	Y	float	None
FLOAT	Y	float	None
DATETIME	Y	datetime	PARSE_DATETIME_ISO8601(datetime_column)
DATETIME2	Y	datetime	PARSE_DATETIME_ISO8601(datetime_column)
SMALLDATETIME	Y	datetime	PARSE_DATETIME_ISO8601(datetime_column)
DATE	Y	date	PARSE_DATE_ISO8601(date_column)
TIME	Y	time	PARSE_TIME_ISO8601(time_column)
DATETIMEOFFSET	Y	datetime	PARSE_DATETIME('%Y-%m-%d %H:%M:%E*S %Ez', datetime_column)
CHAR	Y	string	None
VARCHAR	Y	string	None
VARCHAR (max)	Y	string	None
NCHAR	Y	string	None
NVARCHAR (length)	Y	string	None
NVARCHAR (max)	Y	string	None
BINARY	Y	bytes	FROM_HEX(SUBSTR(column, 3))
VARBINARY	Y	bytes	FROM_HEX(SUBSTR(column, 3))
VARBINARY (max)	Y	bytes	FROM_HEX(SUBSTR(column, 3))
TIMESTAMP	Y	bytes	FROM_HEX(SUBSTR(column, 3))
UNIQUEIDENTIFIER	Y	string	None
HIERARCHYID	Y	string	None
XML	Y	string	None
GEOMETRY	Y	geometry	ST_GEOGFROMTEXT(geometry_column)
GEOGRAPHY	Y	geography	ST_GEOGFROMTEXT(geography_column)

Limitations and Unsupported Operations

Rockset only supports tables with primary keys.

Also, Rockset inherits a lot of the limitations imposed by SQL Server replication and DMS. More notably:

Sparse, memory-optimized (using In-Memory OLTP), and temporal tables aren't supported
Renaming tables or columns via sp_rename isn't supported
Some data manipulation language patterns aren't supported, e.g.
SELECT * INTO new_table FROM existing_table
Changes to computed fields in a SQL Server aren't replicated.
Geometry columns are not supported in full lob mode
A secondary SQL Server database isn't supported as a source database for ongoing replication (CDC)
tasks.

Please review DMS documentation for more details.

Best Practices

We recommend only using one Kinesis stream per collection, i.e. one stream per MSSQL table. If multiple collections read from the same Kinesis stream, they will encounter rate limits imposed by Kinesis, which results in throttling. When encountering throttling, Rockset retries with backoff, which can increase data ingestion latencies. Read more about the limits here.

If using one stream for multiple tables, please note the following:

Any table exported to the stream must have the id field even if the tables are not included in any Rockset collection.
The Primary Key when creating a Rockset collection using the Kinesis stream must be the id field.