This page covers how to migrate data from Elasticsearch to Rockset in a straightforward way. This includes:

  • Copying Elasticsearch data to S3 using the popular open source tool `elasticdump`

  • Creating an S3 Integration to securely connect buckets in your AWS account with Rockset

## Copy Elasticsearch Data to S3

`elasticdump` exports Elasticsearch data in a JSON format to S3.

Use the following commands to install `elasticdump`:



After installation, you will specify the input and the output using the following command:



## Create an S3 Integration

The steps below show how to set up an Amazon S3 integration using **AWS Cross-Account IAM Roles** and **AWS Access Keys** (deprecated). An integration can provide access to one or more S3 buckets within your AWS account. You can use an integration to create collections that sync data from your S3 buckets.

### Step 1: Configure AWS IAM Policy

  1. Navigate to the **IAM Service** in the **AWS Management Console**.

  2. Set up a new policy by navigating to **Policies** and clicking "Create policy".

If you already have a policy set up for Rockset, you may update that existing policy.

For more details, refer to [AWS Documentation on IAM Policies](🔗).

AWS IAM Policies


  1. Set up read-only access to your S3 bucket. You can switch to the `JSON` tab and paste the policy shown below. You must replace `<your-bucket>` with the name of your S3 bucket. If you already have a Rockset policy set up, you can add the body of the `Statement` attribute to it.



Configuration Tip

If you are attempting to restrict the policy to subdirectory /a/b, update the **Resource object** as follows:

"Resource": ["arn:aws:s3:::<your-bucket>", "arn:aws:s3:::<your-bucket>/a/b/*"]

  1. Optionally, if you have an S3 bucket that is encrypted with a KMS key, append the following statement to the `Statement` attribute above.


  1. Save the newly created or updated policy and give it a descriptive name. You will attach this policy to a user or role in the next step.

#### Why these Permissions?

  • `s3:List*` ⁠— Required. Rockset uses the `s3:ListBucket` and `s3:ListAllMyBuckets` permissions to read bucket and object metadata.

  • `s3:GetObject` — Required to retrieve objects from your Amazon S3 bucket.

#### Advanced Permissions

You can set up permissions for multiple buckets, or some specific paths by modifying the `Resource` ARNs. The format of the ARN for S3 is as follows: `arn:aws:s3:::bucket_name/key_name`.

You can substitute the following resources in the policy above to grant access to multiple buckets or prefixes as shown below:

  • All paths under _mybucket/salesdata_:

    • `arn:aws:s3:::mybucket`

    • `arn:aws:s3:::mybucket/salesdata/*`

  • All buckets starting with `sales`:

    • `arn:aws:s3:::sales*`

    • `arn:aws:s3:::sales*/*`

  • All buckets in your account:

    • `arn:aws:s3:::*`

    • `arn:aws:s3:::*/*`

For more details on how to specify a resource path, refer to [AWS documentation on S3 ARNs](🔗).

### Step 2: Configure Role / Access Key

There are two methods by which you can grant Rockset permissions to access your AWS resource. Although Access Keys are supported, Cross-Account roles are strongly recommended as they are more secure and easier to manage.

#### AWS Cross-Account IAM Role

The most secure way to grant Rockset access to your AWS account involves giving Rockset's account cross-account access to your AWS account. To do so, you need to create an IAM Role that assumes your newly created policy on Rockset's behalf.

You will need information from the [Rockset Console](🔗) to create and save this integration.

  1. Navigate to the **IAM service** in the **AWS Management Console**.

  2. Setup a new role by navigating to **Roles** and clicking "Create role".

    If you already have a role for Rockset set up, you may re-use it and either add or update the above policy directly.

    AWS IAM Roles
    

  3. Select "Another AWS account" as type of trusted entity, and tick the box for "Require External ID". Fill in the Account ID and External ID fields with the values (**Rockset Account ID** and **External ID** respectively) found on the [Integration page of the Rockset Console](🔗) (under the **Cross-Account Role Option**). Click to continue.

    AWS IAM Create Role
    

  4. Choose the policy created for this role in Step 1 (or follow Step 1 now to create the policy if needed). Click to continue.

    AWS IAM Roles Attach Policy
    

  5. Optionally, add any tags and click "Next". Name the role descriptively (such as _rockset-role_), then **record the Role ARN** for the Rockset integration in the [Rockset Console](🔗).

#### AWS Access Key (deprecated)

Navigate to the **IAM service** in the **AWS Management Console**.

  1. [Create a new user](🔗) by navigating to **Users** and clicking "Add User". AWS IAM Users

  2. Enter a name for the user and check the "Programmatic access" option. Click to continue. AWS IAM Create User

  3. Choose "Attach existing policies directly" then select the policy you created in Step 1. Click through the remaining steps to finish creating the user. AWS IAM Attach Policy

  4. When the new user is successfully created you should see the **Access key ID** and **Secret access key** displayed on the screen.

    AWS IAM Access Key
    

  5. Record both these values in the [Rockset Console](🔗).

## Create a Collection

When creating a collection in Rockset, you can specify an S3 path (see details below) from which Rockset will ingest data. Rockset will continuously monitor for updates and ingest any new objects. Deleting an object from the source bucket will **not** remove that data from Rockset.

You can create a collection from a S3 source in the [Collections tab of the Rockset Console](🔗).

Create S3 Collection


These operations can also be performed using any of the Rockset [client libraries](🔗), the [Rockset API](🔗), or the [Rockset CLI](🔗).

### Specifying S3 Path

You can ingest all data in a bucket by specifying just the bucket name or restrict to a subset of the objects in the bucket by specifying an additional prefix or pattern.

By default, if the S3 path has no special characters, a prefix match is performed. However, if any of the following special characters are used in the S3 path, it triggers pattern matching semantics.

  • `?` matches one character

  • `*` matches zero or more characters

  • `**` matches zero or more directories in a path

  • `{myparam}` matches a single path parameter and extracts its value into a field in `_meta.s3` named _myparam_

  • `{myparam:&lt;regex&gt;}` matches the specified regular expression and extracts its value into a field in `_meta.s3` named _myparam_

The following examples explain exactly how the patterns can be used:

  • `s3://bucket/xyz` - uses prefix match, matches all files that have a prefix of _xyz_.

  • `s3://bucket/xyz/t?st.csv` - matches _com/test.csv_ but also _com/tast.csv_ or _com/txst.csv_ in the bucket.

  • `s3://bucket/xyz/*.csv` - matches all _.csv_ files in the _xyz_ directory in the bucket.

  • `s3://bucket/xyz/**/test.json` - matches all _test.json_ files in any subdirectory under the _xyz_ path in the bucket.

  • `s3://bucket/05/2018/**/*.json` - matches all _.json_ files underneath any subdirectory under the _/05/2018/_ path in the bucket.

  • `s3://bucket/{month}/{year}/**/*.json` - matches the pattern according to the above rules. In addition, it extracts the value of the matched path segments `{month}`, `{year}` as fields of the form `_meta.s3.month` and `_meta.s3.year` associated with each document.

  • `s3://bucket/{timestamp:\d+}/**/*.json` - matches the pattern according to the above rules. In addition, it extracts the `timestamp` path segment value if it matches the regular expression `\d+` and places the value extracted into `_meta.s3.timestamp` associated with each document.

## Best Practices

  • While Rockset doesn't have an enforced upper bound on object sizes, Rockset recommends that objects in your S3 bucket stay between 5MiB and 10GiB

  • Large objects (>10GiB) can't take advantage of Rockset's parallel processing mechanism and will result in slow ingestion. Splitting large objects into a series of smaller ones will yield higher throughput and faster recovery caused by any intermittent issues

  • As a rule of thumb if you are going to split large objects we recommend aiming for ~1GiB in size. A readily available tool to split line-oriented data formats (like JSON or CSV) in all Unix systems is [Split](🔗)

  • The same applies to grouping multiple large files into a single archive (like zip or tar). Rockset recommends uploading the large files to an S3 bucket without using any archiving tool so that you can take advantage of the higher throughput

  • Too many small files (`<5MiB`) will result in a lot of GET & PUT operations in S3, increasing the cost charged by AWS for your S3 usage

  • If you are using Parquet files, Rockset recommends they **do not exceed 3 GiB**, since that has shown to cause several issues in the past that might result in a complete failure of the bulk ingest operation