Data Sources > Amazon S3

Amazon S3

This page covers how you can set up an Amazon S3 bucket as the data source for a Rockset collection.

If you do not already have data in an S3 bucket, refer to these instructions to create a new bucket and upload your data.

Create an Integration

If your data is in a public bucket, you can skip directly to creating the collection. Otherwise, Rockset will need credentials giving it read access to the data. To set this up, create an integration by following instructions for one of these mechanisms:

Create a Collection

When creating a collection in Rockset, you can specify an S3 path (see details below) from which Rockset will ingest data. Rockset will continuously monitor for updates and ingest any new objects. Deleting an object from the source bucket will not remove that data from Rockset.

In the Rockset console, you can create a collection from Workspace > Collections > Create Collection.

Create Collection

Using the CLI, you can create a collection by running the following.

$ rock create collection my-first-s3-collection \
    s3://my-bucket/my-path-1 s3://my-bucket/my-path-2 \

Collection "my-first-s3-collection" was created successfully.

Note that these operations can also be performed using any of the Rockset client libraries.

Specifying S3 Path

You can ingest all data in a bucket by specifying just the bucket name or restrict to a subset of the objects in the bucket by specifying an additional prefix or pattern.

By default, if the S3 path has no special characters, a prefix match is performed. However, if any of the following special characters are used in the S3 path, it triggers pattern matching semantics.

  • ? matches one character
  • * matches zero or more characters
  • ** matches zero or more directories in a path
  • {myparam} matches a single path parameter and extracts its value into a field in _meta.s3 named “myparam”
  • {myparam:<regex>} matches the specified regular expression and extracts its value into a field in _meta.s3 named “myparam”

A few examples are shown below to explain exactly how the patterns can be used.

  • s3://bucket/xyz - uses prefix match, matches all files in the folder xyz>
  • s3://bucket/xyz/t?st.csv - matches com/test.csv but also com/tast.csv or com/txst.csv in bucket.
  • s3://bucket/xyz/*.csv - matches all .csv files in the xyz directory in bucket.
  • s3://bucket/xyz/**/test.json - matches all test.json files in any subdirectory under the xyz path in bucket.
  • s3://bucket/05/2018/**/*.json - matches all .json files underneath any subdirectory under the /05/2018/ path in bucket.
  • s3://bucket/{month}/{year}/**/*.json - matches the pattern according to the above rules. In addition to this, it extracts the value of the matched path segments {month}, {year} as fields of the form _meta.s3.month and _meta.s3.year associated with each document.
  • s3://bucket/{timestamp:\d+}/**/*.json - matches the pattern according to the the above rules. In addition to this it extracts the value of the timestamp path segment if it matches the regular expression \d+ and places the value extracted into _meta.s3.timestamp associated with each document.