Data Loading

Rockset supports data ingestion from a wide variety of sources through built-in connectors for:

  • Streams
  • Databases
  • Data lakes
  • Cloud Data Warehouses

To learn about our available data sources, check out our documentation on Data Sources. Data is ingested into what Rockset calls a Collection and can be joined with data from another collection.

πŸ’‘

You can use our Write API or CDC streams to ingest data from anywhere. You’re not limited by our built-in connectors.

There are a few things to consider when ingesting data.

First, use an Ingest Transformation and/or Rollups! These features enable you to execute SQL at ingest to transform or aggregate your data, which drastically reduces how much data you need to store in Rockset’s hot storage. That said, each time you change an ingest transformation or rollup, you must re-load your data.

πŸ’‘

Data Loading Tip

It's best practice to test ingest transformations and rollups on ~1000 rows of data rather than your entire dataset, to avoid using up your trial credits on ingest.

Once you verify these features are working properly on a subset of data, apply them to your working dataset.

This brings us to the next point: ingest performance, which is especially relevant if you’re ingesting event streams, such as from Kafka or Kinesis.

Each Virtual Instance size has an implied ingestion rate limit. As you increase VI sizes, this limit doubles. So, if you’re looking to achieve a particular end-to-end latency, make sure you’re using an appropriately sized VI.

πŸ“˜

Check out this page to understand ingest performance at a particular VI size.

The last consideration when ingesting your own data is bulk ingest, which is a useful offering. Bulk ingest is triggered when ingesting 5 GiB or more from a particular source. When this happens, Rockset spins up additional machines to accelerate ingestion speeds beyond the limits mentioned earlier.

This is only possible when a collection is created and is only supported for the following data sources:

  • AWS S3
  • DynamoDB
  • MongoDB
  • Google Cloud Storage
  • Azure Blob Storage