Rockset supports data ingestion from a wide variety of sources through built-in connectors for:

  • Streams

  • Databases

  • Data lakes

  • Cloud Data Warehouses

To learn about our available data sources, check out [this page](🔗). Data is ingested into what Rockset calls a **collection** and can be joined with data from another collection.

**Note:** You can use our **Write API** or **CDC streams** to ingest data from anywhere. You’re not limited by our built-in connectors.

There are a few things to consider when ingesting data.

First, use an [ingest transformation](🔗) and / or [rollups](🔗)! These features enable you to execute SQL at ingest to transform or aggregate your data, which drastically reduces how much data you need to store in Rockset’s _hot storage_. That said, each time you change an ingest transformation or rollup, you must re-load your data. It's best practice to test these features on ~1000 rows of data rather than your entire dataset, to avoid using up your trial credits on ingest. Once you verify rollups / transformations are working properly on a subset of data, apply them to your working dataset.

This brings us to the next point: _ingest performance_, which is especially relevant if you’re ingesting event streams, such as from Kafka or Kinesis.

Each VI size has an implied ingestion rate limit. As you increase VI sizes, this limit doubles. So, if you’re looking to achieve a particular end-to-end latency, make sure you’re using an appropriately sized VI. Check out [this page](🔗) to understand ingest performance at a particular VI size.

The last consideration when ingesting your own data is _bulk ingest_, which is a useful offering. Bulk ingest is triggered when ingesting 5 GiB or more from a particular source. When this happens, Rockset spins up additional machines to accelerate ingestion speeds beyond the limits mentioned earlier.

**Note:** This is only possible when a collection is created and is only supported for the following data sources:

  • AWS S3

  • DynamoDB

  • MongoDB

  • Google Cloud Storage

  • Azure Blob Storage