This page covers how to use a MongoDB collection as a data source in Rockset. This includes:

  • Create a MongoDB <<glossary:Integration>> to securely connect collections in your MongoDB Atlas account or self-managed MongoDB cluster with Rockset.

  • Create a [<<glossary:Collection>>](🔗) which continuously syncs your data from a MongoDB collection into Rockset in real-time.

## Create a MongoDB Integration

A MongoDB integration can be created based on where your MongoDB cluster is located:

  • [MongoDB Atlas](🔗)

  • [Self-managed MongoDB in the public cloud or a data center](🔗)

## Create a Collection

Use the [Rockset-Mongo helper tool](🔗) to ingest collections over 300GB.

Once you create a collection backed by MongoDB, Rockset scans the MongoDB collections to continuously ingest and then subsequently uses the MongoDB Change Stream to update collections as new records are added to the MongoDB collection.

If your MongoDB collection is a [capped collection](🔗), MongoDB change streams don't receive deletes for old documents and hence Rockset collection can go out of sync. For this we recommend setting [retention](🔗) on Rockset collection at time of creation.

You can create a collection from a MongoDB source in the [Collections tab of the Rockset Console](🔗).

Create MongoDB Collection


These operations can also be performed using any of the Rockset [client libraries](🔗), the [Rockset API](🔗), or the [Rockset CLI](🔗).

### How it works

When a MongoDB backed collection is created, indexing in Rockset occurs in two stages:

  1. A one-time full scan of the MongoDB collection in which all records are indexed and stored in the Rockset collection.

  2. Following that, continuous monitoring and sync of changes from the MongoDB collection (inserts, deletes and updates) to the Rockset collection in real-time using [MongoDB Change Streams](🔗).

Once a MongoDB backed collection is set up, it will be a replica of the MongoDB collection, up-to-date to within a few seconds.

### Document Update Types

In the second stage above when Rockset monitors changes through MongoDB Change Streams, there are two modes in which these changes can be received.

  1. Partial Delta: MongoDB only sends the fields that have changed in the document. This is the default behavior.

  2. Full Document: MongoDB returns the most current majority-committed version of the updated document.

As an example, if your MongoDB document was `{"_id": "abc", "x": 1, "y": 2}` and you issued an update in MongoDB to change `y` to 5, you would see different updates depending on the following modes:

  1. Partial Delta– `{"_id": "abc": "y": 5}`

  2. Full Document– `{"_id": "abc", "x": 1, "y": 5}`

While receiving partial deltas is more efficient and puts less load on your MongoDB database and Rockset Virtual Instance, there are some notable drawbacks:

  • You cannot use an [<<glossary:Ingest Transformation>>](🔗) if your MongoDB source generates partial deltas. For example, updates above with the transformation `SELECT x + y AS z FROM _input` make it impossible to evaluate when either `x` or `y` change if Rockset only receives partial deltas.

  • Certain update types, specifically related to nested objects and arrays, can also produce ambiguous partial deltas. This can lead to data inconsistencies between your MongoDB and Rockset collections.

For these reasons, we recommend that you enable Full Document Updates when creating your MongoDB collection. This can be done in the Rockset console or through the REST API's [create collection endpoint](🔗).

## MongoDB Best Practices

When the MongoDB database is under heavy load, it affects the speed at which we can read updates. Below are some best practices for connecting MongoDB as a source with Rockset:

  • **Start bulk ingest when your MongoDB database is under light load**

    • This allows Rockset to do the one-time full scan of MongoDB without any read throttling

  • **Increase the read-throughput on the MongoDB cluster** for bulk ingest

    • Use common techniques to increase read performance for the initial scan. See some recommended techniques in this [blog](🔗) from our solution engineering team.

    • Prefer using read replica to connect as a source with Rockset. Refer to this [MongoDB doc](🔗) for details on how you can setup a connection string with a `readPreference` flag.

    • Rockset uses [majority read concern](🔗). Read concern `"majority"` guarantees that the data read has been acknowledged by a majority of the replica set members (i.e. the documents read are durable and guaranteed not to roll back).

      • Make sure that `majority read concern` is enabled by following the instructions in this [link](🔗).

      • `Majority read concern` is also a [requirement](🔗) for Change Streams in MongoDB 4.0 and earlier.

  • **Increase the op-log size**

    • See MongoDB [recommendation](🔗) for workloads that might require a larger oplog size

    • If the source MongoDB collection has a high write and update rate of operations, it is recommended to [increase](🔗) the op-log size.

      • MongoDB [recommends](🔗) that the oplog size for a cluster should be enough to facilitate a 24 hour Replication Oplog Window. For example, if you are generating 1 GB oplog/hour on average, then the recommendation is that your oplog is 24 GB.

    • Setup alerts on MongoDB project to trigger if the op-log churn (GB / Hour) exceeds a specified threshold.

  • **Monitor [streaming ingest metrics](🔗) in Rockset**

    • If your org’s Virtual Instance is nearing [peak streaming ingest rate](🔗) consider increasing its size to avoid an increase in data latency and slow queries

      • Once the streaming ingest rate is reduced you can decrease the Virtual Instance size back for cost control

      • Using the [metrics endpoint](🔗) you can set alerts with your preferred monitoring tool

    • If the ingest keeps getting rate limited for a prolonged period of time, depending on your oplog size and churn rate, Rockset might not be able to catch up with all the updates coming from MongoDB, and the collection will enter an unrecoverable error state that will require re-creating it.

## Collections over 300GB

We recommend using [this tool](🔗) for collections over 300GB.

## Known Limitations

The Rockset MongoDB integration does not currently support [time series collections](🔗) introduced in MongoDB 5.0. If you need support for the MongoDB time series collections, please contact our support team at [[email protected]](🔗).