MongoDB
This page covers how to use a MongoDB collection as a data source in Rockset. This includes:
- Create a MongoDB Integration to securely connect collections in your MongoDB Atlas account or self-managed MongoDB cluster with Rockset.
- Create a Collection which continuously syncs your data from a MongoDB collection into Rockset in real-time.
Create a MongoDB Integration
A MongoDB integration can be created based on where your MongoDB cluster is located:
Create a Collection
Use the Rockset-Mongo helper tool to ingest collections over 300GB.
Once you create a collection backed by MongoDB, Rockset scans the MongoDB collections to continuously ingest and then subsequently uses the MongoDB Change Stream to update collections as new records are added to the MongoDB collection.
If your MongoDB collection is a capped collection, MongoDB change streams don't receive deletes for old documents and hence Rockset collection can go out of sync. For this we recommend setting retention on Rockset collection at time of creation.
You can create a collection from a MongoDB source in the Collections tab of the Rockset Console.
These operations can also be performed using any of the Rockset client libraries, the Rockset API, or the Rockset CLI.
How it works
When a MongoDB backed collection is created, indexing in Rockset occurs in two stages:
- A one-time full scan of the MongoDB collection in which all records are indexed and stored in the
Rockset collection. - Following that, continuous monitoring and sync of changes from the MongoDB collection (inserts, deletes and updates) to the Rockset collection in real-time using MongoDB Change Streams.
Once a MongoDB backed collection is set up, it will be a replica of the MongoDB collection, up-to-date to within a few seconds.
Document Update Types
In the second stage above when Rockset monitors changes through MongoDB Change Streams, there are two modes in which these changes can be received.
- Partial Delta: MongoDB only sends the fields that have changed in the document. This is the default behavior.
- Full Document: MongoDB returns the most current majority-committed version of the updated document.
As an example, if your MongoDB document was {"_id": "abc", "x": 1, "y": 2}
and you issued an update in MongoDB to change y
to 5, you would see different updates depending on the following modes:
- Partial Delta–
{"_id": "abc": "y": 5}
- Full Document–
{"_id": "abc", "x": 1, "y": 5}
While receiving partial deltas is more efficient and puts less load on your MongoDB database and Rockset Virtual Instance, there are some notable drawbacks:
- You cannot use an Ingest Transformation if your MongoDB source generates partial deltas. For example, updates above with the transformation
SELECT x + y AS z FROM _input
make it impossible to evaluate when eitherx
ory
change if Rockset only receives partial deltas. - Certain update types, specifically related to nested objects and arrays, can also produce ambiguous partial deltas. This can lead to data inconsistencies between your MongoDB and Rockset collections.
For these reasons, we recommend that you enable Full Document Updates when creating your MongoDB collection. This can be done in the Rockset console or through the REST API's create collection endpoint.
MongoDB Best Practices
When the MongoDB database is under heavy load, it affects the speed at which we can read updates. Below are some best practices for connecting MongoDB as a source with Rockset:
- Start bulk ingest when your MongoDB database is under light load
- This allows Rockset to do the one-time full scan of MongoDB without any read throttling
- Increase the read-throughput on the MongoDB cluster for bulk ingest
- Use common techniques to increase read performance for the initial scan. See some recommended techniques in this blog from our solution engineering team.
- Prefer using read replica to connect as a source with Rockset.
Refer to this MongoDB doc for details on how you can setup a connection string with areadPreference
flag. - Rockset uses majority read concern. Read concern
"majority"
guarantees that the data read has been acknowledged by a majority of the replica set members (i.e. the documents read are durable and guaranteed not to roll back).- Make sure that
majority read concern
is enabled by following the instructions in this link. Majority read concern
is also a requirement for Change Streams in MongoDB 4.0 and earlier.
- Make sure that
- Increase the op-log size
- See MongoDB recommendation for workloads that might require a larger oplog size
- If the source MongoDB collection has a high write and update rate of operations, it is recommended to increase the op-log size.
- MongoDB recommends that the oplog size for a cluster should be enough to facilitate a 24 hour Replication Oplog Window. For example, if you are generating 1 GB oplog/hour on average, then the recommendation is that your oplog is 24 GB.
- Setup alerts on MongoDB project to trigger if the op-log churn (GB / Hour) exceeds a specified threshold.
- Monitor streaming ingest metrics in Rockset
- If your org’s Virtual Instance is nearing peak streaming ingest rate consider increasing its size to avoid an increase in data latency and slow queries
- Once the streaming ingest rate is reduced you can decrease the Virtual Instance size back for cost control
- Using the metrics endpoint you can set alerts with your preferred monitoring tool
- If the ingest keeps getting rate limited for a prolonged period of time, depending on your oplog size and churn rate, Rockset might not be able to catch up with all the updates coming from MongoDB, and the collection will enter an unrecoverable error state that will require re-creating it.
- If your org’s Virtual Instance is nearing peak streaming ingest rate consider increasing its size to avoid an increase in data latency and slow queries
Collections over 300GB
We recommend using this tool for collections over 300GB.
Known Limitations
The Rockset MongoDB integration does not currently support time series collections introduced in MongoDB 5.0. If you need support for the MongoDB time series collections, please contact our support team at [email protected].
Updated 7 months ago