Collections
A Collection is a set of Rockset documents. All documents within a collection and all fields within a document are mutable. Similar to tables in traditional SQL databases, collections are traditionally queried with FROM
clauses in SQL queries.
Creating Collections
Collections can be created in the Collections tab of the Rockset Console or by using the Rockset API.
To create a collection using a managed integration with an external data source (such as MongoDB or Amazon S3), first set up the respective integration using the instructions in the Integrations section. We generally recommend mapping each data source (such as MongoDB collection, DynamoDB table, or Kafka topic) to a single collection, and joining those collections at query time using JOIN
when necessary.
Using an external data source is not required.
You can always create collections using the Rockset Console (either empty, or from a file upload) or by using the Rockset API directly without referencing any sources. You can read more about how to create a collection from a self-managed data source.
Data Ingest Modes
When a collection is created using one of the managed integrations with an external data source, the data ingestion happens in two phases:
Phase 1: Bulk Ingest Mode
The upstream data set (often a consistent snapshot of it) is downloaded and indexed into the collection in a batch mode called Bulk Ingest Mode. In this phase, Rockset’s data connectors will automatically scale the compute required in order to complete this phase as quickly and as efficiently as possible. The compute required during this part of the data ingest will not come from your Virtual Instance and will come from dedicated autoscaling Bulk Ingest Compute Units that are fully isolated from your Virtual Instance. You will be charged for the total amount of Bulk Ingest Compute Units consumed in addition to your Virtual Instance. Bulk Ingest Mode often speeds up the initial data ingest by 10-100 times while consuming 5-10 times less compute compared to ingesting those large data sets using your regular Virtual Instance. During this mode, the collection status will be set to “Bulk Ingest” and you will not be able to query the collection.
Phase 2: Streaming Ingest Mode
Once the initial bulk ingest phase completes, the data connectors will seamlessly transition to a streaming ingest mode, and your collection will become available for query processing. Once the data connectors transition to this mode, they will be watching for changes happening in the upstream data source and then replicate those changes into the downstream collection in real-time. During this mode, the collection status will be set to “Ready”.
Ingest Mode | Compute Resource | Compute Measure | Optimized For | Collection Status | Available for Queries |
---|---|---|---|---|---|
Bulk Ingest | Autoscaling Bulk Ingest Compute Units | Bulk Ingest Compute Units | Data ingestion throughput | Bulk Ingest | No |
Streaming Ingest | Virtual Instance | Virtual Instance Seconds | Streaming ingest and query processing | Ready | Yes |
Streaming Ingest Mode with Snowflake Integration
Updates from the data source do not sync during the Streaming Ingest Mode phase with Snowflake integrations. See the Snowflake integration documentation for more information.
Bulk Ingest Mode
Bulk Ingest Mode is supported only for some of the popular data sources such as: Amazon DynamoDB, MongoDB, Amazon S3, Azure Blob Storage, Google Cloud Storage, and Snowflake. Please check the documentation of your particular data source to see if Bulk Ingest Mode is supported for it or not.
When the upstream data set is too small (typically less than 5 GiBs) then the Bulk Ingest Mode will be skipped since it does not necessarily provide any benefit.
Also note that a collection will only ever enter Bulk Ingest mode immediately following its creation. If you plan to periodically ingest large volumes of data, it is recommended to load them into a new collection (whose creation will trigger Bulk Ingest mode) and then use collection aliases to point to the new collection so that your queries will seamlessly transition to the new collection.
Bulk ingest mode is only available to dedicated Virtual Instances.
For shared and free VIs, the ingest speed is limited to 1MiB/sec. You can view peak streaming ingest limits here.
Factors affecting Bulk Ingest Mode
Since bulk ingest compute units automatically scale to fit the needs of the initial bulk ingest phase it is really important you understand the various factors that could make your bulk ingest more or less efficient. This section explains all the factors that could influence the efficiency and there by the cost of your bulk ingest:
- Data Size
- This is probably the most intuitive one. The larger your data set size the more compute it will need to be ingested.
- Ingesting a 100 GiB data set will require more Bulk Ingest Compute seconds than ingesting a 10 GiB one, assuming both of those data sets were uncompressed and in the same data format.
- Data Format
- If your source is using compressed files (parquet, gzip, etc.) then compute resources will be required to decompress them. Some data formats such as Parquet could also result in a significant memory bloat during the data indexing process which could result in additional compute resources being consumed.
- Ingest Transformation
- Transformations require additional compute resources. Simple transformations such as skipping certain rows when some conditions are met or dropping fields or renaming fields are computationally trivial. But if your ingest transformation involves parsing string fields in JSON format into native objects and arrays, or casting strings to date/time fields or involve complex JavaScript UDFs then those will require additional compute resources.
- Rollups
- Rollups are a special case of ingest transformation that involve aggregating an incoming data stream using a GROUP BY clause in the ingest transformation. When used appropriately, Rollups can significantly cut down your compute required since they can cut down the total amount of data that needs to be indexed. But beware of relatively expensive aggregations functions such as ARRAY_AGG() or MAP_AGG() that might require more compute resources.
- Data Clustering
- If the new collection being created is using data clustering on certain fields, then currently it disables many of the data ingest throughput optimizations that are usually available during bulk ingest. This could result in both slower data ingestion times during bulk ingest and also additional compute resources.
Best Practices for Bulk Ingest
If your new collection has a combination of the above factors then please follow these best practices:
- The best way to estimate the actual compute required is to experiment with a small subset of your data first. For example, if you have a 10 TiB data set, first get a 100 GiB subset of that data set, and import it using the same ingest transformation, data clustering and all other collection configuration that you eventually plan to use on your larger data set. You can expect your total bulk ingest compute consumption to grow linearly with your size. In fact, in most cases due to built-in throughput optimizations it will usually scale sub-linear with data growth.
- If you have to experiment with complex ingest time data transformations, then ingest a smaller subset of say 10,000 to 100,000 rows of the raw untransformed dataset. Then build the transformations you want to perform on that raw data collection as a view. Once the view returns the data in the form that fits the needs of your application, then take the view definition SQL, change the FROM clause to
_input
and use that as your ingest SQL transform. - If you have to periodically reload all the data in your collection, then instead of deleting and inserting/updating the data in your collection, import the new data set into a brand new collection and use collection aliasing to switch your queries once the new collection is ready for query processing.
Ingest Transformation
Collections automatically create indexes on every field path that is included in that collection, including those of nested paths.
To transform incoming data by creating an index on a compound or derived field, configure an ingest transformation to create new fields by applying SQL expressions on fields of incoming documents.
Retention Duration
For each collection, you can set a custom retention duration. This determines how long each document is retained after being added to the collection. Low retention values can be used to keep total data indexed low, while still ensuring your applications are querying the latest data. Retention duration is implemented by partitioning the data internally based on time ranges. Once the _event_time
for all documents within a partition has fallen out of the retention window the partition will be dropped all at once, efficiently deleting all of the expired documents. The time range for each partition is a function of the requested retention duration and is calculated so that roughly 12 partitions are always present. So for example, if we have a retention period of 1 year, each partition will carry roughly 1 month of data which will be retained and deleted as a unit. Likewise, for a collection with 1 month retention a partition window of ~2 days will be used, for 1 day retention a window of ~2 hours etc.
The minimum allowable retention for a collection is 1 hour
Ingest Limit
During collection creation you may set a limit on the amount of data you want to have ingested. This may be useful if you want to test that an ingest transformation behaves properly or want to test querying a sample of your data. Rockset will monitor the amount of data read in and will stop ingesting once it notices ingestion has reached or passed the provided limit. Ingestion limit is enforced on a best effort basis and more data may be ingested than requested.
Storage Compression
We offer two options for data compression to reduce the storage size of your collections: LZ4 (default) and ZStandard (abbreviated as Zstd).
- LZ4 is a lossless data compression algorithm optimized for compression and decompression speeds. LZ4 reduces the storage size of your collections, while still enabling low-latency queries.
- Zstd is a customizable lossless data compression algorithm and supports a range of options, which tradeoff between compression speeds and data compression ratios. We currently support Zstd level 3 compression. Zstd offers a stronger data compression ratio compared to LZ4, but Zstd has worse compression and decompression speeds. You can expect collections with Zstd compression to have comparatively smaller storage sizes.
Storage compression, in most cases, does not impact query performance. This is because most working dataset sizes fits in the Virtual Instance cache which is inherently uncompressed.
Shard Count
Collections are split up into smaller pieces called shards, which is the unit of parallelism for a collection.
The shard count for a collection is determined during the bulk ingest phase of ingestion and is immutable after collection creation. Each shard is roughly the same size and can support up to 100GB of uncompressed transformed data. The default shard count assigned to all collections is 16, however if your collection has more data than what can be supported by 16 shards, the shard count will increase by factors of 2 (32, 64, etc.) depending on the collection size. More shards will provide more parallelism, which is advantageous for larger collections as it can improve performance on queries and ingest. Less shards are assigned for smaller collections as the tradeoffs of communication overhead and memory usage with a greater shard count can result in decreased performance.
If a collection does not enter the bulk ingest phase, it will be assigned the default shard count of 16.
You may also specify a minimum shard count, currently available only via the Create Collection API endpoint. You can specify values of 4, 8, 16, or 32 as the minimum shard count using the shard_count_minimum
parameters in the call. An example cURL request utilizing is the following:
curl --request POST \
--url https://api.usw2a1.rockset.com/v1/orgs/self/ws/commons/collections \
--header 'Authorization: ApiKey YOUR_ROCKSET_KEY' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '
{
"description": "example collection",
"name": "foo"
"shard_count_minimum": 8
}
'
The shard limit for a given VI will also determine the collection limit. See the Resource Allocations documentation for more information on shard and collection limits.
Special Fields
Every document ingested into Rockset collections has several system-generated fields which are automatically created by Rockset and added to each document. Learn more about special fields here.
The _events
Collection
_events
CollectionWhen your organization is created, Rockset will automatically create a collection named _events
which is used for audit logging to provide visibility into your account. Learn more about the _events
collection.
Updating Collections
When a collection is created from a managed integration, Rockset will automatically sync your collection with its data source, usually within seconds. For more information about individual source behavior, see the Data Sources section.
If you choose not to create your collection using a managed integration, or wish to make manual changes to data in your collection after Rockset has synced it with your external data source, you can learn more about manually adding, deleting, or patching documents.
Querying Collections
Collections can be queried using SQL the same way tables are queried in traditional SQL databases.
You can write and execute queries on your collections in the Query Editor tab of the Rockset Console. Queries can also be executed using the Rockset API and SDKs (Node.js, Python, Java, or Go) to run queries against collections. SQL queries can also JOIN documents across different Rockset collections and Workspaces.
See our SQL Reference for the full list of supported data types, commands, functions, and more.
Collection Errors
Out-of-Sync Collections
Collections sourced from external systems such as Amazon DynamoDB or MongoDB can occasionally become out-of-sync with the data source. This happens rarely when Rockset's replication pipeline encounters errors or bugs.
You can re-sync a collection by dropping and recreating it:
- Navigate to the Collection Details page for your collection by selecting it in the Collections tab of the Rockset Console.
- Click the "Clone Collection Settings" button in the overflow menu dropdown (indicated by the ... icon) on the top-right corner. This will open the Collection Creation form in a new tab, and populate the configuration from your existing collection. Double-check that this configuration is correct, but do not click the "Create Collection" button.
- In the original tab displaying your Collection Details, click the "Delete Collection" button. Wait at least 30 seconds for the collection deletion to complete.
- Click the "Create" button in the Collection Creation tab to recreate the collection.
Re-Syncing Collections Note
Recreating collections will incur additional ingest costs in Rockset and potentially additional charges on the data source (e.g. RCU charges in Amazon DynamoDB). Furthermore, because re-syncing involves dropping and recreating a collection, there will be a period of unavailability. Thus, if you choose to re-sync a collection, we recommend that you do so during a scheduled maintenance window.
Updated 6 months ago