Monitoring And Alerting

Metrics Endpoint

Beyond the Console Metrics page, additional metrics are accessible through the metrics endpoint in Prometheus / OpenMetrics format. This format is compatible with monitoring / alerting tools such as Prometheus, Datadog and AWS Cloudwatch (among many others).

$ curl https://$ROCKSET_SERVER/v1/orgs/self/metrics -u {API key}:
# HELP rockset_collections Number of collections.
# TYPE rockset_collections gauge
rockset_collections{virtual_instance_id="30",workspace_name="commons",} 20.0
rockset_collections{virtual_instance_id="30",workspace_name="myWorkspace",} 2.0
rockset_collections{virtual_instance_id="30",workspace_name="myOtherWorkspace",} 1.0
# HELP rockset_collection_size_bytes Collection size in bytes.
# TYPE rockset_collection_size_bytes gauge
rockset_collection_size_bytes{virtual_instance_id="30",workspace_name="commons",collection_name="_events",} 3.74311622E8
...

You can enable the metrics endpoint for your Virtual Instance from the Metrics tab in the Rockset Console.

You can read more about the three metric types currently used here:

Note: Some metric types (e.g. Histogram) are represented through a set of sub-items. For example, the rockset_query_latency_seconds metric (a Histogram) would be represented by several rockset_query_latency_seconds_bucket records along with a rockset_query_latency_seconds_sum. Most monitoring clients will handle these complex types automatically on your behalf.

The following metrics are provided and updated at one-minute intervals:

Virtual Instance Metrics

MetricTypeDescription
rockset_leaf_cpu_utilization_percentageGaugeAverage leaf CPU utilization. Leaf nodes store and ingest data. Leaf CPU utilization reflects both data ingestion and query processing.
rockset_leaf_memory_utilization_percentageGaugeAverage leaf memory utilization. Leaf nodes store and ingest data. Leaf memory utilization reflects both data ingestion and query processing.
rockset_agg_cpu_utilization_percentageGaugeAverage aggregator CPU utilization. Aggregator nodes aggregate data during query execution.
rockset_agg_memory_utilization_percentageGaugeAverage aggregator memory utilization. Aggregator nodes aggregate data during query execution.

Virtual Instance metrics are useful for monitoring compute usage and alerting when your VI is near the limits of its performance. Query performance and ingest latency may both degrade as these metrics near 100%.

Collection Metrics

MetricTypeDescription
rockset_collectionsGaugeNumber of collections.
rockset_collection_size_bytesGaugeCollection size in bytes. Note that this size reflects the current storage size and will decrease as documents expire via specified retention duration or are deleted.
rockset_collection_documentsGaugeNumber of documents currently in each collection.
rockset_collection_total_ingest_bytesCounterNumber of bytes ingested over the history of each collection. Note that this count only ever increases and is therefore well suited for increase and rate functions to compute ingest over time.
rockset_collection_parse_errorsCounterNumber of parse errors for each collection.
rockset_collection_data_discovery_latencyHistogramThe duration (in seconds) from when new or updated data appears in a data source until Rockset first detects it. Elevated values for this metric often reflect configuration issues in the underlying data source (e.g. an inadequate number of RCUs provisioned for DynamoDB sources).
rockset_collection_data_process_latencyHistogramThe duration (in seconds) from when new or updated data is first detected by Rockset until the data is fully processed and query-able. Elevated values for this metric can be alleviated by allocating additional compute to your Virtual Instance.
rockset_data_discovery_latencyHistogramData discovery latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections.
rockset_data_process_latencyHistogramData process latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections.

Query Metrics

MetricTypeDescription
rockset_queriesCounterNumber of queries.
rockset_query_latency_secondsHistogramQuery latency, including admission control duration. Note that this metric is exposed as a histogram — you can compute any PXX that you'd like with an accuracy of +/- ~15% in almost all cases.
rockset_query_admission_latency_secondsHistogramAdmission control queue duration per query if admission control is enabled for your account.
rockset_query_queue_sizeGaugeNumber of queries currently queued (throttled by admission control).
rockset_query_errorsCounterNumber of query execution errors, labeled by HTTP error code (e.g. 404, 500).
rockset_query_lambda_queriesCounterNumber of queries by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_latency_secondsHistogramQuery latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_admission_latency_secondsHistogramQuery admission latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_errorsCounterNumber of query execution errors by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.

Reference Configurations & Templates

You can find reference configurations and templates for Prometheus, Datadog, Grafana and Alertmanager here.

Below is an example of a Prometheus scrape_configs:

  - job_name: Rockset Metrics API
    scrape_interval: 1m
    scrape_timeout: 1m
    honor_timestamps: true
    static_configs:
      - targets:
        - api.usw2a1.rockset.com
    scheme: https
    basic_auth:
      username: <API Key>
      password:
    metrics_path: /v1/orgs/self/metrics