Monitoring And Alerting
Metrics Endpoint
Beyond the Console Metrics page, additional metrics are accessible through the metrics endpoint in Prometheus / OpenMetrics format. This format is compatible with monitoring / alerting tools such as Prometheus, Datadog and AWS Cloudwatch (among many others).
$ curl https://$ROCKSET_SERVER/v1/orgs/self/metrics -u {API key}:
# HELP rockset_collections Number of collections.
# TYPE rockset_collections gauge
rockset_collections{virtual_instance_id="30",workspace_name="commons",} 20.0
rockset_collections{virtual_instance_id="30",workspace_name="myWorkspace",} 2.0
rockset_collections{virtual_instance_id="30",workspace_name="myOtherWorkspace",} 1.0
# HELP rockset_collection_size_bytes Collection size in bytes.
# TYPE rockset_collection_size_bytes gauge
rockset_collection_size_bytes{virtual_instance_id="30",workspace_name="commons",collection_name="_events",} 3.74311622E8
...
You can enable the metrics endpoint for your Virtual Instance from the Metrics tab in the Rockset Console.
You can read more about the three metric types currently used here:
Note: Some metric types (e.g. Histogram) are represented through a set of sub-items. For example, the
rockset_query_latency_seconds
metric (a Histogram) would be represented by severalrockset_query_latency_seconds_bucket
records along with arockset_query_latency_seconds_sum
. Most monitoring clients will handle these complex types automatically on your behalf.
The following metrics are provided and updated at one-minute intervals:
Virtual Instance Metrics
Metric | Type | Description |
---|---|---|
rockset_leaf_cpu_utilization_percentage | Gauge | Average leaf CPU utilization. Leaf nodes store and ingest data. Leaf CPU utilization reflects both data ingestion and query processing. |
rockset_leaf_memory_utilization_percentage | Gauge | Average leaf memory utilization. Leaf nodes store and ingest data. Leaf memory utilization reflects both data ingestion and query processing. |
rockset_agg_cpu_utilization_percentage | Gauge | Average aggregator CPU utilization. Aggregator nodes aggregate data during query execution. |
rockset_agg_memory_utilization_percentage | Gauge | Average aggregator memory utilization. Aggregator nodes aggregate data during query execution. |
Virtual Instance metrics are useful for monitoring compute usage and alerting when your VI is near the limits of its performance. Query performance and ingest latency may both degrade as these metrics near 100%.
Collection Metrics
Metric | Type | Description |
---|---|---|
rockset_collections | Gauge | Number of collections. |
rockset_collection_size_bytes | Gauge | Collection size in bytes. Note that this size reflects the current storage size and will decrease as documents expire via specified retention duration or are deleted. |
rockset_collection_documents | Gauge | Number of documents currently in each collection. |
rockset_collection_total_ingest_bytes | Counter | Number of bytes ingested over the history of each collection. Note that this count only ever increases and is therefore well suited for increase and rate functions to compute ingest over time. |
rockset_collection_parse_errors | Counter | Number of parse errors for each collection. |
rockset_collection_data_discovery_latency | Histogram | The duration (in seconds) from when new or updated data appears in a data source until Rockset first detects it. Elevated values for this metric often reflect configuration issues in the underlying data source (e.g. an inadequate number of RCUs provisioned for DynamoDB sources). |
rockset_collection_data_process_latency | Histogram | The duration (in seconds) from when new or updated data is first detected by Rockset until the data is fully processed and query-able. Elevated values for this metric can be alleviated by allocating additional compute to your Virtual Instance. |
rockset_data_discovery_latency | Histogram | Data discovery latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections. |
rockset_data_process_latency | Histogram | Data process latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections. |
Query Metrics
Metric | Type | Description |
---|---|---|
rockset_queries | Counter | Number of queries. |
rockset_query_latency_seconds | Histogram | Query latency, including admission control duration. Note that this metric is exposed as a histogram — you can compute any PXX that you'd like with an accuracy of +/- ~15% in almost all cases. |
rockset_query_admission_latency_seconds | Histogram | Admission control queue duration per query if admission control is enabled for your account. |
rockset_query_queue_size | Gauge | Number of queries currently queued (throttled by admission control). |
rockset_query_errors | Counter | Number of query execution errors, labeled by HTTP error code (e.g. 404 , 500 ). |
rockset_query_lambda_queries | Counter | Number of queries by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_latency_seconds | Histogram | Query latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_admission_latency_seconds | Histogram | Query admission latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_errors | Counter | Number of query execution errors by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
Reference Configurations & Templates
You can find reference configurations and templates for Prometheus, Datadog, Grafana and Alertmanager here.
Below is an example of a Prometheus scrape_configs
:
- job_name: Rockset Metrics API
scrape_interval: 1m
scrape_timeout: 1m
honor_timestamps: true
static_configs:
- targets:
- api.usw2a1.rockset.com
scheme: https
basic_auth:
username: <API Key>
password:
metrics_path: /v1/orgs/self/metrics
Updated 1 day ago