Monitoring And Alerting
Metrics Endpoint
Beyond the Console Metrics page, additional metrics are accessible through the metrics endpoint in Prometheus/OpenMetrics format. This format is compatible with monitoring/alerting tools such as Prometheus, Datadog and AWS Cloudwatch (among many others).
$ curl https://$ROCKSET_SERVER/v1/orgs/self/metrics -u {API key}:
# HELP rockset_collections Number of collections.
# TYPE rockset_collections gauge
rockset_collections{virtual_instance_id="30",workspace_name="commons",} 20.0
rockset_collections{virtual_instance_id="30",workspace_name="myWorkspace",} 2.0
rockset_collections{virtual_instance_id="30",workspace_name="myOtherWorkspace",} 1.0
# HELP rockset_collection_size_bytes Collection size in bytes.
# TYPE rockset_collection_size_bytes gauge
rockset_collection_size_bytes{virtual_instance_id="30",workspace_name="commons",collection_name="_events",} 3.74311622E8
...
You can enable the metrics endpoint for your Virtual Instance from the Metrics tab in the Rockset Console.
You can read more about the three metric types currently used here:
Some metric types (e.g. Histogram) are represented through a set of sub-items.
For example, the
rockset_query_latency_seconds
metric (a Histogram) would be represented by severalrockset_query_latency_seconds_bucket
records along with arockset_query_latency_seconds_sum
.Most monitoring clients will handle these complex types automatically on your behalf.
The following metrics are provided and updated at one-minute intervals:
Organization Metrics
Metric | Type | Description |
---|---|---|
rockset_metrics_updated_at | Gauge | Time Rockset scraped these values, in seconds since 1970. |
rockset_hot_storage_limit_bytes | Gauge | The hot storage size limit for your entire organization. Ingest will be disabled if you hit this limit. |
Virtual Instance Metrics
Metric | Type | Description |
---|---|---|
rockset_leaf_cpu_utilization_percentage | Gauge | Average CPU utilization across the leaves in a Virtual Instance. Leaf nodes store and ingest data. Leaf CPU utilization reflects both data ingestion and query processing. |
rockset_leaf_memory_utilization_percentage | Gauge | Average memory utilization across the leaves in a Virtual Instance. Leaf nodes store and ingest data. Leaf memory utilization reflects both data ingestion and query processing. |
rockset_leaf_block_cache_utilization_percentage | Gauge | Percentage of total memory on the Virtual Instance that the block cache is using. The block cache is where Rockset caches data for reads. |
rockset_leaf_block_cache_allocation_percentage | Gauge | The block cache can use up to this percentage of total memory of the Virtual Instance. The block cache is where Rockset caches data for reads. |
rockset_leaf_block_cache_hit_percentage | Gauge | The hit rate measures how often the queried data is found in the block cache. This number is block cache hits / block cache hits and misses. |
rockset_leaf_memtable_utilization_percentage | Gauge | Percentage of total memory on the Virtual Instance that the memtable is using. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console. |
rockset_leaf_memtable_allocation_percentage | Gauge | The memtable can use up to this percentage of total memory of the Virtual Instance. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console. |
rockset_leaf_tailing_stopped_timestamp_seconds | Gauge | This value will show the timestamp of when tailing stopped on your Virtual Instance. If tailing is active, this value is 0. Tailing stops when you exceed the memory limit of your memtable. Periodically the VI will restart to try to recover to a stable state so you may see tailing resume temporarily. However, if the Virtual Instance continues to have insufficient memory, tailing will stop again. |
rockset_leaf_tailing_latency_seconds | Histogram | The duration (in seconds) from when new or updated data is processed across all collections by the Ingest Virtual Instance until the data is updated for the Query Virtual Instance or for the Ingest Virtual Instance's internal replica. |
rockset_leaf_total_tailing_bytes | Histogram | The number of bytes tailed across all collections from the Ingest Virtual Instance to the Query Virtual Instance or for the Ingest Virtual Instance's internal replica. |
rockset_leaf_cpu_attribution_query_milliseconds | Counter | Counter of leaf CPU milliseconds that can be attributed to queries. To calculate the percentage of CPU in this category, divide the rate of this metric by the rate of rockset_leaf_cpu_attribution_total_milliseconds . |
rockset_leaf_cpu_attribution_ingest_milliseconds | Counter | Counter of leaf CPU milliseconds that can be attributed to ingest. To calculate the percentage of CPU in this category, divide the rate of this metric by the rate of rockset_leaf_cpu_attribution_total_milliseconds . |
rockset_leaf_cpu_attribution_tailing_milliseconds | Counter | Counter of leaf CPU milliseconds that can be attributed to tailing. To calculate the percentage of CPU in this category, divide the rate of this metric by the rate of rockset_leaf_cpu_attribution_total_milliseconds . |
rockset_leaf_cpu_attribution_other_milliseconds | Counter | Counter of leaf CPU milliseconds that could not be attributed to one of the other categories. To calculate the percentage of CPU in this category, divide the rate of this metric by the rate of rockset_leaf_cpu_attribution_total_milliseconds . |
rockset_leaf_cpu_attribution_total_milliseconds | Counter | Counter of total leaf CPU milliseconds between all categories. |
Virtual Instance metrics are useful for monitoring compute usage and alerting when your VI is near the limits of its performance. Query performance and ingest latency may both degrade as these metrics near 100%.
Collection Metrics
Metric | Type | Description |
---|---|---|
rockset_collections | Gauge | Number of collections. |
rockset_collection_size_bytes | Gauge | Collection size in bytes. Note that this size reflects the current storage size and will decrease as documents expire via specified retention duration or are deleted. |
rockset_collection_documents | Gauge | Number of documents currently in each collection. |
rockset_collection_total_ingest_bytes | Counter | Number of bytes ingested over the history of each collection. Note that this count only ever increases and is therefore well suited for increase and rate functions to compute ingest over time. |
rockset_collection_parse_errors | Counter | Number of parse errors for each collection. |
rockset_collection_data_discovery_latency | Histogram | The duration (in seconds) from when new or updated data appears in a data source until Rockset first detects it. Elevated values for this metric often reflect configuration issues in the underlying data source (e.g. an inadequate number of RCUs provisioned for DynamoDB sources). |
rockset_collection_data_process_latency | Histogram | The duration (in seconds) from when new or updated data is first detected by Rockset until the data is fully processed and query-able. Elevated values for this metric can be alleviated by allocating additional compute to your Virtual Instance. |
rockset_collection_memtable_utilization_percentage | Gauge | Percentage of total memory on the Virtual Instance that the memtable is using to tail this collection. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console. |
rockset_data_discovery_latency | Histogram | Data discovery latency across all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections. |
rockset_data_process_latency | Histogram | Data process latency across all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections. |
rockset_leaf_collection_tailing_latency_seconds | Histogram | The duration (in seconds) from when new or updated data is processed for this collection by the Ingest Virtual Instance until the data is updated for the Query Virtual Instance or for the Ingest Virtual Instance's internal replica. |
rockset_leaf_collection_total_tailing_bytes | Histogram | The number of bytes tailed for this collection from the Ingest Virtual Instance to the Query Virtual Instance or for the Ingest Virtual Instance's internal replica. |
Query Metrics
Metric | Type | Description |
---|---|---|
rockset_queries | Counter | Cumulative count of queries run on this Virtual Instance. |
rockset_query_latency_seconds | Histogram | Query latency, including admission control duration. Note that this metric is exposed as a histogram — you can compute any PXX that you'd like with an accuracy of +/- ~15% in almost all cases. |
rockset_query_admission_latency_seconds | Histogram | Admission control queue duration per query if admission control is enabled for your account. |
rockset_query_queue_size | Gauge | Number of queries currently queued (throttled by admission control). |
rockset_query_errors | Counter | Number of query execution errors, labeled by HTTP error code (e.g. 404 , 500 ). |
rockset_query_lambda_queries | Counter | Number of queries by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_latency_seconds | Histogram | Query latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_admission_latency_seconds | Histogram | Query admission latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_query_lambda_errors | Counter | Number of query execution errors by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag. |
rockset_running_queries | Gauge | Number of queries that is currently running on the Virtual Instance. |
Reference Configurations & Templates
You can find reference configurations and templates for Prometheus, Datadog, Grafana and Alertmanager here.
Below is an example of a Prometheus scrape_configs
:
- job_name: Rockset Metrics API
scrape_interval: 1m
scrape_timeout: 1m
honor_timestamps: true
static_configs:
- targets:
- api.usw2a1.rockset.com
scheme: https
basic_auth:
username: <API Key>
password:
metrics_path: /v1/orgs/self/metrics
Updated 6 months ago