Backup and Restore Monitoring

On this page Carat arrow pointing down

CockroachDB includes metrics to monitor backup, restore, and scheduled backup jobs. You can use monitoring integrations to alert when there are anomalies, such as backups that have failed or restore jobs encountering a retryable error.

Depending on whether you are using a CockroachDB Dedicated or CockroachDB Self-Hosted cluster, you can use the following to monitor backup and restore metrics for your CockroachDB cluster:

We recommend setting up monitoring to alert when anomalies occur. You can then use the following SQL statements to inspect details relating to schedules, jobs, and backups:

For detail on managed backups that Cockroach Labs stores for your CockroachDB Cloud cluster, see the Backups page in the Cloud Console.

Note:

Metrics are reported per node. Therefore, it is necessary to retrieve metrics from every node in the cluster. For example, if you are monitoring whether a backup fails, it is necessary to track scheduled_backup_failed on each node.

Prometheus endpoint

You can access the Prometheus endpoint (http://<host>:<http-port>/_status/vars) for backup and restore metrics with CockroachDB Dedicated or CockroachDB Self-Hosted clusters.

See the Monitor CockroachDB with Prometheus tutorial for guidance on installing and setting up Prometheus and Alertmanager to track metrics.

Available metrics

We recommend the following guidelines:

  • Use the schedules_backup_last_completed_time metric to monitor the specific backup job or jobs you would use to recover from a disaster.
  • Configure alerting on the schedules_backup_last_completed_time metric to watch for cases where the timestamp has not moved forward as expected.
Metric Description
schedules_backup_succeeded The number of scheduled backup jobs that have succeeded.
schedules_backup_started The number of scheduled backup jobs that have started.
schedules_backup_last_completed_time The Unix timestamp of the most recently completed scheduled backup specified as maintaining this metric. Note: This metric only updates if the schedule was created with the updates_cluster_last_backup_time_metric option.
schedules_backup_failed The number of scheduled backup jobs that have failed. Note: A stuck scheduled job will not increment this metric.
schedules_round_reschedule_wait The number of schedules that were rescheduled due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the on_previous_running=wait schedule option.
schedules_round_reschedule_skip The number of schedules that were skipped due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the on_previous_running=skip schedule option.
jobs_backup_currently_running The number of backup jobs currently running in Resume or OnFailOrCancel state.
jobs_backup_fail_or_cancel_retry_error The number of backup jobs that failed with a retryable error on their failure or cancelation process.
jobs_backup_fail_or_cancel_completed The number of backup jobs that successfully completed their failure or cancelation process.
jobs_backup_fail_or_cancel_failed The number of backup jobs that failed with a non-retryable error on their failure or cancelation process.
jobs_backup_resume_failed The number of backup jobs that failed with a non-retryable error.
jobs_backup_resume_retry_error The number of backup jobs that failed with a retryable error.
jobs_restore_resume_retry_error The number of restore jobs that failed with a retryable error.
jobs_restore_resume_completed The number of restore jobs that successfully resumed to completion.
jobs_restore_resume_failed The number of restore jobs that failed with a non-retryable error.
jobs_restore_fail_or_cancel_failed The number of restore jobs that failed with a non-retriable error on their failure or cancelation process.
jobs_restore_fail_or_cancel_retry_error The number of restore jobs that failed with a retryable error on their failure or cancelation process.
jobs_restore_currently_running The number of restore jobs currently running in Resume or OnFailOrCancel state.

Datadog integration

To use the Datadog integration with your CockroachDB Dedicated cluster, you can:

Available metrics in Datadog

Metric Description
schedules_backup_succeeded The number of scheduled backup jobs that have succeeded.
schedules_backup_started The number of scheduled backup jobs that have started.
schedules_backup_last_completed_time The Unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric.
schedules_backup_failed The number of scheduled backup jobs that have failed.

See also


Yes No
On this page

Yes No