Ceph cluster maintenance and related operations

🌐 This document is available in both English and Ukrainian. Use the language toggle in the top right corner to switch between versions.

The Platform uses the rook-operator to deploy and manage the openshift-storage Ceph cluster. This document provides best-practice recommendations for maintaining the Ceph cluster and related operations to ensure system stability and data integrity.

1. Manual deep scrubbing

Disable automatic deep scrubbing after deploying the Platform to avoid performance impact during peak hours. However, periodically running deep scrubbing is critical for verifying object integrity. Learn how to re-enable deep scrubbing, trigger it manually, and run it efficiently.

Run deep scrubbing manually at least once every few weeks, depending on your SLA and data criticality.

Schedule it during low-load periods — e.g., weeknights or weekends. No formal maintenance window is required if there’s no impact on critical services.

📌 Avoid PG backlog, where Placement Groups go unchecked for too long. In large clusters, run daily with limited concurrency (osd_max_scrubs); smaller clusters may only need weekly runs.

1.1. Enabling deep scrubbing

To start deep scrubbing manually, first remove the nodeep-scrub flags from all pools and then initiate the necessary Ceph CLI commands from the rook-operator pod in the openshift-storage project.

Remove nodeep-scrub flags from all pools
for pool in $(ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd pool ls); do
  ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd pool set "$pool" nodeep-scrub false
done
Verify that flags have been removed
ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd pool ls detail
Run deep scrubbing on all OSDs (recommended)
ceph osd --conf=/var/lib/rook/openshift-storage/openshift-storage.config deep-scrub all
Alternative methods for initiating deep scrubbing on PGs
# For a specific OSD
ceph osd --conf=/var/lib/rook/openshift-storage/openshift-storage.config deep-scrub 0

# For all pools
for pool in $(ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd pool ls); do
  ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config osd pool deep-scrub "$pool"
done

1.2. Speeding up deep scrubbing when backlog exists

If many PGs are waiting for deep scrubbing, you can speed up the process by:

  • Temporarily increasing the number of concurrent scrubbing processes, up to 4.

    OR

  • Extending the maintenance window and running deep scrubbing manually during weeknights.

Increase osd_max_scrubs to 4
ceph config --conf=/var/lib/rook/openshift-storage/openshift-storage.config set osd osd_max_scrubs 4

After the operation, revert osd_max_scrubs back to 1.

➡️ See more in Concurrent scrubbing settings.

ceph scrubbing 2

2. Monitoring scrubbing status

Use this section to monitor the health and performance of the scrubbing process. You’ll learn how to check cluster-wide scrubbing status, track deep scrubbing execution, and review Placement Group logs. These steps help you detect issues early and maintain data consistency.

2.1. Monitoring via Grafana Web Interface

Use the Grafana Dashboard — specifically the Ceph & PostgreSQL Deep Scrubbing Dashboard — to visualize scrubbing activities and OSD performance.

The dashboard shows:

  • Number of PGs undergoing deep scrubbing over time.

  • OSD read/write latency.

  • Current cluster health status.

ceph cluster maintenance 1
Figure 1. Grafana Dashboard: Ceph deep scrubbing metrics

2.2. Monitoring via Ceph CLI

Use the following commands to track scrubbing progress from the command line:

Check overall cluster status
ceph -s --conf=/var/lib/rook/openshift-storage/openshift-storage.config
Check Placement Group (PG) stats
ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config pg stat
Find active scrubbing processes
ceph pg --conf=/var/lib/rook/openshift-storage/openshift-storage.config dump pgs | grep -E 'scrub|deep'
Query specific PG details
ceph pg 19.11 query --conf=/var/lib/rook/openshift-storage/openshift-storage.config
View timestamp of last deep scrub per PG
ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config pg dump pgs | awk '{print $1, $23}' | sort -k2 | column -t

3. Stopping deep scrubbing

Sometimes it’s necessary to stop or slow down deep scrubbing — for example, during peak usage or critical transactions. This section describes safe ways to reduce or completely halt the process.

3.1. Soft stop

The recommended method is to reduce the number of concurrent scrubbing processes (osd_max_scrubs) and allow ongoing scrubs to complete naturally. The minimum safe value is 1.

Reduce osd_max_scrubs to 1
ceph config --conf=/var/lib/rook/openshift-storage/openshift-storage.config set osd osd_max_scrubs 1

3.2. Force stop

In extreme cases — such as severe performance degradation or blocked I/O — you can stop deep scrubbing immediately by restarting OSDs with PGs in the scrubbing+deep state. Use this approach only when absolutely necessary.

  1. Identify PGs currently undergoing deep scrubbing:

    ceph pg --conf=/var/lib/rook/openshift-storage/openshift-storage.config dump pgs | grep -E 'scrub|deep'
    Example output

    19.11 554 0 0 0 0 72797490 0 0 854 854 active+clean+scrubbing+deep 2025-04-03T07:03:00.295292+0000 2326'7347 2326:18658 [0,2,1] 0 [0,2,1] 0 2263'7328 2025-04-03T07:03:00.295251+0000 2136'4471 2025-03-31T07:38:49.421946+0000 0

    Table 1. Field reference for ceph pg dump pgs output
    Field Description

    19.11

    PG ID — unique Placement Group identifier in pool_id.pg_num format.

    active+clean+scrubbing+deep

    PG Status — shows PG is active, clean, and currently being deep scrubbed.

    [0,2,1]

    Acting Set — list of OSDs serving the PG; first is the primary OSD.

    0

    Primary OSD index — index in the Acting Set indicating which OSD is primary.

    2025-04-03T07:03:00.295251+0000

    deep_scrub_stamp — timestamp of the last deep scrub for the PG.

  2. Delete the relevant OSD pod using OpenShift UI or CLI:

    oc delete pod rook-ceph-osd-0-example -n openshift-storage
Restarting the pod stops active scrubbing on that OSD and triggers PG recovery/rebalancing. Use this option only if no alternatives exist.