Re: 1 pg inconsistent and does not recover

Niklas Hambüchen <mail@xxxxxx> · Thu, 29 Jun 2023 14:11:09 +0200

On 28/06/2023 21:26, Niklas Hambüchen wrote:
I have increased the number of scrubs per OSD from 1 to 3 using `ceph config set osd osd_max_scrubs 3`.
Now the problematic PG is scrubbing in `ceph pg ls`:

     active+clean+scrubbing+deep+inconsistent

This succeeded!
The deep-scrub fixed the PG and the cluster is healthy again.

Thanks a lot!

So indeed the issue was that the deep-scrub I had asked for was simply never scheduled because Ceph always picked some other scrub to do first on the relevant OSD.
Increasing `osd_max_scrubs` beyond 1 made it possible to force the scrub to start.

I conclude that most of the information online, including the Ceph docs, does not give the correct advice when recommending `ceph pg repair`.
Instead, the docs should make clear that a scrub will fix such issues without involvement of `ceph pg repair`.

I find lack of docs disturbing, because a disk failing and being replaced is an extremely common operation for storage cluster.

Including some relevant logs of the scrub recovery:

    # grep '\b2\.87\b' /var/log/ceph/ceph-osd.33.log | grep deep
    2023-05-16T16:33:58.398+0000 7f9a985e5640  0 log_channel(cluster) log [DBG] : 2.87 deep-scrub ok
    2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 0 missing, 1 inconsistent objects
    2023-06-16T20:03:26.923+0000 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 1 errors
    2023-06-26T05:06:17.412+0000 7f9b15bfe640  0 log_channel(cluster) log [INF] : osd.33 pg 2.87 Deep scrub errors, upgrading scrub to deep-scrub
    2023-06-29T10:14:07.791+0000 7f9a985e5640  0 log_channel(cluster) log [DBG] : 2.87 deep-scrub ok

ceph.log:

    2023-06-29T10:14:07.792432+0000 osd.33 (osd.33) 938 : cluster [DBG] 2.87 deep-scrub ok
    2023-06-29T10:14:09.311257+0000 mgr.node-5 (mgr.2454216) 385434 : cluster [DBG] pgmap v385836: 832 pgs: 1 active+clean+scrubbing, 17 active+clean+scrubbing+deep, 814 active+clean; 68 TiB data, 210 TiB used, 229 TiB / 439 TiB avail; 80 MiB/s rd, 40 MiB/s wr, 45 op/s
    2023-06-29T10:14:09.427733+0000 mon.node-4 (mon.0) 20923054 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
    2023-06-29T10:14:09.427758+0000 mon.node-4 (mon.0) 20923055 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent)
    2023-06-29T10:14:09.427786+0000 mon.node-4 (mon.0) 20923056 : cluster [INF] Cluster is now healthy

From this, it seems bad that Ceph did not manage to schedule the cluster-fixing scrub within 7 days of the faulty disk being replaced, nor managed to schedule a human-requested scrub within 2 days.

What mechanism in Ceph decides the scheduling of scrubs?

I see the config value `osd_requested_scrub_priority` which is for "the priority set for user requested scrub on the work queue", but I cannot tell if this also affects scrub start scheduling, or only the priority of IO operations vs e.g. client operations once a scrub has already been started.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx