pg repair or pg deep-scrub does not start

"Marcel Kuiper" <ceph@xxxxxxxx> · Tue, 2 Feb 2021 14:13:54 +0100

Hi

I've got an old cluster running ceph 10.2.11 with filestore backend. Last
week a PG was reported inconsistent with a scrub error

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 38.20 is active+clean+inconsistent, acting [1778,1640,1379]
1 scrub errors

I first tried 'ceph pg repair' but nothing seemed to happen, then

# rados list-inconsistent-obj 38.20 --format=json-pretty

showed that the problem was on osd 1379. The logs showed that that osd had
read errors so I decided to mark that osd out for replacement. Later on
removed it from the crush map en deleted the osd. My thoughts were that
the missing replica gets backfilled on another osd and everything would be
ok again. It got another osd assigned but the health error stayed

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 38.20 is active+clean+inconsistent, acting [1778,1640,1384]
1 scrub errors

Now I get an error on:

# rados list-inconsistent-obj 38.20 --format=json-pretty
No scrub information available for pg 38.20
error 2: (2) No such file or directory

And if I try

# ceph pg deep-scrub 38.20
instructing pg 38.20 on osd.1778 to deep-scrub

The deepscrub does not get scheduled. Same goes for

# ceph daemon osd.1778 trigger_scrub 38.20 on the storage node

Nothing appears in the logs concerning the scrubbing of PG 38.20. I see in
the log that other PG's get (deep) scrubbed according to the automatic
scheduling

There is no recovery going on but just to be sure I'd set ceph daemon
osd.1778 config set osd_scrub_during_recovery true

Also the load limit is set way higher then the actual system load

I checked the other osds en there are no scrubs going on on these when I
schedule the deep-scrub

I found some report of people that had the same problem. However no
solution was found (for example https://tracker.ceph.com/issues/15781).
Even in mimic and luminous there were sort of the same cases

- Does anyone know what logging I should incraese in order to get more
information as to why my deep-scrub does not get scheduled
- Is there a way in jewel to see the list of scheduled scrubs and their
dates for an osd
- Does someone have advice on how to proceed in clearing this PG error

Thanks for any help

Marcel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx