One PG keeps going inconsistent (stat mismatch)

Simon Ironside <sironside@xxxxxxxxxxxxx> · Wed, 22 Sep 2021 18:50:09 +0100

Hi All,

I have a recurring single PG that keeps going inconsistent. A scrub is 
enough to pick up the problem. The primary OSD log shows something like:

2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff scrub starts
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub 1 errors

It always repairs ok when I run ceph pg repair 1.3ff:

2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff repair starts
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair 1 errors, 1 fixed

It's happened multiple times and always with the same PG number, no 
other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning 
disks with separate DB/WAL on SSDs. I don't believe there's an 
underlying hardware problem but in a bid to make sure I reweighted the 
primary OSD for this PG to 0 to get it to move to another disk. The 
backfilling is complete but on manually scrubbing the PG again it showed 
inconsistent as above.

In case it's relevant the only major activity I've performed recently 
has been gradually adding new OSD nodes and disks to the cluster, prior 
to this it had been up without issue for well over a year. The primary 
OSD for this PG was on the first new OSD I added when this issue first 
presented. The inconsistent PG issue didn't start happening immediately 
after adding it though, it was some weeks later.

Any suggestions as to how I can get rid of this problem?
Should I try reweighting the other two OSDs for this PG to 0?
Or is this a known bug that requires some specific work or just an upgrade?

Thanks,
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx