Hi All,
I have a recurring single PG that keeps going inconsistent. A scrub is
enough to pick up the problem. The primary OSD log shows something like:
2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] :
1.3ff scrub starts
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] :
1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones,
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0
hit_set_archive bytes.
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] :
1.3ff scrub 1 errors
It always repairs ok when I run ceph pg repair 1.3ff:
2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] :
1.3ff repair starts
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] :
1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones,
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0
hit_set_archive bytes.
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] :
1.3ff repair 1 errors, 1 fixed
It's happened multiple times and always with the same PG number, no
other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning
disks with separate DB/WAL on SSDs. I don't believe there's an
underlying hardware problem but in a bid to make sure I reweighted the
primary OSD for this PG to 0 to get it to move to another disk. The
backfilling is complete but on manually scrubbing the PG again it showed
inconsistent as above.
In case it's relevant the only major activity I've performed recently
has been gradually adding new OSD nodes and disks to the cluster, prior
to this it had been up without issue for well over a year. The primary
OSD for this PG was on the first new OSD I added when this issue first
presented. The inconsistent PG issue didn't start happening immediately
after adding it though, it was some weeks later.
Any suggestions as to how I can get rid of this problem?
Should I try reweighting the other two OSDs for this PG to 0?
Or is this a known bug that requires some specific work or just an upgrade?
Thanks,
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx