One PG keeps going inconsistent (stat mismatch)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like:

2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff scrub starts 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub 1 errors

It always repairs ok when I run ceph pg repair 1.3ff:

2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] : 1.3ff repair starts 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair 1 errors, 1 fixed

It's happened multiple times and always with the same PG number, no other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with separate DB/WAL on SSDs. I don't believe there's an underlying hardware problem but in a bid to make sure I reweighted the primary OSD for this PG to 0 to get it to move to another disk. The backfilling is complete but on manually scrubbing the PG again it showed inconsistent as above.

In case it's relevant the only major activity I've performed recently has been gradually adding new OSD nodes and disks to the cluster, prior to this it had been up without issue for well over a year. The primary OSD for this PG was on the first new OSD I added when this issue first presented. The inconsistent PG issue didn't start happening immediately after adding it though, it was some weeks later.

Any suggestions as to how I can get rid of this problem?
Should I try reweighting the other two OSDs for this PG to 0?
Or is this a known bug that requires some specific work or just an upgrade?

Thanks,
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux