Re: One PG keeps going inconsistent (stat mismatch)

Eric Petit <eric@xxxxxxxxxx> · Tue, 12 Oct 2021 05:48:27 +0200

FWIW, I saw a similar problem on a cluster ~1 ago and noticed that the PG affected with "stat mismatch" was the very last PG of the pool (4.1fff in my case, with pg_num = 8192). I recall thinking that it looked more like a bug than a hardware issue and, assuming your pool has 1024 PGs, you may be hitting the same issue.

It happened 2 or 3 times and then went away, possibly thanks to software updates (currently on 14.2.21).

Eric

> On 11 Oct 2021, at 18:44, Simon Ironside <sironside@xxxxxxxxxxxxx> wrote:
> 
> Bump for any pointers here?
> 
> tl;dr - I've got a single PG that keeps going inconsistent (stat mismatch). It always repairs ok but comes back every day now when it's scrubbed.
> 
> If there's no suggestions I'll try upgrading to 14.2.22 and then reweighting the other OSDs (I've already done the primary) that serve this PG to 0 to try to force its recreation.
> 
> Thanks,
> Simon.
> 
> On 22/09/2021 18:50, Simon Ironside wrote:
>> Hi All,
>> I have a recurring single PG that keeps going inconsistent. A scrub is enough to pick up the problem. The primary OSD log shows something like:
>> 2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 1.3ff scrub starts
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff scrub 1 errors
>> It always repairs ok when I run ceph pg repair 1.3ff:
>> 2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 1.3ff repair starts
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 1.3ff repair 1 errors, 1 fixed
>> It's happened multiple times and always with the same PG number, no other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with separate DB/WAL on SSDs. I don't believe there's an underlying hardware problem but in a bid to make sure I reweighted the primary OSD for this PG to 0 to get it to move to another disk. The backfilling is complete but on manually scrubbing the PG again it showed inconsistent as above.
>> In case it's relevant the only major activity I've performed recently has been gradually adding new OSD nodes and disks to the cluster, prior to this it had been up without issue for well over a year. The primary OSD for this PG was on the first new OSD I added when this issue first presented. The inconsistent PG issue didn't start happening immediately after adding it though, it was some weeks later.
>> Any suggestions as to how I can get rid of this problem?
>> Should I try reweighting the other two OSDs for this PG to 0?
>> Or is this a known bug that requires some specific work or just an upgrade?
>> Thanks,
>> Simon.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx