Re: recurring stat mismatch on PG

Dan van der Ster <dvanders@xxxxxxxxx> · Sat, 8 Oct 2022 11:48:46 +0200

It's not necessarily a bug... Running deep scrub again will just tell you
the current state of the PG. That's safe any time.

If it's comes back inconsistent again, I'd repair the PG again, let it
finish completely, then scrub once again to double check that the repair
worked.

Thinking back, I've seen PG 1fff have scrub errors like this in the past,
not not recently, indicating it was a listing bug of some sort. Perhaps
this is just a leftover stats error from a bug in mimic, and the complete
repair will fix this fully for you.

(Btw, I've never had a stats error like this result in a visible issue.
Repair should probably fix this transparently).

.. Dan

On Sat, Oct 8, 2022, 11:27 Frank Schilder <frans@xxxxxx> wrote:

> Yes, primary OSD. Extracted with grep -e scrub -e repair -e 19.1fff
> /var/log/ceph/ceph-osd.338.log and then only relevant lines copied.
>
> Yes, according to the case I should just run a deep-scrub and should see.
> I guess if this error was cleared on an aborted repair, this would be a new
> bug? I will do a deep-scrub and report back.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> Sent: 08 October 2022 11:18:37
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re:  recurring stat mismatch on PG
>
> Is that the log from the primary OSD?
>
> About the restart, you should probably just deep-scrub again to see the
> current state.
>
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 11:14 Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi Dan,
>
> yes, 15.2.17. I remember that case and was expecting it to be fixed. Here
> a relevant extract from the log:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
> 2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff repair starts
> 2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 repair : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff repair 1 errors, 1 fixed
>
> Just completed a repair and its gone for now. As an alternative
> explanation, we had this scrub error, I started a repair but then OSDs in
> that PG were shut down and restarted. Is it possible that the repair was
> cancelled and the error cleared erroneously?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>
> Sent: 08 October 2022 11:03:05
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re:  recurring stat mismatch on PG
>
> Hi,
>
> Is that 15.2.17? It reminds me of this bug -
> https://tracker.ceph.com/issues/52705 - where an object with a particular
> name would hash to ffffffff and cause a stat mismatch during scrub. But
> 15.2.17 should have the fix for that.
>
>
> Can you find the relevant osd log for more info?
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 10:42 Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx