Re: Scrub stuck and 'pg has invalid (post-split) stat'

Eugen Block <eblock@xxxxxx> · Thu, 22 Feb 2024 10:50:33 +0000

I found a config to force scrub invalid PGs, what is your current  
setting on that?

ceph config get osd osd_scrub_invalid_stats
true

The config reference states:

Forces extra scrub to fix stats marked as invalid.

But the default seems to be true, so I'd expect it's true in your case  
as well?

Zitat von Cedric <yipikai7@xxxxxxxxx>:

Thanks Eugen for the suggestion, yes we have tried, also repeering
concerned PGs, still the same issue.

Looking at the code it seems the split-mode message is triggered when
the PG as ""stats_invalid": true,", here is the result of a query:

"stats_invalid": true,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "manifest_stats_invalid": false,

I also provide again cluster informations that was lost in previous
missed reply all. Don't hesitate to ask more if needed I would be
glade to provide them.

Cédric

On Thu, Feb 22, 2024 at 11:04 AM Eugen Block <eblock@xxxxxx> wrote:

Hm, I wonder if setting (and unsetting after a while) noscrub and
nodeep-scrub has any effect. Have you tried that?

Zitat von Cedric <yipikai7@xxxxxxxxx>:

> Update: we have run fsck and re-shard on all bluestore volume, seems
> sharding were not applied.
>
> Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
> pool that is suffering the issue, but other PGs scrubs well.
>
> The next step will be to remove the cache tier as suggested, but its
> not available yet as PGs needs to be scrubbed in order for the cache
> tier can be activated.
>
> As we are struggling to make this cluster works again, any help
> would be greatly appreciated.
>
> Cédric
>
>> On 20 Feb 2024, at 20:22, Cedric <yipikai7@xxxxxxxxx> wrote:
>>
>> Thanks Eugen, sorry about the missed reply to all.
>>
>> The reason we still have the cache tier is because we were not able
>> to flush all dirty entry to remove it (as per the procedure), so
>> the cluster as been migrated from HDD/SSD to NVME a while ago but
>> tiering remains, unfortunately.
>>
>> So actually we are trying to understand the root cause
>>
>> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block <eblock@xxxxxx> wrote:
>>>
>>> Please don't drop the list from your response.
>>>
>>> The first question coming to mind is, why do you have a cache-tier if
>>> all your pools are on nvme decices anyway? I don't see any benefit here.
>>> Did you try the suggested workaround and disable the cache-tier?
>>>
>>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>>>
>>>> Thanks Eugen, see attached infos.
>>>>
>>>> Some more details:
>>>>
>>>> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
>>>> rados -p vms_cache cache-flush-evict-all
>>>> - all scrub running on vms_caches pgs are stall / start in a loop
>>>> without actually doing anything
>>>> - all io are 0 both from ceph status or iostat on nodes
>>>>
>>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock@xxxxxx> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> some more details would be helpful, for example what's the pool size
>>>>> of the cache pool? Did you issue a PG split before or during the
>>>>> upgrade? This thread [1] deals with the same problem, the described
>>>>> workaround was to set hit_set_count to 0 and disable the cache layer
>>>>> until that is resolved. Afterwards you could enable the cache layer
>>>>> again. But keep in mind that the code for cache tier is entirely
>>>>> removed in Reef (IIRC).
>>>>>
>>>>> Regards,
>>>>> Eugen
>>>>>
>>>>> [1]
>>>>>  
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd
>>>>>
>>>>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
>>>>>> encounter an issue with a cache pool becoming completely stuck,
>>>>>> relevant messages below:
>>>>>>
>>>>>> pg xx.x has invalid (post-split) stats; must scrub before tier agent
>>>>>> can activate
>>>>>>
>>>>>> In OSD logs, scrubs are starting in a loop without succeeding for all
>>>>>> pg of this pool.
>>>>>>
>>>>>> What we already tried without luck so far:
>>>>>>
>>>>>> - shutdown / restart OSD
>>>>>> - rebalance pg between OSD
>>>>>> - raise the memory on OSD
>>>>>> - repeer PG
>>>>>>
>>>>>> Any idea what is causing this? any help will be greatly appreciated
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Cédric
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx