I found a config to force scrub invalid PGs, what is your current
setting on that?
ceph config get osd osd_scrub_invalid_stats
true
The config reference states:
> Forces extra scrub to fix stats marked as invalid.
But the default seems to be true, so I'd expect it's true in your case
as well?
Zitat von Cedric <yipikai7@xxxxxxxxx>:
> Thanks Eugen for the suggestion, yes we have tried, also repeering
> concerned PGs, still the same issue.
>
> Looking at the code it seems the split-mode message is triggered when
> the PG as ""stats_invalid": true,", here is the result of a query:
>
> "stats_invalid": true,
> "dirty_stats_invalid": false,
> "omap_stats_invalid": false,
> "hitset_stats_invalid": false,
> "hitset_bytes_stats_invalid": false,
> "pin_stats_invalid": false,
> "manifest_stats_invalid": false,
>
> I also provide again cluster informations that was lost in previous
> missed reply all. Don't hesitate to ask more if needed I would be
> glade to provide them.
>
> Cédric
>
>
> On Thu, Feb 22, 2024 at 11:04 AM Eugen Block <eblock@xxxxxx> wrote:
>>
>> Hm, I wonder if setting (and unsetting after a while) noscrub and
>> nodeep-scrub has any effect. Have you tried that?
>>
>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>>
>> > Update: we have run fsck and re-shard on all bluestore volume, seems
>> > sharding were not applied.
>> >
>> > Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
>> > pool that is suffering the issue, but other PGs scrubs well.
>> >
>> > The next step will be to remove the cache tier as suggested, but its
>> > not available yet as PGs needs to be scrubbed in order for the cache
>> > tier can be activated.
>> >
>> > As we are struggling to make this cluster works again, any help
>> > would be greatly appreciated.
>> >
>> > Cédric
>> >
>> >> On 20 Feb 2024, at 20:22, Cedric <yipikai7@xxxxxxxxx> wrote:
>> >>
>> >> Thanks Eugen, sorry about the missed reply to all.
>> >>
>> >> The reason we still have the cache tier is because we were not able
>> >> to flush all dirty entry to remove it (as per the procedure), so
>> >> the cluster as been migrated from HDD/SSD to NVME a while ago but
>> >> tiering remains, unfortunately.
>> >>
>> >> So actually we are trying to understand the root cause
>> >>
>> >> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block <eblock@xxxxxx> wrote:
>> >>>
>> >>> Please don't drop the list from your response.
>> >>>
>> >>> The first question coming to mind is, why do you have a cache-tier if
>> >>> all your pools are on nvme decices anyway? I don't see any
benefit here.
>> >>> Did you try the suggested workaround and disable the cache-tier?
>> >>>
>> >>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>> >>>
>> >>>> Thanks Eugen, see attached infos.
>> >>>>
>> >>>> Some more details:
>> >>>>
>> >>>> - commands that actually hangs: ceph balancer status ; rbd
-p vms ls ;
>> >>>> rados -p vms_cache cache-flush-evict-all
>> >>>> - all scrub running on vms_caches pgs are stall / start in a loop
>> >>>> without actually doing anything
>> >>>> - all io are 0 both from ceph status or iostat on nodes
>> >>>>
>> >>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock@xxxxxx> wrote:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> some more details would be helpful, for example what's the
pool size
>> >>>>> of the cache pool? Did you issue a PG split before or during the
>> >>>>> upgrade? This thread [1] deals with the same problem, the described
>> >>>>> workaround was to set hit_set_count to 0 and disable the
cache layer
>> >>>>> until that is resolved. Afterwards you could enable the cache layer
>> >>>>> again. But keep in mind that the code for cache tier is entirely
>> >>>>> removed in Reef (IIRC).
>> >>>>>
>> >>>>> Regards,
>> >>>>> Eugen
>> >>>>>
>> >>>>> [1]
>> >>>>>
>>
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd
>> >>>>>
>> >>>>> Zitat von Cedric <yipikai7@xxxxxxxxx>:
>> >>>>>
>> >>>>>> Hello,
>> >>>>>>
>> >>>>>> Following an upgrade from Nautilus (14.2.22) to Pacific
(16.2.13), we
>> >>>>>> encounter an issue with a cache pool becoming completely stuck,
>> >>>>>> relevant messages below:
>> >>>>>>
>> >>>>>> pg xx.x has invalid (post-split) stats; must scrub before
tier agent
>> >>>>>> can activate
>> >>>>>>
>> >>>>>> In OSD logs, scrubs are starting in a loop without
succeeding for all
>> >>>>>> pg of this pool.
>> >>>>>>
>> >>>>>> What we already tried without luck so far:
>> >>>>>>
>> >>>>>> - shutdown / restart OSD
>> >>>>>> - rebalance pg between OSD
>> >>>>>> - raise the memory on OSD
>> >>>>>> - repeer PG
>> >>>>>>
>> >>>>>> Any idea what is causing this? any help will be greatly
appreciated
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> Cédric
>> >>>>>> _______________________________________________
>> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>
>> >>>
>> >>>
>>
>>
>>