Re: Scrub stuck and 'pg has invalid (post-split) stat'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hm, I wonder if setting (and unsetting after a while) noscrub and nodeep-scrub has any effect. Have you tried that?

Zitat von Cedric <yipikai7@xxxxxxxxx>:

Update: we have run fsck and re-shard on all bluestore volume, seems sharding were not applied.

Unfortunately scrubs and deep-scrubs are still stuck on PGs of the pool that is suffering the issue, but other PGs scrubs well.

The next step will be to remove the cache tier as suggested, but its not available yet as PGs needs to be scrubbed in order for the cache tier can be activated.

As we are struggling to make this cluster works again, any help would be greatly appreciated.

Cédric

On 20 Feb 2024, at 20:22, Cedric <yipikai7@xxxxxxxxx> wrote:

Thanks Eugen, sorry about the missed reply to all.

The reason we still have the cache tier is because we were not able to flush all dirty entry to remove it (as per the procedure), so the cluster as been migrated from HDD/SSD to NVME a while ago but tiering remains, unfortunately.

So actually we are trying to understand the root cause

On Tue, Feb 20, 2024 at 1:43 PM Eugen Block <eblock@xxxxxx> wrote:

Please don't drop the list from your response.

The first question coming to mind is, why do you have a cache-tier if
all your pools are on nvme decices anyway? I don't see any benefit here.
Did you try the suggested workaround and disable the cache-tier?

Zitat von Cedric <yipikai7@xxxxxxxxx>:

Thanks Eugen, see attached infos.

Some more details:

- commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
rados -p vms_cache cache-flush-evict-all
- all scrub running on vms_caches pgs are stall / start in a loop
without actually doing anything
- all io are 0 both from ceph status or iostat on nodes

On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock@xxxxxx> wrote:

Hi,

some more details would be helpful, for example what's the pool size
of the cache pool? Did you issue a PG split before or during the
upgrade? This thread [1] deals with the same problem, the described
workaround was to set hit_set_count to 0 and disable the cache layer
until that is resolved. Afterwards you could enable the cache layer
again. But keep in mind that the code for cache tier is entirely
removed in Reef (IIRC).

Regards,
Eugen

[1]
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd

Zitat von Cedric <yipikai7@xxxxxxxxx>:

Hello,

Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
encounter an issue with a cache pool becoming completely stuck,
relevant messages below:

pg xx.x has invalid (post-split) stats; must scrub before tier agent
can activate

In OSD logs, scrubs are starting in a loop without succeeding for all
pg of this pool.

What we already tried without luck so far:

- shutdown / restart OSD
- rebalance pg between OSD
- raise the memory on OSD
- repeer PG

Any idea what is causing this? any help will be greatly appreciated

Thanks

Cédric
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx





_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux