Re: Scrub stuck and 'pg has invalid (post-split) stat'

Eugen Block <eblock@xxxxxx> · Mon, 26 Feb 2024 09:57:14 +0000

Hi,

thanks for the context. Was there any progress over the weekend? The  
hanging commands seem to be MGR related, and there's only one in your  
cluster according to your output. Can you deploy a second one  
manually, then adopt it with cephadm? Can you add 'ceph versions' as  
well?

Zitat von florian.leduc@xxxxxxxxxx:

Hi,
A bit of history might help to understand why we have the cache tier.

We run openstack on top ceph since many years now (started with  
mimic, then an upgrade to nautilus (years 2 ago) and today and  
upgrade to pacific). At the beginning of the setup, we used to have  
a mix of hdd+ssd devices in HCI mode for openstack nova. After the  
upgrade to nautilus, we made a hardware refresh with brand new NVME  
devices. And transitionned from mixed devices to nvme. But we were  
never able to evict all the data from the vms_cache pools (even with  
being aggressive with the eviction; the last resort would have been  
to stop all the virtual instances, and that was not an option for  
our customers), so we decided to move on and set cache-mode proxy  
and serve data with only nvme since then. And it's been like this  
for 1 years and a half.

But today, after the upgrade, the situation is that we cannot query  
any stats (with ceph pg x.x query), rados query hangs, scrub hangs  
even though all PGs are "active+clean". and there is no client  
activity reported by the cluster. Recovery, and rebalance. Also some  
other commands hangs, ie: "ceph balancer status".

--------------
bash-4.2$ ceph -s
  cluster:
    id:     <fsid>
    health: HEALTH_WARN
            mon is allowing insecure global_id reclaim
            noscrub,nodeep-scrub,nosnaptrim flag(s) set
            18432 slow ops, oldest one blocked for 7626 sec, daemons  
[osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... have slow  
ops.

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
    mgr: bm9612541(active, since 39m)
    osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
         flags noscrub,nodeep-scrub,nosnaptrim

  data:
    pools:   8 pools, 2409 pgs
    objects: 14.64M objects, 92 TiB
    usage:   276 TiB used, 143 TiB / 419 TiB avail
    pgs:     2409 active+clean
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx