Re: Scrub stuck and 'pg has invalid (post-split) stat'

Eugen Block <eblock@xxxxxx> · Wed, 28 Feb 2024 20:50:17 +0000

Hi,

great that you found a solution. Maybe that also helps to get rid of  
the cache-tier entirely?

Zitat von Cedric <yipikai7@xxxxxxxxx>:

Hello,

Sorry for the late reply, so yes we finally find a solution, which  
was to split apart the cache_pool on dedicated OSD. It had the  
effect to clear off slow ops and allow the cluster to serves clients  
again, after 5 days of lock down, hopefully the majority of VM  
resume well, thanks to the virtio driver that does not seems to have  
any timeout.

It seems that at least one of the main culprit was to store both  
cold and hot data pool on same OSD (which in the end totally make  
sens), maybe some others actions engaged also had an effect, we are  
still trying to trouble shoot the root of slow ops, weirdly it was  
the 5th cluster upgraded and all as almost the same configuration,  
but this one handles 5x time more workload.

In the hope it could help.

Cédric

On 26 Feb 2024, at 10:57, Eugen Block <eblock@xxxxxx> wrote:

Hi,

thanks for the context. Was there any progress over the weekend?  
The hanging commands seem to be MGR related, and there's only one  
in your cluster according to your output. Can you deploy a second  
one manually, then adopt it with cephadm? Can you add 'ceph  
versions' as well?

Zitat von florian.leduc@xxxxxxxxxx:

Hi,
A bit of history might help to understand why we have the cache tier.

We run openstack on top ceph since many years now (started with  
mimic, then an upgrade to nautilus (years 2 ago) and today and  
upgrade to pacific). At the beginning of the setup, we used to  
have a mix of hdd+ssd devices in HCI mode for openstack nova.  
After the upgrade to nautilus, we made a hardware refresh with  
brand new NVME devices. And transitionned from mixed devices to  
nvme. But we were never able to evict all the data from the  
vms_cache pools (even with being aggressive with the eviction; the  
last resort would have been to stop all the virtual instances, and  
that was not an option for our customers), so we decided to move  
on and set cache-mode proxy and serve data with only nvme since  
then. And it's been like this for 1 years and a half.

But today, after the upgrade, the situation is that we cannot  
query any stats (with ceph pg x.x query), rados query hangs, scrub  
hangs even though all PGs are "active+clean". and there is no  
client activity reported by the cluster. Recovery, and rebalance.  
Also some other commands hangs, ie: "ceph balancer status".

--------------
bash-4.2$ ceph -s
 cluster:
   id:     <fsid>
   health: HEALTH_WARN
           mon is allowing insecure global_id reclaim
           noscrub,nodeep-scrub,nosnaptrim flag(s) set
           18432 slow ops, oldest one blocked for 7626 sec,  
daemons  
[osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... have slow  
ops.

 services:
   mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
   mgr: bm9612541(active, since 39m)
   osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
        flags noscrub,nodeep-scrub,nosnaptrim

 data:
   pools:   8 pools, 2409 pgs
   objects: 14.64M objects, 92 TiB
   usage:   276 TiB used, 143 TiB / 419 TiB avail
   pgs:     2409 active+clean
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx