Hi, A bit of history might help to understand why we have the cache tier. We run openstack on top ceph since many years now (started with mimic, then an upgrade to nautilus (years 2 ago) and today and upgrade to pacific). At the beginning of the setup, we used to have a mix of hdd+ssd devices in HCI mode for openstack nova. After the upgrade to nautilus, we made a hardware refresh with brand new NVME devices. And transitionned from mixed devices to nvme. But we were never able to evict all the data from the vms_cache pools (even with being aggressive with the eviction; the last resort would have been to stop all the virtual instances, and that was not an option for our customers), so we decided to move on and set cache-mode proxy and serve data with only nvme since then. And it's been like this for 1 years and a half. But today, after the upgrade, the situation is that we cannot query any stats (with ceph pg x.x query), rados query hangs, scrub hangs even though all PGs are "active+clean". and there is no client activity reported by the cluster. Recovery, and rebalance. Also some other commands hangs, ie: "ceph balancer status". -------------- bash-4.2$ ceph -s cluster: id: <fsid> health: HEALTH_WARN mon is allowing insecure global_id reclaim noscrub,nodeep-scrub,nosnaptrim flag(s) set 18432 slow ops, oldest one blocked for 7626 sec, daemons [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... have slow ops. services: mon: 3 daemons, quorum mon1,mon2,mon3(age 36m) mgr: bm9612541(active, since 39m) osd: 72 osds: 72 up (since 97m), 72 in (since 9h) flags noscrub,nodeep-scrub,nosnaptrim data: pools: 8 pools, 2409 pgs objects: 14.64M objects, 92 TiB usage: 276 TiB used, 143 TiB / 419 TiB avail pgs: 2409 active+clean _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx