Hello all, We are running several Ceph clusters and are facing an issue on one of them, we would appreciate some input on the problems we're seeing. We run Ceph in containers on Centos Stream 8, and we deploy using ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed that OSDs were taking a very long time to restart on one of the clusters. (Other clusters were not impacted at all.) The OSD startup was so slow sometimes that we ended up having slow ops, with 1 or 2 pg stuck in a peering state. We've interrupted the upgrade and the cluster runs fine now, although we have seen 1 OSD flapping recently, having trouble coming back to life. We've checked a lot of things and read a lot of mails from this list, and here are some info: * this cluster has RBD pools for OpenStack and RGW pools; everything is replicated x 3, except the RGW data pool which is EC 4+2 * we haven't found any hardware related issues; we run fully on SSDs and they are all in good shape, no network issue, RAM and CPU are available on all OSD hosts * bluestore with an LVM collocated setup * we have seen the slow restart with almost all the OSDs we've upgraded (100 out of 350) * on restart the ceph-osd process runs at 100% CPU but we haven't seen anything weird on the host * no DB spillover * we have other clusters with the same hardware, and we don't see problems there The only thing that we found that looks suspicious is the number of op logs for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k but `ceph pg dump` show PGs with more than 100k logs (the largest one has > 400k logs). Could this be the reason for the slow startup of OSDs? If so is there a way to trim these logs without too much impact on the cluster? Let me know if additional info or logs are needed. BR, Gauvain _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx