I have one or two more stability issues I'm trying to solve in a cluster that I inherited that I just can't seem to figure out. One issue may be the cause for the other. This is a Jewel 10.2.11 cluster with ~760 - 10TB HDDs and 5GB journals on SSD. When a large number of files are deleted from CephFS (and possibly when leveldb compacts), the OSD will stop responding to heartbeats and get marked down, then come back and start recovery and then other OSDs will have the same issue until client load on the cluster eases up then it settles down. Is there a way to have leveldb compact more frequently or cause it to come up for air more frequently and respond to heartbeats and process some IO? I thought splitting PGs would help, but we are still seeing the problem (previously ~20 PGs per OSD to now ~150). I still have some space on the SSDs that I can double, almost triple the journal, but not sure if that will help in this situation. The other issue I'm seeing is that some IO just gets stuck when the OSDs are getting marked down and coming back through the cluster. Thanks, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx