OSD down when deleting CephFS files/leveldb compaction

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 2 Oct 2019 13:55:47 -0700

I have one or two more stability issues I'm trying to solve in a
cluster that I inherited that I just can't seem to figure out. One
issue may be the cause for the other.

This is a Jewel 10.2.11 cluster with ~760 - 10TB HDDs and 5GB journals on SSD.

When a large number of files are deleted from CephFS (and possibly
when leveldb compacts), the OSD will stop responding to heartbeats and
get marked down, then come back and start recovery and then other OSDs
will have the same issue until client load on the cluster eases up
then it settles down.

Is there a way to have leveldb compact more frequently or cause it to
come up for air more frequently and respond to heartbeats and process
some IO? I thought splitting PGs would help, but we are still seeing
the problem (previously ~20 PGs per OSD to now ~150). I still have
some space on the SSDs that I can double, almost triple the journal,
but not sure if that will help in this situation.

The other issue I'm seeing is that some IO just gets stuck when the
OSDs are getting marked down and coming back through the cluster.

Thanks,
Robert LeBlanc

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx