Ceph MDS randomly hangs with no useful error message

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Fri, 17 Jan 2020 09:46:54 +0100

Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients 
connect at any given time. The FS contains about 80 TB of data and many 
million files, so it is important that meta data operations work 
smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to 
crash or time out regularly as a result of failing to recall caps fast 
enough and weren't able to rejoin afterwards without resetting the 
mds*_openfiles objects (see 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ 
for details).

We have managed to adjust our configuration to avoid this problem. This 
comes down mostly to adjusting the recall decay rate (which still isn't 
documented), massively reducing any scrubbing activities, allowing for 
no more than 10G for mds_cache_memory_limit (the default of 1G is way 
too low, but more than 10G seems to cause trouble during replay), 
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We 
haven't seen crashes since. But what we do see is that one of the MDS 
nodes will randomly lock up and the ceph_mds_reply_latency metric goes 
up and then stays at a higher level than any other MDS. The result is 
not that the FS is completely down, but everything lags massively to the 
point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

   -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping 
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal 
heartbeat is not healthy!
   -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map 
is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs 
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a 
standby to take over, which went without problems. Almost immediately 
after, all three now-active MDSs started reporting > 900 ops/s and the 
FS started working properly again. For some strange reason, the failed 
MDS didn't restart, though. It kept reporting the log message above 
until I manually restarted the daemon process.

Is anybody else experiencing such issues or are there any configuration 
parameters that I can tweak to avoid this behaviour?

Thanks
Janek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com