Ceph MDS randomly hangs with no useful error message

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients connect at any given time. The FS contains about 80 TB of data and many million files, so it is important that meta data operations work smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to crash or time out regularly as a result of failing to recall caps fast enough and weren't able to rejoin afterwards without resetting the mds*_openfiles objects (see https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ for details).

We have managed to adjust our configuration to avoid this problem. This comes down mostly to adjusting the recall decay rate (which still isn't documented), massively reducing any scrubbing activities, allowing for no more than 10G for mds_cache_memory_limit (the default of 1G is way too low, but more than 10G seems to cause trouble during replay), increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We haven't seen crashes since. But what we do see is that one of the MDS nodes will randomly lock up and the ceph_mds_reply_latency metric goes up and then stays at a higher level than any other MDS. The result is not that the FS is completely down, but everything lags massively to the point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

   -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 320.587s ago); MDS internal heartbeat is not healthy!    -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a standby to take over, which went without problems. Almost immediately after, all three now-active MDSs started reporting > 900 ops/s and the FS started working properly again. For some strange reason, the failed MDS didn't restart, though. It kept reporting the log message above until I manually restarted the daemon process.

Is anybody else experiencing such issues or are there any configuration parameters that I can tweak to avoid this behaviour?

Thanks
Janek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux