On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Hi, > > We have a CephFS in our cluster with 3 MDS to which > 300 clients > connect at any given time. The FS contains about 80 TB of data and many > million files, so it is important that meta data operations work > smoothly even when listing large directories. > > Previously, we had massive stability problems causing the MDS nodes to > crash or time out regularly as a result of failing to recall caps fast > enough and weren't able to rejoin afterwards without resetting the > mds*_openfiles objects (see > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ > for details). > > We have managed to adjust our configuration to avoid this problem. This > comes down mostly to adjusting the recall decay rate (which still isn't > documented), massively reducing any scrubbing activities, allowing for > no more than 10G for mds_cache_memory_limit (the default of 1G is way > too low, but more than 10G seems to cause trouble during replay), > increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We > haven't seen crashes since. But what we do see is that one of the MDS > nodes will randomly lock up and the ceph_mds_reply_latency metric goes > up and then stays at a higher level than any other MDS. The result is > not that the FS is completely down, but everything lags massively to the > point where it's not usable. > > Unfortunately, all the hung MDS is reporting is: > > -77> 2020-01-17 09:29:17.891 7f34c967b700 0 mds.beacon.XXX Skipping > beacon heartbeat to monitors (last acked 320.587s ago); MDS internal > heartbeat is not healthy! > -76> 2020-01-17 09:29:18.391 7f34c967b700 1 heartbeat_map > is_healthy 'MDSRank' had timed out after 15 > > and ceph fs status reports only single-digit ops/s for all three MDSs > (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a > standby to take over, which went without problems. Almost immediately > after, all three now-active MDSs started reporting > 900 ops/s and the > FS started working properly again. For some strange reason, the failed > MDS didn't restart, though. It kept reporting the log message above > until I manually restarted the daemon process. > Looks like mds entered same long (/infinite) loops. If this happens again, could you use gdb to attach it, and run command 'thread apply all bt' inside gdb > Is anybody else experiencing such issues or are there any configuration > parameters that I can tweak to avoid this behaviour? > > Thanks > Janek > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com