Re: Ceph MDS randomly hangs with no useful error message

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Mon, 20 Jan 2020 17:09:31 +0100

Hi, I did as you asked and created a thread dump with GDB on the 
blocking MDS. Here's the result: https://pastebin.com/pPbNvfdb

On 17/01/2020 13:07, Yan, Zheng wrote:
On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients
connect at any given time. The FS contains about 80 TB of data and many
million files, so it is important that meta data operations work
smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to
crash or time out regularly as a result of failing to recall caps fast
enough and weren't able to rejoin afterwards without resetting the
mds*_openfiles objects (see
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
for details).

We have managed to adjust our configuration to avoid this problem. This
comes down mostly to adjusting the recall decay rate (which still isn't
documented), massively reducing any scrubbing activities, allowing for
no more than 10G for mds_cache_memory_limit (the default of 1G is way
too low, but more than 10G seems to cause trouble during replay),
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
haven't seen crashes since. But what we do see is that one of the MDS
nodes will randomly lock up and the ceph_mds_reply_latency metric goes
up and then stays at a higher level than any other MDS. The result is
not that the FS is completely down, but everything lags massively to the
point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

     -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
heartbeat is not healthy!
     -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
standby to take over, which went without problems. Almost immediately
after, all three now-active MDSs started reporting > 900 ops/s and the
FS started working properly again. For some strange reason, the failed
MDS didn't restart, though. It kept reporting the log message above
until I manually restarted the daemon process.

Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb

Is anybody else experiencing such issues or are there any configuration
parameters that I can tweak to avoid this behaviour?

Thanks
Janek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com