MDS fails repeatedly while handling many concurrent meta data operations

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 23 Jul 2019 20:29:44 +0200

Hi,

Disclaimer: I posted this before to the cheph.io mailing list, but from
the answers I didn't get and a look at the archives, I concluded that
that list is very dead. So apologies if anyone has read this before.

I am trying to copy our storage server to a CephFS. We have 5 MONs in
our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
trying to copy is about 23GB, so it's a lot of files. I am copying them
in batches of 25k using 16 parallel rsync processes over a 10G link.

I started out with 5 MDSs / 2 active, but had repeated issues with
immense and growing cache sizes far beyond the theoretical maximum of
400k inodes which the 16 rsync processes could keep open at the same
time. The usual inode count was between 1 and 4 million and the cache
size between 20 and 80GB on average.

After a while, the MDSs started failing under this load by either
crashing or being kicked from the quorum. I tried increasing the max
cache size, max log segments, and beacon grace period, but to no avail.
A crashed MDS often needs minutes to rejoin.

The MDSs fail with the following message:

   -21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
   -20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
heartbeat is not healthy!

I found the following thread, which seems to be about the same general
issue:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html

Unfortunately, it does not really contain a solution except things I
have tried already. Though it does give some explanation as to why the
MDSs pile up so many open inodes. It appears like Ceph can't handle many
(write-only) operations on different files, since the clients keep their
capabilities open and the MDS can't evict them from its cache. This is
very baffling to me, since how am I supposed to use a CephFS if I cannot
fill it with files before?

The next thing I tried was increasing the number of active MDSs. Three
seemed to make it worse, but four worked surprisingly well.
Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
Since then the standbys have been (not very successfully) playing
round-robin to replace it, only to be kicked repeatedly. This is the
status quo right now and it has been going for hours with no end in
sight. The only option might be to kill all MDSs and let them restart
from empty caches.

While trying to rejoin, the MDSs keep logging the above-mentioned error
message followed by

2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8
/XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1
0x5642a2ff7700]

and then finally

2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX Map has assigned me to
become a standby

The other thing I noticed over the last few days is that after a
sufficient number of failures, the client locks up completely and the
mount becomes unresponsive, even after the MDSs are back. Sometimes this
lock-up is so catastrophic that I cannot even unmount the share with
umount -lf anymore and a reboot of the machine lets the kernel panic.
This looks like a bug to me.

I hope somebody can provide me with tips to stabilize our setup. I can
move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel,
no problem. But I cannot even rsync a few TB of files from a single node
to the CephFS without knocking out the MDS daemons.

Any help is greatly appreciated!

Janek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com