Re: MDS fails repeatedly while handling many concurrent meta data operations

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 23 Jul 2019 21:58:40 +0200

Here's some additional information. While MDS daemons are trying to
restore rank 0, they always load up to about 26M inodes before being
replaced by a standby:

+------+--------+-------------+---------------+-------+-------+
| Rank | State  |     MDS     |    Activity   |  dns  |  inos |
+------+--------+-------------+---------------+-------+-------+
|  0   | rejoin | node1       |               | 26.4M | 26.4M |
|  1   | active | node2       | Reqs:    0 /s | 17.9M | 17.9M |
|  2   | active | node3       | Reqs:    0 /s | 5861k | 5861k |
|  3   | active | node4       | Reqs:    0 /s | 7913k | 7913k |
+------+--------+-------------+---------------+-------+-------+

While the FS was still operational, all four MDSs handled around 5-15k
requests/s according to ceph fs status, although rank 0 always tended to
handle more than the rest. I think most of the 26M Inodes from the table
above piled up around or after the first MDS crash, although I think
after should be unlikely, since the failing MDS hung up the mount and
the rsync processes were put into D state while waiting for the mount to
come back.

On 23/07/2019 21:29, Janek Bevendorff wrote:
> Thanks for your reply.
>
> On 23/07/2019 21:03, Nathan Fish wrote:
>> What Ceph version? Do the clients match? What CPUs do the MDS servers
>> have, and how is their CPU usage when this occurs?
> Sorry, I totally forgot to mention that while transcribing my post. The
> cluster runs Nautilus (I upgraded recently). The client still had Mimic
> when I started, but an upgrade to Nautilus did not solve any of the
> problems.
>
> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
> While MDSs are trying to rejoin, they tend to saturate a single thread
> shortly, but nothing spectacular. During normal operation, none of the
> cores is particularly under load.
>
>> While migrating to a Nautilus cluster recently, we had up to 14
>> million inodes open, and we increased the cache limit to 16GiB. Other
>> than warnings about oversized cache, this caused no issues.
> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
> without being kicked again after a few seconds), it did not change much
> in terms of the actual problem. Right now I can change it to whatever I
> want, it doesn't do anything, because rank 0 keeps being trashed anyway
> (the other ranks are fine, but the CephFS is down anyway). Is there
> anything useful I can give you to debug this? Otherwise I would try
> killing the MDS daemons so I can at least restore the CephFS to a
> semi-operational state.
>
>
>> On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>>> Hi,
>>>
>>> Disclaimer: I posted this before to the cheph.io mailing list, but from
>>> the answers I didn't get and a look at the archives, I concluded that
>>> that list is very dead. So apologies if anyone has read this before.
>>>
>>> I am trying to copy our storage server to a CephFS. We have 5 MONs in
>>> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
>>> trying to copy is about 23GB, so it's a lot of files. I am copying them
>>> in batches of 25k using 16 parallel rsync processes over a 10G link.
>>>
>>> I started out with 5 MDSs / 2 active, but had repeated issues with
>>> immense and growing cache sizes far beyond the theoretical maximum of
>>> 400k inodes which the 16 rsync processes could keep open at the same
>>> time. The usual inode count was between 1 and 4 million and the cache
>>> size between 20 and 80GB on average.
>>>
>>> After a while, the MDSs started failing under this load by either
>>> crashing or being kicked from the quorum. I tried increasing the max
>>> cache size, max log segments, and beacon grace period, but to no avail.
>>> A crashed MDS often needs minutes to rejoin.
>>>
>>> The MDSs fail with the following message:
>>>
>>>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
>>> 'MDSRank' had timed out after 15
>>>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
>>> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
>>> heartbeat is not healthy!
>>>
>>> I found the following thread, which seems to be about the same general
>>> issue:
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>>>
>>> Unfortunately, it does not really contain a solution except things I
>>> have tried already. Though it does give some explanation as to why the
>>> MDSs pile up so many open inodes. It appears like Ceph can't handle many
>>> (write-only) operations on different files, since the clients keep their
>>> capabilities open and the MDS can't evict them from its cache. This is
>>> very baffling to me, since how am I supposed to use a CephFS if I cannot
>>> fill it with files before?
>>>
>>> The next thing I tried was increasing the number of active MDSs. Three
>>> seemed to make it worse, but four worked surprisingly well.
>>> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
>>> Since then the standbys have been (not very successfully) playing
>>> round-robin to replace it, only to be kicked repeatedly. This is the
>>> status quo right now and it has been going for hours with no end in
>>> sight. The only option might be to kill all MDSs and let them restart
>>> from empty caches.
>>>
>>> While trying to rejoin, the MDSs keep logging the above-mentioned error
>>> message followed by
>>>
>>> 2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
>>> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8
>>> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
>>> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1
>>> 0x5642a2ff7700]
>>>
>>> and then finally
>>>
>>> 2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX Map has assigned me to
>>> become a standby
>>>
>>> The other thing I noticed over the last few days is that after a
>>> sufficient number of failures, the client locks up completely and the
>>> mount becomes unresponsive, even after the MDSs are back. Sometimes this
>>> lock-up is so catastrophic that I cannot even unmount the share with
>>> umount -lf anymore and a reboot of the machine lets the kernel panic.
>>> This looks like a bug to me.
>>>
>>> I hope somebody can provide me with tips to stabilize our setup. I can
>>> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel,
>>> no problem. But I cannot even rsync a few TB of files from a single node
>>> to the CephFS without knocking out the MDS daemons.
>>>
>>> Any help is greatly appreciated!
>>>
>>> Janek
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com