Re: MDS fails repeatedly while handling many concurrent meta data operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Multiple active MDS's is a somewhat new feature, and it might obscure
> debugging information.
> I'm not sure what the best way to restore stability temporarily is,
> but if you can manage it,
> I would go down to one MDS, crank up the debugging, and try to
> reproduce the problem.

That didn't work, since Ceph wouldn't cancel any ranks until rank 0 was
back up, which never happened. I tried restarting all MDS daemons, which
usually worked in the past, but  didn't this time around. So I increased
the beacon grace time even further to 120 seconds, which actually did
the job. Apparently 45 seconds is the time it needed to load those 26M
inodes (default was 15 seconds).

I have it working again now with just one MDS. I started using multiple
MDSs, because one didn't seem to be able to handle all the meta data
operations. It's probably worth noting that with rsync, this situation
only gets more difficult with each time I have to restart the job,
because for all those millions of files, the ratio of meta data
operations to actual data transferred (and therefore time spent doing
it) gets worse each time.

I tried cranking up the debug level to 20, but then I could only get up
to about 70 reqs/s, which is pathetic. I did notice a lot of trimming
messages in the logs, though, even after all clients had disconnected.
It took a while for it to settle. I now have debugging set to 5 and am
copying with 16 processes again. Depending on what folders are being
copied at the moment, I get around 4-12k reqs/s. With 120 seconds beacon
grace time, it hasn't crashed yet (which is remarkable considering my
previous tries), but I expect that it will at some point, because the
number of inodes is only going up. I cracked the 60GB cache mark rather
quickly. Right now it's at around 42M inodes and 93GB of cache. This can
only go on for so long until it reaches the physical limit of the node.
Let me say that almost all rsync jobs finish with no actual data
transferred at this time after so many half-successful attempts.

> How are your OSDs configured? Are they HDDs? Do you have WAL and/or DB
> devices on SSDs?
> Is the metadata pool on SSDs?

No, we do not have any dedicated journal drives. Our cluster has 1216
OSDs at the moment, which are 10TB SAS spinning disks. That also doesn't
really seem to be a problem. I can copy multiple GB per second into the
cluster via RGW with no problems.


> On Tue, Jul 23, 2019 at 4:06 PM Janek Bevendorff wrote:
>> Thanks for your reply.
>>
>> On 23/07/2019 21:03, Nathan Fish wrote:
>>> What Ceph version? Do the clients match? What CPUs do the MDS servers
>>> have, and how is their CPU usage when this occurs?
>> Sorry, I totally forgot to mention that while transcribing my post. The
>> cluster runs Nautilus (I upgraded recently). The client still had Mimic
>> when I started, but an upgrade to Nautilus did not solve any of the
>> problems.
>>
>> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
>> While MDSs are trying to rejoin, they tend to saturate a single thread
>> shortly, but nothing spectacular. During normal operation, none of the
>> cores is particularly under load.
>>
>>> While migrating to a Nautilus cluster recently, we had up to 14
>>> million inodes open, and we increased the cache limit to 16GiB. Other
>>> than warnings about oversized cache, this caused no issues.
>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
>> without being kicked again after a few seconds), it did not change much
>> in terms of the actual problem. Right now I can change it to whatever I
>> want, it doesn't do anything, because rank 0 keeps being trashed anyway
>> (the other ranks are fine, but the CephFS is down anyway). Is there
>> anything useful I can give you to debug this? Otherwise I would try
>> killing the MDS daemons so I can at least restore the CephFS to a
>> semi-operational state.
>>
>>
>>> On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>>>> Hi,
>>>>
>>>> Disclaimer: I posted this before to the cheph.io mailing list, but from
>>>> the answers I didn't get and a look at the archives, I concluded that
>>>> that list is very dead. So apologies if anyone has read this before.
>>>>
>>>> I am trying to copy our storage server to a CephFS. We have 5 MONs in
>>>> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
>>>> trying to copy is about 23GB, so it's a lot of files. I am copying them
>>>> in batches of 25k using 16 parallel rsync processes over a 10G link.
>>>>
>>>> I started out with 5 MDSs / 2 active, but had repeated issues with
>>>> immense and growing cache sizes far beyond the theoretical maximum of
>>>> 400k inodes which the 16 rsync processes could keep open at the same
>>>> time. The usual inode count was between 1 and 4 million and the cache
>>>> size between 20 and 80GB on average.
>>>>
>>>> After a while, the MDSs started failing under this load by either
>>>> crashing or being kicked from the quorum. I tried increasing the max
>>>> cache size, max log segments, and beacon grace period, but to no avail.
>>>> A crashed MDS often needs minutes to rejoin.
>>>>
>>>> The MDSs fail with the following message:
>>>>
>>>>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
>>>> 'MDSRank' had timed out after 15
>>>>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
>>>> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
>>>> heartbeat is not healthy!
>>>>
>>>> I found the following thread, which seems to be about the same general
>>>> issue:
>>>>
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>>>>
>>>> Unfortunately, it does not really contain a solution except things I
>>>> have tried already. Though it does give some explanation as to why the
>>>> MDSs pile up so many open inodes. It appears like Ceph can't handle many
>>>> (write-only) operations on different files, since the clients keep their
>>>> capabilities open and the MDS can't evict them from its cache. This is
>>>> very baffling to me, since how am I supposed to use a CephFS if I cannot
>>>> fill it with files before?
>>>>
>>>> The next thing I tried was increasing the number of active MDSs. Three
>>>> seemed to make it worse, but four worked surprisingly well.
>>>> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
>>>> Since then the standbys have been (not very successfully) playing
>>>> round-robin to replace it, only to be kicked repeatedly. This is the
>>>> status quo right now and it has been going for hours with no end in
>>>> sight. The only option might be to kill all MDSs and let them restart
>>>> from empty caches.
>>>>
>>>> While trying to rejoin, the MDSs keep logging the above-mentioned error
>>>> message followed by
>>>>
>>>> 2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
>>>> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8
>>>> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
>>>> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1
>>>> 0x5642a2ff7700]
>>>>
>>>> and then finally
>>>>
>>>> 2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX Map has assigned me to
>>>> become a standby
>>>>
>>>> The other thing I noticed over the last few days is that after a
>>>> sufficient number of failures, the client locks up completely and the
>>>> mount becomes unresponsive, even after the MDSs are back. Sometimes this
>>>> lock-up is so catastrophic that I cannot even unmount the share with
>>>> umount -lf anymore and a reboot of the machine lets the kernel panic.
>>>> This looks like a bug to me.
>>>>
>>>> I hope somebody can provide me with tips to stabilize our setup. I can
>>>> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel,
>>>> no problem. But I cannot even rsync a few TB of files from a single node
>>>> to the CephFS without knocking out the MDS daemons.
>>>>
>>>> Any help is greatly appreciated!
>>>>
>>>> Janek
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux