Multiple active MDS's is a somewhat new feature, and it might obscure debugging information. I'm not sure what the best way to restore stability temporarily is, but if you can manage it, I would go down to one MDS, crank up the debugging, and try to reproduce the problem. How are your OSDs configured? Are they HDDs? Do you have WAL and/or DB devices on SSDs? Is the metadata pool on SSDs? On Tue, Jul 23, 2019 at 4:06 PM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Thanks for your reply. > > On 23/07/2019 21:03, Nathan Fish wrote: > > What Ceph version? Do the clients match? What CPUs do the MDS servers > > have, and how is their CPU usage when this occurs? > > Sorry, I totally forgot to mention that while transcribing my post. The > cluster runs Nautilus (I upgraded recently). The client still had Mimic > when I started, but an upgrade to Nautilus did not solve any of the > problems. > > The MDS nodes have Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual > CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild. > While MDSs are trying to rejoin, they tend to saturate a single thread > shortly, but nothing spectacular. During normal operation, none of the > cores is particularly under load. > > > While migrating to a Nautilus cluster recently, we had up to 14 > > million inodes open, and we increased the cache limit to 16GiB. Other > > than warnings about oversized cache, this caused no issues. > > I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting > rid of the cache size warnings (and sometimes allowing an MDS to rejoin > without being kicked again after a few seconds), it did not change much > in terms of the actual problem. Right now I can change it to whatever I > want, it doesn't do anything, because rank 0 keeps being trashed anyway > (the other ranks are fine, but the CephFS is down anyway). Is there > anything useful I can give you to debug this? Otherwise I would try > killing the MDS daemons so I can at least restore the CephFS to a > semi-operational state. > > > > > > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote: > >> Hi, > >> > >> Disclaimer: I posted this before to the cheph.io mailing list, but from > >> the answers I didn't get and a look at the archives, I concluded that > >> that list is very dead. So apologies if anyone has read this before. > >> > >> I am trying to copy our storage server to a CephFS. We have 5 MONs in > >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am > >> trying to copy is about 23GB, so it's a lot of files. I am copying them > >> in batches of 25k using 16 parallel rsync processes over a 10G link. > >> > >> I started out with 5 MDSs / 2 active, but had repeated issues with > >> immense and growing cache sizes far beyond the theoretical maximum of > >> 400k inodes which the 16 rsync processes could keep open at the same > >> time. The usual inode count was between 1 and 4 million and the cache > >> size between 20 and 80GB on average. > >> > >> After a while, the MDSs started failing under this load by either > >> crashing or being kicked from the quorum. I tried increasing the max > >> cache size, max log segments, and beacon grace period, but to no avail. > >> A crashed MDS often needs minutes to rejoin. > >> > >> The MDSs fail with the following message: > >> > >> -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy > >> 'MDSRank' had timed out after 15 > >> -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping > >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal > >> heartbeat is not healthy! > >> > >> I found the following thread, which seems to be about the same general > >> issue: > >> > >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html > >> > >> Unfortunately, it does not really contain a solution except things I > >> have tried already. Though it does give some explanation as to why the > >> MDSs pile up so many open inodes. It appears like Ceph can't handle many > >> (write-only) operations on different files, since the clients keep their > >> capabilities open and the MDS can't evict them from its cache. This is > >> very baffling to me, since how am I supposed to use a CephFS if I cannot > >> fill it with files before? > >> > >> The next thing I tried was increasing the number of active MDSs. Three > >> seemed to make it worse, but four worked surprisingly well. > >> Unfortunately, the crash came eventually and the rank-0 MDS got kicked. > >> Since then the standbys have been (not very successfully) playing > >> round-robin to replace it, only to be kicked repeatedly. This is the > >> status quo right now and it has been going for hours with no end in > >> sight. The only option might be to kill all MDSs and let them restart > >> from empty caches. > >> > >> While trying to rejoin, the MDSs keep logging the above-mentioned error > >> message followed by > >> > >> 2019-07-23 17:53:37.386 7f3b135a5700 0 mds.0.cache.ino(0x100019693f8) > >> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8 > >> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0 > >> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1 > >> 0x5642a2ff7700] > >> > >> and then finally > >> > >> 2019-07-23 17:53:48.786 7fb02bc08700 1 mds.XXX Map has assigned me to > >> become a standby > >> > >> The other thing I noticed over the last few days is that after a > >> sufficient number of failures, the client locks up completely and the > >> mount becomes unresponsive, even after the MDSs are back. Sometimes this > >> lock-up is so catastrophic that I cannot even unmount the share with > >> umount -lf anymore and a reboot of the machine lets the kernel panic. > >> This looks like a bug to me. > >> > >> I hope somebody can provide me with tips to stabilize our setup. I can > >> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel, > >> no problem. But I cannot even rsync a few TB of files from a single node > >> to the CephFS without knocking out the MDS daemons. > >> > >> Any help is greatly appreciated! > >> > >> Janek > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com