Addition: the MDS just crashed with a cache size of over 100GB. Nothing in the logs though (not at this level at least). It just went from spamming "handle_client_request" to its usual bootup log spill. On 23/07/2019 23:52, Janek Bevendorff wrote: >> Multiple active MDS's is a somewhat new feature, and it might obscure >> debugging information. >> I'm not sure what the best way to restore stability temporarily is, >> but if you can manage it, >> I would go down to one MDS, crank up the debugging, and try to >> reproduce the problem. > That didn't work, since Ceph wouldn't cancel any ranks until rank 0 was > back up, which never happened. I tried restarting all MDS daemons, which > usually worked in the past, but didn't this time around. So I increased > the beacon grace time even further to 120 seconds, which actually did > the job. Apparently 45 seconds is the time it needed to load those 26M > inodes (default was 15 seconds). > > I have it working again now with just one MDS. I started using multiple > MDSs, because one didn't seem to be able to handle all the meta data > operations. It's probably worth noting that with rsync, this situation > only gets more difficult with each time I have to restart the job, > because for all those millions of files, the ratio of meta data > operations to actual data transferred (and therefore time spent doing > it) gets worse each time. > > I tried cranking up the debug level to 20, but then I could only get up > to about 70 reqs/s, which is pathetic. I did notice a lot of trimming > messages in the logs, though, even after all clients had disconnected. > It took a while for it to settle. I now have debugging set to 5 and am > copying with 16 processes again. Depending on what folders are being > copied at the moment, I get around 4-12k reqs/s. With 120 seconds beacon > grace time, it hasn't crashed yet (which is remarkable considering my > previous tries), but I expect that it will at some point, because the > number of inodes is only going up. I cracked the 60GB cache mark rather > quickly. Right now it's at around 42M inodes and 93GB of cache. This can > only go on for so long until it reaches the physical limit of the node. > Let me say that almost all rsync jobs finish with no actual data > transferred at this time after so many half-successful attempts. > >> How are your OSDs configured? Are they HDDs? Do you have WAL and/or DB >> devices on SSDs? >> Is the metadata pool on SSDs? > No, we do not have any dedicated journal drives. Our cluster has 1216 > OSDs at the moment, which are 10TB SAS spinning disks. That also doesn't > really seem to be a problem. I can copy multiple GB per second into the > cluster via RGW with no problems. > > >> On Tue, Jul 23, 2019 at 4:06 PM Janek Bevendorff wrote: >>> Thanks for your reply. >>> >>> On 23/07/2019 21:03, Nathan Fish wrote: >>>> What Ceph version? Do the clients match? What CPUs do the MDS servers >>>> have, and how is their CPU usage when this occurs? >>> Sorry, I totally forgot to mention that while transcribing my post. The >>> cluster runs Nautilus (I upgraded recently). The client still had Mimic >>> when I started, but an upgrade to Nautilus did not solve any of the >>> problems. >>> >>> The MDS nodes have Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual >>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild. >>> While MDSs are trying to rejoin, they tend to saturate a single thread >>> shortly, but nothing spectacular. During normal operation, none of the >>> cores is particularly under load. >>> >>>> While migrating to a Nautilus cluster recently, we had up to 14 >>>> million inodes open, and we increased the cache limit to 16GiB. Other >>>> than warnings about oversized cache, this caused no issues. >>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting >>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin >>> without being kicked again after a few seconds), it did not change much >>> in terms of the actual problem. Right now I can change it to whatever I >>> want, it doesn't do anything, because rank 0 keeps being trashed anyway >>> (the other ranks are fine, but the CephFS is down anyway). Is there >>> anything useful I can give you to debug this? Otherwise I would try >>> killing the MDS daemons so I can at least restore the CephFS to a >>> semi-operational state. >>> >>> >>>> On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote: >>>>> Hi, >>>>> >>>>> Disclaimer: I posted this before to the cheph.io mailing list, but from >>>>> the answers I didn't get and a look at the archives, I concluded that >>>>> that list is very dead. So apologies if anyone has read this before. >>>>> >>>>> I am trying to copy our storage server to a CephFS. We have 5 MONs in >>>>> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am >>>>> trying to copy is about 23GB, so it's a lot of files. I am copying them >>>>> in batches of 25k using 16 parallel rsync processes over a 10G link. >>>>> >>>>> I started out with 5 MDSs / 2 active, but had repeated issues with >>>>> immense and growing cache sizes far beyond the theoretical maximum of >>>>> 400k inodes which the 16 rsync processes could keep open at the same >>>>> time. The usual inode count was between 1 and 4 million and the cache >>>>> size between 20 and 80GB on average. >>>>> >>>>> After a while, the MDSs started failing under this load by either >>>>> crashing or being kicked from the quorum. I tried increasing the max >>>>> cache size, max log segments, and beacon grace period, but to no avail. >>>>> A crashed MDS often needs minutes to rejoin. >>>>> >>>>> The MDSs fail with the following message: >>>>> >>>>> -21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map is_healthy >>>>> 'MDSRank' had timed out after 15 >>>>> -20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping >>>>> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal >>>>> heartbeat is not healthy! >>>>> >>>>> I found the following thread, which seems to be about the same general >>>>> issue: >>>>> >>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html >>>>> >>>>> Unfortunately, it does not really contain a solution except things I >>>>> have tried already. Though it does give some explanation as to why the >>>>> MDSs pile up so many open inodes. It appears like Ceph can't handle many >>>>> (write-only) operations on different files, since the clients keep their >>>>> capabilities open and the MDS can't evict them from its cache. This is >>>>> very baffling to me, since how am I supposed to use a CephFS if I cannot >>>>> fill it with files before? >>>>> >>>>> The next thing I tried was increasing the number of active MDSs. Three >>>>> seemed to make it worse, but four worked surprisingly well. >>>>> Unfortunately, the crash came eventually and the rank-0 MDS got kicked. >>>>> Since then the standbys have been (not very successfully) playing >>>>> round-robin to replace it, only to be kicked repeatedly. This is the >>>>> status quo right now and it has been going for hours with no end in >>>>> sight. The only option might be to kill all MDSs and let them restart >>>>> from empty caches. >>>>> >>>>> While trying to rejoin, the MDSs keep logging the above-mentioned error >>>>> message followed by >>>>> >>>>> 2019-07-23 17:53:37.386 7f3b135a5700 0 mds.0.cache.ino(0x100019693f8) >>>>> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8 >>>>> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0 >>>>> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1 >>>>> 0x5642a2ff7700] >>>>> >>>>> and then finally >>>>> >>>>> 2019-07-23 17:53:48.786 7fb02bc08700 1 mds.XXX Map has assigned me to >>>>> become a standby >>>>> >>>>> The other thing I noticed over the last few days is that after a >>>>> sufficient number of failures, the client locks up completely and the >>>>> mount becomes unresponsive, even after the MDSs are back. Sometimes this >>>>> lock-up is so catastrophic that I cannot even unmount the share with >>>>> umount -lf anymore and a reboot of the machine lets the kernel panic. >>>>> This looks like a bug to me. >>>>> >>>>> I hope somebody can provide me with tips to stabilize our setup. I can >>>>> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel, >>>>> no problem. But I cannot even rsync a few TB of files from a single node >>>>> to the CephFS without knocking out the MDS daemons. >>>>> >>>>> Any help is greatly appreciated! >>>>> >>>>> Janek >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com