On Mon, Feb 26, 2018 at 4:50 PM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: > Am 26.02.2018 um 17:15 schrieb John Spray: >> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth >> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: >>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly: >>>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth >>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: >>>>> Am 25.02.2018 um 21:50 schrieb John Spray: >>>>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth >>>>>>> Now, with about 100,000,000 objects written, we are in a disaster situation. >>>>>>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. >>>>>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes: >>>>>>> 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start >>>>>>> 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start >>>>>>> and finally, 5 minutes later, OOM. >>>>>>> >>>>>>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine. >>>>>>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles. >>>>>>> Is there anything that can be configured to prevent this from happening? >>>>>> >>>>>> Clients will generally hold onto capabilities for files they've >>>>>> written out -- this is pretty sub-optimal for many workloads where >>>>>> files are written out but not likely to be accessed again in the near >>>>>> future. While clients hold these capabilities, the MDS cannot drop >>>>>> things from its own cache. >>>>>> >>>>>> The way this is *meant* to work is that the MDS hits its cache size >>>>>> limit, and sends a message to clients asking them to drop some files >>>>>> from their local cache, and consequently release those capabilities. >>>>>> However, this has historically been a tricky area with ceph-fuse >>>>>> clients (there are some hacks for detecting kernel version and using >>>>>> different mechanisms for different versions of fuse), and it's >>>>>> possible that on your clients this mechanism is simply not working, >>>>>> leading to a severely oversized MDS cache. >>>>>> >>>>>> The MDS should have been showing health alerts in "ceph status" about >>>>>> this, but I suppose it's possible that it wasn't surviving long enough >>>>>> to hit the timeout (60s) that we apply for warning about misbehaving >>>>>> clients? It would be good to check the cluster log to see if you were >>>>>> getting any health messages along the lines of "Client xyz failing to >>>>>> respond to cache pressure". >>>>> >>>>> This explains the high memory usage indeed. >>>>> I can also confirm seeing those health alerts, now that I check the logs. >>>>> The systems have been (servers and clients) all exclusively CentOS 7.4, >>>>> so kernels are rather old, but I would have hoped things have been backported >>>>> by RedHat. >>>>> >>>>> Is there anything one can do to limit client's cache sizes? >>>> >>>> You said the clients are ceph-fuse running 12.2.3? Then they should have: >>>> >>>> http://tracker.ceph.com/issues/22339 >>>> >>>> (Please double check you're not running older clients on accident.) >>> >>> I can confirm all clients have been running 12.2.3. >>> Is the issue really related? It looks like a remount-failure fix. >> >> The fuse client uses a remount internally to persuade the fuse kernel >> module to really drop things from its cache (fuse doesn't provide the >> ideal hooks for managing this stuff in network filesystems). > > Thanks for the explanation, now I understand! > >> >>>> I have run small file tests with ~128 clients without issue. Generally >>>> if there is an issue it is because clients are not releasing their >>>> capabilities properly (due to invalidation bugs which should be caught >>>> by the above backport) or the MDS memory usage exceeds RAM. If the >>>> clients are not releasing their capabilities, you should see the >>>> errors John described in the cluster log. >>>> >>>> You said in the original post that the `mds cache memory limit = 4GB`. >>>> If that's the case, you really shouldn't be exceeding 40GB of RAM! >>>> It's possible you have found a bug of some kind. I suggest tracking >>>> the MDS cache statistics (which includes the inode count in cache) by >>>> collecting a `perf dump` via the admin socket. Then you can begin to >>>> find out what's consuming all of the MDS memory. >>>> >>>> Additionally, I concur with John on digging into why the MDS is >>>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that >>>> time. It may also shed light on the issue. >>> >>> Yes, I confirmed this earlier - indeed I found the "failing to respond to cache pressure" alerts in the logs. >>> The excess of RAM initally was "only" about 50 - 100 % which was still fine - the main issue started after I tested MDS failover in this situation. >>> If I understand correctly, the clients are only prevented from growing their caps to huge values if an MDS is running >>> and actively preventing them from doing so. Correct? >> >> The clients have their own per-client limit on cache size >> (client_cache_size) that they apply locally. They'll only hold caps >> on things they have in cache, so this indirectly controls how many >> caps they will ask for. However, if you were hitting the 22339 or a >> similar issue then even this limit may not be properly enforced. > > Ok, understood. This we did not touch yet, so it should be 16384 inodes. > > > There is another specialty in our setup, and I am not sure if it matters. > Our stress-test was running using the same approach analysis jobs from users will use later on. > We use HTCondor here as a workload management system, which takes care of starting the separate "jobs" on the worker node machines > which are the cephfs clients. > The jobs are in our case all encapsuled inside singularity containers, which open up a new namespace environment for each job. > This includes a PID namespace and a mount namespace... > I am unsure how exactly the remounting of the fuse client will affect the mount as seen in the namespace, and the mount as seen in the host namespace. > For sure, I can confirm that writing and reading works fine. But I'm unsure how the "remounting" is affected by this specialty. That is certainly interesting information, I don't have the kernel knowledge to say whether it would affect our remounting/invalidation paths, but it seems plausible. >From what I hear, people using CephFS for container volumes are mostly using the kernel client rather than fuse (the kernel client also has much better small file performance in general). > From another Fuse FS: > https://sft.its.cern.ch/jira/browse/CVM-1478 > we have already learnt that while one can even perform "umount /cvmfs/XXX/" on the host, > the cvmfs-fuse-helper stays running since it is still used inside the mount namespace of the container. > Since I don't fully understand to what extend cephfs remounts, and how this affects mount namespaces, > I am unsure if this is related at all, but it surely is special to our setup (but may become more and more common in the future, > especially in HPC). > > We'll try to reproduce the issue overnight (which is the last occasion before we want to embrace the first test users) > and I'll surely look at perf dump on the mds. > > Is there also some way I can extract info from the clients? Fuse clients have an "admin socket" like the server daemons do, it's usually in /var/lib/ceph somewhere, and you can do "ceph daemon <path to .asok file> help" to see available commands -- there are various status, perf dump, etc ones that should include things like how many items are in cache. John > > Cheers and thanks for your input! > Oliver > >> >>> However, since the failover took a few minutes (I played with the beacon timeouts and increased the mds_log_max_segments and mds_log_max_expiring to check impact on performance), >>> this could well have been the main cause for the huge memory consumption. Do I understand correctly that the clients may grow their number of caps >>> to huge numbers if all MDS are down for a few minutes, since nobody holds their hands? >> >> No -- MDS daemons issue the caps, so clients can't get more without >> talking to an MDS. >> >> John >> >>> This could explain why, when the MDS wanted to finally come back after the config changes, it was flooded with a tremendous number of caps, >>> which did not fit into memory + swap at all. This, in turn, made the MDS + the metadata OSDs from which it was feeding (on the same machine...) >>> very slow, so it got stuck for quite a while in rejoin / joint phases, and missed heartbeats, triggering another failover. >>> As soon as I noticed and understood this, several failovers had already happened, and about an hour had passed. >>> >>> >>> If my understanding is correct, this would mean the clients had quite some time to accumulate even more caps. >>> I increased the beacon timeout then, which gave the MDS, which was very sluggish (swapping, waiting for metadata OSDs to feed it) enough grace >>> to start up - and then, it ran into OOM condition, since there were too many caps held for it to ever handle with our hardware. >>> >>> The only way out of this seems to be to kill off the actual clients - right? >>> >>> So if my assumption is correct, it would help to be able to control the maximum number of caps clients can hold, >>> even if the MDS is shortly down for some reason. Is this feasible? >>> >>>> >>>> Thanks for performing the test and letting us know the results. >>>> >>> >>> No problem! We are trying to push the system to its limits now before our users do it, >>> we still have 1-2 days to do that, and want to play a bit with the read patterns of the main application framework our users will run ( https://root.cern.ch/ ), >>> and then our first users will start to do their best to break things apart. >>> >>> Cheers, >>> Oliver >>> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com