On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: > Am 25.02.2018 um 21:50 schrieb John Spray: >> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth >>> Now, with about 100,000,000 objects written, we are in a disaster situation. >>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. >>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes: >>> 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start >>> 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start >>> and finally, 5 minutes later, OOM. >>> >>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine. >>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles. >>> Is there anything that can be configured to prevent this from happening? >> >> Clients will generally hold onto capabilities for files they've >> written out -- this is pretty sub-optimal for many workloads where >> files are written out but not likely to be accessed again in the near >> future. While clients hold these capabilities, the MDS cannot drop >> things from its own cache. >> >> The way this is *meant* to work is that the MDS hits its cache size >> limit, and sends a message to clients asking them to drop some files >> from their local cache, and consequently release those capabilities. >> However, this has historically been a tricky area with ceph-fuse >> clients (there are some hacks for detecting kernel version and using >> different mechanisms for different versions of fuse), and it's >> possible that on your clients this mechanism is simply not working, >> leading to a severely oversized MDS cache. >> >> The MDS should have been showing health alerts in "ceph status" about >> this, but I suppose it's possible that it wasn't surviving long enough >> to hit the timeout (60s) that we apply for warning about misbehaving >> clients? It would be good to check the cluster log to see if you were >> getting any health messages along the lines of "Client xyz failing to >> respond to cache pressure". > > This explains the high memory usage indeed. > I can also confirm seeing those health alerts, now that I check the logs. > The systems have been (servers and clients) all exclusively CentOS 7.4, > so kernels are rather old, but I would have hoped things have been backported > by RedHat. > > Is there anything one can do to limit client's cache sizes? You said the clients are ceph-fuse running 12.2.3? Then they should have: http://tracker.ceph.com/issues/22339 (Please double check you're not running older clients on accident.) I have run small file tests with ~128 clients without issue. Generally if there is an issue it is because clients are not releasing their capabilities properly (due to invalidation bugs which should be caught by the above backport) or the MDS memory usage exceeds RAM. If the clients are not releasing their capabilities, you should see the errors John described in the cluster log. You said in the original post that the `mds cache memory limit = 4GB`. If that's the case, you really shouldn't be exceeding 40GB of RAM! It's possible you have found a bug of some kind. I suggest tracking the MDS cache statistics (which includes the inode count in cache) by collecting a `perf dump` via the admin socket. Then you can begin to find out what's consuming all of the MDS memory. Additionally, I concur with John on digging into why the MDS is missing heartbeats by collecting debug logs (`debug mds = 15`) at that time. It may also shed light on the issue. Thanks for performing the test and letting us know the results. -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com