On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: > Am 26.02.2018 um 16:43 schrieb Patrick Donnelly: >> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth >> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: >>> Am 25.02.2018 um 21:50 schrieb John Spray: >>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth >>>>> Now, with about 100,000,000 objects written, we are in a disaster situation. >>>>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. >>>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes: >>>>> 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start >>>>> 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start >>>>> and finally, 5 minutes later, OOM. >>>>> >>>>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine. >>>>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles. >>>>> Is there anything that can be configured to prevent this from happening? >>>> >>>> Clients will generally hold onto capabilities for files they've >>>> written out -- this is pretty sub-optimal for many workloads where >>>> files are written out but not likely to be accessed again in the near >>>> future. While clients hold these capabilities, the MDS cannot drop >>>> things from its own cache. >>>> >>>> The way this is *meant* to work is that the MDS hits its cache size >>>> limit, and sends a message to clients asking them to drop some files >>>> from their local cache, and consequently release those capabilities. >>>> However, this has historically been a tricky area with ceph-fuse >>>> clients (there are some hacks for detecting kernel version and using >>>> different mechanisms for different versions of fuse), and it's >>>> possible that on your clients this mechanism is simply not working, >>>> leading to a severely oversized MDS cache. >>>> >>>> The MDS should have been showing health alerts in "ceph status" about >>>> this, but I suppose it's possible that it wasn't surviving long enough >>>> to hit the timeout (60s) that we apply for warning about misbehaving >>>> clients? It would be good to check the cluster log to see if you were >>>> getting any health messages along the lines of "Client xyz failing to >>>> respond to cache pressure". >>> >>> This explains the high memory usage indeed. >>> I can also confirm seeing those health alerts, now that I check the logs. >>> The systems have been (servers and clients) all exclusively CentOS 7.4, >>> so kernels are rather old, but I would have hoped things have been backported >>> by RedHat. >>> >>> Is there anything one can do to limit client's cache sizes? >> >> You said the clients are ceph-fuse running 12.2.3? Then they should have: >> >> http://tracker.ceph.com/issues/22339 >> >> (Please double check you're not running older clients on accident.) > > I can confirm all clients have been running 12.2.3. > Is the issue really related? It looks like a remount-failure fix. The fuse client uses a remount internally to persuade the fuse kernel module to really drop things from its cache (fuse doesn't provide the ideal hooks for managing this stuff in network filesystems). >> I have run small file tests with ~128 clients without issue. Generally >> if there is an issue it is because clients are not releasing their >> capabilities properly (due to invalidation bugs which should be caught >> by the above backport) or the MDS memory usage exceeds RAM. If the >> clients are not releasing their capabilities, you should see the >> errors John described in the cluster log. >> >> You said in the original post that the `mds cache memory limit = 4GB`. >> If that's the case, you really shouldn't be exceeding 40GB of RAM! >> It's possible you have found a bug of some kind. I suggest tracking >> the MDS cache statistics (which includes the inode count in cache) by >> collecting a `perf dump` via the admin socket. Then you can begin to >> find out what's consuming all of the MDS memory. >> >> Additionally, I concur with John on digging into why the MDS is >> missing heartbeats by collecting debug logs (`debug mds = 15`) at that >> time. It may also shed light on the issue. > > Yes, I confirmed this earlier - indeed I found the "failing to respond to cache pressure" alerts in the logs. > The excess of RAM initally was "only" about 50 - 100 % which was still fine - the main issue started after I tested MDS failover in this situation. > If I understand correctly, the clients are only prevented from growing their caps to huge values if an MDS is running > and actively preventing them from doing so. Correct? The clients have their own per-client limit on cache size (client_cache_size) that they apply locally. They'll only hold caps on things they have in cache, so this indirectly controls how many caps they will ask for. However, if you were hitting the 22339 or a similar issue then even this limit may not be properly enforced. > However, since the failover took a few minutes (I played with the beacon timeouts and increased the mds_log_max_segments and mds_log_max_expiring to check impact on performance), > this could well have been the main cause for the huge memory consumption. Do I understand correctly that the clients may grow their number of caps > to huge numbers if all MDS are down for a few minutes, since nobody holds their hands? No -- MDS daemons issue the caps, so clients can't get more without talking to an MDS. John > This could explain why, when the MDS wanted to finally come back after the config changes, it was flooded with a tremendous number of caps, > which did not fit into memory + swap at all. This, in turn, made the MDS + the metadata OSDs from which it was feeding (on the same machine...) > very slow, so it got stuck for quite a while in rejoin / joint phases, and missed heartbeats, triggering another failover. > As soon as I noticed and understood this, several failovers had already happened, and about an hour had passed. > > > If my understanding is correct, this would mean the clients had quite some time to accumulate even more caps. > I increased the beacon timeout then, which gave the MDS, which was very sluggish (swapping, waiting for metadata OSDs to feed it) enough grace > to start up - and then, it ran into OOM condition, since there were too many caps held for it to ever handle with our hardware. > > The only way out of this seems to be to kill off the actual clients - right? > > So if my assumption is correct, it would help to be able to control the maximum number of caps clients can hold, > even if the MDS is shortly down for some reason. Is this feasible? > >> >> Thanks for performing the test and letting us know the results. >> > > No problem! We are trying to push the system to its limits now before our users do it, > we still have 1-2 days to do that, and want to play a bit with the read patterns of the main application framework our users will run ( https://root.cern.ch/ ), > and then our first users will start to do their best to break things apart. > > Cheers, > Oliver > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com