Re: CephFS very unstable with many small files

John Spray <jspray@xxxxxxxxxx> · Mon, 26 Feb 2018 16:59:17 +0000

On Mon, Feb 26, 2018 at 4:50 PM, Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
> Am 26.02.2018 um 17:15 schrieb John Spray:
>> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
>>>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>> Am 25.02.2018 um 21:50 schrieb John Spray:
>>>>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>>>>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>>>>>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
>>>>>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
>>>>>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>>>>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>>>>>> and finally, 5 minutes later, OOM.
>>>>>>>
>>>>>>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.
>>>>>>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.
>>>>>>> Is there anything that can be configured to prevent this from happening?
>>>>>>
>>>>>> Clients will generally hold onto capabilities for files they've
>>>>>> written out -- this is pretty sub-optimal for many workloads where
>>>>>> files are written out but not likely to be accessed again in the near
>>>>>> future.  While clients hold these capabilities, the MDS cannot drop
>>>>>> things from its own cache.
>>>>>>
>>>>>> The way this is *meant* to work is that the MDS hits its cache size
>>>>>> limit, and sends a message to clients asking them to drop some files
>>>>>> from their local cache, and consequently release those capabilities.
>>>>>> However, this has historically been a tricky area with ceph-fuse
>>>>>> clients (there are some hacks for detecting kernel version and using
>>>>>> different mechanisms for different versions of fuse), and it's
>>>>>> possible that on your clients this mechanism is simply not working,
>>>>>> leading to a severely oversized MDS cache.
>>>>>>
>>>>>> The MDS should have been showing health alerts in "ceph status" about
>>>>>> this, but I suppose it's possible that it wasn't surviving long enough
>>>>>> to hit the timeout (60s) that we apply for warning about misbehaving
>>>>>> clients?  It would be good to check the cluster log to see if you were
>>>>>> getting any health messages along the lines of "Client xyz failing to
>>>>>> respond to cache pressure".
>>>>>
>>>>> This explains the high memory usage indeed.
>>>>> I can also confirm seeing those health alerts, now that I check the logs.
>>>>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>>>>> so kernels are rather old, but I would have hoped things have been backported
>>>>> by RedHat.
>>>>>
>>>>> Is there anything one can do to limit client's cache sizes?
>>>>
>>>> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>>>>
>>>> http://tracker.ceph.com/issues/22339
>>>>
>>>> (Please double check you're not running older clients on accident.)
>>>
>>> I can confirm all clients have been running 12.2.3.
>>> Is the issue really related? It looks like a remount-failure fix.
>>
>> The fuse client uses a remount internally to persuade the fuse kernel
>> module to really drop things from its cache (fuse doesn't provide the
>> ideal hooks for managing this stuff in network filesystems).
>
> Thanks for the explanation, now I understand!
>
>>
>>>> I have run small file tests with ~128 clients without issue. Generally
>>>> if there is an issue it is because clients are not releasing their
>>>> capabilities properly (due to invalidation bugs which should be caught
>>>> by the above backport) or the MDS memory usage exceeds RAM. If the
>>>> clients are not releasing their capabilities, you should see the
>>>> errors John described in the cluster log.
>>>>
>>>> You said in the original post that the `mds cache memory limit = 4GB`.
>>>> If that's the case, you really shouldn't be exceeding 40GB of RAM!
>>>> It's possible you have found a bug of some kind. I suggest tracking
>>>> the MDS cache statistics (which includes the inode count in cache) by
>>>> collecting a `perf dump` via the admin socket. Then you can begin to
>>>> find out what's consuming all of the MDS memory.
>>>>
>>>> Additionally, I concur with John on digging into why the MDS is
>>>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
>>>> time. It may also shed light on the issue.
>>>
>>> Yes, I confirmed this earlier - indeed I found the "failing to respond to cache pressure" alerts in the logs.
>>> The excess of RAM initally was "only" about 50 - 100 % which was still fine - the main issue started after I tested MDS failover in this situation.
>>> If I understand correctly, the clients are only prevented from growing their caps to huge values if an MDS is running
>>> and actively preventing them from doing so. Correct?
>>
>> The clients have their own per-client limit on cache size
>> (client_cache_size) that they apply locally.  They'll only hold caps
>> on things they have in cache, so this indirectly controls how many
>> caps they will ask for.  However, if you were hitting the 22339 or a
>> similar issue then even this limit may not be properly enforced.
>
> Ok, understood. This we did not touch yet, so it should be 16384 inodes.
>
>
> There is another specialty in our setup, and I am not sure if it matters.
> Our stress-test was running using the same approach analysis jobs from users will use later on.
> We use HTCondor here as a workload management system, which takes care of starting the separate "jobs" on the worker node machines
> which are the cephfs clients.
> The jobs are in our case all encapsuled inside singularity containers, which open up a new namespace environment for each job.
> This includes a PID namespace and a mount namespace...
> I am unsure how exactly the remounting of the fuse client will affect the mount as seen in the namespace, and the mount as seen in the host namespace.
> For sure, I can confirm that writing and reading works fine. But I'm unsure how the "remounting" is affected by this specialty.

That is certainly interesting information, I don't have the kernel
knowledge to say whether it would affect our remounting/invalidation
paths, but it seems plausible.

>From what I hear, people using CephFS for container volumes are mostly
using the kernel client rather than fuse (the kernel client also has
much better small file performance in general).

> From another Fuse FS:
> https://sft.its.cern.ch/jira/browse/CVM-1478
> we have already learnt that while one can even perform "umount /cvmfs/XXX/" on the host,
> the cvmfs-fuse-helper stays running since it is still used inside the mount namespace of the container.
> Since I don't fully understand to what extend cephfs remounts, and how this affects mount namespaces,
> I am unsure if this is related at all, but it surely is special to our setup (but may become more and more common in the future,
> especially in HPC).
>
> We'll try to reproduce the issue overnight (which is the last occasion before we want to embrace the first test users)
> and I'll surely look at perf dump on the mds.
>
> Is there also some way I can extract info from the clients?

Fuse clients have an "admin socket" like the server daemons do, it's
usually in /var/lib/ceph somewhere, and you can do "ceph daemon <path
to .asok file> help" to see available commands -- there are various
status, perf dump, etc ones that should include things like how many
items are in cache.

John

>
> Cheers and thanks for your input!
>         Oliver
>
>>
>>> However, since the failover took a few minutes (I played with the beacon timeouts and increased the mds_log_max_segments and mds_log_max_expiring to check impact on performance),
>>> this could well have been the main cause for the huge memory consumption. Do I understand correctly that the clients may grow their number of caps
>>> to huge numbers if all MDS are down for a few minutes, since nobody holds their hands?
>>
>> No -- MDS daemons issue the caps, so clients can't get more without
>> talking to an MDS.
>>
>> John
>>
>>> This could explain why, when the MDS wanted to finally come back after the config changes, it was flooded with a tremendous number of caps,
>>> which did not fit into memory + swap at all. This, in turn, made the MDS + the metadata OSDs from which it was feeding (on the same machine...)
>>> very slow, so it got stuck for quite a while in rejoin / joint phases, and missed heartbeats, triggering another failover.
>>> As soon as I noticed and understood this, several failovers had already happened, and about an hour had passed.
>>>
>>>
>>> If my understanding is correct, this would mean the clients had quite some time to accumulate even more caps.
>>> I increased the beacon timeout then, which gave the MDS, which was very sluggish (swapping, waiting for metadata OSDs to feed it) enough grace
>>> to start up - and then, it ran into OOM condition, since there were too many caps held for it to ever handle with our hardware.
>>>
>>> The only way out of this seems to be to kill off the actual clients - right?
>>>
>>> So if my assumption is correct, it would help to be able to control the maximum number of caps clients can hold,
>>> even if the MDS is shortly down for some reason. Is this feasible?
>>>
>>>>
>>>> Thanks for performing the test and letting us know the results.
>>>>
>>>
>>> No problem! We are trying to push the system to its limits now before our users do it,
>>> we still have 1-2 days to do that, and want to play a bit with the read patterns of the main application framework our users will run ( https://root.cern.ch/ ),
>>> and then our first users will start to do their best to break things apart.
>>>
>>> Cheers,
>>>         Oliver
>>>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com