Re: CephFS very unstable with many small files

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 17:50:20 +0100

Am 26.02.2018 um 17:15 schrieb John Spray:
> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
>>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>> Am 25.02.2018 um 21:50 schrieb John Spray:
>>>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>>>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>>>>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
>>>>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
>>>>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>>>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>>>>> and finally, 5 minutes later, OOM.
>>>>>>
>>>>>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.
>>>>>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.
>>>>>> Is there anything that can be configured to prevent this from happening?
>>>>>
>>>>> Clients will generally hold onto capabilities for files they've
>>>>> written out -- this is pretty sub-optimal for many workloads where
>>>>> files are written out but not likely to be accessed again in the near
>>>>> future.  While clients hold these capabilities, the MDS cannot drop
>>>>> things from its own cache.
>>>>>
>>>>> The way this is *meant* to work is that the MDS hits its cache size
>>>>> limit, and sends a message to clients asking them to drop some files
>>>>> from their local cache, and consequently release those capabilities.
>>>>> However, this has historically been a tricky area with ceph-fuse
>>>>> clients (there are some hacks for detecting kernel version and using
>>>>> different mechanisms for different versions of fuse), and it's
>>>>> possible that on your clients this mechanism is simply not working,
>>>>> leading to a severely oversized MDS cache.
>>>>>
>>>>> The MDS should have been showing health alerts in "ceph status" about
>>>>> this, but I suppose it's possible that it wasn't surviving long enough
>>>>> to hit the timeout (60s) that we apply for warning about misbehaving
>>>>> clients?  It would be good to check the cluster log to see if you were
>>>>> getting any health messages along the lines of "Client xyz failing to
>>>>> respond to cache pressure".
>>>>
>>>> This explains the high memory usage indeed.
>>>> I can also confirm seeing those health alerts, now that I check the logs.
>>>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>>>> so kernels are rather old, but I would have hoped things have been backported
>>>> by RedHat.
>>>>
>>>> Is there anything one can do to limit client's cache sizes?
>>>
>>> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>>>
>>> http://tracker.ceph.com/issues/22339
>>>
>>> (Please double check you're not running older clients on accident.)
>>
>> I can confirm all clients have been running 12.2.3.
>> Is the issue really related? It looks like a remount-failure fix.
> 
> The fuse client uses a remount internally to persuade the fuse kernel
> module to really drop things from its cache (fuse doesn't provide the
> ideal hooks for managing this stuff in network filesystems).

Thanks for the explanation, now I understand! 

> 
>>> I have run small file tests with ~128 clients without issue. Generally
>>> if there is an issue it is because clients are not releasing their
>>> capabilities properly (due to invalidation bugs which should be caught
>>> by the above backport) or the MDS memory usage exceeds RAM. If the
>>> clients are not releasing their capabilities, you should see the
>>> errors John described in the cluster log.
>>>
>>> You said in the original post that the `mds cache memory limit = 4GB`.
>>> If that's the case, you really shouldn't be exceeding 40GB of RAM!
>>> It's possible you have found a bug of some kind. I suggest tracking
>>> the MDS cache statistics (which includes the inode count in cache) by
>>> collecting a `perf dump` via the admin socket. Then you can begin to
>>> find out what's consuming all of the MDS memory.
>>>
>>> Additionally, I concur with John on digging into why the MDS is
>>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
>>> time. It may also shed light on the issue.
>>
>> Yes, I confirmed this earlier - indeed I found the "failing to respond to cache pressure" alerts in the logs.
>> The excess of RAM initally was "only" about 50 - 100 % which was still fine - the main issue started after I tested MDS failover in this situation.
>> If I understand correctly, the clients are only prevented from growing their caps to huge values if an MDS is running
>> and actively preventing them from doing so. Correct?
> 
> The clients have their own per-client limit on cache size
> (client_cache_size) that they apply locally.  They'll only hold caps
> on things they have in cache, so this indirectly controls how many
> caps they will ask for.  However, if you were hitting the 22339 or a
> similar issue then even this limit may not be properly enforced.

Ok, understood. This we did not touch yet, so it should be 16384 inodes. 

There is another specialty in our setup, and I am not sure if it matters. 
Our stress-test was running using the same approach analysis jobs from users will use later on. 
We use HTCondor here as a workload management system, which takes care of starting the separate "jobs" on the worker node machines
which are the cephfs clients. 
The jobs are in our case all encapsuled inside singularity containers, which open up a new namespace environment for each job. 
This includes a PID namespace and a mount namespace... 
I am unsure how exactly the remounting of the fuse client will affect the mount as seen in the namespace, and the mount as seen in the host namespace. 
For sure, I can confirm that writing and reading works fine. But I'm unsure how the "remounting" is affected by this specialty. 

From another Fuse FS:
https://sft.its.cern.ch/jira/browse/CVM-1478
we have already learnt that while one can even perform "umount /cvmfs/XXX/" on the host,
the cvmfs-fuse-helper stays running since it is still used inside the mount namespace of the container. 
Since I don't fully understand to what extend cephfs remounts, and how this affects mount namespaces,
I am unsure if this is related at all, but it surely is special to our setup (but may become more and more common in the future,
especially in HPC). 

We'll try to reproduce the issue overnight (which is the last occasion before we want to embrace the first test users)
and I'll surely look at perf dump on the mds. 

Is there also some way I can extract info from the clients? 

Cheers and thanks for your input!
	Oliver

> 
>> However, since the failover took a few minutes (I played with the beacon timeouts and increased the mds_log_max_segments and mds_log_max_expiring to check impact on performance),
>> this could well have been the main cause for the huge memory consumption. Do I understand correctly that the clients may grow their number of caps
>> to huge numbers if all MDS are down for a few minutes, since nobody holds their hands?
> 
> No -- MDS daemons issue the caps, so clients can't get more without
> talking to an MDS.
> 
> John
> 
>> This could explain why, when the MDS wanted to finally come back after the config changes, it was flooded with a tremendous number of caps,
>> which did not fit into memory + swap at all. This, in turn, made the MDS + the metadata OSDs from which it was feeding (on the same machine...)
>> very slow, so it got stuck for quite a while in rejoin / joint phases, and missed heartbeats, triggering another failover.
>> As soon as I noticed and understood this, several failovers had already happened, and about an hour had passed.
>>
>>
>> If my understanding is correct, this would mean the clients had quite some time to accumulate even more caps.
>> I increased the beacon timeout then, which gave the MDS, which was very sluggish (swapping, waiting for metadata OSDs to feed it) enough grace
>> to start up - and then, it ran into OOM condition, since there were too many caps held for it to ever handle with our hardware.
>>
>> The only way out of this seems to be to kill off the actual clients - right?
>>
>> So if my assumption is correct, it would help to be able to control the maximum number of caps clients can hold,
>> even if the MDS is shortly down for some reason. Is this feasible?
>>
>>>
>>> Thanks for performing the test and letting us know the results.
>>>
>>
>> No problem! We are trying to push the system to its limits now before our users do it,
>> we still have 1-2 days to do that, and want to play a bit with the read patterns of the main application framework our users will run ( https://root.cern.ch/ ),
>> and then our first users will start to do their best to break things apart.
>>
>> Cheers,
>>         Oliver
>>

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com