Re: CephFS very unstable with many small files

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 18:23:57 +0100

Am 26.02.2018 um 17:59 schrieb John Spray:
> On Mon, Feb 26, 2018 at 4:50 PM, Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>> Am 26.02.2018 um 17:15 schrieb John Spray:
>>> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
>>>>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>> Am 25.02.2018 um 21:50 schrieb John Spray:
>>>>>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>>>>>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>>>>>>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
>>>>>>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
>>>>>>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>>>>>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>>>>>>> and finally, 5 minutes later, OOM.
>>>>>>>>
>>>>>>>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.
>>>>>>>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.
>>>>>>>> Is there anything that can be configured to prevent this from happening?
>>>>>>>
>>>>>>> Clients will generally hold onto capabilities for files they've
>>>>>>> written out -- this is pretty sub-optimal for many workloads where
>>>>>>> files are written out but not likely to be accessed again in the near
>>>>>>> future.  While clients hold these capabilities, the MDS cannot drop
>>>>>>> things from its own cache.
>>>>>>>
>>>>>>> The way this is *meant* to work is that the MDS hits its cache size
>>>>>>> limit, and sends a message to clients asking them to drop some files
>>>>>>> from their local cache, and consequently release those capabilities.
>>>>>>> However, this has historically been a tricky area with ceph-fuse
>>>>>>> clients (there are some hacks for detecting kernel version and using
>>>>>>> different mechanisms for different versions of fuse), and it's
>>>>>>> possible that on your clients this mechanism is simply not working,
>>>>>>> leading to a severely oversized MDS cache.
>>>>>>>
>>>>>>> The MDS should have been showing health alerts in "ceph status" about
>>>>>>> this, but I suppose it's possible that it wasn't surviving long enough
>>>>>>> to hit the timeout (60s) that we apply for warning about misbehaving
>>>>>>> clients?  It would be good to check the cluster log to see if you were
>>>>>>> getting any health messages along the lines of "Client xyz failing to
>>>>>>> respond to cache pressure".
>>>>>>
>>>>>> This explains the high memory usage indeed.
>>>>>> I can also confirm seeing those health alerts, now that I check the logs.
>>>>>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>>>>>> so kernels are rather old, but I would have hoped things have been backported
>>>>>> by RedHat.
>>>>>>
>>>>>> Is there anything one can do to limit client's cache sizes?
>>>>>
>>>>> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>>>>>
>>>>> http://tracker.ceph.com/issues/22339
>>>>>
>>>>> (Please double check you're not running older clients on accident.)
>>>>
>>>> I can confirm all clients have been running 12.2.3.
>>>> Is the issue really related? It looks like a remount-failure fix.
>>>
>>> The fuse client uses a remount internally to persuade the fuse kernel
>>> module to really drop things from its cache (fuse doesn't provide the
>>> ideal hooks for managing this stuff in network filesystems).
>>
>> Thanks for the explanation, now I understand!
>>
>>>
>>>>> I have run small file tests with ~128 clients without issue. Generally
>>>>> if there is an issue it is because clients are not releasing their
>>>>> capabilities properly (due to invalidation bugs which should be caught
>>>>> by the above backport) or the MDS memory usage exceeds RAM. If the
>>>>> clients are not releasing their capabilities, you should see the
>>>>> errors John described in the cluster log.
>>>>>
>>>>> You said in the original post that the `mds cache memory limit = 4GB`.
>>>>> If that's the case, you really shouldn't be exceeding 40GB of RAM!
>>>>> It's possible you have found a bug of some kind. I suggest tracking
>>>>> the MDS cache statistics (which includes the inode count in cache) by
>>>>> collecting a `perf dump` via the admin socket. Then you can begin to
>>>>> find out what's consuming all of the MDS memory.
>>>>>
>>>>> Additionally, I concur with John on digging into why the MDS is
>>>>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
>>>>> time. It may also shed light on the issue.
>>>>
>>>> Yes, I confirmed this earlier - indeed I found the "failing to respond to cache pressure" alerts in the logs.
>>>> The excess of RAM initally was "only" about 50 - 100 % which was still fine - the main issue started after I tested MDS failover in this situation.
>>>> If I understand correctly, the clients are only prevented from growing their caps to huge values if an MDS is running
>>>> and actively preventing them from doing so. Correct?
>>>
>>> The clients have their own per-client limit on cache size
>>> (client_cache_size) that they apply locally.  They'll only hold caps
>>> on things they have in cache, so this indirectly controls how many
>>> caps they will ask for.  However, if you were hitting the 22339 or a
>>> similar issue then even this limit may not be properly enforced.
>>
>> Ok, understood. This we did not touch yet, so it should be 16384 inodes.
>>
>>
>> There is another specialty in our setup, and I am not sure if it matters.
>> Our stress-test was running using the same approach analysis jobs from users will use later on.
>> We use HTCondor here as a workload management system, which takes care of starting the separate "jobs" on the worker node machines
>> which are the cephfs clients.
>> The jobs are in our case all encapsuled inside singularity containers, which open up a new namespace environment for each job.
>> This includes a PID namespace and a mount namespace...
>> I am unsure how exactly the remounting of the fuse client will affect the mount as seen in the namespace, and the mount as seen in the host namespace.
>> For sure, I can confirm that writing and reading works fine. But I'm unsure how the "remounting" is affected by this specialty.
> 
> That is certainly interesting information, I don't have the kernel
> knowledge to say whether it would affect our remounting/invalidation
> paths, but it seems plausible.
> 
> From what I hear, people using CephFS for container volumes are mostly
> using the kernel client rather than fuse (the kernel client also has
> much better small file performance in general).

Ah, sorry, this may have been misunderstood - the container filesystems themselves are living on CVMFS
(which is read-only, perfect for small files, and is optimized for caching, does deduplication via content-defined-chunking,
compression etc.). 
However, we use cephfs for the user data, so it is bind-mounted inside the container environment, and hence will be accessible from within the mount namespace. 

So in an ideal world, if users would listen to recommendations from admins, the cephfs would only contain large bulk data for input and output,
while the container filesystem lives on CVMFS. Any scripts, small executables etc. would be shipped by HTCondor to scratch directories on the worker nodes
(local disks with ext4). 

However, we expect our users (from past experience...) to play with things until they break, i.e. put their source code on cephfs, compile it there,
maybe even put many small files (text files with data, files which don't contain anything but just their names are relevant, small monte carlo samples etc.)
all on cephfs. 
That's why we want to see how easy it is to break it. We don't plan to put small files on it, but many users certainly will. 

The kernel-client is currently a no-go for us, mainly since we need quota support. 

> 
>> From another Fuse FS:
>> https://sft.its.cern.ch/jira/browse/CVM-1478
>> we have already learnt that while one can even perform "umount /cvmfs/XXX/" on the host,
>> the cvmfs-fuse-helper stays running since it is still used inside the mount namespace of the container.
>> Since I don't fully understand to what extend cephfs remounts, and how this affects mount namespaces,
>> I am unsure if this is related at all, but it surely is special to our setup (but may become more and more common in the future,
>> especially in HPC).
>>
>> We'll try to reproduce the issue overnight (which is the last occasion before we want to embrace the first test users)
>> and I'll surely look at perf dump on the mds.
>>
>> Is there also some way I can extract info from the clients?
> 
> Fuse clients have an "admin socket" like the server daemons do, it's
> usually in /var/lib/ceph somewhere, and you can do "ceph daemon <path
> to .asok file> help" to see available commands -- there are various
> status, perf dump, etc ones that should include things like how many
> items are in cache.
> 
Thanks! I'll have a look once the issue reappears. Right now it seems I won't be able to re-run the test so quickly, since just deleting the old files
takes pretty long, even with high parallelization. With 4 clients and some 100 processes running rm, we top out at about 4000 req/s handled by the MDS,
which uses >200 % CPU and causes the metadata OSDs to read about 120 MB/s from the SSDs and in parallel write ~ 70 MB/s to them. 

These in turn also eat about 1 CPU core, but it seems to me the actual limit is then given by the SSDs, since the MDS is probably doing many small synchronous read-modify-writes to the RocksDB
with lots of fsync. To increase the req/s, we will likely have to add more SSDs (right now there are 4 SSDs with 240 GB in a replica 4 pool, so performance is bottlenecked to that of a single SSD,
and small SSDs tend to be not too fast). 

This is also a nice stress-test by itself, so I'd like to see how this performs overnight (I guess it takes the full night). 

Cheers and many thanks!
	Oliver

> John
> 
>>
>> Cheers and thanks for your input!
>>         Oliver
>>
>>>
>>>> However, since the failover took a few minutes (I played with the beacon timeouts and increased the mds_log_max_segments and mds_log_max_expiring to check impact on performance),
>>>> this could well have been the main cause for the huge memory consumption. Do I understand correctly that the clients may grow their number of caps
>>>> to huge numbers if all MDS are down for a few minutes, since nobody holds their hands?
>>>
>>> No -- MDS daemons issue the caps, so clients can't get more without
>>> talking to an MDS.
>>>
>>> John
>>>
>>>> This could explain why, when the MDS wanted to finally come back after the config changes, it was flooded with a tremendous number of caps,
>>>> which did not fit into memory + swap at all. This, in turn, made the MDS + the metadata OSDs from which it was feeding (on the same machine...)
>>>> very slow, so it got stuck for quite a while in rejoin / joint phases, and missed heartbeats, triggering another failover.
>>>> As soon as I noticed and understood this, several failovers had already happened, and about an hour had passed.
>>>>
>>>>
>>>> If my understanding is correct, this would mean the clients had quite some time to accumulate even more caps.
>>>> I increased the beacon timeout then, which gave the MDS, which was very sluggish (swapping, waiting for metadata OSDs to feed it) enough grace
>>>> to start up - and then, it ran into OOM condition, since there were too many caps held for it to ever handle with our hardware.
>>>>
>>>> The only way out of this seems to be to kill off the actual clients - right?
>>>>
>>>> So if my assumption is correct, it would help to be able to control the maximum number of caps clients can hold,
>>>> even if the MDS is shortly down for some reason. Is this feasible?
>>>>
>>>>>
>>>>> Thanks for performing the test and letting us know the results.
>>>>>
>>>>
>>>> No problem! We are trying to push the system to its limits now before our users do it,
>>>> we still have 1-2 days to do that, and want to play a bit with the read patterns of the main application framework our users will run ( https://root.cern.ch/ ),
>>>> and then our first users will start to do their best to break things apart.
>>>>
>>>> Cheers,
>>>>         Oliver
>>>>
>>
>>
>>

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com