Re: CephFS very unstable with many small files

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 00:49:18 +0100

Am 25.02.2018 um 21:50 schrieb John Spray:
> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>> Dear Cephalopodians,
>>
>> in preparation for production, we have run very successful tests with large sequential data,
>> and just now a stress-test creating many small files on CephFS.
>>
>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2.
>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3.
>> There are (at the moment) only two MDS's, one is active, the other standby.
>>
>> For the test, we had 1120 client processes on 40 client machines (all cephfs-fuse!) extract a tarball with 150k small files
>> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a separate subdirectory.
> 
> Running these tests with numerous clients is valuable -- thanks for
> doing it.  The automated testing of Ceph that happens before releases
> unfortunately does not include situations with more than one or two
> clients.
> 
>> Things started out rather well (but expectedly slow), we had to increase
>> mds_log_max_segments => 240
>> mds_log_max_expiring => 160
>> due to https://github.com/ceph/ceph/pull/18624
>> and adjusted mds_cache_memory_limit to 4 GB.
>>
>> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) and so we have been careful with the cache
>> (e.g. due to http://tracker.ceph.com/issues/22599 ).
>>
>> After a while, we tested MDS failover and realized we entered a flip-flop situation between the two MDS nodes we have.
>> Increasing mds_beacon_grace to 240 helped with that.
> 
> In general, if you're in a situation where you've having to increase
> mds_beacon_grace, you already have pretty bad problems.  It's a good
> time to stop and dig into what is tying up the MDS so badly that it
> can't even send a beacon to the monitor in a timely way.  Perhaps at
> this point your MDS daemons were already hitting swap and becoming
> pathologically slow for that reason.

That's good to know!
It was happening when triggering the failover, and the MDS entered the rejoin-state. 
It seems that in this situation, it was very tied - and I believe it also was already swapping
indeed. 

> 
>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>> and finally, 5 minutes later, OOM.
>>
>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.
>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.
>> Is there anything that can be configured to prevent this from happening?
> 
> Clients will generally hold onto capabilities for files they've
> written out -- this is pretty sub-optimal for many workloads where
> files are written out but not likely to be accessed again in the near
> future.  While clients hold these capabilities, the MDS cannot drop
> things from its own cache.
> 
> The way this is *meant* to work is that the MDS hits its cache size
> limit, and sends a message to clients asking them to drop some files
> from their local cache, and consequently release those capabilities.
> However, this has historically been a tricky area with ceph-fuse
> clients (there are some hacks for detecting kernel version and using
> different mechanisms for different versions of fuse), and it's
> possible that on your clients this mechanism is simply not working,
> leading to a severely oversized MDS cache.
> 
> The MDS should have been showing health alerts in "ceph status" about
> this, but I suppose it's possible that it wasn't surviving long enough
> to hit the timeout (60s) that we apply for warning about misbehaving
> clients?  It would be good to check the cluster log to see if you were
> getting any health messages along the lines of "Client xyz failing to
> respond to cache pressure".

This explains the high memory usage indeed. 
I can also confirm seeing those health alerts, now that I check the logs. 
The systems have been (servers and clients) all exclusively CentOS 7.4,
so kernels are rather old, but I would have hoped things have been backported 
by RedHat. 

Is there anything one can do to limit client's cache sizes? 

Cheers and thanks for the very valuable information!
	Oliver

> 
> John
> 
> 
> 
>> Now, I only lost some "stress test data", but later, it might be user's data...
>>
>>
>> In parallel, I had reinstalled one OSD host.
>> It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition.
>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,
>> up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed.
>>
>> Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,
>> leading to PG unavailability, and preventing recovery from completion.
>> I have reported a ticket about that, with stacktrace and log:
>> http://tracker.ceph.com/issues/23120
>> This might well be a consequence of a previous OOM killer condition.
>>
>> However, my final question after these ugly experiences is:
>> Did somebody ever stresstest CephFS for many small files?
>> Are those issues known? Can special configuration help?
>> Are the memory issues known? Are there solutions?
>>
>> We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario.
>> It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production.
>> As of now, this looks really bad, and I'm not sure the cluster will ever recover.
>> I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster.
>>
>> Cheers,
>>         Oliver
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com