Re: CephFS very unstable with many small files

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2018 00:55:28 +0100

Am 25.02.2018 um 23:13 schrieb Vasu Kulkarni:
> 
> 
>> On Feb 25, 2018, at 8:45 AM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Dear Cephalopodians,
>>
>> in preparation for production, we have run very successful tests with large sequential data,
>> and just now a stress-test creating many small files on CephFS. 
>>
>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2. 
>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3. 
>> There are (at the moment) only two MDS's, one is active, the other standby. 
>>
>> For the test, we had 1120 client processes on 40 client machines (all cephfs-fuse!) extract a tarball with 150k small files
>> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a separate subdirectory. 
>>
>> Things started out rather well (but expectedly slow), we had to increase
>> mds_log_max_segments => 240
>> mds_log_max_expiring => 160
>> due to https://github.com/ceph/ceph/pull/18624
>> and adjusted mds_cache_memory_limit to 4 GB. 
>>
>> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) and so we have been careful with the cache
>> (e.g. due to http://tracker.ceph.com/issues/22599 ). 
>>
>> After a while, we tested MDS failover and realized we entered a flip-flop situation between the two MDS nodes we have.
>> Increasing mds_beacon_grace to 240 helped with that. 
>>
>> Now, with about 100,000,000 objects written, we are in a disaster situation. 
>> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. 
>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>> and finally, 5 minutes later, OOM. 
>>
>> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine. 
>> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles. 
>> Is there anything that can be configured to prevent this from happening? 
>> Now, I only lost some "stress test data", but later, it might be user's data... 
>>
>>
>> In parallel, I had reinstalled one OSD host. 
>> It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition. 
>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,
>> up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed. 
> 
> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory requirements for the scale you are trying. what are the size of each osd device?
> Could you also dump osd tree + more cluster info in the tracker you raised, so that one could try to recreate at a lower scale and check.

Done! 
All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 GB. 
We had initially planned to use something more lightweight on CPU and RAM (BeeGFS or Lustre),
but since we encountered serious issues with BeeGFS, have some bad past experience with Lustre (but it was an old version)
and were really happy with the self-healing features of Ceph which also allows us to reinstall OSD-hosts if we do an upgrade without having a downtime,
we have decided to repurpose the hardware. For this reason, the RAM is not really optimized (yet) for Ceph. 
We will try to adapt hardware now as best as possible. 

Are there memory recommendations for a setup of this size? Anything's welcome. 

Cheers and thanks!
	Oliver

> 
>>
>> Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,
>> leading to PG unavailability, and preventing recovery from completion. 
>> I have reported a ticket about that, with stacktrace and log:
>> http://tracker.ceph.com/issues/23120
>> This might well be a consequence of a previous OOM killer condition. 
>>
>> However, my final question after these ugly experiences is: 
>> Did somebody ever stresstest CephFS for many small files? 
>> Are those issues known? Can special configuration help? 
>> Are the memory issues known? Are there solutions? 
>>
>> We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario. 
>> It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production. 
>> As of now, this looks really bad, and I'm not sure the cluster will ever recover. 
>> I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster. 
>>
>> Cheers,
>> 	Oliver
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com