Re: CephFS very unstable with many small files

Stijn De Weirdt <stijn.deweirdt@xxxxxxxx> · Mon, 26 Feb 2018 07:58:16 +0100

hi oliver,

>>> in preparation for production, we have run very successful tests with large sequential data,
>>> and just now a stress-test creating many small files on CephFS. 
>>>
>>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2. 
>>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3. 
(this is all afaik;) so with EC k=4, small files get cut in 4 smaller
parts. i'm not sure when the compression is applied, but your small
files might be very small files before the get cut in 4 tiny parts. this
might become pure iops wrt performance.
with filestore (and witout compression), this was quite awfull. we have
not retested with bluestore yet, but in the end a disk is just a disk.
writing 1 file results in 6 diskwrites, so you need a lot of iops and/or
disks.

<...>

>>> In parallel, I had reinstalled one OSD host. 
>>> It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition. 
>>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,
>>> up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed. 
>>
>> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory requirements for the scale you are trying. what are the size of each osd device?
>> Could you also dump osd tree + more cluster info in the tracker you raised, so that one could try to recreate at a lower scale and check.
> 
> Done! 
> All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 GB. 
the rule of thumb is 1GB per 1 TB. that is a lot (and imho one of the
bad things about ceph, but i'm not complaining ;)
most of the time this memory will not be used except for cache, but eg
recovery is one of the cases where it is used, and thus needed.

i have no idea what the real requirements are (i assumes there's some
fixed amount per OSD and the rest is linear(?) with volume. so you can
try to use some softraid on the disks to reduce the number of OSDs per
host; but i doubt that the fixed part is over 50%, so you will probably
end up with ahving to add some memory or not use certain disks. i don't
know if you can limit the amount of volume per disk, eg only use 2TB of
a 4TB disk, because then you can keep the iops.

stijn

> We had initially planned to use something more lightweight on CPU and RAM (BeeGFS or Lustre),
> but since we encountered serious issues with BeeGFS, have some bad past experience with Lustre (but it was an old version)
> and were really happy with the self-healing features of Ceph which also allows us to reinstall OSD-hosts if we do an upgrade without having a downtime,
> we have decided to repurpose the hardware. For this reason, the RAM is not really optimized (yet) for Ceph. 
> We will try to adapt hardware now as best as possible. 
> 
> Are there memory recommendations for a setup of this size? Anything's welcome. 
> 
> Cheers and thanks!
> 	Oliver
> 
>>
>>>
>>> Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,
>>> leading to PG unavailability, and preventing recovery from completion. 
>>> I have reported a ticket about that, with stacktrace and log:
>>> http://tracker.ceph.com/issues/23120
>>> This might well be a consequence of a previous OOM killer condition. 
>>>
>>> However, my final question after these ugly experiences is: 
>>> Did somebody ever stresstest CephFS for many small files? 
>>> Are those issues known? Can special configuration help? 
>>> Are the memory issues known? Are there solutions? 
>>>
>>> We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario. 
>>> It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production. 
>>> As of now, this looks really bad, and I'm not sure the cluster will ever recover. 
>>> I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster. 
>>>
>>> Cheers,
>>> 	Oliver
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com