Re: CephFS very unstable with many small files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Stijn, 

Am 26.02.2018 um 07:58 schrieb Stijn De Weirdt:
> hi oliver,
> 
>>>> in preparation for production, we have run very successful tests with large sequential data,
>>>> and just now a stress-test creating many small files on CephFS. 
>>>>
>>>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2. 
>>>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3. 
> (this is all afaik;) so with EC k=4, small files get cut in 4 smaller
> parts. i'm not sure when the compression is applied, but your small
> files might be very small files before the get cut in 4 tiny parts. this
> might become pure iops wrt performance.
> with filestore (and witout compression), this was quite awfull. we have
> not retested with bluestore yet, but in the end a disk is just a disk.
> writing 1 file results in 6 diskwrites, so you need a lot of iops and/or
> disks.
> 
> <...>

Thanks for these hints! 
I think in our case, the high number of disks / OSDs saves us from really noticing this. 
At least, checking with iotop / iostat during the stress testing, I saw mostly no disk activity on the OSDs, the MDS
was the main bottleneck. 

> 
>>>> In parallel, I had reinstalled one OSD host. 
>>>> It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition. 
>>>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,
>>>> up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed. 
>>>
>>> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory requirements for the scale you are trying. what are the size of each osd device?
>>> Could you also dump osd tree + more cluster info in the tracker you raised, so that one could try to recreate at a lower scale and check.
>>
>> Done! 
>> All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 GB. 
> the rule of thumb is 1GB per 1 TB. that is a lot (and imho one of the
> bad things about ceph, but i'm not complaining ;)
> most of the time this memory will not be used except for cache, but eg
> recovery is one of the cases where it is used, and thus needed.
> 
> i have no idea what the real requirements are (i assumes there's some
> fixed amount per OSD and the rest is linear(?) with volume. so you can
> try to use some softraid on the disks to reduce the number of OSDs per
> host; but i doubt that the fixed part is over 50%, so you will probably
> end up with ahving to add some memory or not use certain disks. i don't
> know if you can limit the amount of volume per disk, eg only use 2TB of
> a 4TB disk, because then you can keep the iops.

It would likely be possible to duplicate the RAM of the OSDs at an affordable price 
(only half of the DIMM slots are occupied - we already planned for the future, just did not expect this to be necessary so quickly). 
This would grant us 128 GB per OSD host, which matches with 32*4 TB = 128 TB, i.e. 1 GB of RAM for 1 TB of disk. 

For the MDSes, the same is true, we could upgrade them to 64 GB or 96 GB without throwing away existing DIMMs. 
128 GB as Linh had in the HPC setup. Potentially, we could even go for 128 GB and move the small DIMMs from the MDS's to OSD's. 
We'll discuss... 

Many thanks for your very valuable input! 

Cheers,
	Oliver


> 
> stijn
> 
>> We had initially planned to use something more lightweight on CPU and RAM (BeeGFS or Lustre),
>> but since we encountered serious issues with BeeGFS, have some bad past experience with Lustre (but it was an old version)
>> and were really happy with the self-healing features of Ceph which also allows us to reinstall OSD-hosts if we do an upgrade without having a downtime,
>> we have decided to repurpose the hardware. For this reason, the RAM is not really optimized (yet) for Ceph. 
>> We will try to adapt hardware now as best as possible. 
>>
>> Are there memory recommendations for a setup of this size? Anything's welcome. 
>>
>> Cheers and thanks!
>> 	Oliver
>>
>>>
>>>>
>>>> Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,
>>>> leading to PG unavailability, and preventing recovery from completion. 
>>>> I have reported a ticket about that, with stacktrace and log:
>>>> http://tracker.ceph.com/issues/23120
>>>> This might well be a consequence of a previous OOM killer condition. 
>>>>
>>>> However, my final question after these ugly experiences is: 
>>>> Did somebody ever stresstest CephFS for many small files? 
>>>> Are those issues known? Can special configuration help? 
>>>> Are the memory issues known? Are there solutions? 
>>>>
>>>> We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario. 
>>>> It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production. 
>>>> As of now, this looks really bad, and I'm not sure the cluster will ever recover. 
>>>> I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster. 
>>>>
>>>> Cheers,
>>>> 	Oliver
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux