Re: Ceph Users Feedback Survey

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Sat, 18 May 2024 08:35:36 -0400

>> 
>> Hi, Neha, we are very excited to see this opportunity to communicate
>> with the community.
>> 
>> Ceph have been a very important solution to our data storage
>> requirements, however, we also see the following issues that are
>> obstructing ceph's further extensive usage in our office:
>> 
>> 1. We continuously observe significant performance degradation in both
>> rbd and rgw clusters when there's pg backfilling.
>>    a): According to our investigation, this performance degradation
>> mostly comes from the PG removals, the cost of which comes from
>> listing the to-be-removed PGs' objects. We believe that the reason for
>> this high-cost object listing is that bluestore stores metadata in
>> RocksDB the read performance of which may not be superb when there are
>> a lot of keys in it. The case is even worse in our rgw clusters,
>> because our users tend to store small objects, the sizes of which are
>> mostly within 200KB, in our clusters,  which means data of the same
>> total size will create much more metadata to be stored in a single OSD
>> in the rgw clusters than the rbd clusters.

This is very much a known phenomenon.  Fortunately my own ML workload uses ~50MB objects.

Is your RGW index pool on NVMe SSDs?  How many of them, with how many PGs?  Spreading the index pool across more drives and increasing the PG count above the formal guidance may help.

And is your bucket pool on SSDs?

>> from the one-to-many rgw-rados object mapping. As mentioned earlier,
>> our users tend to save small objects in our rgw clusters, and rgw
>> saves each of these small objects as one or multiple rados objects. On
>> the other hand, RADOS does pg backfillling in the granularity of
>> objects. So, in our rgw clusters, PG backfilling will issue large
>> amounts of small data reads to the underlying disks, which not only
>> hurts the client IO performance but also slows down the backfilling
>> itself significantly. We think maybe organizing small rgw objects into
>> large rados objects and having a background compaction process that
>> continuously scans those rados objects and reclaim the space of
>> removed rgw objects will solve this issue.

This is sometimes referred to as packing small objects -- the hole / compaction issue is an obvious wrinkle.
I was told a few years ago that certain folks had a prototype here, but I haven't seen a PR.

>> 
>>   b): since we are trying to handle ML workloads, which involves
>> large amounts of small files, we want to use the memory space of the
>> MDS machine sufficiently, so we usually configure the MDS memory
>> target to 80% the total memory space. However, it seems that the MDS
>> memory target is not a strict limit, so there are cases in which MDSes
>> are killed by the kernel oom-killer, and if this happens, the
>> standby-replay MDS also has a high probability of being killed by the
>> oom-killer after it turns to active. In this case, we often observe
>> that the MDS bootstrapping will need more memory space than its memory
>> target to replay the journal, and there are cases that the memory is
>> not enough and the MDSes may never come up again. Our solution to this
>> situation is to find other machines with larger memory space to run
>> the MDSes again. This hurts us really badly.

Do you have multiple MDS with pinning?

>>   It looks to us that the ML training is a standard
>> write-once-read-many workload, so maybe a simple FUSE implementation
>> based on LibRGW is good enough.
>> 
>> The above two issues are the major obstruction of our ceph ambition at
>> present, and we believe that we can extend the ceph usage aggressively
>> with them solved.
>> 
>> Thanks very much:-)
>> 
>> 
>>> 
>>> Thanks,
>>> Neha
>>> 
>>> [1] https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLMg/viewform?ts=65e87dd8&edit_requested=true
>>> _______________________________________________
>>> Dev mailing list -- dev@xxxxxxx
>>> To unsubscribe send an email to dev-leave@xxxxxxx
>> 
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx