Re: Ceph Users Feedback Survey

Mark Nelson <mark.nelson@xxxxxxxxx> · Sat, 18 May 2024 08:22:34 -0500

On 5/18/24 07:35, Anthony D'Atri wrote:

Hi, Neha, we are very excited to see this opportunity to communicate
with the community.

Ceph have been a very important solution to our data storage
requirements, however, we also see the following issues that are
obstructing ceph's further extensive usage in our office:

1. We continuously observe significant performance degradation in both
rbd and rgw clusters when there's pg backfilling.
    a): According to our investigation, this performance degradation
mostly comes from the PG removals, the cost of which comes from
listing the to-be-removed PGs' objects. We believe that the reason for
this high-cost object listing is that bluestore stores metadata in
RocksDB the read performance of which may not be superb when there are
a lot of keys in it. The case is even worse in our rgw clusters,
because our users tend to store small objects, the sizes of which are
mostly within 200KB, in our clusters,  which means data of the same
total size will create much more metadata to be stored in a single OSD
in the rgw clusters than the rbd clusters.
This is very much a known phenomenon.  Fortunately my own ML workload uses ~50MB objects.

Is your RGW index pool on NVMe SSDs?  How many of them, with how many PGs?  Spreading the index pool across more drives and increasing the PG count above the formal guidance may help.

And is your bucket pool on SSDs?

I would be very interested to hear if the compact-on-deletion feature 
helps in this case.  Here's the commit where it was added:

https://github.com/ceph/ceph/pull/47221/commits/fba5488728e89d9b0a1c1ab94b7024fcc81b3b15

The idea here is that when you are in a situation that involves 
interleaved lists and deletes without a lot of writes, you will end up 
repeatedly iterating over the tomstones for deleted entries.  If you 
don't have a concurrent write workload (say for OSDs that are having 
data migrated away from them), you can end up making listing extremely 
expensive because there are no compactions cleaning the tombstones up.  
This feature makes it so that if RocksDB encounters too many tombstones 
during iteration, it will automatically trigger a compaction.  You'll 
need the latest versions of Pacific/Quincy or reef to be able to 
utilize/tune this feature, but we've seen it make a huge difference in 
several large production clusters.

from the one-to-many rgw-rados object mapping. As mentioned earlier,
our users tend to save small objects in our rgw clusters, and rgw
saves each of these small objects as one or multiple rados objects. On
the other hand, RADOS does pg backfillling in the granularity of
objects. So, in our rgw clusters, PG backfilling will issue large
amounts of small data reads to the underlying disks, which not only
hurts the client IO performance but also slows down the backfilling
itself significantly. We think maybe organizing small rgw objects into
large rados objects and having a background compaction process that
continuously scans those rados objects and reclaim the space of
removed rgw objects will solve this issue.
This is sometimes referred to as packing small objects -- the hole / compaction issue is an obvious wrinkle.
I was told a few years ago that certain folks had a prototype here, but I haven't seen a PR.

   b): since we are trying to handle ML workloads, which involves
large amounts of small files, we want to use the memory space of the
MDS machine sufficiently, so we usually configure the MDS memory
target to 80% the total memory space. However, it seems that the MDS
memory target is not a strict limit, so there are cases in which MDSes
are killed by the kernel oom-killer, and if this happens, the
standby-replay MDS also has a high probability of being killed by the
oom-killer after it turns to active. In this case, we often observe
that the MDS bootstrapping will need more memory space than its memory
target to replay the journal, and there are cases that the memory is
not enough and the MDSes may never come up again. Our solution to this
situation is to find other machines with larger memory space to run
the MDSes again. This hurts us really badly.
Do you have multiple MDS with pinning?

That was my thought too.  Once a single MDS can't keep up with trimming 
you can try to make it more aggressive, but it's also probably time to 
start thinking about going multi-active and using pinning (ephemeral or 
otherwise) to distribute both memory and trimming across multiple 
MDSes.  I suspect one of the problems here is that we keep iterating 
over items in the cache that we can't evict (we used to do this in 
bluestore as well), but I haven't looked at it closely.  At some point 
I'd really like to dig into the MDS cache and see if there's anything we 
can do to make this better.

   It looks to us that the ML training is a standard
write-once-read-many workload, so maybe a simple FUSE implementation
based on LibRGW is good enough.

ML training (at least for something like ResNet50) is all about 
ingestion images fast enough to keep the GPUs busy.  For H100s it's 
going to be about a 1GB/s per card give or take.  LibCephFS itself is 
actually pretty good for this kind of workload (and beats the kernel 
client!).  FUSE is a problem though.  In general, we need to figure out 
how to reduce the amount of work to get data from the network to the 
GPU. (Even if it's not quite as fancy as something like GPUDirect).

The above two issues are the major obstruction of our ceph ambition at
present, and we believe that we can extend the ceph usage aggressively
with them solved.

Thanks very much:-)

Thanks,
Neha

[1] https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLMg/viewform?ts=65e87dd8&edit_requested=true
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx