Re: Ceph Users Feedback Survey

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Hi everyone,
>
> On behalf of the Ceph Foundation Board, I would like to announce the creation of, and cordially invite you to, the first of a recurring series of meetings focused solely on gathering feedback from the users of Ceph. The overarching goal of these meetings is to elicit feedback from the users, companies, and organizations who use Ceph in their production environments. You can find more details about the motivation behind this effort in our user survey [1] that we highly encourage all of you to take. This is an extension of the Ceph User Dev Meeting with concerted focus on Performance (led by Vincent Hsu, IBM) and Orchestration/Deployment (led by Matt Leonard, Bloomberg), to start off with. We would like to kick off this series of meetings on March 21, 2024. The survey will be open until March 18, 2024.
>
> Looking forward to hearing from you!

Hi, Neha, we are very excited to see this opportunity to communicate
with the community.

Ceph have been a very important solution to our data storage
requirements, however, we also see the following issues that are
obstructing ceph's further extensive usage in our office:

1. We continuously observe significant performance degradation in both
rbd and rgw clusters when there's pg backfilling.
     a): According to our investigation, this performance degradation
mostly comes from the PG removals, the cost of which comes from
listing the to-be-removed PGs' objects. We believe that the reason for
this high-cost object listing is that bluestore stores metadata in
RocksDB the read performance of which may not be superb when there are
a lot of keys in it. The case is even worse in our rgw clusters,
because our users tend to store small objects, the sizes of which are
mostly within 200KB, in our clusters,  which means data of the same
total size will create much more metadata to be stored in a single OSD
in the rgw clusters than the rbd clusters. This hurts even more when
our clusters use EC as the data redundancy policy, because those small
objects will be cut into even smaller pieces. The situation is that,
if we use the default pg removal configuration, our rgw clusters can
be rendered completely unavailable when we try to expand them; on the
other hand, we also can't config the pg removal to be too slow most of
the time, because we rely on PG Removal to spare the disk space for
further usage and configuring it to be slow will result in FULL
clusters. It seems that, if we can remove pg objects without listing
the pg, the issue will be gone.
     b): It looks to us that the performance degradation also comes
from the one-to-many rgw-rados object mapping. As mentioned earlier,
our users tend to save small objects in our rgw clusters, and rgw
saves each of these small objects as one or multiple rados objects. On
the other hand, RADOS does pg backfillling in the granularity of
objects. So, in our rgw clusters, PG backfilling will issue large
amounts of small data reads to the underlying disks, which not only
hurts the client IO performance but also slows down the backfilling
itself significantly. We think maybe organizing small rgw objects into
large rados objects and having a background compaction process that
continuously scans those rados objects and reclaim the space of
removed rgw objects will solve this issue.
2. We also have a need to store large amounts of ML training data,
which in most cases are also in the form of small files; however,
CephFS doesn't seem to handle this situation very well:
    a): we continuously see clients failing to respond to write caps
reclamation requests even without data to be flushed to the cluster
and the cluster not detecting this kind of clients which further lead
to them not forcing the reclaiming, which can block a lot of other
clients forever; in this case, our administrators often have to
manually evict the client;
    b): since we are trying to handle ML workloads, which involves
large amounts of small files, we want to use the memory space of the
MDS machine sufficiently, so we usually configure the MDS memory
target to 80% the total memory space. However, it seems that the MDS
memory target is not a strict limit, so there are cases in which MDSes
are killed by the kernel oom-killer, and if this happens, the
standby-replay MDS also has a high probability of being killed by the
oom-killer after it turns to active. In this case, we often observe
that the MDS bootstrapping will need more memory space than its memory
target to replay the journal, and there are cases that the memory is
not enough and the MDSes may never come up again. Our solution to this
situation is to find other machines with larger memory space to run
the MDSes again. This hurts us really badly.
    It looks to us that the ML training is a standard
write-once-read-many workload, so maybe a simple FUSE implementation
based on LibRGW is good enough.

The above two issues are the major obstruction of our ceph ambition at
present, and we believe that we can extend the ceph usage aggressively
with them solved.

Thanks very much:-)


>
> Thanks,
> Neha
>
> [1] https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLMg/viewform?ts=65e87dd8&edit_requested=true
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux