Hi Xuehan, Thank you for sharing your feedback! Those sound like good topics to discuss at the User Dev Meetings. - Neha > On Mar 13, 2024, at 11:14 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > > >> >> Hi everyone, >> >> On behalf of the Ceph Foundation Board, I would like to announce the creation of, and cordially invite you to, the first of a recurring series of meetings focused solely on gathering feedback from the users of Ceph. The overarching goal of these meetings is to elicit feedback from the users, companies, and organizations who use Ceph in their production environments. You can find more details about the motivation behind this effort in our user survey [1] that we highly encourage all of you to take. This is an extension of the Ceph User Dev Meeting with concerted focus on Performance (led by Vincent Hsu, IBM) and Orchestration/Deployment (led by Matt Leonard, Bloomberg), to start off with. We would like to kick off this series of meetings on March 21, 2024. The survey will be open until March 18, 2024. >> >> Looking forward to hearing from you! > > Hi, Neha, we are very excited to see this opportunity to communicate > with the community. > > Ceph have been a very important solution to our data storage > requirements, however, we also see the following issues that are > obstructing ceph's further extensive usage in our office: > > 1. We continuously observe significant performance degradation in both > rbd and rgw clusters when there's pg backfilling. > a): According to our investigation, this performance degradation > mostly comes from the PG removals, the cost of which comes from > listing the to-be-removed PGs' objects. We believe that the reason for > this high-cost object listing is that bluestore stores metadata in > RocksDB the read performance of which may not be superb when there are > a lot of keys in it. The case is even worse in our rgw clusters, > because our users tend to store small objects, the sizes of which are > mostly within 200KB, in our clusters, which means data of the same > total size will create much more metadata to be stored in a single OSD > in the rgw clusters than the rbd clusters. This hurts even more when > our clusters use EC as the data redundancy policy, because those small > objects will be cut into even smaller pieces. The situation is that, > if we use the default pg removal configuration, our rgw clusters can > be rendered completely unavailable when we try to expand them; on the > other hand, we also can't config the pg removal to be too slow most of > the time, because we rely on PG Removal to spare the disk space for > further usage and configuring it to be slow will result in FULL > clusters. It seems that, if we can remove pg objects without listing > the pg, the issue will be gone. > b): It looks to us that the performance degradation also comes > from the one-to-many rgw-rados object mapping. As mentioned earlier, > our users tend to save small objects in our rgw clusters, and rgw > saves each of these small objects as one or multiple rados objects. On > the other hand, RADOS does pg backfillling in the granularity of > objects. So, in our rgw clusters, PG backfilling will issue large > amounts of small data reads to the underlying disks, which not only > hurts the client IO performance but also slows down the backfilling > itself significantly. We think maybe organizing small rgw objects into > large rados objects and having a background compaction process that > continuously scans those rados objects and reclaim the space of > removed rgw objects will solve this issue. > 2. We also have a need to store large amounts of ML training data, > which in most cases are also in the form of small files; however, > CephFS doesn't seem to handle this situation very well: > a): we continuously see clients failing to respond to write caps > reclamation requests even without data to be flushed to the cluster > and the cluster not detecting this kind of clients which further lead > to them not forcing the reclaiming, which can block a lot of other > clients forever; in this case, our administrators often have to > manually evict the client; > b): since we are trying to handle ML workloads, which involves > large amounts of small files, we want to use the memory space of the > MDS machine sufficiently, so we usually configure the MDS memory > target to 80% the total memory space. However, it seems that the MDS > memory target is not a strict limit, so there are cases in which MDSes > are killed by the kernel oom-killer, and if this happens, the > standby-replay MDS also has a high probability of being killed by the > oom-killer after it turns to active. In this case, we often observe > that the MDS bootstrapping will need more memory space than its memory > target to replay the journal, and there are cases that the memory is > not enough and the MDSes may never come up again. Our solution to this > situation is to find other machines with larger memory space to run > the MDSes again. This hurts us really badly. > It looks to us that the ML training is a standard > write-once-read-many workload, so maybe a simple FUSE implementation > based on LibRGW is good enough. > > The above two issues are the major obstruction of our ceph ambition at > present, and we believe that we can extend the ceph usage aggressively > with them solved. > > Thanks very much:-) > > >> >> Thanks, >> Neha >> >> [1] https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLMg/viewform?ts=65e87dd8&edit_requested=true >> _______________________________________________ >> Dev mailing list -- dev@xxxxxxx >> To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx