Re: RGW: max number of shards per bucket index

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 28 Apr 2022 13:18:00 -0400

On Tue, Apr 26, 2022 at 5:25 AM Cory Snyder <csnyder@xxxxxxxxx> wrote:
>
> Thanks for your input, Casey! Your response seems to align with my mental model.
>
> It makes sense that choosing the number of bucket index shards
> involves a tradeoff between write parallelism and bucket listing
> performance. Your point about the relevancy of the number of PGs is
> also reasonable. If those were the only constraints, I think that it
> would be fairly straightforward to develop heuristics for finding an
> appropriate number of shards to meet performance criteria on a
> particular bucket.
>
> The other issue that hasn't surfaced in this discussion yet, though,
> is the issue of large omap objects. Avoiding large omap objects seems
> to be the driver behind the current dynamic resharding logic; dynamic
> resharding isn't explicitly concerned about either write parallelism
> or bucket listing latency. It's surprisingly difficult to find details
> on precisely what problems are caused by large omap objects. Is it
> related to recovery times of those objects? Are there other issues?

recovery time is one aspect, yeah. if you haven't seen it, "Adventures
with large RGW buckets"
(https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/36P62BOOCJBVVJCVUX5F5J7KYCGAAICV/)
was a good discussion on this subject, and Josh touches on the
recovery aspect at the end

in the past, large omap objects were hard on scrub too. but as far as
i know, the only 'other issue' now is just that rocksdb performance
drops off when there's too much data in omap overall. so if there's an
imbalance that causes too many keys to pile onto a single OSD, you'd
want more shards to even that out. but once you have enough shards (or
enough total buckets) to spread that omap data evenly across OSDs,
it's not clear to me that you'd see other benefits from further
splitting into more rados objects. i'd love to hear from someone more
familiar with bluestore

in addition to rgw_max_objs_per_shard, i wanted to point out two
related changes that landed in octopus:

* the default shard count of new buckets was raised to 11 for write
parallelism. see https://github.com/ceph/ceph/pull/32660 and
https://github.com/ceph/ceph/pull/30875 for discussion

* dynamic resharding was limited to rgw_max_dynamix_shards=1999, to
serve as a bound for bucket listing latency. see
https://github.com/ceph/ceph/pull/30795

so while rgw_max_objs_per_shard alone isn't sufficient to capture all
of these dimensions, the three knobs together might be

>
> Thanks,
>
> Cory
>
>
> On Mon, Apr 25, 2022 at 12:13 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> > On Fri, Apr 22, 2022 at 3:20 PM Cory Snyder <csnyder@xxxxxxxxx> wrote:
> > >
> > > Hi all,
> > >
> > > Does anyone have any guidance on the maximum number of bucket index shards
> > > to optimize performance for buckets with a huge number of objects? It seems
> > > like there is probably a threshold where performance starts to decrease
> > > with an increased number of shards (particularly bucket listings). More
> > > specifically, if I have N OSDs in the bucket index pool, does it make sense
> > > to allow a bucket to have more than N index shards?
> >
> > with respect to write parallelism, i think the most interesting limit
> > is the PG count of the index pool. my understanding is that the OSDs
> > can only handle a single write at a time per PG due to the rados
> > recovery model. so you'd expect to see index write performance
> > increase as you raise the shard count, but level off as you get closer
> > to that PG count
> >
> > > Perhaps some multiple
> > > of N makes sense, with the value of the multiplier influenced by
> > > osd_op_num_threads_per_shard and osd_op_num_shards?
> >
> > i'm less familiar with these OSD configurables, but it's possible that
> > they'd impose limits on parallelism below the PG count
> >
> > >
> > > Thanks in advance for any theoretical or empirical insights!
> >
> > if you need to list these huge buckets, you'll want to strike a
> > balance between write parallelism and the latency of bucket listing
> > requests. once that request latency reaches the client's retry timer,
> > you'll really start to see listing performance fall off a cliff
> >
> > >
> > > Cory Snyder
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> >
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx