Re: rgw multisite: revisiting the design of 'async notifications'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



a friendly reminder of last year's discussion. i still think we're
better off removing this feature: it's extra complexity in sync code
that's already too complicated, and it doesn't scale to multiple
gateways per zone. the only benefit it provides is in perceived sync
latency, at small scales, when we're mostly caught up already. in most
other cases, it's a detriment to sync performance because of the extra
network traffic, the duplicated processing of most datalog entries,
and loss of temporal locality in the bucket sync cache

On Tue, Mar 30, 2021 at 2:39 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> On Tue, Mar 30, 2021 at 1:52 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> >
> >
> >
> > On Tue, Mar 30, 2021 at 8:27 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >>
> >> in multisite, these async notifications are http messages that get
> >> periodically broadcast to peer zones as new entries are added to a
> >> shard of the mdlog or datalog. on the destination zones, they serve
> >> two purposes:
> >>
> >> * wake up the coroutine that was processing the given log shards, in
> >> case they were sleeping because there was nothing to do the last time
> >> they polled
> >>
> >> * for data sync only, these messages also carry the keys of each new
> >> datalog entry so we can trigger sync on the related bucket shards (in
> >> addition to the buckets we're already syncing from the datalog itself)
> >>
> >> these notifications have been in since jewel. as i understand it, the
> >> goal was to make replication feel more responsive to updates, but the
> >> model has two major flaws:
> >>
> >> * it doesn't scale to more than one gateway per zone. when
> >> broadcasting these notifications, we choose one radosgw endpoint from
> >> each peer zone - but we have no way to know which one of those is
> >> actually processing the log shards we're trying to notify. on receipt,
> >> data sync will cache all of these keys in a map of 'modified_shards',
> >> and the entries will just pile up in memory for the shards it isn't
> >> processing
> >>
> >> * it reduces the apparent latency of sync on some buckets at the
> >> expense of overall sync throughput. not only does it prioritize sync
> >> of 'hot' buckets over buckets in the backlog, but for every bucket we
> >> sync via a notification, we'll re-sync it again when we get to its
> >> entry in the log. i don't think this tradeoff is a good one
> >>
> >>
> >> what does everyone else think? are there other reasons to keep sending these?
> >
> >
> > If it's not set and there's no other sync happening then sync won't be happening at all until after the polling period for the specific shard happened. Means that throughput is actually reduced in that case as it takes longer time to sync the same amount of data. I'm not sure that prioritizing hot buckets over backlog is a bad idea.
> > It seems to me that maybe we can re-think how the notification is handled instead of completely eliminating it. Maybe we can find something that works and does not affect total system throughput.
> >
> > Yehuda
>
> thanks Yehuda,
>
> it seems like this is trying to optimize throughput in the case where
> we're mostly caught up, but that's not where it matters!
>
> if the backlog is small, all of the buckets it contains are hot, so
> the notifications aren't relevant for prioritization. they become more
> and more relevant as the size of the backlog grows - but if the
> backlog is large, we really should be prioritizing that or we'll just
> get further behind
>
> i'm sure there are cases where it's important to prioritize recent
> changes. but DR still seems to be the major use case for multisite,
> and there i think the overall sync delta is far more important than
> the ability to read changes right after they're written

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux