Re: rgw multisite: revisiting the design of 'async notifications'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 30, 2022 at 10:23 AM Matt Benjamin <mbenjami@xxxxxxxxxx> wrote:
>
> I think it is worth pointing out that, as Yehuda states, it's a
> fundamental property of polling systems that you can't arbitrarily
> reduce the polling interval.  The current polling interval of 20s is
> already quite short, yet it doesn't seem able to overload any
> plausible network, so maybe it's a pretty good default value.

yeah, it's hard to reason about these tunings without specific
workloads in mind. it might be interesting to build some different
workloads, like spiky vs. smooth, and see how sync behaves over a
range of tunings like 5s, 20s, 1m, 5m

>
> I feel like the basic notion of adding notifications was probably a
> good intuition notionally, but right now the model does seem to be ill
> defined.  Without a quantitative model of behavior under a variety of
> conditions, it would be very hard to be confident of it's efficiency.
> It's probably bad as currently tuned.  Just offhand, it seems like the
> main potential utility it has would be to spread activation to new
> bucket activations, in particular when there is a deep outstanding
> backlog of hints (all of which might be for buckets which have been
> ingesting data for a long period)?

data sync could be smarter about this, but it's hard to know what to
prioritize - the right answer will depend heavily on the use case

in earlier discussions we talked about the general desire to
prioritize at least some recent changes, but that too will depend. if
the use case is DR and you're trying to meet some SLA, you really
don't want to prioritize new changes over the oldest

the most obvious metric for prioritization is "oldest change first",
which is (kind of) what we get from the datalog itself: each entry
says "shard X of bucket Y changed at time T", and we expect these
entries to be in mostly-temporal order. so data incremental sync will
spawn bucket sync on the bucket shards with the oldest changes

the main problem with this approach is just that bucket sync is
'greedy' and, once started, won't return until it syncs all the way to
the end of the remote's bilog. so when we trigger bucket sync on the
oldest change, we may go on to sync a bunch of newer stuff as well,
instead of going back and processing the next-oldest change in the
datalog. so we need some limit on bucket sync, based either on time or
on the number of objects we'll sync, to allow data sync to better
prioritize its backlog

as Adam points out, this ties into future 'sync fairness' work which
also needs a way to stop bucket sync early, so that other rgws get a
chance to grab the cls_locks for data sync

>
> Matt
>
> On Wed, Mar 30, 2022 at 9:55 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> > On Wed, Mar 30, 2022 at 7:57 AM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> > >
> > > If there are implementation, configuration or other optimization issues, then these should be handled.
> >
> > there are several known issues discussed in this thread. as i pointed
> > out with the load balancer example above, fixing them may not even be
> > tractable. but before we can decide whether it's worth the effort to
> > pursue these fixes, we as a team need to agree on what these
> > notifications are meant to accomplish
> >
> > it's clear to me that notifications don't help the DR use case. they
> > optimize for a "latency" that is not visible to users, at the expense
> > of overall sync bandwidth. they cause data sync to duplicate the
> > processing of *every event*. they can spam http requests every 200ms,
> > even if the other zone doesn't need the wakeups
> >
> > if notifications are worth fixing, we need to understand the cases
> > where they actually help. i still don't think you've shown any
> > compelling cases, let alone the motivation to optimize for those cases
> > at the expense of DR
> >
> > >
> > > On Wed, Mar 30, 2022 at 5:00 AM Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
> > > >
> > > > agree. if we have the polling time configurable we simplify the overall mechanism and keep test duration at bay.
> > > >
> > > > On Wed, Mar 30, 2022 at 12:12 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> > > >>
> > > >> On Tue, Mar 29, 2022 at 3:56 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> > > >> >
> > > >> > How about this use case:
> > > >> > An rgw multisite test suite that checks that objects have been synced to the remote zone before it can continue to the next test.
> > > >> > Wanting to reduce latency shouldn't be controversial. Performance is not just bandwidth.
> > > >>
> > > >> granted, the multisite tests do write objects and then wait for them
> > > >> to show up on other zones, so they do observe this as actual latency
> > > >>
> > > >> but what actual use cases look like this? where else does 'replication
> > > >> time' mean the same thing as 'latency'?
> > > >>
> > > >> earlier in the thread we talked about making the polling interval
> > > >> configurable; multisite tests can just set that to 1. that knob might
> > > >> be good enough for some other use cases too?
> > > >>
> > > >> >
> > > >> > Yehuda
> > > >> >
> > > >> > On Tue, Mar 29, 2022, 2:56 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> > > >> >>
> > > >> >> hi Yehuda,
> > > >> >>
> > > >> >> On Wed, Mar 23, 2022 at 9:46 AM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> > > >> >> >
> > > >> >> > On Tue, Mar 22, 2022 at 2:14 PM Adam C. Emerson <aemerson@xxxxxxxxxx> wrote:
> > > >> >> > >
> > > >> >> > > On 22/03/2022, Matt Benjamin wrote:
> > > >> >> > > > Just to be clear, why do we think it doesn't serve as an optimization?
> > > >> >> > >
> > > >> >> > > My thought being, if we're already saturated with syncing stuff,
> > > >> >> > > adding more work on top of it won't help anything.
> > > >> >> >
> > > >> >> > And what if we're not saturated? You're optimizing the high traffic
> > > >> >> > case by killing the low traffic case. If there are specific
> > > >> >> > implementation issues then address them, but I think this is very
> > > >> >> > valuable to some use cases.
> > > >> >>
> > > >> >> i'm still interested in exploring these use cases, to learn how async
> > > >> >> notifications can work with the rest of multisite sync to satisfy them
> > > >> >>
> > > >> >> it sounds like you're interested in use cases with very strict
> > > >> >> requirements on the sync delta, given that they demand a 'sensitivity'
> > > >> >> on the order of 200ms
> > > >> >>
> > > >> >> however, multisite does asynchronous replication. this means that no
> > > >> >> client can expect to read an object on a secondary zone immediately
> > > >> >> after writing it to the primary. this replication could be arbitrarily
> > > >> >> far behind. ultimately, we can't provide any guarantees about how long
> > > >> >> it will take for a given write to replicate
> > > >> >>
> > > >> >> so i'm having a lot of trouble coming up with use cases that are
> > > >> >> compatible with async replication, but are also 'killed' when we
> > > >> >> replace notifications every 200ms with polling at a 20s interval
> > > >> >>
> > > >> >> if async replication is the problem, we can't expect notifications to
> > > >> >> fix it. the client probably wants synchronous replication instead,
> > > >> >> which could just mean writing each object to both zones before
> > > >> >> completing
> > > >> >>
> > > >> >> if you're still advocating for these notifications, can you please
> > > >> >> help to frame the discussion here?
> > > >> >>
> > > >> >> >
> > > >> >> > Yehuda
> > > >> >> >
> > > >> >> > >
> > > >> >> > > > OTOH, as Yehuda points out, the intended purpose of the async
> > > >> >> > > > notifies was to implement polling avoidance--to provide wake-ups to
> > > >> >> > > > sync endpoints that might otherwise sleep/idle as replication events
> > > >> >> > > > accumulate.  This is a well established design pattern, and if we
> > > >> >> > > > remember that the async notifies are duplicating hints, it seems to
> > > >> >> > > > make sense.
> > > >> >> > >
> > > >> >> > > Measuring to see how consequential this is would be legitimate.
> > > >> >> > >
> > > >> >> > > I can imagine a world where if the primary has an idea what the
> > > >> >> > > secondary's polling period is, and there hasn't been much sync
> > > >> >> > > activity and the primary knows the secondary won't poll for a while,
> > > >> >> > > it might be worthwhile to send a single wakeup event when there's new
> > > >> >> > > data available telling it that there's new stuff in the data log.
> > > >> >> > >
> > > >> >> > > Whether this is worthwhile would depend heavily on how frequently the
> > > >> >> > > secondary polls the data log in the first place.
> > > >> >> > >
> > > >> >> > > _______________________________________________
> > > >> >> > > Dev mailing list -- dev@xxxxxxx
> > > >> >> > > To unsubscribe send an email to dev-leave@xxxxxxx
> > > >> >> > >
> > > >> >> >
> > > >> >> > _______________________________________________
> > > >> >> > Dev mailing list -- dev@xxxxxxx
> > > >> >> > To unsubscribe send an email to dev-leave@xxxxxxx
> > > >> >> >
> > > >> >>
> > > >>
> > > >> _______________________________________________
> > > >> Dev mailing list -- dev@xxxxxxx
> > > >> To unsubscribe send an email to dev-leave@xxxxxxx
> > > >>
> > >
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> >
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux