Re: Distribute rados locks across several RGWs

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 2 Dec 2022 10:48:04 -0800

>From discussion on the PR and external conversations, I think the
simplest way to resolve this is that you actually can do a write and
return an error code; that mechanism is discussed at
https://github.com/ceph/ceph/pull/45958#discussion_r1038420098

But to explicate a bit on the rest of things:

On Fri, Dec 2, 2022 at 8:24 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> thanks Or and Greg!
>
> On Fri, Dec 2, 2022 at 1:36 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 1, 2022 at 11:15 AM Or Friedmann <ofriedma@xxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> Our team have been working on: https://github.com/ceph/ceph/pull/45958
> >> The main idea behind this solution is to allow several RGWs to fairly distribute locks for multisite syncing.
> >> The way the rgw will do it:
> >>
> >> Every RGW will create a vector of the size of the lock count used by the RGW, in every cell it will store as a value the index of the cell and call std::shuffle on the vector.
> >> Each time RGW tries to lock a lock it will call the lock and add a bid number which is vector[shard_id] and expiration time for the bid.
> >> Once the CLS_LOCK is called, the function should know which RGW has the lowest bid and the bid has not expired for the specific lock, if the RGW that called CLS_LOCK doesn't have the lowest bid it will fail(it will renew if it already acquired the lock) otherwise it will acquire the lock.
> >> Using this method allows the RGWs to share locks almost perfectly between each other so the work could be done by several RGWs.
> >>
> >> Currently, to maintain the bid mapping for all bidded locks, which means, for every lock know how many clients tried to lock and their non-expired bids, we use a static std::unordered_map and a static mutex inside lock_obj function, this is the current implementation, the reason for that is that generally we don't need the map to persistent between restarts or changing the Primary or the OSD, but between calls it should be persistent.
> >
> >
> > I’m sorry, are you saying you’re trying to keep object class state in the OSD’s memory across invocations of the class functions?
> >
> > You absolutely cannot do that. Some reasons: how is rgw supposed to detect when the state is lost? What guarantees your locking continues to work when it’s dropped?
>
> the bids are time-sensitive hints that expire after a duration,
> because we don't want to reserve locks for clients that went away. if
> the OSD forgets this bid state, then cls_lock will just grant the next
> lock request as it normally would, then other active clients resubmit
> their bids as they continue to poll for the lock. so long as we don't
> forget this state too frequently (on the order of minutes), this still
> helps us spread the locks over active clients

Let me clarify a bit here. Since the "real" lock is a normal lock that
mutates disk state, I *think* failures here might not actually break
the lock semantics.
But the bigger concern I was going for is that by injecting this as
in-memory object class state, there's just this state floating around
the OSD doesn't know anything about and can't keep the normal
consistency guarantees on. I think the "fallback" to disk state means
that you won't actually get conflicting lock grants, but under OSD
failure scenarios (where PGs often bounce around a lot), the bids that
happen to be present in memory are going to be a totally random mix of
old bids that haven't been cleaned up, new bids, and disappeared bids
(from the perspective of the client). So the fairness this algorithm
is trying to generate will just completely disappear when the cluster
goes unhealthy.

This kind of "behavior goes bad when the cluster is unhealthy"
scenario has proven to be devastating in practice — unhealthy clusters
have less throughput for client IO, which means sometimes they go from
"able to serve the workload" to "totally overwhelmed", and cluster
behaviors that increase the IO load when the cluster is unhealthy make
that much more likely and accelerate that death spiral. For RGW
syncing fairness specifically, it means the sync time boundaries will
get worse as the cluster becomes less healthy, which seems like rather
the opposite of what's desired.

>
> > Why do you think this gets to allocate what sure looks like an amount of memory that scales with bucket shard and rgw count across every single bucket in the system? Have you worked through the math on how much memory you’re demanding from the OSDs to satisfy this?
>
> these locks aren't used for bucket index shards, but for multisite
> replication logs which are statically dimensioned at 128 data log and
> 64 metadata log objects. but this is meant to be a generic feature for
> use elsewhere too, so we share your concerns about potentially
> unbounded workloads

That sounds better, but yes, in general that's just not an interface
we want to allow. Watch-notify has some of these properties and has
already been an issue, while being much harder to abuse. :)

>
> > The fact that this apparently works at all is an accident of implementation, and one that is likely to break with changes. (For instance: crimson.) it is definitely not part of the intended API.
> >
> > You need to design this so it works by updating disk state. I’d use omap entries; it seems like a good fit?
>
> during review, i had requested that these bid values be persisted in
> the `lock_info_t` struct that cls_lock stores in an object xattr. but
> as Or found, this strategy doesn't work in general. the bid values are
> sent with lock() requests, and we need to remember these bids even for
> the requests that we reject with an error like EBUSY - and rados write
> transactions are not applied on error
>
> we could consider creating a new API, separate from lock(),
> specifically for this bidding which could always return success.
> clients would need to resubmit these bids regularly in addition to
> their existing polling lock requests

As discussed on the PR and in my intro.

>
> we've seen from multisite workload testing that rgw already taxes this
> log pool heavily, reaching latencies of several minutes. because these
> bids are only hints, it seems ideal to keep them in memory to avoid
> all of these additional rados writes

I hear what you're saying, but at the frequency discussed this
shouldn't be a meaningful overhead. If we really do need to keep it in
memory, we'd need to develop some sort of memory sidecar channel we
can develop semantics around (probably it would reset on every primary
interval change, and the object class would be able to see that had
occurred) — and we'd need a strong argument for keeping it in RADOS at
all at that point.

> > In terms of the algorithm, I think you need some more math to prove its properties. Why is this bid system better than just having CLS_LOCK pick a random non-expired rgw from the set it has when a lock call comes in and the previous lock has expired?
>
> that sounds equivalent to the current design, except that the bid
> randomization happens on the clients instead. the important part is
> that the OSD tracks which clients are currently interested in the lock

Right, that's why I asked what the point of the bid shuffling is.
Maybe the bid system has some cool math properties that guarantee each
RGW gets exactly its fair share, whereas we've all learned from CRUSH
that real randomness generates a normal distribution. It's just not
apparent to me on its face, and in the absence of that kind of
guarantee it seems simpler to have an "I'm interested" list the object
class chooses between when needed.

> > Also, how are these locks actually being queried and maintained? Does each rgw just ping every single shard at 30-second intervals with a lock attempt?
>
> essentially yes, syncing rgw instances will poll each log shard with
> cls_lock regularly

It looks like from the PR that this is every 30 seconds? I was
concerned that this was going to happen on every bucket index shard
and on a more frequent basis, but as you've clarified that was not
correct this seems fine. :)
-Greg

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx