Re: Distribute rados locks across several RGWs

Casey Bodley <cbodley@xxxxxxxxxx> · Fri, 2 Dec 2022 11:24:27 -0500

thanks Or and Greg!

On Fri, Dec 2, 2022 at 1:36 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Thu, Dec 1, 2022 at 11:15 AM Or Friedmann <ofriedma@xxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> Our team have been working on: https://github.com/ceph/ceph/pull/45958
>> The main idea behind this solution is to allow several RGWs to fairly distribute locks for multisite syncing.
>> The way the rgw will do it:
>>
>> Every RGW will create a vector of the size of the lock count used by the RGW, in every cell it will store as a value the index of the cell and call std::shuffle on the vector.
>> Each time RGW tries to lock a lock it will call the lock and add a bid number which is vector[shard_id] and expiration time for the bid.
>> Once the CLS_LOCK is called, the function should know which RGW has the lowest bid and the bid has not expired for the specific lock, if the RGW that called CLS_LOCK doesn't have the lowest bid it will fail(it will renew if it already acquired the lock) otherwise it will acquire the lock.
>> Using this method allows the RGWs to share locks almost perfectly between each other so the work could be done by several RGWs.
>>
>> Currently, to maintain the bid mapping for all bidded locks, which means, for every lock know how many clients tried to lock and their non-expired bids, we use a static std::unordered_map and a static mutex inside lock_obj function, this is the current implementation, the reason for that is that generally we don't need the map to persistent between restarts or changing the Primary or the OSD, but between calls it should be persistent.
>
>
> I’m sorry, are you saying you’re trying to keep object class state in the OSD’s memory across invocations of the class functions?
>
> You absolutely cannot do that. Some reasons: how is rgw supposed to detect when the state is lost? What guarantees your locking continues to work when it’s dropped?

the bids are time-sensitive hints that expire after a duration,
because we don't want to reserve locks for clients that went away. if
the OSD forgets this bid state, then cls_lock will just grant the next
lock request as it normally would, then other active clients resubmit
their bids as they continue to poll for the lock. so long as we don't
forget this state too frequently (on the order of minutes), this still
helps us spread the locks over active clients

> Why do you think this gets to allocate what sure looks like an amount of memory that scales with bucket shard and rgw count across every single bucket in the system? Have you worked through the math on how much memory you’re demanding from the OSDs to satisfy this?

these locks aren't used for bucket index shards, but for multisite
replication logs which are statically dimensioned at 128 data log and
64 metadata log objects. but this is meant to be a generic feature for
use elsewhere too, so we share your concerns about potentially
unbounded workloads

> The fact that this apparently works at all is an accident of implementation, and one that is likely to break with changes. (For instance: crimson.) it is definitely not part of the intended API.
>
> You need to design this so it works by updating disk state. I’d use omap entries; it seems like a good fit?

during review, i had requested that these bid values be persisted in
the `lock_info_t` struct that cls_lock stores in an object xattr. but
as Or found, this strategy doesn't work in general. the bid values are
sent with lock() requests, and we need to remember these bids even for
the requests that we reject with an error like EBUSY - and rados write
transactions are not applied on error

we could consider creating a new API, separate from lock(),
specifically for this bidding which could always return success.
clients would need to resubmit these bids regularly in addition to
their existing polling lock requests

we've seen from multisite workload testing that rgw already taxes this
log pool heavily, reaching latencies of several minutes. because these
bids are only hints, it seems ideal to keep them in memory to avoid
all of these additional rados writes

>
> In terms of the algorithm, I think you need some more math to prove its properties. Why is this bid system better than just having CLS_LOCK pick a random non-expired rgw from the set it has when a lock call comes in and the previous lock has expired?

that sounds equivalent to the current design, except that the bid
randomization happens on the clients instead. the important part is
that the OSD tracks which clients are currently interested in the lock

> Also, how are these locks actually being queried and maintained? Does each rgw just ping every single shard at 30-second intervals with a lock attempt?

essentially yes, syncing rgw instances will poll each log shard with
cls_lock regularly

> -Greg
>
>>
>>
>> We thought about other ways to maintain that information:
>>
>> For each lock_info_t maintain a smaller map that will include only clients and clients' bids.
>> The cons with this solution were that we will need to write the xattr(write_lock()) if we change the bids, which happens every call and the second one is that for every non-zero returned value the write_lock() would not happen, so it is not updating the map at all.
>> We could maybe use ObjectContext and store the bid info for each lock there, but ObjectContext are not staying in memory for a long time.
>>
>> What do you think could be the best way to go?
>>
>> Thanks
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx