Re: Distribute rados locks across several RGWs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 1, 2022 at 11:15 AM Or Friedmann <ofriedma@xxxxxxxxxx> wrote:
Hi,

Our team have been working on: https://github.com/ceph/ceph/pull/45958
The main idea behind this solution is to allow several RGWs to fairly distribute locks for multisite syncing.
The way the rgw will do it:
  1. Every RGW will create a vector of the size of the lock count used by the RGW, in every cell it will store as a value the index of the cell and call std::shuffle on the vector.
  2. Each time RGW tries to lock a lock it will call the lock and add a bid number which is vector[shard_id] and expiration time for the bid.
  3. Once the CLS_LOCK is called, the function should know which RGW has the lowest bid and the bid has not expired for the specific lock, if the RGW that called CLS_LOCK doesn't have the lowest bid it will fail(it will renew if it already acquired the lock) otherwise it will acquire the lock.
  4. Using this method allows the RGWs to share locks almost perfectly between each other so the work could be done by several RGWs.
Currently, to maintain the bid mapping for all bidded locks, which means, for every lock know how many clients tried to lock and their non-expired bids, we use a static std::unordered_map and a static mutex inside lock_obj function, this is the current implementation, the reason for that is that generally we don't need the map to persistent between restarts or changing the Primary or the OSD, but between calls it should be persistent.

I’m sorry, are you saying you’re trying to keep object class state in the OSD’s memory across invocations of the class functions?

You absolutely cannot do that. Some reasons: how is rgw supposed to detect when the state is lost? What guarantees your locking continues to work when it’s dropped?
Why do you think this gets to allocate what sure looks like an amount of memory that scales with bucket shard and rgw count across every single bucket in the system? Have you worked through the math on how much memory you’re demanding from the OSDs to satisfy this?
The fact that this apparently works at all is an accident of implementation, and one that is likely to break with changes. (For instance: crimson.) it is definitely not part of the intended API.

You need to design this so it works by updating disk state. I’d use omap entries; it seems like a good fit?

In terms of the algorithm, I think you need some more math to prove its properties. Why is this bid system better than just having CLS_LOCK pick a random non-expired rgw from the set it has when a lock call comes in and the previous lock has expired? Also, how are these locks actually being queried and maintained? Does each rgw just ping every single shard at 30-second intervals with a lock attempt?
-Greg



We thought about other ways to maintain that information:
  1. For each lock_info_t maintain a smaller map that will include only clients and clients' bids.
    The cons with this solution were that we will need to write the xattr(write_lock()) if we change the bids, which happens every call and the second one is that for every non-zero returned value the write_lock() would not happen, so it is not updating the map at all.
  2. We could maybe use ObjectContext and store the bid info for each lock there, but ObjectContext are not staying in memory for a long time.
What do you think could be the best way to go?

Thanks
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux