On Tue, Oct 27, 2020 at 4:10 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > On Tue, Oct 27, 2020 at 6:50 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > Hi Shyam, > > > > Thanks for starting this discussion. > > > > On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana@xxxxxxxxxx> wrote: > > > > > > Asks: > > > ----- > > > This mail is to trigger a discussion on the potential solution, provided > > > later below, for the issue as per the subject, and to possibly gather > > > other ideas/options, to enable the use case as described. > > > > > > Use case/Background: > > > -------------------- > > > Ceph is used by kubernetes to provide persistent storage (block and > > > file, via RBD and CephFS respectively) to pods, via the CSI interface > > > implemented in ceph-csi [1]. > > > > > > One of the use cases that we want to solve is when multiple kubernetes > > > clusters access the same Ceph storage cluster [2], and further these > > > kubernetes clusters provide for DR (disaster recovery) of workloads, > > > when a peer kubernetes cluster becomes unavailable. > > > > > > IOW, if a workload is running on kubernetes cluster-a and has access to > > > persistent storage, it can be migrated to cluster-b in case of a DR > > > event in cluster-a, ensuring workload continuity and with it access to > > > the same persistent storage (as the Ceph cluster is shared and available). > > > > > > Problem: > > > -------- > > > The exact status of all client/nodes in kubernetes cluster-a on a DR > > > event is unknown, all maybe down or some may still be up and running, > > > still accessing storage. > > > > > > This brings about the need to fence all IO from all > > > nodes/container-networks on cluster-a, on a DR event, prior to migrating > > > the workloads to cluster-b. > > > > > > Existing solutions and issues: > > > ------------------------------ > > > Current schemes to fence IO are, per client [3] and further per image > > > for RBD. This makes it a prerequisite that all client addresses in > > > cluster-a are known and are further unique across peer kubernetes > > > clusters, for a fence/blocklist to be effective. > > > > > > Also, during recovery of kubernetes cluster-a, as kubernetes uses > > > current known state of the world (i.e workload "was" running on this > > > cluster) and reconciles to the desired state of the world eventually, it > > > is possible that re-mounts may occur prior to reaching desired state of > > > the world (which would be not to run the said workloads on this cluster). > > > > > > The recovery may hence cause the existing connection based blocklists to > > > be reset, as newer mounts/maps of the fs/image are performed on the > > > recovering cluster. > > > > > > The issues as above, makes the existing blocklist scheme either > > > unreliable or cumbersome to deal with for all possible nodes in the > > > respective kubernetes clusters. > > > > > > Potential solution: > > > ------------------- > > > On discussing the above with Jason, he pointed out to a potential > > > solution (as follows) to resolve the problem, > > > > > > <snip> > > > My suggestion would be to utilize CephX to revoke access to the cluster > > > from site A when site B is promoted. The one immediate issue with this > > > approach is that any clients with existing tickets will keep their > > > access to the cluster until the ticket expires. Therefore, for this to > > > be effective, we would need a transient CephX revocation list capability > > > to essentially blocklist CephX clients for X period of time until we can > > > be sure that their tickets have expired and are therefore no longer usable. > > > </snip> > > > > > > The above is quite trivial from a kubernetes and ceph-csi POV, as each > > > peer kubernetes cluster can be configured to use different cephx > > > identities, and thus independently revoked and later reinstated, solving > > > the issues laid out above. > > > > > > The ability to revoke credentials for an existing cephx identity can be > > > done if we change its existing authorization and hence is readily available. > > > > > > The ability to provide a revocation list for existing valid tickets, > > > that clients already have, would need to be developed. > > > > > > Thoughts and other options? > > > > While tempting, I think we're unnecessarily restricting our attention > > to current options. It seems to me we should consider another > > mechanism for blocklisting clients in-mass. I would suggest having > > clients add a "tag" to their sessions with Ceph daemons which can be > > separately blocklisted. The tag can be derived from the cephx key they > > use so it does not require updating all client code to send a tag > > (like the kernel). The cephx credential would probably look like: "mon > > `allow r tag=bar'". > > I like this idea as well, but I think the syntax for describing the > tag feels a little funky since you aren't actually "allowing" > anything. At this point, why not just extend blocklisting to support > entity names in general and avoid the need to touch the caps? An entity name glob may also work! > Would > there be other uses for this tag? Perhaps it'd be another piece of metadata on a cap and not part of "mon" to avoid confusion. The tags could also be useful for mass removal/modification of auth credentials. That has potential use for fine-grained access control with thousands of auth credentials. > > Once a new tag is added to the blocklist distributed via the MonMap > > (or OSDMap for consistency), daemons would need to go through their > > open sessions and blocklist any matches. > > > > There's other applications beyond DR. We currently have a heavy-weight > > "registered clients" map in the MgrMap which records all open RADOS > > instances by the mgr. These all need blocklisted if the mgr fails > > over. It is racy to keep this up-to-date so we see virtually > > unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag > > for the mgr.x credential (perhaps an implicit tag of several), we > > could blocklist that instead to avoid keeping track entirely. > > How would mgr.x unblocklist itself when it restarts? Ya, this is the tricky part since there's no nonce. If we used tags, the mgr could configure a tag with a nonce for itself spanning all sessions (in g_ceph_context). The mons would then blocklist that tag. For entity names, I guess you'd have to unblocklist it as part of startup. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx