Re: Fencing an entire client cluster from access to Ceph (in kubernetes)

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Tue, 27 Oct 2020 17:59:50 -0700

On Tue, Oct 27, 2020 at 4:10 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>
> On Tue, Oct 27, 2020 at 6:50 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> >
> > Hi Shyam,
> >
> > Thanks for starting this discussion.
> >
> > On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana@xxxxxxxxxx> wrote:
> > >
> > > Asks:
> > > -----
> > > This mail is to trigger a discussion on the potential solution, provided
> > > later below, for the issue as per the subject, and to possibly gather
> > > other ideas/options, to enable the use case as described.
> > >
> > > Use case/Background:
> > > --------------------
> > > Ceph is used by kubernetes to provide persistent storage (block and
> > > file, via RBD and CephFS respectively) to pods, via the CSI interface
> > > implemented in ceph-csi [1].
> > >
> > > One of the use cases that we want to solve is when multiple kubernetes
> > > clusters access the same Ceph storage cluster [2], and further these
> > > kubernetes clusters provide for DR (disaster recovery) of workloads,
> > > when a peer kubernetes cluster becomes unavailable.
> > >
> > > IOW, if a workload is running on kubernetes cluster-a and has access to
> > > persistent storage, it can be migrated to cluster-b in case of a DR
> > > event in cluster-a, ensuring workload continuity and with it access to
> > > the same persistent storage (as the Ceph cluster is shared and available).
> > >
> > > Problem:
> > > --------
> > > The exact status of all client/nodes in kubernetes cluster-a on a DR
> > > event is unknown, all maybe down or some may still be up and running,
> > > still accessing storage.
> > >
> > > This brings about the need to fence all IO from all
> > > nodes/container-networks on cluster-a, on a DR event, prior to migrating
> > > the workloads to cluster-b.
> > >
> > > Existing solutions and issues:
> > > ------------------------------
> > > Current schemes to fence IO are, per client [3] and further per image
> > > for RBD. This makes it a prerequisite that all client addresses in
> > > cluster-a are known and are further unique across peer kubernetes
> > > clusters, for a fence/blocklist to be effective.
> > >
> > > Also, during recovery of kubernetes cluster-a, as kubernetes uses
> > > current known state of the world (i.e workload "was" running on this
> > > cluster) and reconciles to the desired state of the world eventually, it
> > > is possible that re-mounts may occur prior to reaching desired state of
> > > the world (which would be not to run the said workloads on this cluster).
> > >
> > > The recovery may hence cause the existing connection based blocklists to
> > > be reset, as newer mounts/maps of the fs/image are performed on the
> > > recovering cluster.
> > >
> > > The issues as above, makes the existing blocklist scheme either
> > > unreliable or cumbersome to deal with for all possible nodes in the
> > > respective kubernetes clusters.
> > >
> > > Potential solution:
> > > -------------------
> > > On discussing the above with Jason, he pointed out to a potential
> > > solution (as follows) to resolve the problem,
> > >
> > > <snip>
> > > My suggestion would be to utilize CephX to revoke access to the cluster
> > > from site A when site B is promoted. The one immediate issue with this
> > > approach is that any clients with existing tickets will keep their
> > > access to the cluster until the ticket expires. Therefore, for this to
> > > be effective, we would need a transient CephX revocation list capability
> > > to essentially blocklist CephX clients for X period of time until we can
> > > be sure that their tickets have expired and are therefore no longer usable.
> > > </snip>
> > >
> > > The above is quite trivial from a kubernetes and ceph-csi POV, as each
> > > peer kubernetes cluster can be configured to use different cephx
> > > identities, and thus independently revoked and later reinstated, solving
> > > the issues laid out above.
> > >
> > > The ability to revoke credentials for an existing cephx identity can be
> > > done if we change its existing authorization and hence is readily available.
> > >
> > > The ability to provide a revocation list for existing valid tickets,
> > > that clients already have, would need to be developed.
> > >
> > > Thoughts and other options?
> >
> > While tempting, I think we're unnecessarily restricting our attention
> > to current options. It seems to me we should consider another
> > mechanism for blocklisting clients in-mass. I would suggest having
> > clients add a "tag" to their sessions with Ceph daemons which can be
> > separately blocklisted. The tag can be derived from the cephx key they
> > use so it does not require updating all client code to send a tag
> > (like the kernel). The cephx credential would probably look like: "mon
> > `allow r tag=bar'".
>
> I like this idea as well, but I think the syntax for describing the
> tag feels a little funky since you aren't actually "allowing"
> anything. At this point, why not just extend blocklisting to support
> entity names in general and avoid the need to touch the caps?

An entity name glob may also work!

> Would
> there be other uses for this tag?

Perhaps it'd be another piece of metadata on a cap and not part of
"mon" to avoid confusion. The tags could also be useful for mass
removal/modification of auth credentials. That has potential use for
fine-grained access control with thousands of auth credentials.

> > Once a new tag is added to the blocklist distributed via the MonMap
> > (or OSDMap for consistency), daemons would need to go through their
> > open sessions and blocklist any matches.
> >
> > There's other applications beyond DR. We currently have a heavy-weight
> > "registered clients" map in the MgrMap which records all open RADOS
> > instances by the mgr. These all need blocklisted if the mgr fails
> > over. It is racy to keep this up-to-date so we see virtually
> > unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag
> > for the mgr.x credential (perhaps an implicit tag of several), we
> > could blocklist that instead to avoid keeping track entirely.
>
> How would mgr.x unblocklist itself when it restarts?

Ya, this is the tricky part since there's no nonce. If we used tags,
the mgr could configure a tag with a nonce for itself spanning all
sessions (in g_ceph_context). The mons would then blocklist that tag.

For entity names, I guess you'd have to unblocklist it as part of startup.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx