Re: Fencing an entire client cluster from access to Ceph (in kubernetes)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Shyam,

Thanks for starting this discussion.

On Tue, Oct 27, 2020 at 8:12 AM Shyam Ranganathan <srangana@xxxxxxxxxx> wrote:
>
> Asks:
> -----
> This mail is to trigger a discussion on the potential solution, provided
> later below, for the issue as per the subject, and to possibly gather
> other ideas/options, to enable the use case as described.
>
> Use case/Background:
> --------------------
> Ceph is used by kubernetes to provide persistent storage (block and
> file, via RBD and CephFS respectively) to pods, via the CSI interface
> implemented in ceph-csi [1].
>
> One of the use cases that we want to solve is when multiple kubernetes
> clusters access the same Ceph storage cluster [2], and further these
> kubernetes clusters provide for DR (disaster recovery) of workloads,
> when a peer kubernetes cluster becomes unavailable.
>
> IOW, if a workload is running on kubernetes cluster-a and has access to
> persistent storage, it can be migrated to cluster-b in case of a DR
> event in cluster-a, ensuring workload continuity and with it access to
> the same persistent storage (as the Ceph cluster is shared and available).
>
> Problem:
> --------
> The exact status of all client/nodes in kubernetes cluster-a on a DR
> event is unknown, all maybe down or some may still be up and running,
> still accessing storage.
>
> This brings about the need to fence all IO from all
> nodes/container-networks on cluster-a, on a DR event, prior to migrating
> the workloads to cluster-b.
>
> Existing solutions and issues:
> ------------------------------
> Current schemes to fence IO are, per client [3] and further per image
> for RBD. This makes it a prerequisite that all client addresses in
> cluster-a are known and are further unique across peer kubernetes
> clusters, for a fence/blocklist to be effective.
>
> Also, during recovery of kubernetes cluster-a, as kubernetes uses
> current known state of the world (i.e workload "was" running on this
> cluster) and reconciles to the desired state of the world eventually, it
> is possible that re-mounts may occur prior to reaching desired state of
> the world (which would be not to run the said workloads on this cluster).
>
> The recovery may hence cause the existing connection based blocklists to
> be reset, as newer mounts/maps of the fs/image are performed on the
> recovering cluster.
>
> The issues as above, makes the existing blocklist scheme either
> unreliable or cumbersome to deal with for all possible nodes in the
> respective kubernetes clusters.
>
> Potential solution:
> -------------------
> On discussing the above with Jason, he pointed out to a potential
> solution (as follows) to resolve the problem,
>
> <snip>
> My suggestion would be to utilize CephX to revoke access to the cluster
> from site A when site B is promoted. The one immediate issue with this
> approach is that any clients with existing tickets will keep their
> access to the cluster until the ticket expires. Therefore, for this to
> be effective, we would need a transient CephX revocation list capability
> to essentially blocklist CephX clients for X period of time until we can
> be sure that their tickets have expired and are therefore no longer usable.
> </snip>
>
> The above is quite trivial from a kubernetes and ceph-csi POV, as each
> peer kubernetes cluster can be configured to use different cephx
> identities, and thus independently revoked and later reinstated, solving
> the issues laid out above.
>
> The ability to revoke credentials for an existing cephx identity can be
> done if we change its existing authorization and hence is readily available.
>
> The ability to provide a revocation list for existing valid tickets,
> that clients already have, would need to be developed.
>
> Thoughts and other options?

While tempting, I think we're unnecessarily restricting our attention
to current options. It seems to me we should consider another
mechanism for blocklisting clients in-mass. I would suggest having
clients add a "tag" to their sessions with Ceph daemons which can be
separately blocklisted. The tag can be derived from the cephx key they
use so it does not require updating all client code to send a tag
(like the kernel). The cephx credential would probably look like: "mon
`allow r tag=bar'".

Once a new tag is added to the blocklist distributed via the MonMap
(or OSDMap for consistency), daemons would need to go through their
open sessions and blocklist any matches.

There's other applications beyond DR. We currently have a heavy-weight
"registered clients" map in the MgrMap which records all open RADOS
instances by the mgr. These all need blocklisted if the mgr fails
over. It is racy to keep this up-to-date so we see virtually
unavoidable and annoying test failures [1,2] If we used a "mgr.x" tag
for the mgr.x credential (perhaps an implicit tag of several), we
could blocklist that instead to avoid keeping track entirely.

What do you think?

[1] https://tracker.ceph.com/issues/40867
[2] https://tracker.ceph.com/issues/43943

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux