Fencing an entire client cluster from access to Ceph (in kubernetes)

Shyam Ranganathan <srangana@xxxxxxxxxx> · Tue, 27 Oct 2020 11:11:49 -0400

Asks:
-----
This mail is to trigger a discussion on the potential solution, provided 
later below, for the issue as per the subject, and to possibly gather 
other ideas/options, to enable the use case as described.

Use case/Background:
--------------------
Ceph is used by kubernetes to provide persistent storage (block and 
file, via RBD and CephFS respectively) to pods, via the CSI interface 
implemented in ceph-csi [1].

One of the use cases that we want to solve is when multiple kubernetes 
clusters access the same Ceph storage cluster [2], and further these 
kubernetes clusters provide for DR (disaster recovery) of workloads, 
when a peer kubernetes cluster becomes unavailable.

IOW, if a workload is running on kubernetes cluster-a and has access to 
persistent storage, it can be migrated to cluster-b in case of a DR 
event in cluster-a, ensuring workload continuity and with it access to 
the same persistent storage (as the Ceph cluster is shared and available).

Problem:
--------
The exact status of all client/nodes in kubernetes cluster-a on a DR 
event is unknown, all maybe down or some may still be up and running, 
still accessing storage.

This brings about the need to fence all IO from all 
nodes/container-networks on cluster-a, on a DR event, prior to migrating 
the workloads to cluster-b.

Existing solutions and issues:
------------------------------
Current schemes to fence IO are, per client [3] and further per image 
for RBD. This makes it a prerequisite that all client addresses in 
cluster-a are known and are further unique across peer kubernetes 
clusters, for a fence/blocklist to be effective.

Also, during recovery of kubernetes cluster-a, as kubernetes uses 
current known state of the world (i.e workload "was" running on this 
cluster) and reconciles to the desired state of the world eventually, it 
is possible that re-mounts may occur prior to reaching desired state of 
the world (which would be not to run the said workloads on this cluster).

The recovery may hence cause the existing connection based blocklists to 
be reset, as newer mounts/maps of the fs/image are performed on the 
recovering cluster.

The issues as above, makes the existing blocklist scheme either 
unreliable or cumbersome to deal with for all possible nodes in the 
respective kubernetes clusters.

Potential solution:
-------------------
On discussing the above with Jason, he pointed out to a potential 
solution (as follows) to resolve the problem,

<snip>
My suggestion would be to utilize CephX to revoke access to the cluster 
from site A when site B is promoted. The one immediate issue with this 
approach is that any clients with existing tickets will keep their 
access to the cluster until the ticket expires. Therefore, for this to 
be effective, we would need a transient CephX revocation list capability 
to essentially blocklist CephX clients for X period of time until we can 
be sure that their tickets have expired and are therefore no longer usable.
</snip>

The above is quite trivial from a kubernetes and ceph-csi POV, as each 
peer kubernetes cluster can be configured to use different cephx 
identities, and thus independently revoked and later reinstated, solving 
the issues laid out above.

The ability to revoke credentials for an existing cephx identity can be 
done if we change its existing authorization and hence is readily available.

The ability to provide a revocation list for existing valid tickets, 
that clients already have, would need to be developed.

Thoughts and other options?

Thanks,
Shyam

[1] Ceph-csi: https://github.com/ceph/ceph-csi
[2] DR use case in ceph-csi: https://github.com/ceph/ceph-csi/pull/1558
[3] RBD exclusive locks and blocklists: 
https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/
    CephFS client eviction and blocklists: 
https://docs.ceph.com/en/latest/cephfs/eviction/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx