From: Jeff Layton <jlayton@xxxxxxxxxx> This is an update of the patchset I had originally posted back in late January. The basic idea is to add a new rados_cluster recovery backend that allows running ganesha servers to self-aggregate and work out the grace period amongst themselves by following a set of simple rules and indicating their current and desired state in a rados object. The patchset starts by extending the recovery_backend operations interface to cover handling the grace period. All of these collapse down to being no-ops on the singleton recovery_backends. It then adds a new support library that abstracts out management of the shared rados object. This object tracks whether there is a cluster-wide grace period in effect, and from what reboot epoch recovery is allowed. It also allows the cluster nodes to indicate whether they need a grace period (in order to allow recovery) and whether they are currently enforcing the grace period. Then, a new command-line tool for directly manipulating the shared object. This gives an admin a way to do things like request a grace period manually, remove a dead host from the cluster, and "fake up" other nodes in the cluster for purposes of testing. Finally, it adds a new recovery backend that plugs into the same library to allow ganesha to participate as a clustered node. The immediate aim here is to allow us to do an active/active export of FSAL_CEPH from multiple heads, probably under some sort of container orchestration (e.g., Kubernetes). The underlying design however should be extendible to other clustered backends. While this does work, it's still very much proof-of-concept code at this point. There is quite a bit of room for improvement here so, I don't think this is quite ready for merge, but I'd appreciate any early feedback on the approach. Does anyone see any major red flags in this design that I haven't yet spotted? There is one prerequisite for this set -- it currently relies on a patch to ceph that is not yet in tree (to allow ganesha to immediately kill off the Ceph MDS session of its previous incarnation). That's still under development, but it's fairly straightforward. Jeff Layton (13): HASHTABLE: add a hashtable_for_each function reaper: add a way to wake up the reaper immediately main: initialize recovery backend earlier SAL: make some rados_kv symbols public SAL: add new try_lift_grace recovery operation SAL: add recovery operation to maybe start a grace period SAL: add new set_enforcing operation SAL: add a way to check for grace period being enforced cluster-wide main: add way to stall server until grace is being enforced support: add a rados_grace support library tools: add new rados_grace manipulation tool SAL: add new clustered RADOS recovery backend FSAL_CEPH: kill off old session before the mount src/CMakeLists.txt | 1 + src/FSAL/FSAL_CEPH/main.c | 39 ++ src/MainNFSD/nfs_init.c | 8 - src/MainNFSD/nfs_lib.c | 13 + src/MainNFSD/nfs_main.c | 12 + src/MainNFSD/nfs_reaper_thread.c | 11 + src/SAL/CMakeLists.txt | 3 +- src/SAL/nfs4_recovery.c | 90 ++- src/SAL/recovery/recovery_rados.h | 6 + src/SAL/recovery/recovery_rados_cluster.c | 406 +++++++++++++ src/SAL/recovery/recovery_rados_kv.c | 7 +- src/cmake/modules/FindCEPHFS.cmake | 8 + src/doc/man/ganesha-core-config.rst | 1 + src/hashtable/hashtable.c | 17 + src/include/config-h.in.cmake | 1 + src/include/hashtable.h | 3 + src/include/nfs_core.h | 1 + src/include/rados_grace.h | 82 +++ src/include/sal_functions.h | 11 +- src/nfs-ganesha.spec-in.cmake | 2 + src/support/CMakeLists.txt | 4 + src/support/rados_grace.c | 678 ++++++++++++++++++++++ src/tools/CMakeLists.txt | 4 + src/tools/rados_grace_tool.c | 178 ++++++ 24 files changed, 1567 insertions(+), 19 deletions(-) create mode 100644 src/SAL/recovery/recovery_rados_cluster.c create mode 100644 src/include/rados_grace.h create mode 100644 src/support/rados_grace.c create mode 100644 src/tools/rados_grace_tool.c -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html