On Wed, 2019-05-29 at 13:49 +0000, Stolte, Felix wrote: > Hi, > > is anyone running an active-passive nfs-ganesha cluster with cephfs backend and using the rados_kv recovery backend? My setup runs fine, but takeover is giving me a headache. On takeover I see the following messages in ganeshas log file: > Note that there are significant problems with the rados_kv recovery backend. In particular, it does not properly handle the case where the server crashes during the grace period. The rados_ng and rados_cluster backends do handle those situations properly. > 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 5 > 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip 10.0.0.5 > 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] rados_kv_traverse :CLIENT ID :EVENT :Failed to lst kv ret=-2 > 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] rados_kv_read_recov_clids_takeover :CLIENT ID :EVENT :Failed to takeover > 29/05/2019 15:38:26 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE > > The result is clients hanging for up to 2 Minutes. Has anyone ran into the same problem? > > Ceph Version: 12.2.11 > nfs-ganesha: 2.7.3 > If I had to guess, the hanging is probably due to state that is being held by the other node's MDS session that hasn't expired yet. Ceph v12 doesn't have the client reclaim interfaces that make more instantaneous failover possible. That's new in v14 (Nautilus). See pages 12 and 13 here: https://static.sched.com/hosted_files/cephalocon2019/86/Rook-Deployed%20NFS%20Clusters%20over%20CephFS.pdf > ganesha.conf (identical on both nodes besides nodeid in rados_kv: > > NFS_CORE_PARAM { > Enable_RQUOTA = false; > Protocols = 3,4; > } > > CACHEINODE { > Dir_Chunk = 0; > NParts = 1; > Cache_Size = 1; > } > > NFS_krb5 { > Active_krb5 = false; > } > > NFSv4 { > Only_Numeric_Owners = true; > RecoveryBackend = rados_kv; > Grace_Period = 5; > Lease_Lifetime = 5; Yikes! That's _way_ too short a grace period and lease lifetime. Ganesha will probably exit the grace period before the clients ever realize the server has restarted, and they will fail to reclaim their state. > Minor_Versions = 1,2; > } > > RADOS_KV { > ceph_conf = '/etc/ceph/ceph.conf'; > userid = "ganesha"; > pool = "cephfs_metadata"; > namespace = "ganesha"; > nodeid = "cephgw-k2-1"; > } > > Any hint would be appreciated. I consider ganesha's dbus-based takeover mechanism to be broken by design, as it requires the recovery backend to do things that can't be done atomically. If a crash occurs at the wrong time, the recovery database can end up trashed and no one can reclaim anything. If you really want an active/passive setup then I'd move away from that and just have whatever clustering software you're using start up the daemon on the active node after ensuring that it's shut down on the passive one. With that, you can also use the rados_ng recovery backend, which is more resilient in the face of multiple crashes. In that configuration you would want to have the same config file on both nodes, including the same nodeid so that you can potentially take advantage of the RECLAIM_RESET interface to kill off the old session quickly after the server restarts. You also need a much longer grace period. Cheers, -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com