Re: Nfs-ganesha with rados_kv backend

Jeff Layton <jlayton@xxxxxxxxxxxxxxx> · Wed, 29 May 2019 15:26:53 -0400

On Wed, 2019-05-29 at 13:49 +0000, Stolte, Felix wrote:
> Hi,
> 
> is anyone running an active-passive nfs-ganesha cluster with cephfs backend and using the rados_kv recovery backend? My setup runs fine, but takeover is giving me a headache. On takeover I see the following messages in ganeshas log file:
> 

Note that there are significant problems with the rados_kv recovery
backend. In particular, it does not properly handle the case where the
server crashes during the grace period. The rados_ng and rados_cluster
backends do handle those situations properly.

> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip 10.0.0.5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] rados_kv_traverse :CLIENT ID :EVENT :Failed to lst kv ret=-2
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[dbus_heartbeat] rados_kv_read_recov_clids_takeover :CLIENT ID :EVENT :Failed to takeover
> 29/05/2019 15:38:26 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
> 
> The result is clients hanging for up to 2 Minutes. Has anyone ran into the same problem?
> 
> Ceph Version: 12.2.11
> nfs-ganesha: 2.7.3
> 

If I had to guess, the hanging is probably due to state that is being
held by the other node's MDS session that hasn't expired yet. Ceph v12
doesn't have the client reclaim interfaces that make more instantaneous
failover possible. That's new in v14 (Nautilus). See pages 12 and 13
here:

https://static.sched.com/hosted_files/cephalocon2019/86/Rook-Deployed%20NFS%20Clusters%20over%20CephFS.pdf

> ganesha.conf (identical on both nodes besides nodeid in rados_kv:
> 
> NFS_CORE_PARAM {
>     Enable_RQUOTA = false;
>     Protocols = 3,4;
> }
> 
> CACHEINODE {
>     Dir_Chunk = 0;
>     NParts = 1;
>     Cache_Size = 1;
> }
> 
> NFS_krb5 {
>     Active_krb5 = false;
> }
> 
> NFSv4 {
>     Only_Numeric_Owners = true;
>     RecoveryBackend = rados_kv;
>     Grace_Period = 5;
>     Lease_Lifetime = 5;

Yikes! That's _way_ too short a grace period and lease lifetime. Ganesha
will probably exit the grace period before the clients ever realize the
server has restarted, and they will fail to reclaim their state.

>     Minor_Versions = 1,2;
> }
> 
> RADOS_KV {
>         ceph_conf = '/etc/ceph/ceph.conf';
>         userid = "ganesha";
>         pool = "cephfs_metadata";
>         namespace = "ganesha";
>         nodeid = "cephgw-k2-1";
> }
> 
> Any hint would be appreciated.

I consider ganesha's dbus-based takeover mechanism to be broken by
design, as it requires the recovery backend to do things that can't be
done atomically. If a crash occurs at the wrong time, the recovery
database can end up trashed and no one can reclaim anything.

If you really want an active/passive setup then I'd move away from that
and just have whatever clustering software you're using start up the
daemon on the active node after ensuring that it's shut down on the
passive one. With that, you can also use the rados_ng recovery backend,
which is more resilient in the face of multiple crashes.

In that configuration you would want to have the same config file on
both nodes, including the same nodeid so that you can potentially take
advantage of the RECLAIM_RESET interface to kill off the old session
quickly after the server restarts.

You also need a much longer grace period.

Cheers,
-- 
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com