CephFS/Ganesha NFS HA

Andreas Teuchert <a.teuchert@xxxxxxxxxxxx> · Mon, 15 Aug 2022 15:17:14 +0200

Hello,

I recently spent some time setting up a HA NFS service in front of 
CephFS (using 17.2.1).

Our requirements are:

- Access to a specific export must be restricted to certain IP addresses.
- Automatic "fast" failover when a NFS server goes down (main use case 
is maintenance work, but also hardware failure etc.). It's not 
completely clear how "fast" the failover must be. I think one minute is 
probably OK, five minutes is probably not. Some clients accumulate a lot 
of blocked processes the longer I/O to the export is blocked, so the 
faster the failover, the better.
- Probably using multiple active NFS servers in parallel. I haven't done 
any performance testing yet but I assume that a single Ganesha instance 
doesn't yield enough performance.

Because IP address restrictions for exports are not possible with the 
ingress setup from [1] (Ganesha only sees the haproxy IP address [2]), 
we cannot use it. Instead I manually set up keepalived instances to 
provide virtual NFS server IP adresses.

In the current test setup we have four nodes and a Ganesha instance on 
each node. Each node is primary for one virtual IP address which is 
quickly moved to the secondary node as soon as the local Ganesha 
instance is unavailable (this is achieved with a track script in the 
keepalived config that does "/usr/bin/nc -z 127.0.0.1 2049").

On the protocol level the failover works very well. After stopping a 
Ganesha instance, the virtual IP is quickly moved to another node, the 
clients establish a new TCP connection with this node and when sending 
the first NFS call they learn that their session and client ID have 
become invalid and they establish a new client ID and session.

However, then I/O to exports that were accessed via the failed Ganesha 
instance hangs for five minutes (even for client connections that go 
through unaffected Ganesha instances). This seems to be due to the MDS 
caps that were held by the failed instance and take five minutes to time 
out. When the MDS session(s) held by the failed instance is/are manually 
evicted, I/O to the exports is possible immediately.

(Sidenote: The Ganesha instances are also supposed to enforce a grace 
period after one instance goes down, however that doesn't seem to have 
much practical effect. When I load a customized NFS config with "ceph 
nfs cluster config set ..." after setting up a new NFS cluster, all 
Ganesha instances are restarted and afterwards they all permanently have 
the "NEED" and "ENFORCING" flags set. Then there effectively seems to be 
no grace period enforcement anymore. I assume that this is a bug that 
might get fixed at some point, but the same behaviour can probably be 
achieved by setting the graceless option in the Ganesha config.)

I have learned from [3] that when one Ganesha instance goes down, a new 
instance must be started with the same UUID within five minutes. This 
instance can then take over the MDS session. This instance must not be 
running before the old instance failed because it takes over the MDS 
session on startup. Clients that were connected to the failed instance 
must connect to this new instance in order to reclaim their state.

However, as far as I know there's currently no mechanism to make this 
happen.

Based on what I experienced so far I would like to propose an 
alternative to that:

When a Ganesha instance fails, its IP address is moved to another node 
where an instance is already running and the MDS session(s) of the 
failed instance is/are evicted. Clients will connect to the other 
instance where they learn that their NFS session and client ID are now 
invalid. While they establish a new client ID they learn that they are 
now connected to a different server (the new instance has a different 
server owner and server scope). According to [4] reclaiming locks 
(state) is not possible in this situation, so there's no need to enforce 
a grace period. In this scenario clients that were connected to the 
failed instance lose all their state, but other clients are unaffected.

There might be some organizations that depend on proper state reclaim on 
failover and they depend on the implementation of a failover mechanism 
that provides that, but there are probably also many organizations 
(including us) that don't need that and would prefer a faster failover 
without blocking I/O for unaffected clients.

Best regards,

Andreas

[1] 
https://docs.ceph.com/en/quincy/cephadm/services/nfs/#high-availability-nfs
[2] https://tracker.ceph.com/issues/55663
[3] 
https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/G5VRS22AVOJS2BCWCYRI2TM7YT65UWCC/
[4] https://www.rfc-editor.org/rfc/rfc8881#name-state-reclaim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx