CephFS/Ganesha NFS HA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I recently spent some time setting up a HA NFS service in front of CephFS (using 17.2.1).

Our requirements are:

- Access to a specific export must be restricted to certain IP addresses.
- Automatic "fast" failover when a NFS server goes down (main use case is maintenance work, but also hardware failure etc.). It's not completely clear how "fast" the failover must be. I think one minute is probably OK, five minutes is probably not. Some clients accumulate a lot of blocked processes the longer I/O to the export is blocked, so the faster the failover, the better. - Probably using multiple active NFS servers in parallel. I haven't done any performance testing yet but I assume that a single Ganesha instance doesn't yield enough performance.

Because IP address restrictions for exports are not possible with the ingress setup from [1] (Ganesha only sees the haproxy IP address [2]), we cannot use it. Instead I manually set up keepalived instances to provide virtual NFS server IP adresses.

In the current test setup we have four nodes and a Ganesha instance on each node. Each node is primary for one virtual IP address which is quickly moved to the secondary node as soon as the local Ganesha instance is unavailable (this is achieved with a track script in the keepalived config that does "/usr/bin/nc -z 127.0.0.1 2049").

On the protocol level the failover works very well. After stopping a Ganesha instance, the virtual IP is quickly moved to another node, the clients establish a new TCP connection with this node and when sending the first NFS call they learn that their session and client ID have become invalid and they establish a new client ID and session.

However, then I/O to exports that were accessed via the failed Ganesha instance hangs for five minutes (even for client connections that go through unaffected Ganesha instances). This seems to be due to the MDS caps that were held by the failed instance and take five minutes to time out. When the MDS session(s) held by the failed instance is/are manually evicted, I/O to the exports is possible immediately.

(Sidenote: The Ganesha instances are also supposed to enforce a grace period after one instance goes down, however that doesn't seem to have much practical effect. When I load a customized NFS config with "ceph nfs cluster config set ..." after setting up a new NFS cluster, all Ganesha instances are restarted and afterwards they all permanently have the "NEED" and "ENFORCING" flags set. Then there effectively seems to be no grace period enforcement anymore. I assume that this is a bug that might get fixed at some point, but the same behaviour can probably be achieved by setting the graceless option in the Ganesha config.)

I have learned from [3] that when one Ganesha instance goes down, a new instance must be started with the same UUID within five minutes. This instance can then take over the MDS session. This instance must not be running before the old instance failed because it takes over the MDS session on startup. Clients that were connected to the failed instance must connect to this new instance in order to reclaim their state.

However, as far as I know there's currently no mechanism to make this happen.

Based on what I experienced so far I would like to propose an alternative to that:

When a Ganesha instance fails, its IP address is moved to another node where an instance is already running and the MDS session(s) of the failed instance is/are evicted. Clients will connect to the other instance where they learn that their NFS session and client ID are now invalid. While they establish a new client ID they learn that they are now connected to a different server (the new instance has a different server owner and server scope). According to [4] reclaiming locks (state) is not possible in this situation, so there's no need to enforce a grace period. In this scenario clients that were connected to the failed instance lose all their state, but other clients are unaffected.

There might be some organizations that depend on proper state reclaim on failover and they depend on the implementation of a failover mechanism that provides that, but there are probably also many organizations (including us) that don't need that and would prefer a faster failover without blocking I/O for unaffected clients.

Best regards,

Andreas

[1] https://docs.ceph.com/en/quincy/cephadm/services/nfs/#high-availability-nfs
[2] https://tracker.ceph.com/issues/55663
[3] https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/G5VRS22AVOJS2BCWCYRI2TM7YT65UWCC/
[4] https://www.rfc-editor.org/rfc/rfc8881#name-state-reclaim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux