Hello,
I recently spent some time setting up a HA NFS service in front of
CephFS (using 17.2.1).
Our requirements are:
- Access to a specific export must be restricted to certain IP addresses.
- Automatic "fast" failover when a NFS server goes down (main use case
is maintenance work, but also hardware failure etc.). It's not
completely clear how "fast" the failover must be. I think one minute is
probably OK, five minutes is probably not. Some clients accumulate a lot
of blocked processes the longer I/O to the export is blocked, so the
faster the failover, the better.
- Probably using multiple active NFS servers in parallel. I haven't done
any performance testing yet but I assume that a single Ganesha instance
doesn't yield enough performance.
Because IP address restrictions for exports are not possible with the
ingress setup from [1] (Ganesha only sees the haproxy IP address [2]),
we cannot use it. Instead I manually set up keepalived instances to
provide virtual NFS server IP adresses.
In the current test setup we have four nodes and a Ganesha instance on
each node. Each node is primary for one virtual IP address which is
quickly moved to the secondary node as soon as the local Ganesha
instance is unavailable (this is achieved with a track script in the
keepalived config that does "/usr/bin/nc -z 127.0.0.1 2049").
On the protocol level the failover works very well. After stopping a
Ganesha instance, the virtual IP is quickly moved to another node, the
clients establish a new TCP connection with this node and when sending
the first NFS call they learn that their session and client ID have
become invalid and they establish a new client ID and session.
However, then I/O to exports that were accessed via the failed Ganesha
instance hangs for five minutes (even for client connections that go
through unaffected Ganesha instances). This seems to be due to the MDS
caps that were held by the failed instance and take five minutes to time
out. When the MDS session(s) held by the failed instance is/are manually
evicted, I/O to the exports is possible immediately.
(Sidenote: The Ganesha instances are also supposed to enforce a grace
period after one instance goes down, however that doesn't seem to have
much practical effect. When I load a customized NFS config with "ceph
nfs cluster config set ..." after setting up a new NFS cluster, all
Ganesha instances are restarted and afterwards they all permanently have
the "NEED" and "ENFORCING" flags set. Then there effectively seems to be
no grace period enforcement anymore. I assume that this is a bug that
might get fixed at some point, but the same behaviour can probably be
achieved by setting the graceless option in the Ganesha config.)
I have learned from [3] that when one Ganesha instance goes down, a new
instance must be started with the same UUID within five minutes. This
instance can then take over the MDS session. This instance must not be
running before the old instance failed because it takes over the MDS
session on startup. Clients that were connected to the failed instance
must connect to this new instance in order to reclaim their state.
However, as far as I know there's currently no mechanism to make this
happen.
Based on what I experienced so far I would like to propose an
alternative to that:
When a Ganesha instance fails, its IP address is moved to another node
where an instance is already running and the MDS session(s) of the
failed instance is/are evicted. Clients will connect to the other
instance where they learn that their NFS session and client ID are now
invalid. While they establish a new client ID they learn that they are
now connected to a different server (the new instance has a different
server owner and server scope). According to [4] reclaiming locks
(state) is not possible in this situation, so there's no need to enforce
a grace period. In this scenario clients that were connected to the
failed instance lose all their state, but other clients are unaffected.
There might be some organizations that depend on proper state reclaim on
failover and they depend on the implementation of a failover mechanism
that provides that, but there are probably also many organizations
(including us) that don't need that and would prefer a faster failover
without blocking I/O for unaffected clients.
Best regards,
Andreas
[1]
https://docs.ceph.com/en/quincy/cephadm/services/nfs/#high-availability-nfs
[2] https://tracker.ceph.com/issues/55663
[3]
https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/G5VRS22AVOJS2BCWCYRI2TM7YT65UWCC/
[4] https://www.rfc-editor.org/rfc/rfc8881#name-state-reclaim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx