Re: Huge headaches with NFS and ingress HA failover

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 22 Jul 2021 07:25:51 -0400

On Wed, 2021-07-21 at 18:28 +0200, Andreas Weisker wrote:
> Hi,
> 
> we recently set up a new pacific cluster with cephadm.
> Deployed nfs on two hosts and ingress on two other hosts. (ceph orch 
> apply for nfs and ingress like on the docs page)
> 
> So far so good. ESXi with NFS41 connects, but the way ingress works 
> confuses me.
> 
> It distributes clients static to one nfs daemon by their ip addresses. 
> (I know nfs won't like it if the client switches all the time, because 
> of reservations.)
> Three of our ESXi servers seem to connect to host1, the 4th one to the 
> other. This leads to problem in ESXi where it doesn't recognize the 
> store as the same like the others. I can't find on how exactly ESXi 
> calculates that, but there must be different information coming from 
> these nfs daemons. nfs-ganesha doesn't behave exactly the same on these 
> hosts.
> 

I don't know much about ESXi, but my guess would be that it's looking at
the eir_server_owner during the EXCHANGE_ID operation. Those will be
different depending on which ganesha server you connect to. In this
configuration, you have two entirely different servers that just happen
to serve out the same data in the same way. There is no session trunking
or anything like that.

> Besides that, I wanted to do some failover tests, before the cluster 
> goes live. I stopped stopped on nfs server, but ingress (haproxy) does't 
> seem to care.
> On the haproxy stats page, both backends are listed with "no check", so 
> there is no failover happening to the NFS clients. haproxy does not fail 
> over to the other host. Datastores are disconnected and unable to 
> connect new ones.
> 

I'm confused. You have haproxy directing clients to specific backend
servers based on their IP address. Why would you expect them to fail
over when one goes down? To be clear, ganesha's rados_cluster recovery
backend is a scale-out solution. It doesn't provide HA all by itself.

It _is_ possible to have the individual ganesha backends in a
active/passive configuration and then gang them together into a scale
out cluster using rados_cluster, if you want both scale-out and HA.

> How is ingress supposed to detect a failed nfs server and how to tell 
> ganesha to be identical to each other?
> 

You can't. That's not how this works.

> Bonus question: Why can't keepalived not just manage nfs-ganesha on two 
> hosts instead of haproxy? It would eliminate an extra network hop.
> 

Agreed. I'm not a fan of using haproxy like this. I don't see that it
provides anything of value, and it'll make it harder to use NFSv4
migration later, if we decide to add that support.

My recommendation is still to have a different address for each server
and just use round-robin DNS to distribute the clients.

> Hope someone has a few insights to that. Spent way too much time to 
> switch to some other solution.

Hope this helps!
-- 
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx