Re: cephadm docs on HA NFS

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 29 Jul 2021 14:36:53 -0400

On Thu, 2021-07-29 at 13:14 -0500, Sage Weil wrote:
> On Thu, Jul 29, 2021 at 12:28 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > And that's exactly what we need for the OpenStack use case.  Clients
> > > are guest VMs, running who knows what.  We cannot assume or require
> > > anything about how they resolve hostnames to addresses.
> > > 
> > > -- Tom
> > > 
> > 
> > How are these IP addresses supplied to the tenants in an openstack
> > cluster? I guess this is using cinder?
> > 
> > If guests can't rely on DNS resolution, then perhaps cinder could be
> > taught to hand the guests a random address from a set? Or even better,
> > hand them more than one address. At least that way they could decide to
> > pick a different host for their initial mount if there was a problem
> > contacting the first one they try.
> > 
> > To reiterate, my main concern is that relying on an ingress controller
> > limits our options for improving the ability to scale in the future,
> > since it pretty much means you can't use NFSv4 migration at all.
> 
> Two thoughts.  First, if we want to distribute load across multiple
> IPs/servers via delegations, then that means we need to provide
> (ingress with) multiple IP addresses to work with.  IIUC the primary
> reason to do this would be to avoid proxying all of the NFS traffic
> for a given instance of NFS service through a single IP/node.  I
> suspect that roughly the same thing could also be accomplished with
> multiple instances of ingress for the same service and round-robin
> DNS.
> 
> Second, the primary benefit that ingress provides today (a known,
> stable, user-provided virtual IP address) is extremely valuable to
> real users.  The alternatives are something like: (1) query a ceph
> CLI/API to determine which IP to connect to, and if there is a
> failure, force unmount, (re)query the API, and remount the new IP;
> 

We have to do that anyway if you're not using DNS, no? How do the
clients discover the (ingress) IP address to connect to in the first
place? Is there no way to give them more than one address to work with?

> (2)
> manually constrain placement of the NFS service to a specific host
> with a known IP, and lose NFS service if that host fails.  Neither of
> these are "HA" by any stretch of the imagination.
> 

If you're using a stable hashing scheme with your ingress controller,
don't you already have the #2 problem above? If one of the heads goes
down, you still have to wait until the replacement backend server comes
back in order for it to reconnect. Worse yet, the client can't make a
decision to try a different head in the cluster in that event either.

To be clear, there is _no_ requirement that a client connect to a
particular host in the cluster for the initial mount. The requirement is
that it must connect back to the same host in order to _reclaim_ state
if the NFS server goes down. If the client goes down, then its state is
toast anyway and we don't care which host it connects to.

I'm still not sold on the value of an ingress controller here. True, it
allows you to have one "true" IP address for the NFS server, but that's
really only useful at mount time. Meanwhile, it's adding an extra
network hop and point of failure.

> What the cephadm ingress service does now is essentially identical to
> what kubernetes does, except that the update interval isn't low enough
> (yet) for lock reclaim to work.  You get a known, stable IP, and NFS
> service can tolerate any cluster host failing.
> 

Yeah, we need to improve the update interval there regardless of whether
there is an ingress controller or not. 10 mins is a _very_ long time,
and that violates a major assumption in rados_cluster.

> Ingress *also* uses haproxy to distribute load, but that probably
> isn't necessary for most users.  If you don't need scale-out then a
> single backend NFS daemon is sufficient and haproxy is doing nothing
> but proxying traffic from the host with the virtual IP to the host
> where the daemon is running.  This is really an implementation detail;
> we can swap out the implementation to do something else if we decide
> there are better tools to use...
> 
> As far as the user is concerned, ingress is "on" (and you provide the
> virtual IP) or it's not.  We can modify the implementation in the
> future to do whatever we like.
> 

Fair enough. I still think the value here doesn't justify the added
complexity, but if you feel you'll be able to migrate people off of this
scheme later, then so be it.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx