Re: cephadm docs on HA NFS

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 29 Jul 2021 15:22:01 -0500

On Thu, Jul 29, 2021 at 1:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > Second, the primary benefit that ingress provides today (a known,
> > stable, user-provided virtual IP address) is extremely valuable to
> > real users.  The alternatives are something like: (1) query a ceph
> > CLI/API to determine which IP to connect to, and if there is a
> > failure, force unmount, (re)query the API, and remount the new IP;
>
> We have to do that anyway if you're not using DNS, no? How do the
> clients discover the (ingress) IP address to connect to in the first
> place? Is there no way to give them more than one address to work with?

There are two differences: (1) the virtual IP is user-provided and
specified up front, which may not matter in many cases but might when
you need it to, say, exist within a particular subnet (vs reuse some
random host's existing static IP).  More importantly, (2) the IP will
not change even if there is a host failure.  The second point is the
important one, since it's what allows the service to be reliable.
That's a strict requirement for just about everyone, right?

> > (2)
> > manually constrain placement of the NFS service to a specific host
> > with a known IP, and lose NFS service if that host fails.  Neither of
> > these are "HA" by any stretch of the imagination.
> >
>
> If you're using a stable hashing scheme with your ingress controller,
> don't you already have the #2 problem above? If one of the heads goes
> down, you still have to wait until the replacement backend server comes
> back in order for it to reconnect.  Worse yet,

Yes.  I'm not sure what the problem is or why this is a bad.  IIUC
this is the best that ganesha can do; if a rank fails, we must restart
a new ganesha instance with the same rank in another location and
redirect traffic to it.

> the client can't make a
> decision to try a different head in the cluster in that event either.

Do any such clients (or servers) exist that can do that?

> To be clear, there is _no_ requirement that a client connect to a
> particular host in the cluster for the initial mount. The requirement is
> that it must connect back to the same host in order to _reclaim_ state
> if the NFS server goes down. If the client goes down, then its state is
> toast anyway and we don't care which host it connects to.

Understood... this is why we settled on this design (with static
hashing in the case where multiple backends are present).

> I'm still not sold on the value of an ingress controller here. True, it
> allows you to have one "true" IP address for the NFS server, but that's
> really only useful at mount time. Meanwhile, it's adding an extra
> network hop and point of failure.

Mount time is easy--we could provide the most recent ganesha's IP to
clients at mount time.  The problem is how to handle a failure when
the ganesha has to be restarted on some other host.  To make clients
not break (force unmount + remount + lose all state) we have to
restart a new ganesha on the same IP, which must be virtual in the
case that a host may be down.

The only optional part of what ingress does now is the static hashing.
That was done because (1) we wanted to have a proxy to enable flexible
placement of the ganeshas independent of the virtual IP placement, and
(2) to enable load to be distributed.  Neither of those is strict
requirements, but it fit together easily enough for NFS, and load
distribution *was* a strong requirement for RGW, which the ingress
service also covers.  (Even if it's more complex than necessary for
just NFS, it's less code overall than having the complex thing for RGW
and a separate simpler thing for NFS only.)

But having a virtual IP and ganesha failover *is* an requirement for
anything resembling high availability.  Round-robin DNS doesn't give
you that; if a host fails those clients are screwed until the host is
rebooted or replaced.

> > What the cephadm ingress service does now is essentially identical to
> > what kubernetes does, except that the update interval isn't low enough
> > (yet) for lock reclaim to work.  You get a known, stable IP, and NFS
> > service can tolerate any cluster host failing.
> >
>
> Yeah, we need to improve the update interval there regardless of whether
> there is an ingress controller or not. 10 mins is a _very_ long time,
> and that violates a major assumption in rados_cluster.

What is the rados_cluster assumption?

> > Ingress *also* uses haproxy to distribute load, but that probably
> > isn't necessary for most users.  If you don't need scale-out then a
> > single backend NFS daemon is sufficient and haproxy is doing nothing
> > but proxying traffic from the host with the virtual IP to the host
> > where the daemon is running.  This is really an implementation detail;
> > we can swap out the implementation to do something else if we decide
> > there are better tools to use...
> >
> > As far as the user is concerned, ingress is "on" (and you provide the
> > virtual IP) or it's not.  We can modify the implementation in the
> > future to do whatever we like.
>
> Fair enough. I still think the value here doesn't justify the added
> complexity, but if you feel you'll be able to migrate people off of this
> scheme later, then so be it.

If it were only for NFS I'd agree, but ingress also covers RGW, which
does require haproxy, and I expect we'll use it for SMB as well.

Even so, it may still make sense to support a simpler mode with a
single backend instance and virtual IP only, since that'll reduce the
number of hops when scale-out isn't needed... would this cover all of
the bases from your perspective?

sage

>
> --
> Jeff Layton <jlayton@xxxxxxxxxx>
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx