Re: cephadm docs on HA NFS

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 29 Jul 2021 18:00:20 -0400

On Thu, 2021-07-29 at 15:22 -0500, Sage Weil wrote:
> On Thu, Jul 29, 2021 at 1:36 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > Second, the primary benefit that ingress provides today (a known,
> > > stable, user-provided virtual IP address) is extremely valuable to
> > > real users.  The alternatives are something like: (1) query a ceph
> > > CLI/API to determine which IP to connect to, and if there is a
> > > failure, force unmount, (re)query the API, and remount the new IP;
> > 
> > We have to do that anyway if you're not using DNS, no? How do the
> > clients discover the (ingress) IP address to connect to in the first
> > place? Is there no way to give them more than one address to work with?
> 
> There are two differences: (1) the virtual IP is user-provided and
> specified up front, which may not matter in many cases but might when
> you need it to, say, exist within a particular subnet (vs reuse some
> random host's existing static IP).  More importantly, (2) the IP will
> not change even if there is a host failure.  The second point is the
> important one, since it's what allows the service to be reliable.
> That's a strict requirement for just about everyone, right?
> 

The IP address for a particular server cannot change, but IIUC with an
ingress controller, the backend IP addresses also don't change.
Basically the ingress controller just proxies your packets from the
public IP addr to one of the the backend private addrs.

That does allow you to have a one-to-many relationship, but that's all
it really gives you. It adds nothing to redundancy or resiliency. You're
just hiding the fact that there are several backend servers.

> > > (2)
> > > manually constrain placement of the NFS service to a specific host
> > > with a known IP, and lose NFS service if that host fails.  Neither of
> > > these are "HA" by any stretch of the imagination.
> > > 
> > 
> > If you're using a stable hashing scheme with your ingress controller,
> > don't you already have the #2 problem above? If one of the heads goes
> > down, you still have to wait until the replacement backend server comes
> > back in order for it to reconnect.  Worse yet,
> 
> Yes.  I'm not sure what the problem is or why this is a bad.  IIUC
> this is the best that ganesha can do; if a rank fails, we must restart
> a new ganesha instance with the same rank in another location and
> redirect traffic to it.
> 

Right, so at that point, the ingress controller buys you nothing. You're
just adding a layer of indirection, but it's not really giving you
anything (other than the fiction that you're dealing with a single
host).

> > the client can't make a
> > decision to try a different head in the cluster in that event either.
> 
> Do any such clients (or servers) exist that can do that?
> 

mount.nfs will try several addresses in turn based on DNS results.

Tom mentioned that they are using manila and that they can't rely on DNS
though, so YMMV there. There's no reason the manila "consumer" couldn't
just try a list of servers until it gets a successful mount though.

I don't know that much about openstack or manila though.

> > To be clear, there is _no_ requirement that a client connect to a
> > particular host in the cluster for the initial mount. The requirement is
> > that it must connect back to the same host in order to _reclaim_ state
> > if the NFS server goes down. If the client goes down, then its state is
> > toast anyway and we don't care which host it connects to.
> 
> Understood... this is why we settled on this design (with static
> hashing in the case where multiple backends are present).
> 
> > I'm still not sold on the value of an ingress controller here. True, it
> > allows you to have one "true" IP address for the NFS server, but that's
> > really only useful at mount time. Meanwhile, it's adding an extra
> > network hop and point of failure.
> 
> Mount time is easy--we could provide the most recent ganesha's IP to
> clients at mount time.  The problem is how to handle a failure when
> the ganesha has to be restarted on some other host.  To make clients
> not break (force unmount + remount + lose all state) we have to
> restart a new ganesha on the same IP, which must be virtual in the
> case that a host may be down.
> 

The question here is whether the backend servers that are being
resurrected are bringing their old IP addresses with them. If, on
failover, you're resurrecting them with a completely new (private)
address and then redirecting the ingress controller to use that instead
of the old one then I suppose that would work too.

That seems like an awfully complicated way to go about this though --
you might as well just float the actual public IP address to the right
host instead. If, on the other hand, you are resurrecting them with the
same private address every time, then your ingress controller is really
just pointless indirection.

> The only optional part of what ingress does now is the static hashing.
> That was done because (1) we wanted to have a proxy to enable flexible
> placement of the ganeshas independent of the virtual IP placement, and
> (2) to enable load to be distributed.  Neither of those is strict
> requirements, but it fit together easily enough for NFS, and load
> distribution *was* a strong requirement for RGW, which the ingress
> service also covers.  (Even if it's more complex than necessary for
> just NFS, it's less code overall than having the complex thing for RGW
> and a separate simpler thing for NFS only.)
> 
> But having a virtual IP and ganesha failover *is* an requirement for
> anything resembling high availability.  Round-robin DNS doesn't give
> you that; if a host fails those clients are screwed until the host is
> rebooted or replaced.
> 

They're screwed until the host is rebooted or replaced anyway. The
ingress controller doesn't change that. All it gives you is the ability
to hand out a single IP address and split up the hosts according to it.

> > > What the cephadm ingress service does now is essentially identical to
> > > what kubernetes does, except that the update interval isn't low enough
> > > (yet) for lock reclaim to work.  You get a known, stable IP, and NFS
> > > service can tolerate any cluster host failing.
> > > 
> > 
> > Yeah, we need to improve the update interval there regardless of whether
> > there is an ingress controller or not. 10 mins is a _very_ long time,
> > and that violates a major assumption in rados_cluster.
> 
> What is the rados_cluster assumption?

5 minutes. We do this when setting up the export:

        /*
         * Set long timeout for the session to ensure that MDS doesn't lose
         * state before server can come back and do recovery.
         */
        ceph_set_session_timeout(export->cmount, 300);

But...that's just us trying to be gracious in case there are major
problems getting things going again. You really want it back up much
sooner, preferably within seconds of one going down. There may be stalls
when trying to do stateful operations until the dead node is back.

> 
> > > Ingress *also* uses haproxy to distribute load, but that probably
> > > isn't necessary for most users.  If you don't need scale-out then a
> > > single backend NFS daemon is sufficient and haproxy is doing nothing
> > > but proxying traffic from the host with the virtual IP to the host
> > > where the daemon is running.  This is really an implementation detail;
> > > we can swap out the implementation to do something else if we decide
> > > there are better tools to use...
> > > 
> > > As far as the user is concerned, ingress is "on" (and you provide the
> > > virtual IP) or it's not.  We can modify the implementation in the
> > > future to do whatever we like.
> > 
> > Fair enough. I still think the value here doesn't justify the added
> > complexity, but if you feel you'll be able to migrate people off of this
> > scheme later, then so be it.
> 
> If it were only for NFS I'd agree, but ingress also covers RGW, which
> does require haproxy, and I expect we'll use it for SMB as well.
> 
> Even so, it may still make sense to support a simpler mode with a
> single backend instance and virtual IP only, since that'll reduce the
> number of hops when scale-out isn't needed... would this cover all of
> the bases from your perspective?
> 

The ideal thing is to just give the different ganesha heads different
public IP addresses. Being able to live-balance (and shrink!) the
cluster seems like something that would be valuable. You won't be able
to do that with an ingress controller in the way.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx