Re: cephadm docs on HA NFS

Tom Barron <tbarron@xxxxxxxxxx> · Fri, 30 Jul 2021 07:21:46 -0400

On 29/07/21 15:32 -0400, Tom Barron wrote:
On 29/07/21 14:45 -0400, Jeff Layton wrote:
On Thu, 2021-07-29 at 14:25 -0400, Tom Barron wrote:
On 29/07/21 13:28 -0400, Jeff Layton wrote:
On Thu, 2021-07-29 at 08:49 -0400, Tom Barron wrote:
> On 29/07/21 08:18 -0400, Jeff Layton wrote:
> > On Mon, 2021-07-26 at 13:55 -0500, Sage Weil wrote:
> > > On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
> > > > On 26/07/21 09:17 -0500, Sage Weil wrote:
> > > > > The design is for cephadm to resurrect the failed ganesha rank on
> > > > > another host, just like k8s does.  The current gap in the
> > > > > implementation is that the cephadm refresh is pretty slow (~10m) which
> > > > > means that probably won't happen before the NFS state times out.  We
> > > > > are working on improving the refresh/responsiveness right now, but
> > > > > you're right that the current code isn't yet complete.
> > > > >
> > > > > I think a PR that updates the docs to note that the ingress mode isn't
> > > > > yet complete, and also adds a section that documents how to do
> > > > > round-robin DNS would be the best immediate step?
> > > >
> > > > OK, good to know that ganesha will be resurrected on another node (as
> > > > I had been thinking earlier) and that the refresh delay is being
> > > > worked on.
> > > >
> > > > Before Jeff's note I had also been assuming that the ingress isn't
> > > > just another single point of failure but behaves more like a k8s
> > > > ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
> > > > IPs rather than DNS/round robin so this does matter.
> > >
> > > Yeah, ingress is meant to fill a similar role as k8s ingress, where
> > > that role is roughly "whatever magic is necessary to make traffic
> > > distributed and highly-available".  Currently we use keepalived and
> > > haproxy, although that implementation could conceivably be switched
> > > around in the future.  With cephadm, the endpoint is a single
> > > user-specified virtual IP.
> > >
> > > We haven't implemented the orchestrator+k8s glue yet to control k8s
> > > ingress services, but my (limited) understanding is that there is a
> > > broad range of k8s ingress implementations that may have a slightly
> > > different model (e.g., ingress using AWS services may dynamically
> > > allocate an IP and/or DNS instead of asking the user to provide one).
> > >
> > > To make NFS work using round-robin DNS, the user would need to extract
> > > the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
> > > --daemon-type nfs --format json'), probably on a periodic basis in
> > > case failures or configuration changes lead cephadm to redeploy
> > > ganesha daemons elsewhere in the cluster.  Working on a documentation
> > > patch to describe this approach.
> > >
> >
> > When you add or remove a node from the cluster, you're adding or
> > removing an address as well. Updating DNS at that time would just be
> > another step. I'm not sure that requires any special scripting as these
> > should be fairly rare events, but I'm less versed in how cephadm does
> > these things.
> >
> > To be clear, my main concern with using an ingress controller is that it
> > doesn't seem to give you any real benefit. It makes sense for something
> > like a webserver where you can just redirect the client to another node
> > if one of the backend nodes goes down.
> >
> > With the current rados_cluster NFS clustering, you can't really do that
> > so it doesn't really give you anything wrt redundancy. All it does is
> > give you a single IP address to contact.
>
> And that's exactly what we need for the OpenStack use case.  Clients
> are guest VMs, running who knows what.  We cannot assume or require
> anything about how they resolve hostnames to addresses.
>
> -- Tom
>

How are these IP addresses supplied to the tenants in an openstack
cluster? I guess this is using cinder? 

No, Manila (Shared File Service).

There are APIs to expose "export locations" to users.  For NFS these
are of the form <ip>:<path> .

If guests can't rely on DNS resolution, then perhaps cinder could be
taught to hand the guests a random address from a set? Or even better,
hand them more than one address. At least that way they could decide to
pick a different host for their initial mount if there was a problem
contacting the first one they try.

We already allow for handing out more than one export location for a
share from Manila.  The problem at hand though is that we need fault
tolerance on whatever IP a client uses to mount a share -- regardless
of whether there was only one or several export locations to choose
from when the original mount was done.  The issue is not finding an
export location that works.  It is, rather, providing an export
location that will continue to work when a ganesha node goes down.  By
"continue to work", we expect e.g. that client mounts persist and
writes with hard mounts will block but eventually complete when
service has been restored.

No. The ingress controller cannot give you that _at_all_. It does
nothing for fault tolerance in this configuration because the clients
are being distributed among backend ganesha servers according to a hash
of the client's IP address.

If one of the backend ganesha heads goes down then any client of it will
have to wait for it to be resurrected, with the same backend IP address
-- period.

We're willing to wait some time for the backend ganesha head to be 
resurrected if the ingress (using hash of client's IP address) will 
still direct the client to a backend ganesha head with the same 
backend IP address and relevant state after the resurrection.

Worse, clients that are trying to do an initial mount will need to wait
as well because the decision of what server they mount is entirely up to
the ingress controller's hashing algorithm.

Yes.  But in general our VM clients can't be assumed to round robin 
among available servers (or to use a round-robin DNS service, as 
discussed earlier).  I guess the only way to avoid failed mounts in 
this circumstance would be to stick with active/failover with a single 
IP as we have now.  But failed mounts during a ganesha service node 
failure are less consequential for us than failed I/O on an existing 
mount.

It really does nothing of value here that I can see.

I am *not* discounting your arguments, just trying to match potential 
solutions with the OpenStack use case for Ganesha.  To that end, I 
need to check whether there is anything about them that is cephadm 
specific.  What I mean is, if there were the same kind of 
active-active (loadsharing) Ganesha service with k8s resurrection of 
failed ganesha nodes and a k8s ingress with a stable IP in front of 
multiple Ganesha backend servers, would your arguments against the 
value of the ingress service still apply?

A couple of years ago (before cephadm) we were all (I think) imagining 
that your work in combination with k8s life cycle management of the 
ganesha daemons and a k8s ingress might yield a decent solution for 
the OpenStack use case.

And I should add that I *do* understand your point that any fault 
tolerance is completely the result of resurrecting the failed ganesha 
server node in time, not of the ingress.  (And perhaps, that having an 
ingress controller in front of multiple "active-active" ganesha back 
ends may foster the misunderstanding that these are active-active in 
the sense that any back end can take over another's work.)

The value of the ingress for OpenStack is that it provides a stable 
service IP address that can be chosen by OpenStack when launching a 
Ganesha cluster.  We'd like to be able to launch an NFS cluster per 
OpenStack tenant "at" an address on the tenant's private neutron 
network, the way we can with proprietary NFS back end solutions.

We have that degree of fault tolerance today with our (complex,
active-standby) pacemaker-corosync setup for Ganesha.  Mounts use
export locations with Virtual IPs.  When a node with a Ganesha server
behind the VIP dies, another takes its place.  Our hope is to get
OpenStack out of the business of managing HA for Ganesha but still
maintain a similar level of fault tolerance because the Ceph
orchestrator will resurrect the downed Ganesha service -- in time,
with the right state, and connected by the ingress service to the
right preexisting mounts.

To reiterate, my main concern is that relying on an ingress controller
limits our options for improving the ability to scale in the future,
since it pretty much means you can't use NFSv4 migration at all.

> >
> > That sounds convenient, but eventually, I'd like to wire up the ability
> > to use NFSv4 fs_locations to redirect clients to other cluster nodes so
> > we can do things like rebalance the cluster or live-shrink it. The
> > ingress controller will just be getting in the way at that point as
> > you're subject to its hashing algorithm and that can't be changed on the
> > fly.
> >
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >
>

--
Jeff Layton <jlayton@xxxxxxxxxx>

--
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx