Re: cephadm docs on HA NFS

Tom Barron <tbarron@xxxxxxxxxx> · Thu, 29 Jul 2021 14:25:36 -0400

On 29/07/21 13:28 -0400, Jeff Layton wrote:
On Thu, 2021-07-29 at 08:49 -0400, Tom Barron wrote:
On 29/07/21 08:18 -0400, Jeff Layton wrote:
> On Mon, 2021-07-26 at 13:55 -0500, Sage Weil wrote:
> > On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
> > > On 26/07/21 09:17 -0500, Sage Weil wrote:
> > > > The design is for cephadm to resurrect the failed ganesha rank on
> > > > another host, just like k8s does.  The current gap in the
> > > > implementation is that the cephadm refresh is pretty slow (~10m) which
> > > > means that probably won't happen before the NFS state times out.  We
> > > > are working on improving the refresh/responsiveness right now, but
> > > > you're right that the current code isn't yet complete.
> > > >
> > > > I think a PR that updates the docs to note that the ingress mode isn't
> > > > yet complete, and also adds a section that documents how to do
> > > > round-robin DNS would be the best immediate step?
> > >
> > > OK, good to know that ganesha will be resurrected on another node (as
> > > I had been thinking earlier) and that the refresh delay is being
> > > worked on.
> > >
> > > Before Jeff's note I had also been assuming that the ingress isn't
> > > just another single point of failure but behaves more like a k8s
> > > ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
> > > IPs rather than DNS/round robin so this does matter.
> >
> > Yeah, ingress is meant to fill a similar role as k8s ingress, where
> > that role is roughly "whatever magic is necessary to make traffic
> > distributed and highly-available".  Currently we use keepalived and
> > haproxy, although that implementation could conceivably be switched
> > around in the future.  With cephadm, the endpoint is a single
> > user-specified virtual IP.
> >
> > We haven't implemented the orchestrator+k8s glue yet to control k8s
> > ingress services, but my (limited) understanding is that there is a
> > broad range of k8s ingress implementations that may have a slightly
> > different model (e.g., ingress using AWS services may dynamically
> > allocate an IP and/or DNS instead of asking the user to provide one).
> >
> > To make NFS work using round-robin DNS, the user would need to extract
> > the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
> > --daemon-type nfs --format json'), probably on a periodic basis in
> > case failures or configuration changes lead cephadm to redeploy
> > ganesha daemons elsewhere in the cluster.  Working on a documentation
> > patch to describe this approach.
> >
>
> When you add or remove a node from the cluster, you're adding or
> removing an address as well. Updating DNS at that time would just be
> another step. I'm not sure that requires any special scripting as these
> should be fairly rare events, but I'm less versed in how cephadm does
> these things.
>
> To be clear, my main concern with using an ingress controller is that it
> doesn't seem to give you any real benefit. It makes sense for something
> like a webserver where you can just redirect the client to another node
> if one of the backend nodes goes down.
>
> With the current rados_cluster NFS clustering, you can't really do that
> so it doesn't really give you anything wrt redundancy. All it does is
> give you a single IP address to contact.

And that's exactly what we need for the OpenStack use case.  Clients
are guest VMs, running who knows what.  We cannot assume or require
anything about how they resolve hostnames to addresses.

-- Tom

How are these IP addresses supplied to the tenants in an openstack
cluster? I guess this is using cinder? 

No, Manila (Shared File Service).

There are APIs to expose "export locations" to users.  For NFS these 
are of the form <ip>:<path> .

If guests can't rely on DNS resolution, then perhaps cinder could be
taught to hand the guests a random address from a set? Or even better,
hand them more than one address. At least that way they could decide to
pick a different host for their initial mount if there was a problem
contacting the first one they try.

We already allow for handing out more than one export location for a 
share from Manila.  The problem at hand though is that we need fault 
tolerance on whatever IP a client uses to mount a share -- regardless 
of whether there was only one or several export locations to choose 
from when the original mount was done.  The issue is not finding an 
export location that works.  It is, rather, providing an export 
location that will continue to work when a ganesha node goes down.  By 
"continue to work", we expect e.g. that client mounts persist and 
writes with hard mounts will block but eventually complete when 
service has been restored.

We have that degree of fault tolerance today with our (complex, 
active-standby) pacemaker-corosync setup for Ganesha.  Mounts use 
export locations with Virtual IPs.  When a node with a Ganesha server 
behind the VIP dies, another takes its place.  Our hope is to get 
OpenStack out of the business of managing HA for Ganesha but still 
maintain a similar level of fault tolerance because the Ceph 
orchestrator will resurrect the downed Ganesha service -- in time, 
with the right state, and connected by the ingress service to the 
right preexisting mounts.

To reiterate, my main concern is that relying on an ingress controller
limits our options for improving the ability to scale in the future,
since it pretty much means you can't use NFSv4 migration at all.

>
> That sounds convenient, but eventually, I'd like to wire up the ability
> to use NFSv4 fs_locations to redirect clients to other cluster nodes so
> we can do things like rebalance the cluster or live-shrink it. The
> ingress controller will just be getting in the way at that point as
> you're subject to its hashing algorithm and that can't be changed on the
> fly.
>
> --
> Jeff Layton <jlayton@xxxxxxxxxx>
>

--
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx