Re: cephadm docs on HA NFS

Tom Barron <tbarron@xxxxxxxxxx> · Thu, 29 Jul 2021 08:49:09 -0400

On 29/07/21 08:18 -0400, Jeff Layton wrote:
On Mon, 2021-07-26 at 13:55 -0500, Sage Weil wrote:
On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
> On 26/07/21 09:17 -0500, Sage Weil wrote:
> > The design is for cephadm to resurrect the failed ganesha rank on
> > another host, just like k8s does.  The current gap in the
> > implementation is that the cephadm refresh is pretty slow (~10m) which
> > means that probably won't happen before the NFS state times out.  We
> > are working on improving the refresh/responsiveness right now, but
> > you're right that the current code isn't yet complete.
> >
> > I think a PR that updates the docs to note that the ingress mode isn't
> > yet complete, and also adds a section that documents how to do
> > round-robin DNS would be the best immediate step?
>
> OK, good to know that ganesha will be resurrected on another node (as
> I had been thinking earlier) and that the refresh delay is being
> worked on.
>
> Before Jeff's note I had also been assuming that the ingress isn't
> just another single point of failure but behaves more like a k8s
> ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
> IPs rather than DNS/round robin so this does matter.

Yeah, ingress is meant to fill a similar role as k8s ingress, where
that role is roughly "whatever magic is necessary to make traffic
distributed and highly-available".  Currently we use keepalived and
haproxy, although that implementation could conceivably be switched
around in the future.  With cephadm, the endpoint is a single
user-specified virtual IP.

We haven't implemented the orchestrator+k8s glue yet to control k8s
ingress services, but my (limited) understanding is that there is a
broad range of k8s ingress implementations that may have a slightly
different model (e.g., ingress using AWS services may dynamically
allocate an IP and/or DNS instead of asking the user to provide one).

To make NFS work using round-robin DNS, the user would need to extract
the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
--daemon-type nfs --format json'), probably on a periodic basis in
case failures or configuration changes lead cephadm to redeploy
ganesha daemons elsewhere in the cluster.  Working on a documentation
patch to describe this approach.

When you add or remove a node from the cluster, you're adding or
removing an address as well. Updating DNS at that time would just be
another step. I'm not sure that requires any special scripting as these
should be fairly rare events, but I'm less versed in how cephadm does
these things.

To be clear, my main concern with using an ingress controller is that it
doesn't seem to give you any real benefit. It makes sense for something
like a webserver where you can just redirect the client to another node
if one of the backend nodes goes down.

With the current rados_cluster NFS clustering, you can't really do that
so it doesn't really give you anything wrt redundancy. All it does is
give you a single IP address to contact.

And that's exactly what we need for the OpenStack use case.  Clients 
are guest VMs, running who knows what.  We cannot assume or require 
anything about how they resolve hostnames to addresses.

-- Tom

That sounds convenient, but eventually, I'd like to wire up the ability
to use NFSv4 fs_locations to redirect clients to other cluster nodes so
we can do things like rebalance the cluster or live-shrink it. The
ingress controller will just be getting in the way at that point as
you're subject to its hashing algorithm and that can't be changed on the
fly.

--
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx