Re: cephadm docs on HA NFS

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 29 Jul 2021 14:45:38 -0400

On Thu, 2021-07-29 at 14:25 -0400, Tom Barron wrote:
> On 29/07/21 13:28 -0400, Jeff Layton wrote:
> > On Thu, 2021-07-29 at 08:49 -0400, Tom Barron wrote:
> > > On 29/07/21 08:18 -0400, Jeff Layton wrote:
> > > > On Mon, 2021-07-26 at 13:55 -0500, Sage Weil wrote:
> > > > > On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
> > > > > > On 26/07/21 09:17 -0500, Sage Weil wrote:
> > > > > > > The design is for cephadm to resurrect the failed ganesha rank on
> > > > > > > another host, just like k8s does.  The current gap in the
> > > > > > > implementation is that the cephadm refresh is pretty slow (~10m) which
> > > > > > > means that probably won't happen before the NFS state times out.  We
> > > > > > > are working on improving the refresh/responsiveness right now, but
> > > > > > > you're right that the current code isn't yet complete.
> > > > > > > 
> > > > > > > I think a PR that updates the docs to note that the ingress mode isn't
> > > > > > > yet complete, and also adds a section that documents how to do
> > > > > > > round-robin DNS would be the best immediate step?
> > > > > > 
> > > > > > OK, good to know that ganesha will be resurrected on another node (as
> > > > > > I had been thinking earlier) and that the refresh delay is being
> > > > > > worked on.
> > > > > > 
> > > > > > Before Jeff's note I had also been assuming that the ingress isn't
> > > > > > just another single point of failure but behaves more like a k8s
> > > > > > ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
> > > > > > IPs rather than DNS/round robin so this does matter.
> > > > > 
> > > > > Yeah, ingress is meant to fill a similar role as k8s ingress, where
> > > > > that role is roughly "whatever magic is necessary to make traffic
> > > > > distributed and highly-available".  Currently we use keepalived and
> > > > > haproxy, although that implementation could conceivably be switched
> > > > > around in the future.  With cephadm, the endpoint is a single
> > > > > user-specified virtual IP.
> > > > > 
> > > > > We haven't implemented the orchestrator+k8s glue yet to control k8s
> > > > > ingress services, but my (limited) understanding is that there is a
> > > > > broad range of k8s ingress implementations that may have a slightly
> > > > > different model (e.g., ingress using AWS services may dynamically
> > > > > allocate an IP and/or DNS instead of asking the user to provide one).
> > > > > 
> > > > > To make NFS work using round-robin DNS, the user would need to extract
> > > > > the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
> > > > > --daemon-type nfs --format json'), probably on a periodic basis in
> > > > > case failures or configuration changes lead cephadm to redeploy
> > > > > ganesha daemons elsewhere in the cluster.  Working on a documentation
> > > > > patch to describe this approach.
> > > > > 
> > > > 
> > > > When you add or remove a node from the cluster, you're adding or
> > > > removing an address as well. Updating DNS at that time would just be
> > > > another step. I'm not sure that requires any special scripting as these
> > > > should be fairly rare events, but I'm less versed in how cephadm does
> > > > these things.
> > > > 
> > > > To be clear, my main concern with using an ingress controller is that it
> > > > doesn't seem to give you any real benefit. It makes sense for something
> > > > like a webserver where you can just redirect the client to another node
> > > > if one of the backend nodes goes down.
> > > > 
> > > > With the current rados_cluster NFS clustering, you can't really do that
> > > > so it doesn't really give you anything wrt redundancy. All it does is
> > > > give you a single IP address to contact.
> > > 
> > > And that's exactly what we need for the OpenStack use case.  Clients
> > > are guest VMs, running who knows what.  We cannot assume or require
> > > anything about how they resolve hostnames to addresses.
> > > 
> > > -- Tom
> > > 
> > 
> > How are these IP addresses supplied to the tenants in an openstack
> > cluster? I guess this is using cinder? 
> 
> No, Manila (Shared File Service).
> 
> There are APIs to expose "export locations" to users.  For NFS these 
> are of the form <ip>:<path> .
> 
> > 
> > If guests can't rely on DNS resolution, then perhaps cinder could be
> > taught to hand the guests a random address from a set? Or even better,
> > hand them more than one address. At least that way they could decide to
> > pick a different host for their initial mount if there was a problem
> > contacting the first one they try.
> 
> We already allow for handing out more than one export location for a 
> share from Manila.  The problem at hand though is that we need fault 
> tolerance on whatever IP a client uses to mount a share -- regardless 
> of whether there was only one or several export locations to choose 
> from when the original mount was done.  The issue is not finding an 
> export location that works.  It is, rather, providing an export 
> location that will continue to work when a ganesha node goes down.  By 
> "continue to work", we expect e.g. that client mounts persist and 
> writes with hard mounts will block but eventually complete when 
> service has been restored.

No. The ingress controller cannot give you that _at_all_. It does
nothing for fault tolerance in this configuration because the clients
are being distributed among backend ganesha servers according to a hash
of the client's IP address.

If one of the backend ganesha heads goes down then any client of it will
have to wait for it to be resurrected, with the same backend IP address
-- period.

Worse, clients that are trying to do an initial mount will need to wait
as well because the decision of what server they mount is entirely up to
the ingress controller's hashing algorithm.

It really does nothing of value here that I can see.

> 
> We have that degree of fault tolerance today with our (complex, 
> active-standby) pacemaker-corosync setup for Ganesha.  Mounts use 
> export locations with Virtual IPs.  When a node with a Ganesha server 
> behind the VIP dies, another takes its place.  Our hope is to get 
> OpenStack out of the business of managing HA for Ganesha but still 
> maintain a similar level of fault tolerance because the Ceph 
> orchestrator will resurrect the downed Ganesha service -- in time, 
> with the right state, and connected by the ingress service to the 
> right preexisting mounts.
> 
> > 
> > To reiterate, my main concern is that relying on an ingress controller
> > limits our options for improving the ability to scale in the future,
> > since it pretty much means you can't use NFSv4 migration at all.
> > 
> > 
> > > > 
> > > > That sounds convenient, but eventually, I'd like to wire up the ability
> > > > to use NFSv4 fs_locations to redirect clients to other cluster nodes so
> > > > we can do things like rebalance the cluster or live-shrink it. The
> > > > ingress controller will just be getting in the way at that point as
> > > > you're subject to its hashing algorithm and that can't be changed on the
> > > > fly.
> > > > 
> > > > --
> > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > > 
> > > 
> > 
> > -- 
> > Jeff Layton <jlayton@xxxxxxxxxx>
> > 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx