Re: cephadm docs on HA NFS

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 26 Jul 2021 13:55:14 -0500

On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
> On 26/07/21 09:17 -0500, Sage Weil wrote:
> >The design is for cephadm to resurrect the failed ganesha rank on
> >another host, just like k8s does.  The current gap in the
> >implementation is that the cephadm refresh is pretty slow (~10m) which
> >means that probably won't happen before the NFS state times out.  We
> >are working on improving the refresh/responsiveness right now, but
> >you're right that the current code isn't yet complete.
> >
> >I think a PR that updates the docs to note that the ingress mode isn't
> >yet complete, and also adds a section that documents how to do
> >round-robin DNS would be the best immediate step?
>
> OK, good to know that ganesha will be resurrected on another node (as
> I had been thinking earlier) and that the refresh delay is being
> worked on.
>
> Before Jeff's note I had also been assuming that the ingress isn't
> just another single point of failure but behaves more like a k8s
> ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
> IPs rather than DNS/round robin so this does matter.

Yeah, ingress is meant to fill a similar role as k8s ingress, where
that role is roughly "whatever magic is necessary to make traffic
distributed and highly-available".  Currently we use keepalived and
haproxy, although that implementation could conceivably be switched
around in the future.  With cephadm, the endpoint is a single
user-specified virtual IP.

We haven't implemented the orchestrator+k8s glue yet to control k8s
ingress services, but my (limited) understanding is that there is a
broad range of k8s ingress implementations that may have a slightly
different model (e.g., ingress using AWS services may dynamically
allocate an IP and/or DNS instead of asking the user to provide one).

To make NFS work using round-robin DNS, the user would need to extract
the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
--daemon-type nfs --format json'), probably on a periodic basis in
case failures or configuration changes lead cephadm to redeploy
ganesha daemons elsewhere in the cluster.  Working on a documentation
patch to describe this approach.

sage

>
> Thanks!
>
> -- Tom Barron
>
> >
> >
> >On Thu, Jul 22, 2021 at 9:32 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >>
> >> I think we probably need to redo this bit of documentation:
> >>
> >>     https://docs.ceph.com/en/latest/cephadm/nfs/#high-availability-nfs
> >>
> >> I would just spin up a patch, but I think we might also just want to
> >> reconsider recommending an ingress controller at all.
> >>
> >> Some people seem to be taking this to mean that they can shoot down one
> >> of the nodes in the NFS server cluster, and the rest will just pick up
> >> the load. That's not at all how this works.
> >>
> >> If a NFS cluster node goes down, then it _must_ be resurrected in some
> >> fashion, period. Otherwise, the MDS will eventually (in 5 mins) time out
> >> the state it held and the NFS clients will not be able to reclaim their
> >> state.
> >>
> >> Given that, the bulleted list at the start of the doc above is wrong. We
> >> cannot do any sort of failover if there is a host failure. My assumption
> >> was that the orchestrator took care of starting up an NFS server
> >> elsewhere if the host it was running on went down. Is that not the case?
> >>
> >> In any case, think we should reconsider recommending an ingress
> >> controller at all. It's really just another point of failure, and a lot
> >> of people seem to be misconstruing what guarantees that offers.
> >>
> >> Round-robin DNS would be a better option in this situation, and it
> >> wouldn't be as problematic if we want to support things like live
> >> shrinking the cluster in the future.
> >> --
> >> Jeff Layton <jlayton@xxxxxxxxxx>
> >>
> >> _______________________________________________
> >> Dev mailing list -- dev@xxxxxxx
> >> To unsubscribe send an email to dev-leave@xxxxxxx
> >_______________________________________________
> >Dev mailing list -- dev@xxxxxxx
> >To unsubscribe send an email to dev-leave@xxxxxxx
> >
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx