Re: cephadm docs on HA NFS

Tom Barron <tbarron@xxxxxxxxxx> · Mon, 26 Jul 2021 15:47:08 -0400

On 26/07/21 13:55 -0500, Sage Weil wrote:
On Mon, Jul 26, 2021 at 1:26 PM Tom Barron <tbarron@xxxxxxxxxx> wrote:
On 26/07/21 09:17 -0500, Sage Weil wrote:
>The design is for cephadm to resurrect the failed ganesha rank on
>another host, just like k8s does.  The current gap in the
>implementation is that the cephadm refresh is pretty slow (~10m) which
>means that probably won't happen before the NFS state times out.  We
>are working on improving the refresh/responsiveness right now, but
>you're right that the current code isn't yet complete.
>
>I think a PR that updates the docs to note that the ingress mode isn't
>yet complete, and also adds a section that documents how to do
>round-robin DNS would be the best immediate step?

OK, good to know that ganesha will be resurrected on another node (as
I had been thinking earlier) and that the refresh delay is being
worked on.

Before Jeff's note I had also been assuming that the ingress isn't
just another single point of failure but behaves more like a k8s
ingress.  Is that correct or not?  FWIW, OpenStack wants to work with
IPs rather than DNS/round robin so this does matter.

Yeah, ingress is meant to fill a similar role as k8s ingress, where
that role is roughly "whatever magic is necessary to make traffic
distributed and highly-available".  Currently we use keepalived and
haproxy, although that implementation could conceivably be switched
around in the future.  With cephadm, the endpoint is a single
user-specified virtual IP.

We haven't implemented the orchestrator+k8s glue yet to control k8s
ingress services, but my (limited) understanding is that there is a
broad range of k8s ingress implementations that may have a slightly
different model (e.g., ingress using AWS services may dynamically
allocate an IP and/or DNS instead of asking the user to provide one).

To make NFS work using round-robin DNS, the user would need to extract
the list of IPs for ganeshas from cephadm (e.g., examine 'ceph orch ps
--daemon-type nfs --format json'), probably on a periodic basis in
case failures or configuration changes lead cephadm to redeploy
ganesha daemons elsewhere in the cluster.  Working on a documentation
patch to describe this approach.

sage

Thanks, Sage.  From a consumer POV, any implementation that yields a 
highly available [*] IP per NFS cluster is fine.

-- Tom

[*] maybe fault tolerant, resurrection within NFS state time outs, 
equivalent to what can be done (reportedly) with rook-ceph.

Thanks!

-- Tom Barron

>
>
>On Thu, Jul 22, 2021 at 9:32 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>
>> I think we probably need to redo this bit of documentation:
>>
>>     https://docs.ceph.com/en/latest/cephadm/nfs/#high-availability-nfs
>>
>> I would just spin up a patch, but I think we might also just want to
>> reconsider recommending an ingress controller at all.
>>
>> Some people seem to be taking this to mean that they can shoot down one
>> of the nodes in the NFS server cluster, and the rest will just pick up
>> the load. That's not at all how this works.
>>
>> If a NFS cluster node goes down, then it _must_ be resurrected in some
>> fashion, period. Otherwise, the MDS will eventually (in 5 mins) time out
>> the state it held and the NFS clients will not be able to reclaim their
>> state.
>>
>> Given that, the bulleted list at the start of the doc above is wrong. We
>> cannot do any sort of failover if there is a host failure. My assumption
>> was that the orchestrator took care of starting up an NFS server
>> elsewhere if the host it was running on went down. Is that not the case?
>>
>> In any case, think we should reconsider recommending an ingress
>> controller at all. It's really just another point of failure, and a lot
>> of people seem to be misconstruing what guarantees that offers.
>>
>> Round-robin DNS would be a better option in this situation, and it
>> wouldn't be as problematic if we want to support things like live
>> shrinking the cluster in the future.
>> --
>> Jeff Layton <jlayton@xxxxxxxxxx>
>>
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>_______________________________________________
>Dev mailing list -- dev@xxxxxxx
>To unsubscribe send an email to dev-leave@xxxxxxx
>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx