Re: cephadm docs on HA NFS

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 30 Jul 2021 11:39:28 -0400

On Fri, 2021-07-30 at 09:55 -0500, Sage Weil wrote:
> I think we're talking past each other a bit here.  Let me try to reset
> this thread a bit by summarizing the different options we've
> discussed:
> 

I think you're right that I was misunderstanding. Thanks for the
clarification. Some comments inline below:

> 1. multime ganeshas (existing nfs service)
> - N ganesha daemons
> - IP are based on which machines they get scheduled on
> - we could generate a list of IPs for clients to use at mount time
> - if a host fails, any clients mounting that particular ganesha break.
> - currently implemented (base nfs service)
> 
> 2. multiple ganeshas + round-robin DNS
> - N ganesha daemons
> - IP are based on which machines they get scheduled on
> - generate DNS with A record for each ganesha
> - once client mounts a particular IP, it sticks to that IP
> - if a host is down at mount time, clients will select a different IP to mount
> - if a host fails after mount, client breaks.
> - can be implemented w/ current base nfs service + some scripting to update DNS
> 
> 3. single ganesha + ingress
> - 1 ganesha daemon
> - M (probably 2 or 3) haproxy + keepalived hosts
> - virtual IP on the proxy hosts
> - clients mount a known, stable IP
> - if a proxy host goes down, virtual IP moves.  no affect on mounting
> or mounted clients.
> - if ganesha host goes down, cephadm reschedules the ganesha elsewhere
> and adjusts the proxy config.  (* this is slow currently, will be fast
> Real Soon Now.)
> - currently implemented: nfs(count=1) + ingress
> 
> 4. multiple ganesha + ingress
> - N ganesha daemons
> - M (probably 2 or 3) haproxy + keepalived hosts
> - virtual IP on the proxy hosts
> - haproxy does static hashing
> - clients mount a known, stable IP
> - if a proxy host goes down, virtual IP moves.  no affect on mounting
> or mounted clients.
> - if a ganesha host goes down, cephadm reschedules the ganesha
> elsewhere and adjusts the proxy config.  (* this is slow currently,
> will be fast Real Soon Now.)
> - if ganesha count (N) increased or decreased, some clients get
> remapped to new ganeshas.  I chose an haproxy hash function that
> minimizes this.  but, basically, you can't change N without *some*
> disruption.
> - mostly implemented (except that cephadm update needs to be faster)
> 
> 5. single ganesha + single virtual IP
> - 1 ganesha daemon
> - 1 virtual IP that follows the ganesha daemon
> - on failure, cephadm would deploy ganesha elsewhere + move virtual IP
> - not implemented
> 
> 6. multiple ganesha + multiple virtual IPs
> - N ganesha daemons
> - N virtual IPs
> - requires ganesha changes to (1) make ganesha aware of peers and (2)
> instruct clients to move around
> - on failure, cephadm would deploy failed ganesha elsewhere + move
> that virtual IP
> - not implemented (in cephadm or ganesha)
> 
> So:
> - 1 is obviously not HA
> - 2 is better, but also not really usable since a failure breaks
> mounted clients.
> - 3 and 4 are what we implemented.  This is partly because its the
> same ingress architecture as RGW, and partly because the overall
> cephadm structure works well in terms of making 'ingress' a separate,
> modular, optional service that layers on top of the base NFS service.
> They (will, as soon as cephadm update frequency is fixed) provide HA
> and scale-out.
> - 5 is simpler than 3/4, but more complex to implement in cephadm (no
> shared code/architecture with RGW ingress) and does not provide any
> scale-out capability.
> - 6 is what you'd like to move to.  It would provide HA, scale-out,
> and online scale up/down.
> 
> 

Ok, got it. So basically you're mainly using haproxy to implement some
poor-man's virtual IP handling. That obviously works, and you _could_
just set up multiple haproxy addresses to allow for scale-out.

As a side note, someone asking the other day about HA ganesha mentioned
using ucarp, which looks pretty simple.

    https://wiki.greentual.com/index.php/Ucarp

It may be worth considering that instead, but it may not give you much
if you need deal with haproxy for RGW anyway. That said, floating a VIP
between machines is probably more efficient than proxying packets.

> > The ideal thing is to just give the different ganesha heads different
> > public IP addresses. Being able to live-balance (and shrink!) the
> > cluster seems like something that would be valuable. You won't be able
> > to do that with an ingress controller in the way.
> 
> I think the ganesha changes you're talking about sound great, and 6
> sounds like the way to make it all come together.
> 
> I disagree that 2 is a viable option, because a failure disrupts
> mounted clients. In the short-term it might make sense since 3/4
> aren't yet complete. but we should keep in mind it requires additional
> plumbing/automation to generate the DNS records, whereas 3/4 have a
> single stable IP.
> 
> I agree that 3/4 is more complicated than 5/6.  The original pad that
> we used to make this decision is here:
>  https://pad.ceph.com/p/cephadm-nfs-ha
> 6 above matches option 2a in the pad.  I think we didn't choose it
> because (1) we didn't know that modifying ganesha to move clients
> around was something being considered in the mid- to long-term, and
> (2) it would make cephadm directly responsible for managing the
> virtual IPs, vs letting keepalived do it for us, which is new
> territory.  Mostly, though, it (3) tightly couples the HA/ingress
> behavior with the NFS service itself, which makes the cephadm code
> less modular (doesn't reuse RGW ingress).
> 
> Moving forward, I think this makes sense, each step building on the previous:
> - finish the work improving cephadm reconciliation frequently.  This
> will make 3/4 work properly and get us a fully functional HA solution.
> - implement something like 6 above (2a in the pad).  This would be
> orthogonal to ingress, since the NFS service itself would be provided
> the virtual IP list and the container start/stop would be
> configuring/deconfiguring the VIPs.
> - extend ganesha to make it aware of its peers and do client
> delegation.  Aside from the basics, I'm not sure how the load
> balancing and scale up/down parts would work... is there any plan for
> that yet?
> 

Ok, that all sounds reasonable.

For migration plans, nothing is fully baked. Implementing migration to
allow for taking a node out of the cluster live is not _too_ difficult.
I have a rough draft implementation of that here:

    https://github.com/jtlayton/nfs-ganesha/commits/fsloc

With that you can mark an export's config on a node with
"Moved = true;" and the node's clients should vacate it soon afterward.

What I don't have yet is a way to allow the client to reclaim state on a
different node. That mechanism could potentially be useful for breaking
the constraint that reclaiming hosts always need to go to the same
server. That would also enable us to live-move clients for balancing as
well. I haven't sat down to design anything like that yet so I don't
know how difficult it is, but I think it's possible.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx