Re: cephadm docs on HA NFS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think we're talking past each other a bit here.  Let me try to reset
this thread a bit by summarizing the different options we've
discussed:

1. multime ganeshas (existing nfs service)
- N ganesha daemons
- IP are based on which machines they get scheduled on
- we could generate a list of IPs for clients to use at mount time
- if a host fails, any clients mounting that particular ganesha break.
- currently implemented (base nfs service)

2. multiple ganeshas + round-robin DNS
- N ganesha daemons
- IP are based on which machines they get scheduled on
- generate DNS with A record for each ganesha
- once client mounts a particular IP, it sticks to that IP
- if a host is down at mount time, clients will select a different IP to mount
- if a host fails after mount, client breaks.
- can be implemented w/ current base nfs service + some scripting to update DNS

3. single ganesha + ingress
- 1 ganesha daemon
- M (probably 2 or 3) haproxy + keepalived hosts
- virtual IP on the proxy hosts
- clients mount a known, stable IP
- if a proxy host goes down, virtual IP moves.  no affect on mounting
or mounted clients.
- if ganesha host goes down, cephadm reschedules the ganesha elsewhere
and adjusts the proxy config.  (* this is slow currently, will be fast
Real Soon Now.)
- currently implemented: nfs(count=1) + ingress

4. multiple ganesha + ingress
- N ganesha daemons
- M (probably 2 or 3) haproxy + keepalived hosts
- virtual IP on the proxy hosts
- haproxy does static hashing
- clients mount a known, stable IP
- if a proxy host goes down, virtual IP moves.  no affect on mounting
or mounted clients.
- if a ganesha host goes down, cephadm reschedules the ganesha
elsewhere and adjusts the proxy config.  (* this is slow currently,
will be fast Real Soon Now.)
- if ganesha count (N) increased or decreased, some clients get
remapped to new ganeshas.  I chose an haproxy hash function that
minimizes this.  but, basically, you can't change N without *some*
disruption.
- mostly implemented (except that cephadm update needs to be faster)

5. single ganesha + single virtual IP
- 1 ganesha daemon
- 1 virtual IP that follows the ganesha daemon
- on failure, cephadm would deploy ganesha elsewhere + move virtual IP
- not implemented

6. multiple ganesha + multiple virtual IPs
- N ganesha daemons
- N virtual IPs
- requires ganesha changes to (1) make ganesha aware of peers and (2)
instruct clients to move around
- on failure, cephadm would deploy failed ganesha elsewhere + move
that virtual IP
- not implemented (in cephadm or ganesha)

So:
- 1 is obviously not HA
- 2 is better, but also not really usable since a failure breaks
mounted clients.
- 3 and 4 are what we implemented.  This is partly because its the
same ingress architecture as RGW, and partly because the overall
cephadm structure works well in terms of making 'ingress' a separate,
modular, optional service that layers on top of the base NFS service.
They (will, as soon as cephadm update frequency is fixed) provide HA
and scale-out.
- 5 is simpler than 3/4, but more complex to implement in cephadm (no
shared code/architecture with RGW ingress) and does not provide any
scale-out capability.
- 6 is what you'd like to move to.  It would provide HA, scale-out,
and online scale up/down.


> The ideal thing is to just give the different ganesha heads different
> public IP addresses. Being able to live-balance (and shrink!) the
> cluster seems like something that would be valuable. You won't be able
> to do that with an ingress controller in the way.

I think the ganesha changes you're talking about sound great, and 6
sounds like the way to make it all come together.

I disagree that 2 is a viable option, because a failure disrupts
mounted clients. In the short-term it might make sense since 3/4
aren't yet complete. but we should keep in mind it requires additional
plumbing/automation to generate the DNS records, whereas 3/4 have a
single stable IP.

I agree that 3/4 is more complicated than 5/6.  The original pad that
we used to make this decision is here:
 https://pad.ceph.com/p/cephadm-nfs-ha
6 above matches option 2a in the pad.  I think we didn't choose it
because (1) we didn't know that modifying ganesha to move clients
around was something being considered in the mid- to long-term, and
(2) it would make cephadm directly responsible for managing the
virtual IPs, vs letting keepalived do it for us, which is new
territory.  Mostly, though, it (3) tightly couples the HA/ingress
behavior with the NFS service itself, which makes the cephadm code
less modular (doesn't reuse RGW ingress).

Moving forward, I think this makes sense, each step building on the previous:
- finish the work improving cephadm reconciliation frequently.  This
will make 3/4 work properly and get us a fully functional HA solution.
- implement something like 6 above (2a in the pad).  This would be
orthogonal to ingress, since the NFS service itself would be provided
the virtual IP list and the container start/stop would be
configuring/deconfiguring the VIPs.
- extend ganesha to make it aware of its peers and do client
delegation.  Aside from the basics, I'm not sure how the load
balancing and scale up/down parts would work... is there any plan for
that yet?

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux