On Fri, 2021-07-30 at 09:55 -0500, Sage Weil wrote: > I think we're talking past each other a bit here. Let me try to reset > this thread a bit by summarizing the different options we've > discussed: > I think you're right that I was misunderstanding. Thanks for the clarification. Some comments inline below: > 1. multime ganeshas (existing nfs service) > - N ganesha daemons > - IP are based on which machines they get scheduled on > - we could generate a list of IPs for clients to use at mount time > - if a host fails, any clients mounting that particular ganesha break. > - currently implemented (base nfs service) > > 2. multiple ganeshas + round-robin DNS > - N ganesha daemons > - IP are based on which machines they get scheduled on > - generate DNS with A record for each ganesha > - once client mounts a particular IP, it sticks to that IP > - if a host is down at mount time, clients will select a different IP to mount > - if a host fails after mount, client breaks. > - can be implemented w/ current base nfs service + some scripting to update DNS > > 3. single ganesha + ingress > - 1 ganesha daemon > - M (probably 2 or 3) haproxy + keepalived hosts > - virtual IP on the proxy hosts > - clients mount a known, stable IP > - if a proxy host goes down, virtual IP moves. no affect on mounting > or mounted clients. > - if ganesha host goes down, cephadm reschedules the ganesha elsewhere > and adjusts the proxy config. (* this is slow currently, will be fast > Real Soon Now.) > - currently implemented: nfs(count=1) + ingress > > 4. multiple ganesha + ingress > - N ganesha daemons > - M (probably 2 or 3) haproxy + keepalived hosts > - virtual IP on the proxy hosts > - haproxy does static hashing > - clients mount a known, stable IP > - if a proxy host goes down, virtual IP moves. no affect on mounting > or mounted clients. > - if a ganesha host goes down, cephadm reschedules the ganesha > elsewhere and adjusts the proxy config. (* this is slow currently, > will be fast Real Soon Now.) > - if ganesha count (N) increased or decreased, some clients get > remapped to new ganeshas. I chose an haproxy hash function that > minimizes this. but, basically, you can't change N without *some* > disruption. > - mostly implemented (except that cephadm update needs to be faster) > > 5. single ganesha + single virtual IP > - 1 ganesha daemon > - 1 virtual IP that follows the ganesha daemon > - on failure, cephadm would deploy ganesha elsewhere + move virtual IP > - not implemented > > 6. multiple ganesha + multiple virtual IPs > - N ganesha daemons > - N virtual IPs > - requires ganesha changes to (1) make ganesha aware of peers and (2) > instruct clients to move around > - on failure, cephadm would deploy failed ganesha elsewhere + move > that virtual IP > - not implemented (in cephadm or ganesha) > > So: > - 1 is obviously not HA > - 2 is better, but also not really usable since a failure breaks > mounted clients. > - 3 and 4 are what we implemented. This is partly because its the > same ingress architecture as RGW, and partly because the overall > cephadm structure works well in terms of making 'ingress' a separate, > modular, optional service that layers on top of the base NFS service. > They (will, as soon as cephadm update frequency is fixed) provide HA > and scale-out. > - 5 is simpler than 3/4, but more complex to implement in cephadm (no > shared code/architecture with RGW ingress) and does not provide any > scale-out capability. > - 6 is what you'd like to move to. It would provide HA, scale-out, > and online scale up/down. > > Ok, got it. So basically you're mainly using haproxy to implement some poor-man's virtual IP handling. That obviously works, and you _could_ just set up multiple haproxy addresses to allow for scale-out. As a side note, someone asking the other day about HA ganesha mentioned using ucarp, which looks pretty simple. https://wiki.greentual.com/index.php/Ucarp It may be worth considering that instead, but it may not give you much if you need deal with haproxy for RGW anyway. That said, floating a VIP between machines is probably more efficient than proxying packets. > > The ideal thing is to just give the different ganesha heads different > > public IP addresses. Being able to live-balance (and shrink!) the > > cluster seems like something that would be valuable. You won't be able > > to do that with an ingress controller in the way. > > I think the ganesha changes you're talking about sound great, and 6 > sounds like the way to make it all come together. > > I disagree that 2 is a viable option, because a failure disrupts > mounted clients. In the short-term it might make sense since 3/4 > aren't yet complete. but we should keep in mind it requires additional > plumbing/automation to generate the DNS records, whereas 3/4 have a > single stable IP. > > I agree that 3/4 is more complicated than 5/6. The original pad that > we used to make this decision is here: > https://pad.ceph.com/p/cephadm-nfs-ha > 6 above matches option 2a in the pad. I think we didn't choose it > because (1) we didn't know that modifying ganesha to move clients > around was something being considered in the mid- to long-term, and > (2) it would make cephadm directly responsible for managing the > virtual IPs, vs letting keepalived do it for us, which is new > territory. Mostly, though, it (3) tightly couples the HA/ingress > behavior with the NFS service itself, which makes the cephadm code > less modular (doesn't reuse RGW ingress). > > Moving forward, I think this makes sense, each step building on the previous: > - finish the work improving cephadm reconciliation frequently. This > will make 3/4 work properly and get us a fully functional HA solution. > - implement something like 6 above (2a in the pad). This would be > orthogonal to ingress, since the NFS service itself would be provided > the virtual IP list and the container start/stop would be > configuring/deconfiguring the VIPs. > - extend ganesha to make it aware of its peers and do client > delegation. Aside from the basics, I'm not sure how the load > balancing and scale up/down parts would work... is there any plan for > that yet? > Ok, that all sounds reasonable. For migration plans, nothing is fully baked. Implementing migration to allow for taking a node out of the cluster live is not _too_ difficult. I have a rough draft implementation of that here: https://github.com/jtlayton/nfs-ganesha/commits/fsloc With that you can mark an export's config on a node with "Moved = true;" and the node's clients should vacate it soon afterward. What I don't have yet is a way to allow the client to reclaim state on a different node. That mechanism could potentially be useful for breaking the constraint that reclaiming hosts always need to go to the same server. That would also enable us to live-move clients for balancing as well. I haven't sat down to design anything like that yet so I don't know how difficult it is, but I think it's possible. -- Jeff Layton <jlayton@xxxxxxxxxx> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx