--- On Mon, 1/12/09, Gordan Bobic <gordan@xxxxxxxxxx> wrote: Ding, ding, ding ding!!! I get it, you are using NFS to achieve blocking, exactly my #1 remaining grip with glusterfs, it does not block! Please try explaining why this is important to you to the glusterfs devs! I am not sure that I made my case clear to them. It seems like your use of NFS is primarily based upon this (what I perceive to be) major remaining shortcoming of glusterfs. Would you give up NFS if blocking were implemented in glusterfs? One remaining drawback to NFS, which you may not care about, is the fact the NFS servers should not themselves be NFS clients. My desired operational scenario is a more "peer 2 peer" scenario in which I would need my servers to be able to mount their own subvolumes. > > So, the HA translator supports talking to two > > different servers with two different transport > > mechanism and two different IPs. Bonding does not > support anything like this is far as I can tell. > > True. Bonding is more transparent. You make two NICs into > one virtual NIC and round-robin packets down them. If one > NIC/path fails, all the traffic will fail over to the other > NIC/path. Another benefit of the HA translator is that you can have to entirely different paths which is very hard to do with bonding. With bonding you are restricted to one IP. If you think about using a WAN, this would not allow you to access a remote server using two entirely different IPs which use two entirely different WAN GWs. The HA translator should in theory make this very easy. > No, not at all. Multiple servers, 1 floating IP per server. > Floating as in it can be migrated to the other server if one > fails. You balance the load by assigning half of your > clients to one floating IP, and the other half of the > clients to the other floating IP. So, when both servers are > up, each handles half the load. If one server fails, > it's IP gets migrated to the other server, and all > clients thereafter talk to the surviving server since it has > both IPs (until the other server comes back up and asks for > it's IP address back). Got it. > >> This is a reinvention of a wheel. NFS already > >> handles this gracefully for the use-case you > >> are describing. > > > > I am lost, what does NFS have to do with it? > > It already handles the "server has gone away" > situation gracefully. What I'm saying is that you can > use GlusterFS underneath for mirroring the data (AFR) and > re-export with NFS to the clients. If you want to avoid > client-side AFR and still have graceful failover with > lightweight transport, NFS is not a bad choice. Uh, not exactly a good choice though, it seems like an awfully big hammer to use just because you think it's better than reinventing the wheel. I can see that it will work in your strict client/server use case, but not in "peer 2 peer". A simple HA translator would be a much better more flexible, better glusterfs integrated solution, don't you think? > > How do current tools achieve the following > > setup? Client A talks to Server A and submits a read > request. The read request is received on Server A (TCP > acked to the client), and then Server A dies. How will > > the following request be completed without > > glusterfs returning an "endpoint not > connected" error? > > You make client <-> server comms NFS. > You make server <-> server comms GlusterFS. > > If the NFS server goes away, the client will keep retrying > until the server returns. In this case, that would mean > it'll keep retrying until the other server fails the IP > address over to itself. > > This achieves: > 1) server side AFR with GlusterFS for redundancy > 2) client connects to a single server via NFS so > there's no double-bandwidth used by the client > 3) servers can fail over relatively transparently to the > client Makes sense. > > No, I have not confirmed that this actually > > works with the HA translator, but I was told > > that the following would happen if it were used. > Client A talks to Server A and submits a read request. The > read request is received on Server A (TCP acked to the > client), and then Server A dies. Client A > > will then in theory retry the read request > > on Server B. Bonding cannot do anything > > like this (since the read was tcp ACKed)? > > Agreed, if a server fails, bonding won't help. Cluster > fail-over server-side, however, will, provided the network > file system protocol can deal with it reasonably well. Yes, but I fear you might still have a corner case where you can get some non-posix behavior with this setup, just as I mentioned that I believe you would with the HA translator. -Client 1 writes a seq #(1) via server A file foo -Server A processes writes to file foo on both server A and B and dies without acking the write to client 1 -Client 2 reads the seq#(1) from file foo via server B. -Client 2 increments seq # to 2 and writes it to file foo via server B. -Client 1 retries its original write of 1 to file foo via server B which it believes failed via server A and succeeds. ->> Beware file foo now contains 1 yet client 2 clearly read it as 1 and successfully incremented it to 2. Tricky, but evil! > > I think that this is quite different from > > any bonding solution. Not better, different, > > If I were to use this it would not preclude me from > > also using bonding, but it solves a somewhat different > > problem. It is not a complete solution, it is a piece, but > > not a duplicated piece. If you don't like it, > > or it doesn't fit your backend use case, don't > > use it! :) > > If it can handle the described failure more gracefully than > what I'm proposing, then I'm all for it. I'm > just not sure there is that much scope for it being better > since the last write may not have made it to the mirror > server anyway, so even if the protocol can re-try, it would > need to have some kind of journaling, roll back the journal > and replay the operation. That's why I said theory about the HA translator! :) I do not see anything in the code that actually keeps track of requests until they are replied to, but I was told that it can replay it. Can someone explain where this is done? I can' see how this is done without some type of RAM journal? I say RAM, because request need not survive a client crash, they simply need to hit the server disk before the client return a success, but if the clients crashes, the apps never got a confirm, so request will not need to be replayed. Why do you think a client would need to be able to roll back the journal, it should just have to replay it, no roll back. > This, however, is a much more complex approach (very > similar to what GFS does), and there is a high price to pay > in terms of performance when the nodes aren't on the > same LAN. With glusterfs's architecture it should not be much of a price, just the buffering of requests until they are completed. > > No. I am proposing adding a complete transactional > model to AFR so that if a write fails on one node, some > policy can decide whether the same write should be committed > of rolled back on the other nodes. Today, the policy is to > simply apply it to the other nodes regardless. This is a > recipe for split brain. > > OK, I get what you mean. It's basically the same > problem I described above when I mentioned that you'd > need some kind of a journal to roll-back the operation that > hasn't been fully committed. I don't see it at all like above, since above you do not need to rollback. in this case, depending on which side of the segregated network you are on, the journal may need to be rolled back or committed. > > In the case of network segregation some policy should > > decide to allow writes to be applied > > to one side of the segregation and denied on the > > other. This does not require fencing (but it > > would be better with it), it could be a simple policy > > like: "apply writes if a majority of nodes can be > > reached", if not fail (or block would be > > even better). > > Hmm... This could lead to an elastic shifting quorum. > I'm not sure how you'd handle resyncing if nodes are > constantly leaving/joining. It seems a bit > non-deterministic. I wasn't trying to focus on a specific policy, but I fail to see any actual problem as long as you always have a majority? Could you be specific about a problematic case? I would suggest other policies also, thus my request for an external hook. > > I guess what you call tiny, I call huge. Even if you > > have your heartbeat fencing occur in under a > > tenth of a second, that is time enough to split brain > > a major portion of a filesystem. I would never trust it. > > In GlusterFS that problem exists anyway, Why "anyway"? It exists, sure, but it's certainly something that I would hope gets fixed eventually. > but it is largely > mitigated by the fact that it works on file level rather > than block device level. Certainly not FS devastating like it would be for a block device, but bad data is still bad data. It would be of no consolation to me that I have access to the rest of my FS if one really important file is corrupt! > > To borrow your analogy, adding heartbeat to the > current AFR: "It's a bit like fitting a big > padlock on the door when there's a wall missing." > > :) > > Every single write needs to ensure that it will not > cause split brain for me to trust it. > > Sounds like GlusterFS isn't necessarily the solution > for you, then. :( It's not all bad, it's just not usable for some use cases yet. > > If not, why would I bother with gluserfs over > > AFR instead of glusterfs over DRBD? Oh right, because > I cannot get glusterfs to failover without > > incurring connection errors on the client! ;) > > (not your beef, I know, from another thread) > > Precisely - which is why I originally suggested not using > GlusterFS for client-server communication. :) ... > And this is exactly why I suggested using NFS for the > client<->server connection. NFS blocks until the > server becomes contactable again. Yes, but do you have any other suggestions besides NFS? Anything that can be safely used as both a client and a server? :) > > But, to be clear, I am not disagreeing with you > > that the HA translator does not solve the split > > brain problem at all. Perhaps this is what is really > > "upsetting" you, not that it is > > "duplicated" functionality, but rather that > > it does not help AFR solve it's split brain personality > > disorders, it only helps make them more available, thus > > making split brain even more likely!! ;( > > I'm not sure it makes it any worse WRT split-brain, it > just seems that you are looking for GlusterFS+HA to provide > you with exactly the same set of features that NFS+(server > fail-over) already provides. You are right, glusterfs + AFR + HA is probably no worse than glusterfs + AFR + NFS. But both make it slightly more likely to have split brain than simply glusterfs + AFR. And glusterfs + AFR itself is much more likely to split brain than glusterfs + DRBD. > Of course, there could be > advantages in GlusterFS behaving the same way as NFS when > the server goes away if it's a single-server setup I fail to see how having it not behave that way even if you have many servers and they all went down would not be desirable? > - it > would be easier to set up and a bit more elegant. But it > wouldn's add any functionality that couldn't be > re-created using the sort of a setup I described. I guess just multi path, multi protocol (encrypt one, not the other...). Primarily flexibility, bonding is very limited. I would think that it might in some usecases increase bandwidth also. My reading on bonding suggests that if you are using separate switches that you can get either HA bonding or link aggregation bonding, but not both right? -Martin