Hi, I must say I find the idea of a journal approach quite appealing, although the split brain problem is an issue .. that said AFR volumes already have a split-brain problem .. unplugging a network lead between two AFR sub-volumes is an easy demonstration of this .. both servers will assume the other is down and carry on .. would adding a journal make the issue any worse? (or am I missing something?) In terms of a real use-case, I've had lots of cluster issues relating to single nodes becoming unavailable for short periods. With the exception of "heartbeat" screwing up a DRBD setup (which was an internal software failure, rather than anything we would be looking to protect against) I've never experienced two nodes becoming isolated and potentially suffering from split-brain. (I accept it can/does happen, but I'm thinking it's not an everyday occurrence) So ... a journal would not be a perfect solution, however a very limited amount of split-brian protection might be considered a "pretty good" solution in-context and it would provide excellent recovery metrics in most cases. ?? In terms of work, I'm guessing each write operation would need to put an additional (serial,path,offset,bytes,data) to the journal volume .. each data volume would need to keep track of it's most recent serial, then mount would need to check the journal and run playbacks for each sub-volume who's serial isn't up to the most recent in the journal serial ... If all this is done in a journal translator .. it doesn't "sound" too onerous or that it would involve changing any other code ... ?? Gareth. ----- Original Message ----- From: "Gordan Bobic" <gordan@xxxxxxxxxx> To: "gluster-devel" <gluster-devel@xxxxxxxxxx> Sent: Monday, April 28, 2008 7:56:16 PM GMT +00:00 GMT Britain, Ireland, Portugal Subject: Re: Re; Load balancing ... Martin Fick wrote: > May I suggest an alternate approach? The rsync model > seems like a nice one when you have no idea what the > changes are, but with the glusterfs AFR it is possible > to keep track of the changes. What about adding a > journaling volume option to the AFR translator? Sounds like you are effectively describing an extent based volume, very similar to what DRBD does to limit the amount of sync required. > So if changes cannot be written to Sub B they would > be recorded in Journal A. When B comes back up and > AFR notices a mismatch between a file on Sub A and Sub > B and would normally query Sub A for the file > contents, it could query Journal A first to see if the > changes to the file are stored there. If so, Journal > A could reply with just the changes instead of the > whole file and AFR can then apply the changes to Sub > B. Splitbrain handling of this would be impossible, and one version would always have to win. But other than that, I can see that would work. > The journal volume would not actually be required and > would be space limited, it would simply drop changes > that it can no longer keep track of. If the journal > does not have the change logged, everything would > proceed as it does today, the subvolume would be > queried for the whole file. This would be a little > like the DRBD model, but more inline with the gluster > way of doing things. It would be better than what > DRBD does since it would be more granular. When space > for changes runs out, whole files might have to be > synced, but not necessarily the whole filessytem! I think having an rsync type syncing algorithm that can operate on the whole file would be more flexible and potentially provide enough of an improvement to make the complication of adding journals/extents not worthwhile. > I realize that this a major enhancement, and would be > a lot of work, but then again, so probably would the > rsync model implementation, would it not? I haven't looked at the GlusterFS code (yet), but I would imagine that implementing rsync-like file sync would be _much_ less work than implementing extents/journals/undo logs. > The > advantage here is that consistency would be assured. That is arguably fairly academic. Just use the rolling hash for rsync that is big enough that the probability of a false negative in the hashed block is around the same as the probability of a media error. > The tradeoff between the journal and the rsync model > is one of disk space for the journal versus CPU time > for the rsync model. Certainly both could be > implemented, the journal could be queried first, and > if that fails, use the rsync method! > Thoughts? In the ideal world - yes. In practice, I think that just adding rsync capability for partial syncs would give most of the benefits for relatively little effort in terms of implementation. Gordan _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxx http://lists.nongnu.org/mailman/listinfo/gluster-devel