On 03/23/2010 03:23 PM, Ed W wrote: > I'm not an active Glusterfs user yet, but what worries me about gluster > is this very casual attitude to split brain... Other cluster solutions > take outages extremely seriously to the point they fence off the downed > server until it's guaranteed back into a synchronised state... I'm not sure I'd say the attitude is casual, so much as that it emphasizes availability over consistency. > Once a machine has gone down then it should be fenced off and not be > allowed to serve files again until it's fully synced - otherwise you are > just asking for a set of circumstances (however, unlikely) to cause the > out of date data to be served... This is a very common approach to a very common problem in clustered systems, but it does require server-to-server communication (which GlusterFS has historically avoided). > A superb solution would be for the replication tracker to actually log > and mark dirty anything it can't fully replicate. When the replication > partner comes back up these could then be treated as a priority sync > list to get the servers back up to date? To put a slight twist on that, it would be nice if clients knew which servers were still in catch-up mode, and not direct traffic to them except as part of the catch-up process. That process, in turn, should be based on precise logging of changes on the survivors so that only an absolute minimum of files need to be touched. That's kind of a whole different replication architecture, but IMO it would be better for local replication and practically necessary for wide-area.