David, The problem what you are facing is something we are already investigating. We still haven't root-caused it yet, but from what we have seen this happens only on / and only for metadata changelog. This shows up as just annoying logs but it should not affect your functionality. Avati On Thu, Jan 27, 2011 at 2:03 PM, David Lloyd < david.lloyd at v-consultants.co.uk> wrote: > Yes, it seemed really dangerous to me too. But with the lack of > documentation, and lack of response from gluster (and the data is still on > the old system too), I thought I'd give it a shot. > > Thanks for the explanation. The split-brain problem seems to come up fairly > regularly, but I've not found any clear explanation of what to do in this > situation. I'm starting to worry about what appears to be a rationing of > information from gluster.com to the the community at large. > > We're not in a position to purchase support, and I'm a sysadmin, not a > developer. I hope to make a contribution in terms of testing and feedback > and bug reports, but I'm seeing a lot of threads that seem to go nowhere, > and it's getting a bit frustrating. > > David > > > > > This seems really dangerous to me. On a brick xxx, the trusted.afr.yyy > > attribute consists of three unsigned 32-bit counters, indicating how many > > uncommitted operations (data, metadata, and namespace respectively) might > > exist at yyy. If xxx shows uncommitted operations at yyy but not vice > > versa, then we know that xxx is more up to date and it should be the > source > > for self-heal. If two bricks show uncommitted operations at each other, > > then we're in the infamous "split brain" scenario. Some client was > unable > > to clear the counter at xxx while another was unable to clear it at yyy, > or > > both xxx and yyy went down after the operation was complete but before > they > > could clear the counters for each other. > > > > In this case, it looks like a metadata operation (permission change) was > in > > this state. If the permissions are in fact the same both places then it > > doesn't matter which way self-heal happens, or whether it happens at all. > > In fact, it seems to me that AFR should be able to detect this > particular > > condition and not flag it as an error. In any case, I think you're > probably > > fine in this case but in general it's a very bad idea to clear these > flags > > manually because it can cause updates to be lost (if self-heal goes the > > wrong way) or files to remain in an inconsistent state (if no self-heal > > occurs). > > > > The real thing I'd wonder about is why both servers are so frequently > > becoming unavailable at the same instant (switch problem?) and why > > permission changes on the root are apparently so frequent that this ofen > > results in a split-brain. > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > > > > > -- > David Lloyd > V Consultants > www.v-consultants.co.uk > tel: +44 7983 816501 > skype: davidlloyd1243 > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >