self heal errors on 3.1.1 clients

anand.avati at gmail.com (Anand Avati) · Thu, 27 Jan 2011 14:44:34 -0800

David,
  The problem what you are facing is something we are already investigating.
We still haven't root-caused it yet, but from what we have seen this happens
only on / and only for metadata changelog. This shows up as just annoying
logs but it should not affect your functionality.

Avati

On Thu, Jan 27, 2011 at 2:03 PM, David Lloyd <
david.lloyd at v-consultants.co.uk> wrote:

> Yes, it seemed really dangerous to me too. But with the lack of
> documentation, and lack of response from gluster (and the data is still on
> the old system too), I thought I'd give it a shot.
>
> Thanks for the explanation. The split-brain problem seems to come up fairly
> regularly, but I've not found any clear explanation of what to do in this
> situation. I'm starting to worry about what appears to be a rationing of
> information from gluster.com to the the community at large.
>
> We're not in a position to purchase support, and I'm a sysadmin, not a
> developer. I hope to make a contribution in terms of testing and feedback
> and bug reports, but I'm seeing a lot of threads that seem to go nowhere,
> and it's getting a bit frustrating.
>
> David
>
>
>
> > This seems really dangerous to me.  On a brick xxx, the trusted.afr.yyy
> > attribute consists of three unsigned 32-bit counters, indicating how many
> > uncommitted operations (data, metadata, and namespace respectively) might
> > exist at yyy.  If xxx shows uncommitted operations at yyy but not vice
> > versa, then we know that xxx is more up to date and it should be the
> source
> > for self-heal.  If two bricks show uncommitted operations at each other,
> > then we're in the infamous "split brain" scenario.  Some client was
> unable
> > to clear the counter at xxx while another was unable to clear it at yyy,
> or
> > both xxx and yyy went down after the operation was complete but before
> they
> > could clear the counters for each other.
> >
> > In this case, it looks like a metadata operation (permission change) was
> in
> > this state.  If the permissions are in fact the same both places then it
> > doesn't matter which way self-heal happens, or whether it happens at all.
> >  In fact, it seems to me that AFR should be able to detect this
> particular
> > condition and not flag it as an error.  In any case, I think you're
> probably
> > fine in this case but in general it's a very bad idea to clear these
> flags
> > manually because it can cause updates to be lost (if self-heal goes the
> > wrong way) or files to remain in an inconsistent state (if no self-heal
> > occurs).
> >
> > The real thing I'd wonder about is why both servers are so frequently
> > becoming unavailable at the same instant (switch problem?) and why
> > permission changes on the root are apparently so frequent that this ofen
> > results in a split-brain.
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> >
>
>
>
> --
> David Lloyd
> V Consultants
> www.v-consultants.co.uk
> tel: +44 7983 816501
> skype: davidlloyd1243
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>