self heal errors on 3.1.1 clients

jdarcy at redhat.com (Jeff Darcy) · Thu, 27 Jan 2011 09:01:11 -0500

On 01/26/2011 07:25 PM, David Lloyd wrote:
> Well, I did this and it seems to have worked. I was just guessing really,
> didn't have any documentation or advice from anyone in the know.
>
> I just reset the attributes on the root directory for each brick that was
> not all zeroes.
>
> I found it easier to dump the attributes without the '-e hex'
>
> g4:~ # getfattr -d  -m trusted.afr /mnt/glus1 /mnt/glus2
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/glus1
> trusted.afr.glustervol1-client-2=0sAAAAAAAAAAEAAAAA
> trusted.afr.glustervol1-client-3=0sAAAAAAAAAAAAAAAA
>
> Then
> setfattr -n trusted.afr.glustervol1-client-2 -v 0sAAAAAAAAAAAAAAAA
> /mnt/glus1
>
> I did that on all the bricks that didn't have all A's
>
> next time i stat-ed the root of the filesystem on the client the self heal
> worked ok.
>
> I'm not comfortable advising you to do this as I'm really feeling my way
> here, but it looks as though it worked for me.

This seems really dangerous to me.  On a brick xxx, the trusted.afr.yyy 
attribute consists of three unsigned 32-bit counters, indicating how 
many uncommitted operations (data, metadata, and namespace respectively) 
might exist at yyy.  If xxx shows uncommitted operations at yyy but not 
vice versa, then we know that xxx is more up to date and it should be 
the source for self-heal.  If two bricks show uncommitted operations at 
each other, then we're in the infamous "split brain" scenario.  Some 
client was unable to clear the counter at xxx while another was unable 
to clear it at yyy, or both xxx and yyy went down after the operation 
was complete but before they could clear the counters for each other.

In this case, it looks like a metadata operation (permission change) was 
in this state.  If the permissions are in fact the same both places then 
it doesn't matter which way self-heal happens, or whether it happens at 
all.  In fact, it seems to me that AFR should be able to detect this 
particular condition and not flag it as an error.  In any case, I think 
you're probably fine in this case but in general it's a very bad idea to 
clear these flags manually because it can cause updates to be lost (if 
self-heal goes the wrong way) or files to remain in an inconsistent 
state (if no self-heal occurs).

The real thing I'd wonder about is why both servers are so frequently 
becoming unavailable at the same instant (switch problem?) and why 
permission changes on the root are apparently so frequent that this ofen 
results in a split-brain.