On 01/26/2011 07:25 PM, David Lloyd wrote: > Well, I did this and it seems to have worked. I was just guessing really, > didn't have any documentation or advice from anyone in the know. > > I just reset the attributes on the root directory for each brick that was > not all zeroes. > > I found it easier to dump the attributes without the '-e hex' > > g4:~ # getfattr -d -m trusted.afr /mnt/glus1 /mnt/glus2 > getfattr: Removing leading '/' from absolute path names > # file: mnt/glus1 > trusted.afr.glustervol1-client-2=0sAAAAAAAAAAEAAAAA > trusted.afr.glustervol1-client-3=0sAAAAAAAAAAAAAAAA > > Then > setfattr -n trusted.afr.glustervol1-client-2 -v 0sAAAAAAAAAAAAAAAA > /mnt/glus1 > > I did that on all the bricks that didn't have all A's > > next time i stat-ed the root of the filesystem on the client the self heal > worked ok. > > I'm not comfortable advising you to do this as I'm really feeling my way > here, but it looks as though it worked for me. This seems really dangerous to me. On a brick xxx, the trusted.afr.yyy attribute consists of three unsigned 32-bit counters, indicating how many uncommitted operations (data, metadata, and namespace respectively) might exist at yyy. If xxx shows uncommitted operations at yyy but not vice versa, then we know that xxx is more up to date and it should be the source for self-heal. If two bricks show uncommitted operations at each other, then we're in the infamous "split brain" scenario. Some client was unable to clear the counter at xxx while another was unable to clear it at yyy, or both xxx and yyy went down after the operation was complete but before they could clear the counters for each other. In this case, it looks like a metadata operation (permission change) was in this state. If the permissions are in fact the same both places then it doesn't matter which way self-heal happens, or whether it happens at all. In fact, it seems to me that AFR should be able to detect this particular condition and not flag it as an error. In any case, I think you're probably fine in this case but in general it's a very bad idea to clear these flags manually because it can cause updates to be lost (if self-heal goes the wrong way) or files to remain in an inconsistent state (if no self-heal occurs). The real thing I'd wonder about is why both servers are so frequently becoming unavailable at the same instant (switch problem?) and why permission changes on the root are apparently so frequent that this ofen results in a split-brain.