On 1/8/13 12:44 PM, Tomasz Chmielewski wrote: > On 01/08/2013 06:33 PM, Tomasz Chmielewski wrote: > >> [root at ca2.sg1 /]# attr -l /data/gluster/lfd/techstudiolfc/pub >> >> Attribute "gfid" has a 16 byte value for >> /data/gluster/lfd/techstudiolfc/pub >> >> Attribute "afr.shared-client-0" has a 12 byte value for >> /data/gluster/lfd/techstudiolfc/pub >> >> Attribute "afr.shared-client-1" has a 12 byte value for >> /data/gluster/lfd/techstudiolfc/pub > > Perhaps that would be useful, too - it differs on both servers > (trusted.afr.shared-client-0 and trusted.afr.shared-client-1). What's its > meaning? What situations could lead to them being different? > > > [root at ca1.sg1 /]# getfattr -m . -d -e hex > /data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from > absolute path names # file: data/gluster/lfd/techstudiolfc/pub > trusted.afr.shared-client-0=0x000000000000000000000000 > trusted.afr.shared-client-1=0x000000000000001d00000000 > trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2 > > > [root at ca2.sg1 /]# getfattr -m . -d -e hex > /data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from > absolute path names # file: data/gluster/lfd/techstudiolfc/pub > trusted.afr.shared-client-0=0x000000000000000300000000 > trusted.afr.shared-client-1=0x000000000000000000000000 > trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2 I've written a bit about what the values mean at [1] and [2] if you want to understand at a conceptual level what these values mean, but I'll also try to boil it down a bit. The shared-client-N values are arrays of 32-bit counters for how many updates we believe are still pending at each brick. Here we see that ca1 (presumably corresponding to client-0) has a count of 0x1d for client-1 (presumably corresponding to ca2). In other words, ca1 saw 29 updates that it doesn't know completed at ca2. At the same time, ca2 saw 3 operations that it doesn't know completed at ca1. When there seem to be updates that need to be propagated in both directions, we don't know which ones should superseded which others, so we call it split brain and decline to do anything lest we cause data loss. In this particular case it's the middle counter that's affected - metadata such as ownership or permissions rather than file contents. That means the conflicting operations are not writes but chown, chmod, etc. This raises two questions: * Why are operations being seen as having been initiated but not completed? Possible reasons include frequent network problems or server crashes. * Why are there so many metadata operations, when those are usually rare? I can't answer these questions for you, but perhaps they'll give you some ideas for what to investigate next. If you're still stumped, let us know and we'll see what else we can think of. [1] http://hekafs.org/index.php/2011/04/glusterfs-extended-attributes/ [2] http://hekafs.org/index.php/2012/03/glusterfs-algorithms-replication-present/