self heal errors on 3.1.1 clients

david.lloyd at v-consultants.co.uk (David Lloyd) · Wed, 26 Jan 2011 17:23:53 +0000

I read on another thread about checking the getfattr output for each brick,
but it tailed off before any explanation of what to do with this information

We have 8 bricks in the volume. Config is:

g1:~ # gluster volume info glustervol1

Volume Name: glustervol1
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: g1:/mnt/glus1
Brick2: g2:/mnt/glus1
Brick3: g3:/mnt/glus1
Brick4: g4:/mnt/glus1
Brick5: g1:/mnt/glus2
Brick6: g2:/mnt/glus2
Brick7: g3:/mnt/glus2
Brick8: g4:/mnt/glus2
Options Reconfigured:
performance.write-behind-window-size: 100mb
performance.cache-size: 512mb
performance.stat-prefetch: on

and the getfattr outputs are:

g1:~ # getfattr -d -e hex -m trusted.afr /mnt/glus1
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus1
trusted.afr.glustervol1-client-0=0x000000000000000000000000
trusted.afr.glustervol1-client-1=0x000000000000000000000000

g1:~ # getfattr -d -e hex -m trusted.afr /mnt/glus2
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus2
trusted.afr.glustervol1-client-4=0x000000000000000000000000
trusted.afr.glustervol1-client-5=0x000000000000000000000000

g2:~ # getfattr -d -e hex -m trusted.afr /mnt/glus1
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus1
trusted.afr.glustervol1-client-0=0x000000000000000000000000
trusted.afr.glustervol1-client-1=0x000000000000000000000000

g2:~ # getfattr -d -e hex -m trusted.afr /mnt/glus2
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus2
trusted.afr.glustervol1-client-4=0x000000000000000000000000
trusted.afr.glustervol1-client-5=0x000000000000000000000000

g3:~ # getfattr -d -e hex -m trusted.afr /mnt/glus1
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus1
trusted.afr.glustervol1-client-2=0x000000000000000000000000
trusted.afr.glustervol1-client-3=0x000000000000000100000000

g3:~ # getfattr -d -e hex -m trusted.afr /mnt/glus2
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus2
trusted.afr.glustervol1-client-6=0x000000000000000000000000
trusted.afr.glustervol1-client-7=0x000000000000000000000000

g4:~ # getfattr -d -e hex -m trusted.afr /mnt/glus1
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus1
trusted.afr.glustervol1-client-2=0x000000000000000100000000
trusted.afr.glustervol1-client-3=0x000000000000000000000000

g4:~ # getfattr -d -e hex -m trusted.afr /mnt/glus2
getfattr: Removing leading '/' from absolute path names
# file: mnt/glus2
trusted.afr.glustervol1-client-6=0x000000000000000000000000
trusted.afr.glustervol1-client-7=0x000000000000000000000000

Hope someone can help. Things still seem to be working, but slowed down.

Cheers
David

On 26 January 2011 17:07, David Lloyd <david.lloyd at v-consultants.co.uk>wrote:

> We started getting the same problem at almost exactly the same time.
>
> get one of these messages every time I access the root of the mounted
> volume (and nowhere else, I think).
> This is also 3.1.1
>
> I'm just starting to look in to it, I'll let you know if I get anywhere.
>
> David
>
> On 26 January 2011 16:38, Burnash, James <jburnash at knight.com> wrote:
>
>> These errors are appearing in the file /var/log/glusterfs/<mountpoint>.log
>>
>> [2011-01-26 11:02:10.342349] I [afr-common.c:672:afr_lookup_done]
>> pfs-ro1-replicate-5: split brain detected during lookup of /.
>> [2011-01-26 11:02:10.342366] I [afr-common.c:716:afr_lookup_done]
>> pfs-ro1-replicate-5: background  meta-data data self-heal triggered. path: /
>> [2011-01-26 11:02:10.342502] E
>> [afr-self-heal-metadata.c:524:afr_sh_metadata_fix] pfs-ro1-replicate-2:
>> Unable to self-heal permissions/ownership of '/' (possible split-brain).
>> Please fix the file on all backend volumes
>>
>> Apparently the issue is the root of the storage pool, which in my case on
>> the backend storage servers is this path:
>>
>> /export/read-only - permissions are:            drwxr-xr-x 12 root root
>> 4096 Dec 28 12:09 /export/read-only/
>>
>> Installation is GlusterFS 3.1.1 on servers and clients, servers running
>> CentOS 5.5, clients running CentOS 5.2.
>>
>> The volume info header is below:
>>
>> Volume Name: pfs-ro1
>> Type: Distributed-Replicate
>> Status: Started
>> Number of Bricks: 10 x 2 = 20
>> Transport-type: tcp
>>
>> Any ideas? I don't see a permission issue on the directory or it's subs
>> themselves.
>>
>> James Burnash, Unix Engineering
>>
>>
>>