Server outage, file sync/self-heal doesn't sync ALL files?!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all!

We have another incident over here.

One of the servers (pserver12) in a pair (12 & 13) has been rebooted.  
pserver13 showed 63 files not in sync after the outage for 2h.

Both server are clients as well.

Starting pserver12 brought up the self-heal mechanism, but only 39 files 
were triggered within the first 10 min. Now the system seems dormant and 
24 files are left hanging.

On the other three servers no inconsistencies are seen.

tail of client log file:

2011-04-29 14:48:23.820022] I 
[afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 
0-storage0-replicate-2: diff self-heal on /pserver13-17: 1960 blocks of 
22736 were different (8.62%)
[2011-04-29 14:48:23.887651] E [afr-common.c:110:afr_set_split_brain] 
0-storage0-replicate-2: invalid argument: inode
[2011-04-29 14:48:23.887740] I 
[afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 
0-storage0-replicate-2: background  data self-heal completed on 
/pserver13-17
[2011-04-29 14:48:24.272220] I 
[afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 
0-storage0-replicate-2: diff self-heal on /pserver13-19: 1960 blocks of 
22744 were different (8.62%)
[2011-04-29 14:48:24.341868] E [afr-common.c:110:afr_set_split_brain] 
0-storage0-replicate-2: invalid argument: inode
[2011-04-29 14:48:24.341959] I 
[afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 
0-storage0-replicate-2: background  data self-heal completed on 
/pserver13-19
[2011-04-29 14:48:24.758131] I 
[afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 
0-storage0-replicate-2: diff self-heal on /pserver13-23: 1952 blocks of 
22752 were different (8.58%)
[2011-04-29 14:48:24.766054] E [afr-common.c:110:afr_set_split_brain] 
0-storage0-replicate-2: invalid argument: inode
[2011-04-29 14:48:24.766137] I 
[afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 
0-storage0-replicate-2: background  data self-heal completed on 
/pserver13-23
[2011-04-29 14:48:24.884613] I 
[afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 
0-storage0-replicate-2: diff self-heal on /pserver13-10: 1952 blocks of 
22760 were different (8.58%)
[2011-04-29 14:48:24.895631] E [afr-common.c:110:afr_set_split_brain] 
0-storage0-replicate-2: invalid argument: inode
[2011-04-29 14:48:24.895721] I 
[afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 
0-storage0-replicate-2: background  data self-heal completed on 
/pserver13-10
0 root at pserver13:/var/log/glusterfs # date
Fri Apr 29 15:08:18 UTC 2011


Search for mismatch:

0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." 
/mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep 
-B1 -A1 trusted | grep -c file
getfattr: Removing leading '/' from absolute path names
*24*


0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." 
/mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep 
-B1  trusted
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-33
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-26
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: 
mnt/gluster/brick0/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images
trusted.afr.storage0-client-4=0x000000000000001600000001
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-24
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-8
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-21
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-22
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-30
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-20
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-9
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-38
trusted.afr.storage0-client-4=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-18
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-2
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-23
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-4
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-3
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-34
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-37
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-12
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-27
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: 
mnt/gluster/brick1/storage/images/1831/9a039a81-60fe-5fa3-f562-8f6d3828382b/hdd-images/13169
trusted.afr.storage0-client-6=0x100000020000000000000000
--
# file: 
mnt/gluster/brick1/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images
trusted.afr.storage0-client-6=0x000000000000001600000002
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-25
trusted.afr.storage0-client-6=0x270000010000000000000000
--
# file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-7
trusted.afr.storage0-client-6=0x270000010000000000000000



I could trigger manually but why isn't the sync/self-heal not working on 
all files shown as inconsistent? Or am I assuming something wrongly here?!?

Best, Martin



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux