Hello all,
we have a problem on a geo-replicated volume after upgrade from
glusterfs 3.3.2 to 3.4.6 on ubuntu 12.04.5 lts.
for e.g. a 'ls -l' on the mounted geo-replicated volume does not show
the entire content while the same command on the underlying bricks shows
the entire content.
the events in chronological order..:
we are running a 6 node distributed replicated Volume (vol1) which is
geo-replicated to a 4 node distributed replicated Volume (vol2).
disk space on vol2 becomes insufficient so we needed to add two further
nodes.
vol1 and vol2 is running on ubuntu 12.04 lts / glusterfs 3.3.2
we stopped the geo-replication, stopped the vol2 and updated the nodes
of vol2 to the latest ubuntu 12.04.5 release (dist-upgrade) and to
glusterfs 3.4.6. all gluster-clients which make use of vol2 were also
updated from glusterfs-client 3.3.2 to 3.4.6.
then we added two further bricks to vol2 with the same software level
(ubuntu 12.04.5 lts,gfs 3.4.6) like the first four nodes and started the
volume vol2 again.
afterwards we started a rebalance process on vol2 and the
geo-replication on the master-node of vol1. a check-script on
geo-replication master is copying/deleting a testfile to vol1 in
dependence of the existence of that file on vol2. everything seems to be
ok so far...
after the rebalance process was finished (without errors) we observed an
abnormality on vol2...the data on vol2 is somehow unequal
distributed...the first two pairs shows a brick-usage of about 80% while
the last added pair shows a brick-usage of about 50%. so we restarted
the rebalance process twice but nothing changed...
however, more critical than that is the fact that since update and
expansion of vol2 we cannot see/access all files by default on the
mounted vol2 while the files are visible in their brick-directories...
example 1:
vol1 contains 446 files/directories, for e.g. directory /sdn/1051
vol1 is mounted to /sdn :
[ 15:54:28 ] - root@vol1 /sdn $ls -l | wc -l
446
[ 15:55:06 ] - root@vol1 /sdn $ls -l | grep 1051
drwxrwxrwx 5 1007 1013 12288 Jan 22 07:42 1051
[ 15:55:46 ] - root@vol1 /sdn $du -ks 1051
5588129 1051
[ 15:56:03 ] - root@vol1 /sdn $
vol2 contains 304 files/directoris, but 1051 is not listed. when i run a
'du -ks /sdn/1051' or a 'ls -l /sdn/1051' on vol2 the directory becomes
visible...
vol2 is mounted to /sdn :
[ 15:54:35 ] - root@vol2 /sdn $ls | wc -l
304
[ 15:56:19 ] - root@vol2 /sdn $ls -l | grep 1051
[ 15:56:28 ] - root@vol2 /sdn $du -ks 1051
5588001 1051
[ 15:56:43 ] - root@vol2 /sdn $ls -l | grep 1051
drwxrwxrwx 5 1007 1013 8255 Apr 17 15:56 1051
[ 15:56:59 ] - root@vol2 /sdn $ls | wc -l
305
example 2:
directory 2098 is visible on the brick but not on the gluster-volume.
after listing the named-directory it is visible on the gluster-volume again.
[ 16:11:00 ] - root@vol2 /sdn $ls | grep 2098
[ 16:12:21 ] - root@vol2 /sdn $ls -l /gluster-export/ | grep 2098
drwxrwxrwx 4 1015 1013 4096 Jan 18 03:07 2098
[ 16:12:28 ] - root@vol2 /sdn $ls -l /sdn/2098
...
[ 16:13:12 ] - root@vol2 /sdn $ls -l | grep 2098
drwxrwxrwx 4 1015 1013 8237 Apr 17 16:13 2098
[ 16:13:27 ] - root@vol2 /sdn $
[ 16:13:27 ] - root@vol2 /sdn $ls | wc -l
306
i did not found helpful hints in the gluster-logs, currently i'm
frequently faced with following messages, but the missing directories on
vol2 are not mentioned :
vol2 :
$tail -f sdn.log
[2015-04-17 14:00:14.816730] I
[dht-layout.c:726:dht_layout_dir_mismatch] 1-aut-wien-01-dht: /1011 -
disk layout missing
[2015-04-17 14:00:14.816745] I [dht-common.c:638:dht_revalidate_cbk]
1-aut-wien-01-dht: mismatching layouts for /1011
[2015-04-17 14:00:14.817590] I
[dht-layout.c:726:dht_layout_dir_mismatch] 1-aut-wien-01-dht: /1005 -
disk layout missing
[2015-04-17 14:00:14.817602] I [dht-common.c:638:dht_revalidate_cbk]
1-aut-wien-01-dht: mismatching layouts for /1005
vol 1 is slightly smaller than vol2. all nodes are using the same
disk-configuration and all bricks are xfs-formatted.
df -m :
vol1:/vol1 57217563 39230421 17987143 69% /sdn
vol2:/vol2 57217563 40399541 16818023 71% /sdn
currently i'm confused because i don't know the reason for this behaviour...
i guess it was not a good idea to update the geo-replication slave to
3.4.6 while the master is still running 3.3.2, but I'm not sure.
possibly there is an issue with 3.4.6 itself and geo-replication does
not have any influence on that.
for the first time i stopped the geo-replication.
can somebody point me to the cause or has helpful hints what to do next...?
best regards
dietmar
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users