Hello,
I'm looking into a usecase, rolling upgrade of gluster cluster nodes, and have issues with that a node I'm to take away from the gluster cluster may have data it hasn't pushed to the cluster although I'm doing a nice shutdown like "sync; service glusterd stop".
The reason for taking away a node might be to install more memory, disk and such. So quite normal maintenance all in all.
During this rolling upgrade, there will be use of the gluster volume they serve. Data will be incoming and outgoing.
I've found a easy testcase using two VM's that each act as gluster cluster nodes.
I create the volume as:
gluster volume create test-volume replica 2 transport tcp 192.168.0.1:/export/brick 192.168.0.9:/export/brick
From guides etc it seems like a common basic setup.
After making sure everything works ok in normal mode with both servers alive, I shutdown one node. Everything is still ok of course and the remaining node is taking care of serving the volume, as the clients still push data to the volume. Now, after we have maintained the first node, we bring it up again and it gets into the cluster ok (replication starts). So, now we want to maintain the second node and brings it down. Unfortunately this means that data it had on the volume might not have made it to the first node before we stop it. I can see that because I'm checking md5sum of a datafile just written to the volume from the node which I shortly am about to shutdown and at same time checking the md5sum on the file as it is seen on the node that just was maintained.
Here is how I do this. I'm starting with node1 up, node2 down (simulating it beeing down for maintenance).
# node1
dd if=/dev/urandom of=/import/datafil bs=65535 count=4096; md5sum /import/datafil; ls -l /import/datafil; sync; umount /import; sync; service glusterd stop
# above take about 35 sec to finish so after like 10 sec I then startup glusterd on second node, simulating node2 coming back from maintenance.
# node2
service glusterd start; sleep 3; mount -t glusterfs 192.168.0.9:test-volume /import; while true; do md5sum /import/datafil; ls -l /import/datafil; sleep 1; done
root@p1-sr0-sl1:/var/log/glusterfs# dd if=/dev/urandom of=/import/datafil bs=65535 count=4096; md5sum /import/datafil; ls -l /import/datafil; /root/filesync /import/datafil; umount /import; service glusterd stop
4096+0 records in
4096+0 records out
268431360 bytes (268 MB) copied, 35.6098 s, 7.5 MB/s
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
fsync of /import/datafil ... done
Stopping glusterd:
root@p1-sr0-sl1:/var/log/glusterfs#
and on node2:
root@p1-sr0-sl9:~# service glusterd start; sleep 2; mount -t glusterfs 192.168.0.9:test-volume /import; while true; do md5sum /import/datafil; ls -l /import/datafil; sleep 1; done
Starting glusterd:
d05c8177b981b921b0c56980eaf3e33e /import/datafil
-rw-r--r-- 1 root root 172750260 Apr 23 11:41 /import/datafil
1d0cf10228cb341290fa43094cc67edf /import/datafil
-rw-r--r-- 1 root root 207221670 Apr 23 11:41 /import/datafil
f9a0f254c3239c6d8ebad1be05b27bf7 /import/datafil
-rw-r--r-- 1 root root 242413965 Apr 23 11:41 /import/datafil
md5sum: /import/datafil: Transport endpoint is not connected
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
....
So, the sum is not matching. It is quite obvious we got more and more data from node1 until it was down for maintanence (no more data to file after "transport endpoint is not connected", md5sum stays the same)
Now if I start glusterd on node1 again:
root@p1-sr0-sl1:/var/log/glusterfs# service glusterd start
Starting glusterd:
root@p1-sr0-sl1:/var/log/glusterfs#
It will after a while sync ok on node2:
...
e0d7bd9fa1fce24d65ccf89b8217231f /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
99e935937799cba1edaab3aed622798a /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
2d6a9afbd3f8517baab5622d1337826f /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
badce4130e98cbe6675793680c6bf3d7 /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
6f7e441ccd11f8679ec824aafda56abc /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
^C
root@p1-sr0-sl9:~#
Metadata syncs it seems (the file has correct length and date) but in the file it is just zero's from the point where it didn't manage to shuffle data from node1 to node2. It surprises me a bit that a glusterd node is allowed to leave the cluster without having all local unique data written to the remaining nodes (in this case just one). Think of same scenario with a nfs server. If a client has mounted and pushed some data to the fs, we cannot unmount until it has written it all cleanly to the nfs server.
Seems like sync is just like a "nop" on glusterfs volumes, which probably is by design as can be understood from this, admitting a bit old, thread:
So it is not accepted for sync to "spread into" fuse userspace fs's (still true?)
and thus we don't get a full sync done.
So, the question is how to know when a glusterd node is ready (= not having any local unique data) to be shutdown?
Anyone else caring about this? What's your recipes to manage a nice clean glusterfs node shutdown? (while still beeing sure data is correct at all time)
We are running glusterfs 3.4.2 on a 3.4'ish linux kernel. I've ported the fuse kernel/userland parts from the thread mentioned above so syncfs() is arriving to glusterfs. Also started with some tweaking to glusterfs in an attempt to have gluster flush it's local unique data when this syncfs() arrives.
Wonder if anyone else is looking into this area?
Per
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users