Re: Testing failover and recovery

Per Hallsmark <per@xxxxxxxxxxxx> · Mon, 9 Dec 2013 14:12:22 +0100

Hello,
Interesting, we seems to be several users with issues regarding recovery but there is no to little replies... ;-)

I did some more testing over the weekend. Same initial workload (two glusterfs servers, one client that continuesly
updates a file with timestamps) and then two easy testcases:

1. one of the glusterfs servers is constantly rebooting (just a initscript that sleeps for 60 seconds before issuing "reboot")

2. similar to 1 but instead of rebooting itself, it is rebooting the other glusterfs server so that the result is that they a server
    comes up, wait for a bit and then rebooting the other server.

During the whole weekend this has progressed nicely. The client is running all the time without issues and the glusterfs
that comes back (either only one or one of the servers, depending on the testcase shown above) is actively getting into
sync and updates it's copy of the file.

So it seems to me that we need to look deeper in the recovery case (of course, but it is interesting to know about the
nice&easy usescases as well). I'm surprised that the recovery from a failover (to restore the rendundancy) isn't getting
higher attention here. Are we (and others that has difficulties in this area) running a unusual usecase?

BR,
Per

On Wed, Dec 4, 2013 at 12:17 PM, Per Hallsmark <per@xxxxxxxxxxxx> wrote:

Hello,

I've found GlusterFS to be an interesting project. Not so much experience of it
(although from similar usecases with DRBD+NFS setups) so I setup some

testcase to try out failover and recovery.

For this I have a setup with two glusterfs servers (each is a VM) and one client (also a VM).
I'm using GlusterFS 3.4 btw.

The servers manages a gluster volume created as:

gluster volume create testvol rep 2 transport tcp gs1:/export/vda1/brick gs2:/export/vda1/brick
gluster volume start testvol

gluster volume set testvol network.ping-timeout	5

Then the client mounts this volume as:
mount -t glusterfs gs1:/testvol /import/testvol

Everything seems to work good in normal usecases, I can write/read to the volume, take servers down and up again etc.

As a fault scenario, I'm testing a fault injection like this:

1. continuesly writing timestamps to a file on the volume from the client. It is automated in a smaller testscript like:
:~/glusterfs-test$ cat scripts/test-gfs-client.sh 
#!/bin/sh

gfs=/import/testvol

while true; do
	date +%s >> $gfs/timestamp.txt
	ts=`tail -1 $gfs/timestamp.txt`

	md5sum=`md5sum $gfs/timestamp.txt | cut -f1 -d" "`
	echo "Timestamp = $ts, md5sum = $md5sum"
	sleep 1
done
:~/glusterfs-test$

As can be seen, the client is a quite simple user of the glusterfs volume. Low datarate and single user for example.

2. disabling ethernet in one of the VM (ifconfig eth0 down) to simulate like a broken network

3. After a short while, the failed server is brought alive again (ifconfig eth0 up)

Step 2 and 3 is also automated in a testscript like:

:~/glusterfs-test$ cat scripts/fault-injection.sh 
#!/bin/sh

# fault injection script tailored for two glusterfs nodes named gs1 and gs2

if [ "$HOSTNAME" == "gs1" ]; then
	peer="gs2"
else
	peer="gs1"

fi

inject_eth_fault() {
	echo "network down..."
	ifconfig eth0 down
	sleep 10
	ifconfig eth0 up
	echo "... and network up again."

}

recover() {
	echo "recovering from fault..."
	service glusterd restart

}

while true; do
	sleep 60
	if [ ! -f /tmp/nofault ]; then
		if ping -c 1 $peer; then

			inject_eth_fault
			recover
		fi
	fi

done
:~/glusterfs-test$

I then see that:

A. This goes well first time, one server leaves the cluster and the client hang for like 8 seconds before beeing able to write to the volume again.

B. When the failed server comes back, I can check that from both servers they see each other and "gluster peer status" shows they believe the other is in connected state.

C. When the failed server comes back, it is not automatically seeking active participation on syncing volume etc (the local storage timestamp file isn't updated).

D. If I do restart of glusterd service (service glusterd restart) the failed node seems to get back like it was before. Not always though... The chance is higher if I have long time between fault injections (long = 60 sec or so, with a forced faulty state of 10 sec)

With a period time of some minutes, I could have the cluster servicing the client OK for up to 8+ hours at least.
Shortening the period, I'm easily down to like 10-15 minutes.

E. Sooner or later I enter a state where the two servers seems to be up, seeing it's peer (gluster peer status) and such but none is serving the volume to the client.
I've tried to "heal" the volume in different way but it doesn't help. Sometimes it is just that one of the timestamp copies in each of
the servers is ahead which is simpler but sometimes both the timestamp files have added data at end that the other doesnt have.

To the questions: 

* Is it so that from a design point of perspective, the choice in the glusterfs team is that one shouldn't rely soley on glusterfs daemons beeing able to  recover from a faulty state? There is need for cluster manager services (like heartbeat for example) to be part? That would make experience C understandable and one could then take heartbeat or similar packages to start/stop services.

* What would then be the recommended procedure to recover from a faulty glusterfs node? (so that experience D and E is not happening)

* What is the expected failover timing (of course depending on config, but say with a give ping timeout etc)?

  and expected recovery timing (with similar dependency on config)?

* What/how is glusterfs team testing to make sure that the failover, recovery/healing functionality etc works?

Any opinion if the testcase is bad is of course also very welcome.

Best regards,
Per

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users