outage post-mortem

Nicolas Ochem <nicolas.ochem@xxxxxxxxx> · Thu, 27 Mar 2014 23:08:03 -0700

Hi list,
I would like to describe an issue I had today with Gluster and ask for opinion:

I have a replicated mount with 2 replica. There is about 1TB of production data in there in around 100.000 files. They sit on 2x Supermicro x9dr3-ln4f machines with a RAID array of 18TB each, 64gb of ram, 2x Xeon CPUs, as recommended in Red Hat hardware guidelines for storage server. They have a 10gb link between each other. I am running gluster 3.4.2 on centos 6.5

This storage is NFS-mounted to a lot of production servers. A very little part of this data is actually useful, the rest is legacy.

Due to some unrelated issue with one of the supermicro server (faulty memory), I had to take one of the nodes offline for 3 days.

When I brought it back up, some files and directories ended up in heal-failed state (but no split-brain). Unfortunately that were the critical files that had been edited in the last 3 days. On the NFS mounts, attempts to read these files resulted in I/O error.

I was able to fix a few of these files by manually removing them in each brick and then copying them to the mounted volume again. But I did not know what to do when full directories were unreachable because of "heal failed".

I later read that healing could take time and that heal-failed may be a transient state (is that correct? http://stackoverflow.com/questions/19257054/is-it-normal-to-get-a-lot-of-heal-failed-entries-in-a-gluster-mount), but at the time I thought that was beyond recovery, so I proceeded to destroy the gluster volume. Then on one of the replicas I moved the content of the brick to another directory, created another volume with the same name, then copied the content of the brick to the mounted volume. This took around 2 hours. Then I had to reboot all my NFS-mounted machines which were in "stale NFS file handle" state.

Few questions :
- I realize that I cannot expect 1TB of data to heal instantly, but is there any way for me to know if  the system would have recovered eventually despite being shown as "heal failed" ? 

- if yes, what amount of files and filesize should I clean-up from my volume to make this time go under 10 minutes ?
- would native gluster mounts instead of NFS have been of help here ?

- would any other course of action have resulted in faster recovery time ?
- is there a way in such situation to make one replica have authority about the correct status of the filesystem  ?

Thanks in advance for your replies.

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users