self-heal failed

B.Candler at pobox.com (Brian Candler) · Thu, 10 Jan 2013 22:06:07 +0000

On Thu, Jan 10, 2013 at 12:50:48PM -0500, Liang Ma wrote:
>    I assume to replace a failed replicate disk or node should be a
>    standard procedure, isn't it? I could find anything related to this in
>    the 3.3 manual.

You'd have thought so, wouldn't you :-(

I know of two options.

(1) If the server itself is OK, but the brick filesystem you're exporting
from that server has died, then just stop glusterd, erase (mkfs) the
filesystem which the brick is on, remount it, restart glusterd.

After a few minutes, self-heal will kick in and copy the data for replicated
volumes. At least, it did in a test setup I tried once.

This of course assumes your data filesystem is separate from your OS
filesystem, which I'd suggest is a good idea anyway.

(2) If the whole server has died, or you have to re-install the OS from
scratch, but the replacement server has the same hostname as the old one,
then there's a different procedure.

It was documented at
http://gluster.org/community/documentation/index.php/Gluster_3.2:_Brick_Restoration_-_Replace_Crashed_Server
for glusterfs 3.2. It is almost the same for glusterfs 3.3, but the config
directory has moved. It works if you change these two steps:

grep server3 /var/lib/glusterd/peers/*

echo UUID=... >/var/lib/glusterd/glusterd.info

HTH,

Brian.