Kosher admin practices

peek at nimbios.org (Michael Peek) · Tue, 23 Jul 2013 13:53:58 -0400

Hi guys,

I have a cluster with replication (four machines, two drives in each)
for testing that I've been beating on.  I've just simulated one type of
hardware failure by remounting a drive read-only.

The manual covers many useful things: Adding/removing peers;
Starting/stopping, creating, expanding, shrinking, and deleting volumes;
etc.  But it doesn't cover what you should do to replace a failed brick
to minimize frustration and chances of data loss.

I can't unmount the brick because glusterfs still has open files on it.

If I stop the glusterfs-server then that takes the other brick in the
machine out of commission too.

I have the same problem if I reboot the machine -- I take the other
brick out of service.

What's the correct way to deal with this?  Is there a way to tell
gluster to take a brick out of commission for replacement without
interrupting access to other bricks in the same machine?

Thanks for your help,

Michael Peek