Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write timeout)

jdarcy at redhat.com (Jeff Darcy) · Thu, 29 Nov 2012 05:58:01 -0500

On 11/26/12 4:46 AM, ZHANG Cheng wrote:
> Early this morning our 2 bricks replicated cluster had an outage. The
> disk space for one of the brick server (brick02) was used up. When we
> responded to the disk full alert, the issue already lasted for a few
> hours. We reclaimed some disk space, and reboot the brick02 server,
> expecting once it come back it will go self healing.
> 
> It did go self healing, but just after couple minutes, access to
> gluster filesystem freeze. Tons of "nfs: server brick not responding,
> still trying" popped up in dmesg. The load average on app server went
> up to 200 something from usual 0.10. We had to shutdown brick02 server
> or stop gluster server process on it, to get the gluster cluster back
> working.

Have you checked the glustershd logs (should be in /var/log/glusterfs)
on the bricks?  If there's nothing useful there, a statedump would also
be useful.  See the "gluster volume statedump" instructions on your
friendly local admin guide (section 10.4 for GlusterFS 3.3).  Most
helpful of all would be a bug report with any of this information plus a
description of your configuration.  You can either create a new one or
attach the info to an existing bug if one seems to fit.  The following
seems like it might be related, even though it's for virtual machines.

https://bugzilla.redhat.com/show_bug.cgi?id=881685