Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write timeout)

czhang.oss at gmail.com (ZHANG Cheng) · Fri, 30 Nov 2012 13:42:15 +0800

In addition, my self heal split-brain info has lots of repeated gfid
strings. How may I deal with such situation?

gluster> v heal staticvol info split-brain
Gathering Heal info on volume staticvol has been successful

Brick brick01:/exports/static
Number of entries: 108
at                    path on brick
-----------------------------------
2012-11-30 10:50:14 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 09:32:46 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 09:32:43 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 09:23:36 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 09:23:33 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 09:10:14 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 09:10:14 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 08:59:15 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
...

Brick brick02:/exports/static
Number of entries: 499
at                    path on brick
-----------------------------------
2012-11-30 11:54:47 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 10:39:55 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 10:39:54 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 10:29:55 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 10:29:54 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 10:19:55 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 10:19:54 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 10:09:56 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
2012-11-30 10:09:55 <gfid:f4a9368c-aa7b-4696-b17a-d09e3d26d5b9>
2012-11-30 09:59:55 <gfid:022fd9fc-d725-4066-acbd-1e2b224710b0>
...

On Fri, Nov 30, 2012 at 12:33 PM, ZHANG Cheng <czhang.oss at gmail.com> wrote:
> I have lots of such line in my log:
> [2012-11-30 12:27:22.203030] E
> [afr-self-heald.c:685:_link_inode_update_loc] 0-staticvol-replicate-0:
> inode link failed on the inode (00000000-0000-0000-0000-000000000000)
>
> I am running on gluster 3.3.1.
>
> On Thu, Nov 29, 2012 at 6:58 PM, Jeff Darcy <jdarcy at redhat.com> wrote:
>> On 11/26/12 4:46 AM, ZHANG Cheng wrote:
>>> Early this morning our 2 bricks replicated cluster had an outage. The
>>> disk space for one of the brick server (brick02) was used up. When we
>>> responded to the disk full alert, the issue already lasted for a few
>>> hours. We reclaimed some disk space, and reboot the brick02 server,
>>> expecting once it come back it will go self healing.
>>>
>>> It did go self healing, but just after couple minutes, access to
>>> gluster filesystem freeze. Tons of "nfs: server brick not responding,
>>> still trying" popped up in dmesg. The load average on app server went
>>> up to 200 something from usual 0.10. We had to shutdown brick02 server
>>> or stop gluster server process on it, to get the gluster cluster back
>>> working.
>>
>> Have you checked the glustershd logs (should be in /var/log/glusterfs)
>> on the bricks?  If there's nothing useful there, a statedump would also
>> be useful.  See the "gluster volume statedump" instructions on your
>> friendly local admin guide (section 10.4 for GlusterFS 3.3).  Most
>> helpful of all would be a bug report with any of this information plus a
>> description of your configuration.  You can either create a new one or
>> attach the info to an existing bug if one seems to fit.  The following
>> seems like it might be related, even though it's for virtual machines.
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=881685
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users