Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write timeout)

czhang.oss at gmail.com (ZHANG Cheng) · Mon, 26 Nov 2012 17:46:26 +0800

Early this morning our 2 bricks replicated cluster had an outage. The
disk space for one of the brick server (brick02) was used up. When we
responded to the disk full alert, the issue already lasted for a few
hours. We reclaimed some disk space, and reboot the brick02 server,
expecting once it come back it will go self healing.

It did go self healing, but just after couple minutes, access to
gluster filesystem freeze. Tons of "nfs: server brick not responding,
still trying" popped up in dmesg. The load average on app server went
up to 200 something from usual 0.10. We had to shutdown brick02 server
or stop gluster server process on it, to get the gluster cluster back
working.

How could we deal with this issue? Thanks in advance.

Our gluster setup is followed the official doc.

gluster> volume info

Volume Name: staticvol
Type: Replicate
Volume ID: fdcbf635-5faf-45d6-ab4e-be97c74d7715
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: brick01:/exports/static
Brick2: brick02:/exports/static

Underlying filesystem is xfs (on a lvm volume), as:
/dev/mapper/vg_node-brick on /exports/static type xfs
(rw,noatime,nodiratime,nobarrier,logbufs=8)

The brick servers don't act as gluster client.

Our app servers are the gluster client, mount via nfs.
brick:/staticvol on /mnt/gfs-static type nfs
(rw,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,addr=10.10.10.51)

brick is a DNS round-robin record for brick01 and brick02.