Gluster NFS fails to start when replica brick is down

"Kolasinski, Brent D." <bkolasinski@xxxxxxx> · Mon, 9 Jun 2014 20:59:26 +0000

Hi all,

I have noticed some interesting behavior from my gluster setup regarding
NFS on Gluster 3.5.0:

My Problem:
I have 2 bricks in a replica volume (named gvol0).  This volume is
accessed through NFS.  If I fail one of the servers, everything works as
expected; gluster NFS continues to export the volume from the remaining
brick.  However, if I restart the the glusterd, glusterfsd, and rpcbind
services or reboot the remaining host while the other brick is down,
gluster NFS no longer exports the volume from the remaining brick.  It
appears to share the volume for the gluster-fuse client though.  Is this
intended behavior, or is this a possible bug?

Here is a ps just after a brick fails, with 1 brick remaining to export
the volume over gluster NFS:

[root@nfs0 ~]# ps aux | grep gluster
root      2145  0.0  0.1 518444 24972 ?        Ssl  19:24   0:00
/usr/sbin/glusterfsd -s nfs0g --volfile-id gvol0.nfs0g.data-brick0-gvol0
-p /var/lib/glusterd/vols/gvol0/run/nfs0g-data-brick0-gvol0.pid -S
/var/run/91885b40ac4835907081de3bdc235620.socket --brick-name
/data/brick0/gvol0 -l /var/log/glusterfs/bricks/data-brick0-gvol0.log
--xlator-option *-posix.glusterd-uuid=49f53699-babd-4731-9c56-582b2b90b27c
--brick-port 49152 --xlator-option gvol0-server.listen-port=49152
root      2494  0.1  0.1 414208 19204 ?        Ssl  19:46   0:02
/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
root      2511  0.0  0.4 471324 77868 ?        Ssl  19:47   0:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
/var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S
/var/run/b0f1e836c0c9f168518e0adba7187c10.socket
root      2515  0.0  0.1 334968 25408 ?        Ssl  19:47   0:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/lib/glusterd/glustershd/run/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/173a6cd55e36ea8e0ce0896d27533355.socket --xlator-option
*replicate*.node-uuid=49f53699-babd-4731-9c56-582b2b90b27c

Here is a ps after restarting the remaining host, with the other brick
still down:

[root@nfs0 ~]# ps aux | grep gluster

root      2134  0.1  0.0 280908 14684 ?        Ssl  20:36   0:00
/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
root      2144  0.0  0.1 513192 17300 ?        Ssl  20:36   0:00
/usr/sbin/glusterfsd -s nfs0g --volfile-id gvol0.nfs0g.data-brick0-gvol0
-p /var/lib/glusterd/vols/gvol0/run/nfs0g-data-brick0-gvol0.pid -S
/var/run/91885b40ac4835907081de3bdc235620.socket --brick-name
/data/brick0/gvol0 -l /var/log/glusterfs/bricks/data-brick0-gvol0.log
--xlator-option *-posix.glusterd-uuid=49f53699-babd-4731-9c56-582b2b90b27c
--brick-port 49152 --xlator-option gvol0-server.listen-port=49152

It appears glusterfsd is not starting the gluster NFS service back up upon
reboot of the remaining host.  If I were to restart glusterfsd on the
remaining host, it still will not bring up NFS.  However, if I start the
gluster service on the host that serves the down brick, NFS will start up
again, without me restarting any services.

Here is the volume information:

Volume Name: gvol0
Type: Replicate
Volume ID: e88afc1c-50d3-4e2e-b540-4c2979219d12
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: nfs0g:/data/brick0/gvol0
Brick2: nfs1g:/data/brick0/gvol0
Options Reconfigured:
nfs.disable: 0
network.ping-timeout: 3

Is this a bug, or intended functionality?

----------
Brent Kolasinski
Computer Systems Engineer

Argonne National Laboratory
Decision and Information Sciences
ARM Climate Research Facility

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users