On 07/09/2013 06:47 AM, Greg Scott wrote > I don't get this. I have a replicated volume and 2 nodes. My > challenge is, when I take one node offline, the other node can no > longer access the volume until both nodes are back online again. > Details: > I have 2 nodes, fw1 and fw2. Each node has an XFS file system, > /gluster-fw1 on node fw1 and gluster-fw2 no node fw2. Node fw1 is at > IP Address 192.168.253.1. Node fw2 is at 192.168.253.2. > I create a gluster volume named firewall-scripts which is a replica of > those two XFS file systems. The volume holds a bunch of config files > common to both fw1 and fw2. The application is an active/standby pair > of firewalls and the idea is to keep config files in a gluster volume. > When both nodes are online, everything works as expected. But when I > take either node offline, node fw2 behaves badly: > [root at chicago-fw2 ~]# ls /firewall-scripts > ls: cannot access /firewall-scripts: Transport endpoint is not connected > And when I bring the offline node back online, node fw2 eventually > behaves normally again. > What's up with that? Gluster is supposed to be resilient and > self-healing and able to stand up to this sort of abuse. So I must be > doing something wrong. > Here is how I set up everything -- it doesn't get much simpler than > this and my setup is right out the Getting Started Guide but using my > own names. > Here are the steps I followed, all from fw1: > gluster peer probe 192.168.253.2 > gluster peer status > Create and start the volume: > gluster volume create firewall-scripts replica 2 transport tcp > 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2 > gluster volume start firewall-scripts > On fw1: > mkdir /firewall-scripts > mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts > and add this line to /etc/fstab: > 192.168.253.1:/firewall-scripts /firewall-scripts glusterfs > defaults,_netdev 0 0 > on fw2: > mkdir /firewall-scripts > mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts > and add this line to /etc/fstab: > 192.168.253.2:/firewall-scripts /firewall-scripts glusterfs > defaults,_netdev 0 0 > That's it. That's the whole setup. When both nodes are online, > everything replicates beautifully. But take one node offline and it > all falls apart. > Here is the output from gluster volume info, identical on both nodes: > [root at chicago-fw1 etc]# gluster volume info > Volume Name: firewall-scripts > Type: Replicate > Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c > Status: Started > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: 192.168.253.1:/gluster-fw1 > Brick2: 192.168.253.2:/gluster-fw2 > [root at chicago-fw1 etc]# > Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see > errors like this every couple of seconds: > [2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] > 0-firewall-scripts-replicate-0: no subvolumes up > [2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] > 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not > connected) > And then when I bring fw1 back online, I see these messages on fw2: > [2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] > 0-firewall-scripts-client-0: changing port to 49152 (from 0) > [2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] > 0-firewall-scripts-client-0: readv failed (No data available) > [2013-07-09 01:01:35.018546] I > [client-handshake.c:1658:select_server_supported_programs] > 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num > (1298437), Version (330) > [2013-07-09 01:01:35.019273] I > [client-handshake.c:1456:client_setvolume_cbk] > 0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, > attached to remote volume '/gluster-fw1'. > [2013-07-09 01:01:35.019356] I > [client-handshake.c:1468:client_setvolume_cbk] > 0-firewall-scripts-client-0: Server and Client lk-version numbers are > not same, reopening the fds > [2013-07-09 01:01:35.019441] I > [client-handshake.c:1308:client_post_handshake] > 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they > are re-opened > [2013-07-09 01:01:35.020070] I > [client-handshake.c:930:client_child_up_reopen_done] > 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - > notifying CHILD-UP > [2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] > 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' > came back up; going online. > [2013-07-09 01:01:35.020616] I > [client-handshake.c:450:client_set_lk_version_cbk] > 0-firewall-scripts-client-0: Server lk version = 1 > So how do I make glusterfs survive a node failure, which is the whole > point of all this? > It looks like the brick processes on fw2 machine are not running and hence when fw1 is down, the entire replication process is stalled. can u do a ps and get the status of all the gluster processes and ensure that the brick process is up on fw2. Regards Raghav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130710/dee4ef2b/attachment-0001.html>