So - answering myself with the (apparent) solution. The configuration IS correct as shown - the problems were elsewhere. Primary cause for this seems to be performing the gluster native client mount on a virtual machine WITHOUT using the " -O --disable-direct-io-mode" parameter. So I was mounting like this: mount -t glusterfs jc1letgfs5:/test-pfs-ro1 /test-pfs2 When I should have been doing this: mount -t glusterfs -O --disable-direct-io-mode jc1letgfs5:/test-pfs-ro1 /test-pfs2 Secondly, I changed the volume parameter "network.ping-timeout" from its default of 43 to 10 seconds, in order to get faster recovery from a downed storage node: gluster volume set pfs-rw1 network.ping-timeout 10 This configuration now survives the loss of either node of the two storage server mirrors. There is a noticeable delay before commands on the mount point complete the first time a command is issued after one of the nodes have gone done - but then they return at the same speed as when all nodes were present. Thanks especially to all who helped, and Anush who helped me troubleshoot it from a different angle. James Burnash, Unix Engineering -----Original Message----- From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, James Sent: Friday, March 11, 2011 11:31 AM To: gluster-users at gluster.org Subject: Re: Why does this setup not survive a node crash? Could anyone else please take a peek at this an sanity check my configuration. I'm quite frankly at a loss and tremendously under the gun ... Thanks in advance to any kind souls. James Burnash, Unix Engineering -----Original Message----- From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, James Sent: Thursday, March 10, 2011 3:55 PM To: gluster-users at gluster.org Subject: Why does this setup not survive a node crash? Perhaps someone will see immediately, given the data below, why this configuration will not survive a crash of one node - it appears that any node crashed out of this set will cause gluster native clients to hang until the node comes back. Given (2) initial storage servers (CentOS 5.5, Gluster 3.1.1): Starting out by creating a Replicated-Distributed pair with this command: gluster volume create test-pfs-ro1 replica 2 jc1letgfs5:/export/read-only/g01 jc1letgfs6:/export/read-only/g01 jc1letgfs5:/export/read-only/g02 jc1letgfs6:/export/read-only/g02 Which ran fine (thought I did not attempt to crash 1 of the pair) And then adding (2) more servers, identically configured, with this command: gluster volume add-brick test-pfs-ro1 jc1letgfs7:/export/read-only/g01 jc1letgfs8:/export/read-only/g01 jc1letgfs7:/export/read-only/g02 jc1letgfs8:/export/read-only/g02 Add Brick successful root at jc1letgfs5:~# gluster volume info Volume Name: test-pfs-ro1 Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: jc1letgfs5:/export/read-only/g01 Brick2: jc1letgfs6:/export/read-only/g01 Brick3: jc1letgfs5:/export/read-only/g02 Brick4: jc1letgfs6:/export/read-only/g02 Brick5: jc1letgfs7:/export/read-only/g01 Brick6: jc1letgfs8:/export/read-only/g01 Brick7: jc1letgfs7:/export/read-only/g02 Brick8: jc1letgfs8:/export/read-only/g02 And this volfile info out of the log file /var/log/glusterfs/etc-glusterd-mount-test-pfs-ro1.log: [2011-03-10 14:38:26.310807] W [dict.c:1204:data_to_str] dict: @data=(nil) Given volfile: +------------------------------------------------------------------------------+ 1: volume test-pfs-ro1-client-0 2: type protocol/client 3: option remote-host jc1letgfs5 4: option remote-subvolume /export/read-only/g01 5: option transport-type tcp 6: end-volume 7: 8: volume test-pfs-ro1-client-1 9: type protocol/client 10: option remote-host jc1letgfs6 11: option remote-subvolume /export/read-only/g01 12: option transport-type tcp 13: end-volume 14: 15: volume test-pfs-ro1-client-2 16: type protocol/client 17: option remote-host jc1letgfs5 18: option remote-subvolume /export/read-only/g02 19: option transport-type tcp 20: end-volume 21: 22: volume test-pfs-ro1-client-3 23: type protocol/client 24: option remote-host jc1letgfs6 25: option remote-subvolume /export/read-only/g02 26: option transport-type tcp 27: end-volume 28: 29: volume test-pfs-ro1-client-4 30: type protocol/client 31: option remote-host jc1letgfs7 32: option remote-subvolume /export/read-only/g01 33: option transport-type tcp 34: end-volume 35: 36: volume test-pfs-ro1-client-5 37: type protocol/client 38: option remote-host jc1letgfs8 39: option remote-subvolume /export/read-only/g01 40: option transport-type tcp 41: end-volume 42: 43: volume test-pfs-ro1-client-6 44: type protocol/client 45: option remote-host jc1letgfs7 46: option remote-subvolume /export/read-only/g02 47: option transport-type tcp 48: end-volume 49: 50: volume test-pfs-ro1-client-7 51: type protocol/client 52: option remote-host jc1letgfs8 53: option remote-subvolume /export/read-only/g02 54: option transport-type tcp 55: end-volume 56: 57: volume test-pfs-ro1-replicate-0 58: type cluster/replicate 59: subvolumes test-pfs-ro1-client-0 test-pfs-ro1-client-1 60: end-volume 61: 62: volume test-pfs-ro1-replicate-1 63: type cluster/replicate 64: subvolumes test-pfs-ro1-client-2 test-pfs-ro1-client-3 65: end-volume 66: 67: volume test-pfs-ro1-replicate-2 68: type cluster/replicate 69: subvolumes test-pfs-ro1-client-4 test-pfs-ro1-client-5 70: end-volume 71: 72: volume test-pfs-ro1-replicate-3 73: type cluster/replicate 74: subvolumes test-pfs-ro1-client-6 test-pfs-ro1-client-7 75: end-volume 76: 77: volume test-pfs-ro1-dht 78: type cluster/distribute 79: subvolumes test-pfs-ro1-replicate-0 test-pfs-ro1-replicate-1 test-pfs-ro1-replicate-2 test-pfs-ro1-replicate-3 80: end-volume 81: 82: volume test-pfs-ro1-write-behind 83: type performance/write-behind 84: subvolumes test-pfs-ro1-dht 85: end-volume 86: 87: volume test-pfs-ro1-read-ahead 88: type performance/read-ahead 89: subvolumes test-pfs-ro1-write-behind 90: end-volume 91: 92: volume test-pfs-ro1-io-cache 93: type performance/io-cache 94: subvolumes test-pfs-ro1-read-ahead 95: end-volume 96: 97: volume test-pfs-ro1-quick-read 98: type performance/quick-read 99: subvolumes test-pfs-ro1-io-cache 100: end-volume 101: 102: volume test-pfs-ro1-stat-prefetch 103: type performance/stat-prefetch 104: subvolumes test-pfs-ro1-quick-read 105: end-volume 106: 107: volume test-pfs-ro1 108: type debug/io-stats 109: subvolumes test-pfs-ro1-stat-prefetch 110: end-volume Any input would be greatly appreciated. I'm working beyond my deadline already, and I'm guessing that I'm not seeing the forest for the trees here. James Burnash, Unix Engineering DISCLAIMER: This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users