Hello All, We have recently switched to gluster from nfs for sharing images between a cluster of web servers. I have noticed a few issues, and am hoping someone has some advice. One of the main reasons to switch was redundancy - if one server goes down the clients continue to write and read images, when the server comes back, it gets synced. When we would have trouble with nfs, the whole site was crippled/down, we are trying to get away from this. We launched gluster into production a couple of weeks ago. We have seen a few issues since then. 1) when one of the servers was under load (from something running on the same box), we started getting a lot of timeout errors from the app. I turned off the gluster server on that host and things were ok again 2) we rebooted one of the clients, gluster was not started on reboot, so we made some config changes and rebooted to see that it came up, it did not (our isssue, I know) so, we started it manually...we started getting a lot of timeout errors from our app...and then we started getting a lot from all the other clients too, I ended up killing gluster and remounting all the clients and things seem to be ok now...sorry to be so vague, I just don't have a lot of data yet... 3) A client box became totally unresponsive and had to be power cycled, we suspect it was gluster related as it had a really high load not too long after the event above from the logs on one of the servers, this is a snip, it looks mostly like this 2009-02-25 17:48:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:44380) 2009-02-25 17:48:01 W [posix.c:959:posix_create] brick-ns: open on /images/b/bd/Logo-southerncrosshumanitarian-org.jpg: No such file or directory 2009-02-25 17:49:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:44383) 2009-02-25 17:49:58 W [posix.c:959:posix_create] brick-ns: open on /images/a/a1/Logo-replica-designers-com.gif: No such file or directory 2009-02-25 17:50:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:44388) 2009-02-25 17:51:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:44391) 2009-02-25 17:51:47 W [posix.c:959:posix_create] brick-ns: open on /images/9/9b/Logo-brisbanetraybodys-com-au.png: No such file or directory 2009-02-25 17:52:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:44396) 2009-02-25 17:52:29 W [posix.c:959:posix_create] brick-ns: open on /images/e/e5/Logo-callverse-com.gif: No such file or directory 2009-02-25 17:53:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:53321) 2009-02-25 17:54:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:53326) 2009-02-25 17:55:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:53336) 2009-02-25 17:55:36 W [posix.c:959:posix_create] brick-ns: open on /images/6/6f/jigsaw-logo.png: No such file or directory 2009-02-25 17:56:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:53337) 2009-02-25 17:57:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:53342) 2009-02-25 17:58:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:54064) 2009-02-25 17:59:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:54067) 2009-02-25 18:00:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:54072) 2009-02-25 18:01:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:54079) 2009-02-25 18:01:30 W [posix.c:959:posix_create] brick-ns: open on /images/8/86/Portrait-KARACTERE.jpg: No such file or directory 2009-02-25 18:01:34 W [posix.c:959:posix_create] brick-ns: open on /images/e/e4/Cropped-Portrait-KARACTERE.jpg: No such file or directory 2009-02-25 18:02:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:54084) 2009-02-25 18:03:01 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (127.0.0.1:43887) from one of the clients 2009-02-25 17:27:12 E [client-protocol.c:4430:client_lookup_cbk] brick-ns1: no proper reply from server, returning ENOTCONN 2009-02-25 17:27:12 E [client-protocol.c:325:client_protocol_xfer] brick-ns1: transport_submit failed 2009-02-25 17:29:13 C [client-protocol.c:212:call_bail] brick-ns1: bailing transport 2009-02-25 17:29:13 E [client-protocol.c:4834:client_protocol_cleanup] brick-ns1: forced unwinding frame type(2) op(6) reply=@0x2aaab4467d80 2009-02-25 17:29:13 E [client-protocol.c:4277:client_unlock_cbk] brick-ns1: no proper reply from server, returning ENOTCONN 2009-02-25 17:29:13 E [client-protocol.c:325:client_protocol_xfer] brick-ns1: transport_submit failed 2009-02-25 17:34:28 E [afr.c:4625:afr_create_cbk] afr-ns: (path=/images/8/8a/Portrait-Dan_Korn.jpg child=brick-ns2) op_ret=-1 op_errno=2 2009-02-25 17:34:28 E [afr.c:4625:afr_create_cbk] afr-ns: (path=/images/6/65/Cropped-Portrait-Dan_Korn.jpg child=brick-ns2) op_ret=-1 op_errno=2 2009-02-25 17:36:35 C [client-protocol.c:212:call_bail] brick-ns1: bailing transport 2009-02-25 17:36:35 E [client-protocol.c:4834:client_protocol_cleanup] brick-ns1: forced unwinding frame type(1) op(40) reply=@0x2aaab408f390 2009-02-25 17:36:35 E [client-protocol.c:4613:client_checksum_cbk] brick-ns1: no proper reply from server, returning ENOTCONN 2009-02-25 17:56:53 C [client-protocol.c:212:call_bail] brick-ns1: bailing transport 2009-02-25 17:56:53 E [client-protocol.c:4834:client_protocol_cleanup] brick-ns1: forced unwinding frame type(2) op(5) reply=@0x2aaab46410a0 2009-02-25 17:56:53 E [client-protocol.c:4246:client_lock_cbk] brick-ns1: no proper reply from server, returning ENOTCONN 2009-02-25 17:56:53 E [client-protocol.c:325:client_protocol_xfer] brick-ns1: transport_submit failed 2009-02-25 17:58:34 C [client-protocol.c:212:call_bail] brick-ns1: bailing transport 2009-02-25 17:58:34 E [client-protocol.c:4834:client_protocol_cleanup] brick-ns1: forced unwinding frame type(1) op(34) reply=@0x2aaab41634e0 2009-02-25 17:58:34 E [client-protocol.c:4430:client_lookup_cbk] brick-ns1: no proper reply from server, returning ENOTCONN 2009-02-25 17:58:34 E [client-protocol.c:325:client_protocol_xfer] brick-ns1: transport_submit failed 2009-02-25 17:59:24 C [client-protocol.c:212:call_bail] brick-ns1: bailing transport 2009-02-25 17:59:24 E [client-protocol.c:4834:client_protocol_cleanup] brick-ns1: forced unwinding frame type(2) op(6) reply=@0x2aaab407b890 2009-02-25 17:59:24 E [client-protocol.c:4277:client_unlock_cbk] brick-ns1: no proper reply from server, returning ENOTCONN ______________ server conf _____________ volume brick type storage/posix option directory /opt/glusterfs/share/ end-volume volume brick-ns type storage/posix option directory /opt/glusterfs/share-ns/ end-volume volume server type protocol/server option transport-type tcp/server option client-volume-filename /etc/glusterfs/glusterfs-client.vol subvolumes brick brick-ns option auth.ip.brick.allow 10.* # Allow access to "brick" volume option auth.ip.brick-ns.allow 10.* # Allow access to "brick-ns" volume end-volume # performance changes volume locks type features/posix-locks option mandatory-locks on subvolumes brick end-volume volume iothreads type performance/io-threads option thread-count 8 subvolumes locks end-volume ___________ client config __________ volume brick1 type protocol/client option transport-type tcp/client option remote-host cumulus.adm # IP address of the remote brick option remote-subvolume brick # name of the remote volume end-volume volume brick2 type protocol/client option transport-type tcp/client option remote-host dbs3.adm option remote-subvolume brick end-volume volume brick-ns1 type protocol/client option transport-type tcp/client option remote-host cumulus.adm option remote-subvolume brick-ns # Note the different remote volume name. end-volume volume brick-ns2 type protocol/client option transport-type tcp/client option remote-host dbs3.adm option remote-subvolume brick-ns # Note the different remote volume name. end-volume volume afr1 type cluster/afr subvolumes brick1 brick2 end-volume volume afr-ns type cluster/afr subvolumes brick-ns1 brick-ns2 end-volume olume unify type cluster/unify option namespace afr-ns option scheduler rr subvolumes afr1 end-volume # performance changes volume writebehind type performance/write-behind option aggregate-size 128KB option window-size 1MB subvolumes unify end-volume volume cache type performance/io-cache option cache-size 512MB subvolumes writebehind end-volume Any advice is greatly appreciated