AFR fails to provide High Availability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

we are experiencing the following problem in our hpc cluster, with a gluster filesystem built using unify over afr on several couples of nodes: Every once in a while, one of our nodes freezes; it will reply to 'ping', but it will not allow ssh connections, nor direct terminal access. Under those circumstances, the gluster filesystem will crash, while if the frozen node is shut down, the filesystem will work (in deprecated mode). Since most of our running jobs use glusterfs to write in, this is becoming a quite serious problem.
Our server spec files look like:

-------------------------------------------------------------------
### GlusterFS Server Volume Specification

### Export volume brick.
volume brick
      type storage/posix
      option directory /state/partition1/glfsdir/
end-volume

### Add network serving capability to brick.
volume server
      type protocol/server
      option transport-type tcp/server
      option listen-port 6996
      subvolumes brick
      option auth.ip.brick.allow 10.*.*.*
end-volume
-------------------------------------------------------------------

and our client spec files look like:

-------------------------------------------------------------------
### GlusterFS Client Volume Specification

### Add client feature and attach to remote subvolume of server
volume brick0-0
 type protocol/client
 option transport-type tcp/client
 option remote-host compute-0-0
 option remote-subvolume brick
end-volume

[... several of those, up to compute 7-5 ...]

volume brick7-5
 type protocol/client
 option transport-type tcp/client
 option remote-host compute-7-5
 option remote-subvolume brick
end-volume

###  Namespace brick
volume local-ns
   type protocol/client
   option transport-type tcp/client
   option remote-host vulcano
   option remote-subvolume brick-ns
end-volume

###  Automatic File Replication
volume afr1
   type cluster/afr
   subvolumes brick0-0 brick4-0
end-volume

[... several of those, up to afr24 ...]

volume afr24
   type cluster/afr
   subvolumes brick3-5 brick7-5
end-volume

###  Unify
volume unify
   type cluster/unify
subvolumes afr1 afr2 afr3 afr4 afr5 afr6 afr7 afr8 afr9 afr10 afr11 afr12 afr13 afr14 afr15 afr16 afr17 afr18 afr19 afr20 afr21 afr22 afr23 afr24
   option namespace local-ns
# ALU scheduler
   option scheduler alu            # use the ALU scheduler
option alu.limits.min-free-disk 5% # Don't create files on a volume with less than 5% free diskspace ## When deciding where to place a file, first look at the write-usage, then at
##   read-usage, disk-usage, open files, and finally the disk-speed-usage.
option alu.order write-usage:read-usage:disk-usage:open-files-usage:disk-speed-usage option alu.write-usage.entry-threshold 20% # Kick in when the write-usage discrepancy is 20% option alu.write-usage.exit-threshold 15% # Don't stop until the discrepancy has been reduced to 5% option alu.read-usage.entry-threshold 20% # Kick in when the read-usage discrepancy is 20% option alu.read-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) option alu.disk-usage.entry-threshold 10GB # Kick in if the discrep. in disk-usage between volumes is more than 10GB option alu.disk-usage.exit-threshold 1GB # Don't stop writing to the least-used volume until the discrep. is 9GB option alu.open-files-usage.entry-threshold 1024 # Kick in if the discrepancy in open files is 1024 option alu.open-files-usage.exit-threshold 32 # Stop when 992 files have been written in the least-used vol. # option alu.disk-speed-usage.entry-threshold # NEVER SET IT. SPEED IS CONSTANT!!! # option alu.disk-speed-usage.exit-threshold # NEVER SET IT. SPEED IS CONSTANT!!! option alu.stat-refresh.interval 10sec # Refresh the statistics used for decision-making every 10 seconds # option alu.stat-refresh.num-file-create 10 # Refresh the statistics used for decision-making after creating 10 files
## NUFA scheduler
#    option scheduler nufa
#    option nufa.local-volume-name afr24
end-volume
-------------------------------------------------------------------

The namespace is provided by the frontend 'vulcano' which does not otherwise contribute to the filesystem. The scheduler is NUFA for the nodes and ALU for the frontend. We have lately added 'option self-heal on' to the afr bricks and 'option transport-timeout 10' to the basic node bricks, 'brickX-X', but that had no effect on our problem.

What we get in /var/log/glusterfs/glusterfs.log is always something like this (with two node freeze examples):

-------------------------------------------------------------------
2008-06-06 20:54:41 W [client-protocol.c:204:call_bail] brick6-0: activating bail-out. pending frames = 4. last sent = 2008-06-06 20:51:51. last received = 2008-06-06 20:43:08 transport-timeout = 108 2008-06-06 20:54:41 C [client-protocol.c:211:call_bail] brick6-0: bailing transport 2008-06-06 20:54:41 W [client-protocol.c:4759:client_protocol_cleanup] brick6-0: cleaning up state in transport object 0x554c10 2008-06-06 20:54:41 E [client-protocol.c:4809:client_protocol_cleanup] brick6-0: forced unwinding frame type(1) op(34) reply=@0x2a966a6880 2008-06-06 20:54:41 E [client-protocol.c:4405:client_lookup_cbk] brick6-0: no proper reply from server, returning ENOTCONN 2008-06-06 20:54:41 E [client-protocol.c:4809:client_protocol_cleanup] brick6-0: forced unwinding frame type(1) op(15) reply=@0x2a966a6880 2008-06-06 20:54:41 E [client-protocol.c:3866:client_statfs_cbk] brick6-0: no proper reply from server, returning ENOTCONN 2008-06-06 20:54:41 E [client-protocol.c:4809:client_protocol_cleanup] brick6-0: forced unwinding frame type(1) op(34) reply=@0x2a966a6880 2008-06-06 20:54:41 E [client-protocol.c:4405:client_lookup_cbk] brick6-0: no proper reply from server, returning ENOTCONN 2008-06-06 20:54:41 E [client-protocol.c:4809:client_protocol_cleanup] brick6-0: forced unwinding frame type(1) op(15) reply=@0x2a966a6880 2008-06-06 20:54:41 E [client-protocol.c:3866:client_statfs_cbk] brick6-0: no proper reply from server, returning ENOTCONN 2008-06-06 21:00:10 W [client-protocol.c:279:client_protocol_xfer] brick6-0: attempting to pipeline request type(1) op(35) with handshake 2008-06-06 21:00:10 W [client-protocol.c:4759:client_protocol_cleanup] brick6-0: cleaning up state in transport object 0x554c10 2008-06-06 21:00:10 E [client-protocol.c:4809:client_protocol_cleanup] brick6-0: forced unwinding frame type(1) op(35) reply=@0x2a9622d710 2008-06-06 21:00:10 E [tcp-client.c:190:tcp_connect] brick6-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:00:10 W [client-protocol.c:331:client_protocol_xfer] brick6-0: not connected at the moment to submit frame type(1) op(35) 2008-06-06 21:00:57 E [tcp-client.c:190:tcp_connect] brick6-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:02:32 E [tcp-client.c:190:tcp_connect] brick6-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:02:32 W [client-protocol.c:331:client_protocol_xfer] brick6-0: not connected at the moment to submit frame type(1) op(15) 2008-06-06 21:02:32 E [client-protocol.c:3866:client_statfs_cbk] brick6-0: no proper reply from server, returning ENOTCONN 2008-06-06 21:04:24 E [protocol.c:271:gf_block_unserialize_transport] local-ns: EOF from peer (10.1.1.1:6996) 2008-06-06 21:04:24 W [client-protocol.c:4759:client_protocol_cleanup] local-ns: cleaning up state in transport object 0x58b580 2008-06-06 21:10:24 E [protocol.c:271:gf_block_unserialize_transport] local-ns: EOF from peer (10.1.1.1:6996) 2008-06-06 21:10:24 W [client-protocol.c:4759:client_protocol_cleanup] local-ns: cleaning up state in transport object 0x58e980 2008-06-09 11:11:26 W [client-protocol.c:204:call_bail] brick6-5: activating bail-out. pending frames = 1. last sent = 2008-06-09 11:11:06. last received = 2008-06-09 04:02:03 transport-timeout = 10 2008-06-09 11:11:26 C [client-protocol.c:211:call_bail] brick6-5: bailing transport 2008-06-09 11:11:26 W [client-protocol.c:4759:client_protocol_cleanup] brick6-5: cleaning up state in transport object 0x56eef0 2008-06-09 11:11:26 E [client-protocol.c:4809:client_protocol_cleanup] brick6-5: forced unwinding frame type(1) op(15) reply=@0x2a96205fd0 2008-06-09 11:11:26 E [client-protocol.c:3866:client_statfs_cbk] brick6-5: no proper reply from server, returning ENOTCONN 2008-06-09 13:32:52 W [client-protocol.c:279:client_protocol_xfer] brick6-5: attempting to pipeline request type(1) op(15) with handshake 2008-06-09 13:33:09 W [client-protocol.c:204:call_bail] brick6-5: activating bail-out. pending frames = 1. last sent = 2008-06-09 13:32:52. last received = 1970-01-01 01:00:00 transport-timeout = 10 2008-06-09 13:33:09 C [client-protocol.c:211:call_bail] brick6-5: bailing transport 2008-06-09 13:33:09 W [client-protocol.c:4759:client_protocol_cleanup] brick6-5: cleaning up state in transport object 0x56eef0 2008-06-09 13:33:09 E [client-protocol.c:4809:client_protocol_cleanup] brick6-5: forced unwinding frame type(1) op(15) reply=@0x2a962009e0 2008-06-09 13:33:09 E [fuse-bridge.c:2487:fuse_thread_proc] glusterfs-fuse: fuse_chan_receive() returned -1 (25) 2008-06-09 13:33:09 E [client-protocol.c:3866:client_statfs_cbk] brick6-5: no proper reply from server, returning ENOTCONN
-------------------------------------------------------------------

In the frozen node, after rebooting, /var/log/glusterfs/glusterfsd.log states (for the brick6-0 case):

-------------------------------------------------------------------
2008-06-06 21:10:02 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.205:901) 2008-06-06 21:10:24 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.1.1.1:995) 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.252:997) 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.251:997) 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.250:997) 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.249:996) 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.248:997) 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.247:997) 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.246:997) 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.245:997) 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.244:997) 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.243:997) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.242:997) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.241:996) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.240:997) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.239:996) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.238:997) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.237:997) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.236:997) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.235:997) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.234:997) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.233:997) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.232:997) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.231:997) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.230:997) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.229:997) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.228:997) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.227:997) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.226:997) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.225:997) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.224:997) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.223:997) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.222:997) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.221:997) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.220:997) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.219:996) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.218:997) 2008-06-06 21:10:36 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.217:996) 2008-06-06 21:10:36 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.255.255.216:997)
-------------------------------------------------------------------

and the client log says:

-------------------------------------------------------------------
2008-06-06 21:10:24 E [protocol.c:271:gf_block_unserialize_transport] local-ns: EOF from peer (10.1.1.1:6996) 2008-06-06 21:10:24 W [client-protocol.c:4759:client_protocol_cleanup] local-ns: cleaning up state in transport object 0x58d7f0 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] brick0-0: EOF from peer (10.255.255.252:6996) 2008-06-06 21:10:25 W [client-protocol.c:4759:client_protocol_cleanup] brick0-0: cleaning up state in transport object 0x51dc80 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] brick0-1: EOF from peer (10.255.255.251:6996) 2008-06-06 21:10:25 W [client-protocol.c:4759:client_protocol_cleanup] brick0-1: cleaning up state in transport object 0x522b80 2008-06-06 21:10:25 E [protocol.c:271:gf_block_unserialize_transport] brick0-2: EOF from peer (10.255.255.250:6996) 2008-06-06 21:10:25 W [client-protocol.c:4759:client_protocol_cleanup] brick0-2: cleaning up state in transport object 0x5274e0 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] brick0-3: EOF from peer (10.255.255.249:6996) 2008-06-06 21:10:26 W [client-protocol.c:4759:client_protocol_cleanup] brick0-3: cleaning up state in transport object 0x52be40 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] brick0-4: EOF from peer (10.255.255.248:6996) 2008-06-06 21:10:26 W [client-protocol.c:4759:client_protocol_cleanup] brick0-4: cleaning up state in transport object 0x5307a0 2008-06-06 21:10:26 E [protocol.c:271:gf_block_unserialize_transport] brick0-5: EOF from peer (10.255.255.247:6996) 2008-06-06 21:10:26 W [client-protocol.c:4759:client_protocol_cleanup] brick0-5: cleaning up state in transport object 0x535100 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] brick1-0: EOF from peer (10.255.255.246:6996) 2008-06-06 21:10:27 W [client-protocol.c:4759:client_protocol_cleanup] brick1-0: cleaning up state in transport object 0x539a60 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] brick1-1: EOF from peer (10.255.255.245:6996) 2008-06-06 21:10:27 W [client-protocol.c:4759:client_protocol_cleanup] brick1-1: cleaning up state in transport object 0x53e3c0 2008-06-06 21:10:27 E [protocol.c:271:gf_block_unserialize_transport] brick1-2: EOF from peer (10.255.255.244:6996) 2008-06-06 21:10:27 W [client-protocol.c:4759:client_protocol_cleanup] brick1-2: cleaning up state in transport object 0x542d20 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] brick1-3: EOF from peer (10.255.255.243:6996) 2008-06-06 21:10:28 W [client-protocol.c:4759:client_protocol_cleanup] brick1-3: cleaning up state in transport object 0x547680 2008-06-06 21:10:28 E [tcp-client.c:190:tcp_connect] local-ns: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] brick1-4: EOF from peer (10.255.255.242:6996) 2008-06-06 21:10:28 W [client-protocol.c:4759:client_protocol_cleanup] brick1-4: cleaning up state in transport object 0x54bfe0 2008-06-06 21:10:28 E [tcp-client.c:190:tcp_connect] brick1-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:28 E [tcp-client.c:190:tcp_connect] brick0-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] brick1-5: EOF from peer (10.255.255.241:6996) 2008-06-06 21:10:28 W [client-protocol.c:4759:client_protocol_cleanup] brick1-5: cleaning up state in transport object 0x550940 2008-06-06 21:10:28 E [tcp-client.c:190:tcp_connect] brick0-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:28 E [tcp-client.c:190:tcp_connect] brick1-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:28 E [protocol.c:271:gf_block_unserialize_transport] brick2-0: EOF from peer (10.255.255.240:6996) 2008-06-06 21:10:28 W [client-protocol.c:4759:client_protocol_cleanup] brick2-0: cleaning up state in transport object 0x5552a0 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick0-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick2-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] brick2-1: EOF from peer (10.255.255.239:6996) 2008-06-06 21:10:29 W [client-protocol.c:4759:client_protocol_cleanup] brick2-1: cleaning up state in transport object 0x559c00 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick0-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick2-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick1-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] brick2-2: EOF from peer (10.255.255.238:6996) 2008-06-06 21:10:29 W [client-protocol.c:4759:client_protocol_cleanup] brick2-2: cleaning up state in transport object 0x55e560 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick2-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick0-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [tcp-client.c:190:tcp_connect] brick1-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:29 E [protocol.c:271:gf_block_unserialize_transport] brick2-3: EOF from peer (10.255.255.237:6996) 2008-06-06 21:10:29 W [client-protocol.c:4759:client_protocol_cleanup] brick2-3: cleaning up state in transport object 0x562ec0 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick0-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] brick2-4: EOF from peer (10.255.255.236:6996) 2008-06-06 21:10:30 W [client-protocol.c:4759:client_protocol_cleanup] brick2-4: cleaning up state in transport object 0x567820 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick1-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] brick2-5: EOF from peer (10.255.255.235:6996) 2008-06-06 21:10:30 W [client-protocol.c:4759:client_protocol_cleanup] brick2-5: cleaning up state in transport object 0x56c180 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick1-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick2-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [protocol.c:271:gf_block_unserialize_transport] brick3-0: EOF from peer (10.255.255.234:6996) 2008-06-06 21:10:30 W [client-protocol.c:4759:client_protocol_cleanup] brick3-0: cleaning up state in transport object 0x570ae0 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick1-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:30 E [tcp-client.c:190:tcp_connect] brick3-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick2-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] brick3-1: EOF from peer (10.255.255.233:6996) 2008-06-06 21:10:31 W [client-protocol.c:4759:client_protocol_cleanup] brick3-1: cleaning up state in transport object 0x575440 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] local-ns: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick1-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick3-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick2-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick0-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick1-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] brick3-2: EOF from peer (10.255.255.232:6996) 2008-06-06 21:10:31 W [client-protocol.c:4759:client_protocol_cleanup] brick3-2: cleaning up state in transport object 0x579da0 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick3-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick2-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick0-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick1-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [protocol.c:271:gf_block_unserialize_transport] brick3-3: EOF from peer (10.255.255.231:6996) 2008-06-06 21:10:31 W [client-protocol.c:4759:client_protocol_cleanup] brick3-3: cleaning up state in transport object 0x57e700 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick3-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:31 E [tcp-client.c:190:tcp_connect] brick3-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick0-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick2-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] brick3-4: EOF from peer (10.255.255.230:6996) 2008-06-06 21:10:32 W [client-protocol.c:4759:client_protocol_cleanup] brick3-4: cleaning up state in transport object 0x583060 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick3-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick3-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick0-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick2-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] brick3-5: EOF from peer (10.255.255.229:6996) 2008-06-06 21:10:32 W [client-protocol.c:4759:client_protocol_cleanup] brick3-5: cleaning up state in transport object 0x5879c0 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick3-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick3-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick0-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick2-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:32 E [protocol.c:271:gf_block_unserialize_transport] brick4-0: EOF from peer (10.255.255.228:6996) 2008-06-06 21:10:32 W [client-protocol.c:4759:client_protocol_cleanup] brick4-0: cleaning up state in transport object 0x520680 2008-06-06 21:10:32 E [tcp-client.c:190:tcp_connect] brick4-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick3-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick0-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick2-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] brick4-1: EOF from peer (10.255.255.227:6996) 2008-06-06 21:10:33 W [client-protocol.c:4759:client_protocol_cleanup] brick4-1: cleaning up state in transport object 0x525000 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick4-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick3-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick1-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick2-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] brick4-2: EOF from peer (10.255.255.226:6996) 2008-06-06 21:10:33 W [client-protocol.c:4759:client_protocol_cleanup] brick4-2: cleaning up state in transport object 0x529960 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick4-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick3-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick1-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick2-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [protocol.c:271:gf_block_unserialize_transport] brick4-3: EOF from peer (10.255.255.225:6996) 2008-06-06 21:10:33 W [client-protocol.c:4759:client_protocol_cleanup] brick4-3: cleaning up state in transport object 0x52e2c0 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick4-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:33 E [tcp-client.c:190:tcp_connect] brick4-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick1-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick3-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] brick4-4: EOF from peer (10.255.255.224:6996) 2008-06-06 21:10:34 W [client-protocol.c:4759:client_protocol_cleanup] brick4-4: cleaning up state in transport object 0x532c20 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick4-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick4-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick1-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick3-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] brick4-5: EOF from peer (10.255.255.223:6996) 2008-06-06 21:10:34 W [client-protocol.c:4759:client_protocol_cleanup] brick4-5: cleaning up state in transport object 0x537580 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick1-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick4-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick4-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick3-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [protocol.c:271:gf_block_unserialize_transport] brick5-0: EOF from peer (10.255.255.222:6996) 2008-06-06 21:10:34 W [client-protocol.c:4759:client_protocol_cleanup] brick5-0: cleaning up state in transport object 0x53bee0 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick1-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick5-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:34 E [tcp-client.c:190:tcp_connect] brick4-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick3-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] brick5-1: EOF from peer (10.255.255.221:6996) 2008-06-06 21:10:35 W [client-protocol.c:4759:client_protocol_cleanup] brick5-1: cleaning up state in transport object 0x540840 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick2-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick5-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick4-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick3-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] brick5-2: EOF from peer (10.255.255.220:6996) 2008-06-06 21:10:35 W [client-protocol.c:4759:client_protocol_cleanup] brick5-2: cleaning up state in transport object 0x5451a0 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick2-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick5-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick4-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick3-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [protocol.c:271:gf_block_unserialize_transport] brick5-3: EOF from peer (10.255.255.219:6996) 2008-06-06 21:10:35 W [client-protocol.c:4759:client_protocol_cleanup] brick5-3: cleaning up state in transport object 0x549b00 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick5-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick2-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick5-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:35 E [tcp-client.c:190:tcp_connect] brick4-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [protocol.c:271:gf_block_unserialize_transport] brick5-4: EOF from peer (10.255.255.218:6996) 2008-06-06 21:10:36 W [client-protocol.c:4759:client_protocol_cleanup] brick5-4: cleaning up state in transport object 0x54e460 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick5-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick2-3: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick5-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick4-1: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] local-ns: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [protocol.c:271:gf_block_unserialize_transport] brick5-5: EOF from peer (10.255.255.217:6996) 2008-06-06 21:10:36 W [client-protocol.c:4759:client_protocol_cleanup] brick5-5: cleaning up state in transport object 0x552dc0 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick5-5: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick2-4: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick0-0: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick5-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:10:36 E [tcp-client.c:190:tcp_connect] brick4-2: non-blocking connect() returned: 111 (Connection refused) 2008-06-06 21:11:00 W [nufa.c:47:nufa_init] nufa: No option for limit min-free-disk given, defaulting it to 15 2008-06-06 21:11:00 W [nufa.c:55:nufa_init] nufa: No option for nufa.refresh-interval given, defaulting it to 30
-------------------------------------------------------------------

It seems to us that gluster still thinks that the frozen node is alive, at least to some extent, so it does not disregard it as part of the filesystem. Any ideas on what is happening, and how could we overcome it? Thanks in advance,

Ricardo Garcia Mayoral
Computational Fluid Mechanics
ETSI Aeronauticos, Universidad Politecnica de Madrid
Pz Cardenal Cisneros 3, 28040 Madrid, Spain.
Phone: (+34) 913363291  Fax: (+34) 913363295
e-mail: ricardo@xxxxxxxxxxxxxxxxxx





[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux